Fix decoding of non-ascii field names in ignored source #132018

jordan-powers · 2025-07-28T14:58:53Z

When encoding an ignored source entry, we use String.length() to get the length of the encoded field name. This will only work when the UTF-8 encoding has only ascii characters, with 1 byte per character.

The solution is to use the actual length of the encoded field name byte[] array.

I added a test that encodes a field in _ignored_source with a random unicode key. Without this fix, it fails with this stack trace:

java.lang.IllegalArgumentException: Can't decode [9c bd f2 8c b0 91 f1 80 ac 9c 53 f0 9e ad af f0 ae a2 b3 f1 9a 86 bd f1 a7 b5 86 f3 ac b0 82 e7 b5 b0 f1 9f 8f b3 f3 a7 b7 94 f2 91 9f bd f1 a0 80 af]
        at __randomizedtesting.SeedInfo.seed([1:80DEBC9A7DDCB0FE]:0)
        at org.elasticsearch.index.mapper.XContentDataHelper.decodeAndWrite(XContentDataHelper.java:109)
        at org.elasticsearch.index.mapper.XContentDataHelper.writeMerged(XContentDataHelper.java:196)
        at org.elasticsearch.index.mapper.ObjectMapper$SyntheticSourceFieldLoader$FieldWriter$IgnoredSource.writeTo(ObjectMapper.java:1230)
        at org.elasticsearch.index.mapper.ObjectMapper$SyntheticSourceFieldLoader.write(ObjectMapper.java:1141)
        at org.elasticsearch.index.mapper.SourceLoader$Synthetic$SyntheticLeaf.write(SourceLoader.java:240)
        at org.elasticsearch.index.mapper.SourceLoader$Synthetic$SyntheticLeaf.source(SourceLoader.java:199)
        at org.elasticsearch.index.mapper.SourceLoader$Synthetic$LeafWithMetrics.source(SourceLoader.java:162)
        at org.elasticsearch.search.lookup.ConcurrentSegmentSourceProvider$Leaf.getSource(ConcurrentSegmentSourceProvider.java:79)
        at org.elasticsearch.search.lookup.ConcurrentSegmentSourceProvider.getSource(ConcurrentSegmentSourceProvider.java:58)
        at org.elasticsearch.index.mapper.MapperServiceTestCase.syntheticSource(MapperServiceTestCase.java:862)
        at org.elasticsearch.index.mapper.MapperServiceTestCase.syntheticSource(MapperServiceTestCase.java:816)
        at org.elasticsearch.index.mapper.IgnoredSourceFieldMapperTests.getSyntheticSourceWithFieldLimit(IgnoredSourceFieldMapperTests.java:62)
        at org.elasticsearch.index.mapper.IgnoredSourceFieldMapperTests.getSyntheticSourceWithFieldLimit(IgnoredSourceFieldMapperTests.java:54)
        at org.elasticsearch.index.mapper.IgnoredSourceFieldMapperTests.testIgnoredStringFullUnicode(IgnoredSourceFieldMapperTests.java:134)

This is an alternative solution to the same issue as #131950. I think this solution is better because it can read old data written prior to this change (since this solution only changes the decoding logic, and the encoding logic stays the same).

elasticsearchmachine · 2025-07-28T14:59:19Z

Pinging @elastic/es-storage-engine (Team:StorageEngine)

elasticsearchmachine · 2025-07-28T14:59:20Z

Hi @jordan-powers, I've created a changelog YAML for you.

martijnvg

LGTM 👍

When encoding an ignored source entry, we write the string length of the field name, not the encoded byte count; however, the decode logic treats this encoded value as the byte length. This patch updates the decode logic to instead properly treat the value as the string length.

elasticsearchmachine · 2025-07-28T16:43:51Z

💚 Backport successful

Status	Branch	Result
✅	9.1
✅	8.19
✅	8.18

When encoding an ignored source entry, we write the string length of the field name, not the encoded byte count; however, the decode logic treats this encoded value as the byte length. This patch updates the decode logic to instead properly treat the value as the string length.

…32030) When encoding an ignored source entry, we write the string length of the field name, not the encoded byte count; however, the decode logic treats this encoded value as the byte length. This patch updates the decode logic to instead properly treat the value as the string length.

…32029) When encoding an ignored source entry, we write the string length of the field name, not the encoded byte count; however, the decode logic treats this encoded value as the byte length. This patch updates the decode logic to instead properly treat the value as the string length.

…32031) When encoding an ignored source entry, we write the string length of the field name, not the encoded byte count; however, the decode logic treats this encoded value as the byte length. This patch updates the decode logic to instead properly treat the value as the string length.

jordan-powers · 2025-07-28T17:53:38Z

💚 All backports created successfully

Status	Branch	Result
✅	9.0

Questions ?

Please refer to the Backport tool documentation

When encoding an ignored source entry, we write the string length of the field name, not the encoded byte count; however, the decode logic treats this encoded value as the byte length. This patch updates the decode logic to instead properly treat the value as the string length. (cherry picked from commit 178c0c9)

In #132018, we found an issue with non-ascii strings in ignored source. This issue was not exposed by the data generation tests because these tests use ascii-only strings. This patch updates the tests to instead use full unicode strings by default.

…32034) When encoding an ignored source entry, we write the string length of the field name, not the encoded byte count; however, the decode logic treats this encoded value as the byte length. This patch updates the decode logic to instead properly treat the value as the string length. (cherry picked from commit 178c0c9)

…-tracking * upstream/main: (26 commits) Add release notes for v9.1.0 release (elastic#131953) Unmute multi_node generative tests (elastic#132021) Avoid re-enqueueing merge tasks (elastic#132020) Fix file entitlements for shared data dir (elastic#131748) ES|QL brute force l2_norm vector function (elastic#132025) Make ES|QL SAMPLE not a pipeline breaker (elastic#132014) Speed up tail computation in MemorySegmentES91OSQVectorsScorer (elastic#132001) Remove deprecated usages in `TransportPutFollowAction` (elastic#132038) Simulate impact of shard movement using shard-level write load (elastic#131406) Remove RemoteClusterService.getConnections() method (elastic#131948) Fix off by one in ValuesBytesRefAggregator (elastic#132032) Use unicode strings in data generation by default (elastic#132028) Adding index.refresh_interval as a data stream setting (elastic#131482) [ES|QL] Add more Min/MaxOverTime CSV tests (elastic#131070) Restrict remote ENRICH after FORK (elastic#131945) Fix decoding of non-ascii field names in ignored source (elastic#132018) [docs] Use centrally maintained version variables (elastic#131939) Configurable Inference timeout during Query time (elastic#131551) ESQL: Allow pruning columns added by InlineJoin (elastic#131204) ESQL: Fail `profile` on text response formats (elastic#128627) ...

jordan-powers added 2 commits July 28, 2025 07:35

Add test

6473997

Interpret nameSize as charcount instead of bytecount

fc1cd67

jordan-powers requested a review from martijnvg July 28, 2025 14:58

jordan-powers self-assigned this Jul 28, 2025

jordan-powers added >bug auto-backport Automatically create backport pull requests when merged Team:StorageEngine :StorageEngine/Mapping The storage related side of mappings v9.2.0 v9.1.1 v8.19.1 v8.18.5 labels Jul 28, 2025

Update docs/changelog/132018.yaml

abd3713

jordan-powers mentioned this pull request Jul 28, 2025

Fix encoding of non-ascii field names in ignored source #131950

Closed

martijnvg approved these changes Jul 28, 2025

View reviewed changes

jordan-powers mentioned this pull request Jul 28, 2025

Use UTF strings in data generation #132028

Merged

jordan-powers enabled auto-merge (squash) July 28, 2025 16:37

jordan-powers merged commit 178c0c9 into elastic:main Jul 28, 2025
33 checks passed

This was referenced Jul 28, 2025

[9.1] Fix decoding of non-ascii field names in ignored source (#132018) #132029

Merged

[8.19] Fix decoding of non-ascii field names in ignored source (#132018) #132030

Merged

jordan-powers mentioned this pull request Jul 28, 2025

[8.18] Fix decoding of non-ascii field names in ignored source (#132018) #132031

Merged

jordan-powers deleted the fix-utf-encoding-ignored-source-2 branch July 28, 2025 16:54

jordan-powers added the v9.0.5 label Jul 28, 2025

jordan-powers mentioned this pull request Jul 28, 2025

[9.0] Fix decoding of non-ascii field names in ignored source (#132018) #132034

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix decoding of non-ascii field names in ignored source #132018

Fix decoding of non-ascii field names in ignored source #132018

Uh oh!

jordan-powers commented Jul 28, 2025

Uh oh!

elasticsearchmachine commented Jul 28, 2025

Uh oh!

elasticsearchmachine commented Jul 28, 2025

Uh oh!

martijnvg left a comment

Uh oh!

Uh oh!

elasticsearchmachine commented Jul 28, 2025

Uh oh!

jordan-powers commented Jul 28, 2025

Uh oh!

Uh oh!

Fix decoding of non-ascii field names in ignored source #132018

Fix decoding of non-ascii field names in ignored source #132018

Uh oh!

Conversation

jordan-powers commented Jul 28, 2025

Uh oh!

elasticsearchmachine commented Jul 28, 2025

Uh oh!

elasticsearchmachine commented Jul 28, 2025

Uh oh!

martijnvg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

elasticsearchmachine commented Jul 28, 2025

💚 Backport successful

Uh oh!

jordan-powers commented Jul 28, 2025

💚 All backports created successfully

Questions ?

Uh oh!

Uh oh!