Skip to content

Fix decoding of non-ascii field names in ignored source #132018

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

jordan-powers
Copy link
Contributor

When encoding an ignored source entry, we use String.length() to get the length of the encoded field name. This will only work when the UTF-8 encoding has only ascii characters, with 1 byte per character.

The solution is to use the actual length of the encoded field name byte[] array.

I added a test that encodes a field in _ignored_source with a random unicode key. Without this fix, it fails with this stack trace:

java.lang.IllegalArgumentException: Can't decode [9c bd f2 8c b0 91 f1 80 ac 9c 53 f0 9e ad af f0 ae a2 b3 f1 9a 86 bd f1 a7 b5 86 f3 ac b0 82 e7 b5 b0 f1 9f 8f b3 f3 a7 b7 94 f2 91 9f bd f1 a0 80 af]
        at __randomizedtesting.SeedInfo.seed([1:80DEBC9A7DDCB0FE]:0)
        at org.elasticsearch.index.mapper.XContentDataHelper.decodeAndWrite(XContentDataHelper.java:109)
        at org.elasticsearch.index.mapper.XContentDataHelper.writeMerged(XContentDataHelper.java:196)
        at org.elasticsearch.index.mapper.ObjectMapper$SyntheticSourceFieldLoader$FieldWriter$IgnoredSource.writeTo(ObjectMapper.java:1230)
        at org.elasticsearch.index.mapper.ObjectMapper$SyntheticSourceFieldLoader.write(ObjectMapper.java:1141)
        at org.elasticsearch.index.mapper.SourceLoader$Synthetic$SyntheticLeaf.write(SourceLoader.java:240)
        at org.elasticsearch.index.mapper.SourceLoader$Synthetic$SyntheticLeaf.source(SourceLoader.java:199)
        at org.elasticsearch.index.mapper.SourceLoader$Synthetic$LeafWithMetrics.source(SourceLoader.java:162)
        at org.elasticsearch.search.lookup.ConcurrentSegmentSourceProvider$Leaf.getSource(ConcurrentSegmentSourceProvider.java:79)
        at org.elasticsearch.search.lookup.ConcurrentSegmentSourceProvider.getSource(ConcurrentSegmentSourceProvider.java:58)
        at org.elasticsearch.index.mapper.MapperServiceTestCase.syntheticSource(MapperServiceTestCase.java:862)
        at org.elasticsearch.index.mapper.MapperServiceTestCase.syntheticSource(MapperServiceTestCase.java:816)
        at org.elasticsearch.index.mapper.IgnoredSourceFieldMapperTests.getSyntheticSourceWithFieldLimit(IgnoredSourceFieldMapperTests.java:62)
        at org.elasticsearch.index.mapper.IgnoredSourceFieldMapperTests.getSyntheticSourceWithFieldLimit(IgnoredSourceFieldMapperTests.java:54)
        at org.elasticsearch.index.mapper.IgnoredSourceFieldMapperTests.testIgnoredStringFullUnicode(IgnoredSourceFieldMapperTests.java:134)

This is an alternative solution to the same issue as #131950. I think this solution is better because it can read old data written prior to this change (since this solution only changes the decoding logic, and the encoding logic stays the same).

@jordan-powers jordan-powers requested a review from martijnvg July 28, 2025 14:58
@jordan-powers jordan-powers self-assigned this Jul 28, 2025
@jordan-powers jordan-powers added >bug auto-backport Automatically create backport pull requests when merged Team:StorageEngine :StorageEngine/Mapping The storage related side of mappings v9.2.0 v9.1.1 v8.19.1 v8.18.5 labels Jul 28, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-storage-engine (Team:StorageEngine)

@elasticsearchmachine
Copy link
Collaborator

Hi @jordan-powers, I've created a changelog YAML for you.

Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@jordan-powers jordan-powers enabled auto-merge (squash) July 28, 2025 16:37
@jordan-powers jordan-powers merged commit 178c0c9 into elastic:main Jul 28, 2025
33 checks passed
jordan-powers added a commit to jordan-powers/elasticsearch that referenced this pull request Jul 28, 2025
When encoding an ignored source entry, we write the string length of the 
field name, not the encoded byte count; however, the decode logic treats
this encoded value as the byte length. This patch updates the decode logic
to instead properly treat the value as the string length.
@elasticsearchmachine
Copy link
Collaborator

💚 Backport successful

Status Branch Result
9.1
8.19
8.18

jordan-powers added a commit to jordan-powers/elasticsearch that referenced this pull request Jul 28, 2025
When encoding an ignored source entry, we write the string length of the 
field name, not the encoded byte count; however, the decode logic treats
this encoded value as the byte length. This patch updates the decode logic
to instead properly treat the value as the string length.
jordan-powers added a commit to jordan-powers/elasticsearch that referenced this pull request Jul 28, 2025
When encoding an ignored source entry, we write the string length of the 
field name, not the encoded byte count; however, the decode logic treats
this encoded value as the byte length. This patch updates the decode logic
to instead properly treat the value as the string length.
@jordan-powers jordan-powers deleted the fix-utf-encoding-ignored-source-2 branch July 28, 2025 16:54
elasticsearchmachine pushed a commit that referenced this pull request Jul 28, 2025
…32030)

When encoding an ignored source entry, we write the string length of the 
field name, not the encoded byte count; however, the decode logic treats
this encoded value as the byte length. This patch updates the decode logic
to instead properly treat the value as the string length.
elasticsearchmachine pushed a commit that referenced this pull request Jul 28, 2025
…32029)

When encoding an ignored source entry, we write the string length of the 
field name, not the encoded byte count; however, the decode logic treats
this encoded value as the byte length. This patch updates the decode logic
to instead properly treat the value as the string length.
elasticsearchmachine pushed a commit that referenced this pull request Jul 28, 2025
…32031)

When encoding an ignored source entry, we write the string length of the 
field name, not the encoded byte count; however, the decode logic treats
this encoded value as the byte length. This patch updates the decode logic
to instead properly treat the value as the string length.
@jordan-powers
Copy link
Contributor Author

💚 All backports created successfully

Status Branch Result
9.0

Questions ?

Please refer to the Backport tool documentation

jordan-powers added a commit to jordan-powers/elasticsearch that referenced this pull request Jul 28, 2025
When encoding an ignored source entry, we write the string length of the
field name, not the encoded byte count; however, the decode logic treats
this encoded value as the byte length. This patch updates the decode logic
to instead properly treat the value as the string length.

(cherry picked from commit 178c0c9)
jordan-powers added a commit that referenced this pull request Jul 28, 2025
In #132018, we found an issue with non-ascii strings in ignored source. 
This issue was not exposed by the data generation tests because these tests
use ascii-only strings. This patch updates the tests to instead use full
unicode strings by default.
elasticsearchmachine pushed a commit that referenced this pull request Jul 28, 2025
…32034)

When encoding an ignored source entry, we write the string length of the
field name, not the encoded byte count; however, the decode logic treats
this encoded value as the byte length. This patch updates the decode logic
to instead properly treat the value as the string length.

(cherry picked from commit 178c0c9)
szybia added a commit to szybia/elasticsearch that referenced this pull request Jul 29, 2025
…-tracking

* upstream/main: (26 commits)
  Add release notes for v9.1.0 release (elastic#131953)
  Unmute multi_node generative tests (elastic#132021)
  Avoid re-enqueueing merge tasks (elastic#132020)
  Fix file entitlements for shared data dir (elastic#131748)
  ES|QL brute force l2_norm vector function (elastic#132025)
  Make ES|QL SAMPLE not a pipeline breaker (elastic#132014)
  Speed up tail computation in MemorySegmentES91OSQVectorsScorer (elastic#132001)
  Remove deprecated usages in `TransportPutFollowAction` (elastic#132038)
  Simulate impact of shard movement using shard-level write load (elastic#131406)
  Remove RemoteClusterService.getConnections() method (elastic#131948)
  Fix off by one in ValuesBytesRefAggregator (elastic#132032)
  Use unicode strings in data generation by default (elastic#132028)
  Adding index.refresh_interval as a data stream setting (elastic#131482)
  [ES|QL] Add more Min/MaxOverTime CSV tests (elastic#131070)
  Restrict remote ENRICH after FORK (elastic#131945)
  Fix decoding of non-ascii field names in ignored source (elastic#132018)
  [docs] Use centrally maintained version variables (elastic#131939)
  Configurable Inference timeout during Query time (elastic#131551)
  ESQL: Allow pruning columns added by InlineJoin (elastic#131204)
  ESQL: Fail `profile` on text response formats (elastic#128627)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Automatically create backport pull requests when merged >bug :StorageEngine/Mapping The storage related side of mappings Team:StorageEngine v8.18.5 v8.19.1 v9.0.5 v9.1.1 v9.2.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants