Skip to content

Fix encoding of non-ascii field names in ignored source #131950

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

jordan-powers
Copy link
Contributor

@jordan-powers jordan-powers commented Jul 25, 2025

When encoding an ignored source entry, we use String.length() to get the length of the encoded field name. This will only work when the UTF-8 encoding has only ascii characters, with 1 byte per character.

The solution is to use the actual length of the encoded field name byte[] array.

I added a test that encodes a field in _ignored_source with a random unicode key. Without this fix, it fails with this stack trace:

java.lang.IllegalArgumentException: Can't decode [9c bd f2 8c b0 91 f1 80 ac 9c 53 f0 9e ad af f0 ae a2 b3 f1 9a 86 bd f1 a7 b5 86 f3 ac b0 82 e7 b5 b0 f1 9f 8f b3 f3 a7 b7 94 f2 91 9f bd f1 a0 80 af]
        at __randomizedtesting.SeedInfo.seed([1:80DEBC9A7DDCB0FE]:0)
        at org.elasticsearch.index.mapper.XContentDataHelper.decodeAndWrite(XContentDataHelper.java:109)
        at org.elasticsearch.index.mapper.XContentDataHelper.writeMerged(XContentDataHelper.java:196)
        at org.elasticsearch.index.mapper.ObjectMapper$SyntheticSourceFieldLoader$FieldWriter$IgnoredSource.writeTo(ObjectMapper.java:1230)
        at org.elasticsearch.index.mapper.ObjectMapper$SyntheticSourceFieldLoader.write(ObjectMapper.java:1141)
        at org.elasticsearch.index.mapper.SourceLoader$Synthetic$SyntheticLeaf.write(SourceLoader.java:240)
        at org.elasticsearch.index.mapper.SourceLoader$Synthetic$SyntheticLeaf.source(SourceLoader.java:199)
        at org.elasticsearch.index.mapper.SourceLoader$Synthetic$LeafWithMetrics.source(SourceLoader.java:162)
        at org.elasticsearch.search.lookup.ConcurrentSegmentSourceProvider$Leaf.getSource(ConcurrentSegmentSourceProvider.java:79)
        at org.elasticsearch.search.lookup.ConcurrentSegmentSourceProvider.getSource(ConcurrentSegmentSourceProvider.java:58)
        at org.elasticsearch.index.mapper.MapperServiceTestCase.syntheticSource(MapperServiceTestCase.java:862)
        at org.elasticsearch.index.mapper.MapperServiceTestCase.syntheticSource(MapperServiceTestCase.java:816)
        at org.elasticsearch.index.mapper.IgnoredSourceFieldMapperTests.getSyntheticSourceWithFieldLimit(IgnoredSourceFieldMapperTests.java:62)
        at org.elasticsearch.index.mapper.IgnoredSourceFieldMapperTests.getSyntheticSourceWithFieldLimit(IgnoredSourceFieldMapperTests.java:54)
        at org.elasticsearch.index.mapper.IgnoredSourceFieldMapperTests.testIgnoredStringFullUnicode(IgnoredSourceFieldMapperTests.java:134)

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-storage-engine (Team:StorageEngine)

@elasticsearchmachine
Copy link
Collaborator

Hi @jordan-powers, I've created a changelog YAML for you.

@jordan-powers jordan-powers added v9.1.1 v8.19.1 v9.0.5 v8.18.5 auto-backport Automatically create backport pull requests when merged labels Jul 25, 2025
Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you confirm my thinking in the comment I left? Otherwise LGTM.

@@ -170,7 +170,7 @@ static byte[] encode(NameValue values) {

byte[] nameBytes = values.name.getBytes(StandardCharsets.UTF_8);
byte[] bytes = new byte[4 + nameBytes.length + values.value.length];
ByteUtils.writeIntLE(values.name.length() + PARENT_OFFSET_IN_NAME_OFFSET * values.parentOffset, bytes, 0);
ByteUtils.writeIntLE(nameBytes.length + PARENT_OFFSET_IN_NAME_OFFSET * values.parentOffset, bytes, 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just double checking, there is no need for an index version check here, given that decode isn't updated in this change. In other words, Indexing new documents in indices with older index version, would result in the error described in the PR description to not occur.

@jordan-powers
Copy link
Contributor Author

I realized that I can update the decode to handle the old format. This way we won't lose non-ascii data written in the old format. I've opened another PR: #132018

@jordan-powers
Copy link
Contributor Author

Closed in favor of the solution in #132018

@jordan-powers jordan-powers deleted the fix-utf-encoding-ignored-source branch July 28, 2025 16:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Automatically create backport pull requests when merged >bug :StorageEngine/Mapping The storage related side of mappings Team:StorageEngine v8.18.5 v8.19.1 v9.0.5 v9.1.1 v9.2.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants