-
Notifications
You must be signed in to change notification settings - Fork 25.4k
Fix encoding of non-ascii field names in ignored source #131950
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix encoding of non-ascii field names in ignored source #131950
Conversation
Pinging @elastic/es-storage-engine (Team:StorageEngine) |
Hi @jordan-powers, I've created a changelog YAML for you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you confirm my thinking in the comment I left? Otherwise LGTM.
@@ -170,7 +170,7 @@ static byte[] encode(NameValue values) { | |||
|
|||
byte[] nameBytes = values.name.getBytes(StandardCharsets.UTF_8); | |||
byte[] bytes = new byte[4 + nameBytes.length + values.value.length]; | |||
ByteUtils.writeIntLE(values.name.length() + PARENT_OFFSET_IN_NAME_OFFSET * values.parentOffset, bytes, 0); | |||
ByteUtils.writeIntLE(nameBytes.length + PARENT_OFFSET_IN_NAME_OFFSET * values.parentOffset, bytes, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just double checking, there is no need for an index version check here, given that decode isn't updated in this change. In other words, Indexing new documents in indices with older index version, would result in the error described in the PR description to not occur.
I realized that I can update the decode to handle the old format. This way we won't lose non-ascii data written in the old format. I've opened another PR: #132018 |
Closed in favor of the solution in #132018 |
When encoding an ignored source entry, we use
String.length()
to get the length of the encoded field name. This will only work when the UTF-8 encoding has only ascii characters, with 1 byte per character.The solution is to use the actual length of the encoded field name byte[] array.
I added a test that encodes a field in _ignored_source with a random unicode key. Without this fix, it fails with this stack trace: