-
Notifications
You must be signed in to change notification settings - Fork 62
Fix handling of non-BMP Unicode strings (jsonnet strings should operate over codepoints, not UTF-16 code units) #500
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
if (int >= v.value.length) | ||
Error.fail(s"string bounds error: $int not within [0, ${v.value.length})", pos) | ||
Val.Str(pos, new String(Array(v.value(int)))) | ||
val unicodeLength = v.value.codePointCount(0, v.value.length) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extract v.value
to a local varaiable.
i += 1 | ||
while (i < str.length) { | ||
val codePoint = str.codePointAt(i) | ||
chars(charIndex) = Val.Str(pos, Character.toString(codePoint)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a String.toString(codePoint)
in java 11 :(, but those two are the same anyway.
sjsonnet/src/sjsonnet/Util.scala
Outdated
case _ => | ||
val range = start until end by step | ||
new String(range.dropWhile(_ < 0).takeWhile(_ < s.length).map(s).toArray) | ||
val result = new java.lang.StringBuilder() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
init the StringBuilder's size
sjsonnet/src/sjsonnet/Util.scala
Outdated
if (Character.isSurrogate(c)) { | ||
// Handle surrogate pair | ||
val cp = s.codePointAt(sIdx) | ||
if (rel % step == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use nextInclude += step , codepointIndex==nextInclude
to avoid the %
, which can be slow
@JoshRosen, what if we record if the string contains |
Due to checks performed at its callers, one of the branches in codePointOffsetsToStringIndices was unreachable and untested. It's clearer (and likely more performant) to eliminate this method and inline specialized versions of its logic at its former callsites.
e344d47
to
60454bc
Compare
Based on some quick toy benchmarks with test files, I don't think that this would be worth it: the overall perf. hit doesn't seem significant and the affected code probably isn't one of the top bottlenecks in regular evaluation. |
I will find time to run benchmarks against one of our pathological cases this week or next. |
TL;DR: sjsonnet string operations (length, indexing, ordering) were operating on UTF-16 code units (i.e. Java / Scala chars) but this is inconsistent with the jsonnet language reference, which defines jsonnet strings as sequences of Unicode codepoints.
As a consequence, sjsonnet returned incorrect results (w.r.t. the go/c++ reference implementations) for several operations involving strings containing Unicode characters with codepoints outside of the Basic Multilingual Plane.
For example, the string "🌍" consists of a single codepoint (
\u1F30D
, "earth globe europe-africa") and its UTF-16 representation is the surrogate pair\uD83C\uDF0D
. In Java, this string has.length == 2
but.codePointCount == 1
. Sjsonnet was returningstd.length("🌍") == 2
instead of1
. Similar issues affected string indexing, slicing, and sorting.google/go-jsonnet#600 is a related past issue in go-jsonnet.
This PR aims to holistically update string operations and
std
functions to operate over codepoints:std.codepoint
: now handles non-BMP characters; previously, errored with a complaint about a too-long input string.[]
operatorstd.substr
std.length
std.findSubstr
std.stringChars
,std.map
std.flatMap
<
, <=,
>,
>=`)Evaluator.compare
std.manfiestTomlEx
,std.manifestJson
std.objectFields
,std.objectFieldsAll
Performance considerations
Since most jsonnet inputs probably contain only ASCII or Latin-1 characters, it's important that we don't regress performance too much.
In some cases, Java's compact string optimizations might hide most of the perf. impact. For example,
String.codePointCount
has a O(1) fastpath for Latin-1 strings and similar shortcuts exist forcodePointAt
. The biggest impact is likely to be in sorting / comparison: I didn't notice huge differences in end-to-end benchmarks sorting lots of strings with long common prefixes, so I wager that this comparison performance is unlikely to be a significant issue in real practice.Design decision: preserving existing handling of unpaired surrogates
There are pre-existing small discrepancies in how c++ and go jsonnet handle unpaired surrogates:
std.char()
maps surrogates' codepoints to the replacement character (e.g.std.codepoint(std.char(55296)) == 65533
), whereas c++ jsonnet preserves them as unpaired surrogates.In this PR, I have (somewhat arbitrarily) chosen to preserve sjsonnet's existing permissive behavior and have added new test cases to cover it.