Skip to content

Conversation

JoshRosen
Copy link
Contributor

@JoshRosen JoshRosen commented Sep 3, 2025

TL;DR: sjsonnet string operations (length, indexing, ordering) were operating on UTF-16 code units (i.e. Java / Scala chars) but this is inconsistent with the jsonnet language reference, which defines jsonnet strings as sequences of Unicode codepoints.

As a consequence, sjsonnet returned incorrect results (w.r.t. the go/c++ reference implementations) for several operations involving strings containing Unicode characters with codepoints outside of the Basic Multilingual Plane.

For example, the string "🌍" consists of a single codepoint (\u1F30D, "earth globe europe-africa") and its UTF-16 representation is the surrogate pair \uD83C\uDF0D. In Java, this string has .length == 2 but .codePointCount == 1. Sjsonnet was returning std.length("🌍") == 2 instead of 1. Similar issues affected string indexing, slicing, and sorting.

google/go-jsonnet#600 is a related past issue in go-jsonnet.


This PR aims to holistically update string operations and std functions to operate over codepoints:

  • std.codepoint: now handles non-BMP characters; previously, errored with a complaint about a too-long input string.
  • Slicing and indexing:
    • [] operator
    • std.substr
    • std.length
    • std.findSubstr
  • Iteration:
    • std.stringChars,
    • std.map
    • std.flatMap
  • Comparison / ordering / sorting:
    • binary operations (<, <=, >, >=`)
    • Evaluator.compare
    • Materializer object field sorting
    • std.manfiestTomlEx, std.manifestJson
    • std.objectFields, std.objectFieldsAll

Performance considerations

Since most jsonnet inputs probably contain only ASCII or Latin-1 characters, it's important that we don't regress performance too much.

In some cases, Java's compact string optimizations might hide most of the perf. impact. For example, String.codePointCount has a O(1) fastpath for Latin-1 strings and similar shortcuts exist for codePointAt. The biggest impact is likely to be in sorting / comparison: I didn't notice huge differences in end-to-end benchmarks sorting lots of strings with long common prefixes, so I wager that this comparison performance is unlikely to be a significant issue in real practice.

Design decision: preserving existing handling of unpaired surrogates

There are pre-existing small discrepancies in how c++ and go jsonnet handle unpaired surrogates:

  • Both reject unpaired surrogates in string escapes, whereas sjsonnet permits them.
  • In go-jsonnet, std.char() maps surrogates' codepoints to the replacement character (e.g. std.codepoint(std.char(55296)) == 65533), whereas c++ jsonnet preserves them as unpaired surrogates.

In this PR, I have (somewhat arbitrarily) chosen to preserve sjsonnet's existing permissive behavior and have added new test cases to cover it.

if (int >= v.value.length)
Error.fail(s"string bounds error: $int not within [0, ${v.value.length})", pos)
Val.Str(pos, new String(Array(v.value(int))))
val unicodeLength = v.value.codePointCount(0, v.value.length)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extract v.value to a local varaiable.

i += 1
while (i < str.length) {
val codePoint = str.codePointAt(i)
chars(charIndex) = Val.Str(pos, Character.toString(codePoint))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a String.toString(codePoint) in java 11 :(, but those two are the same anyway.

case _ =>
val range = start until end by step
new String(range.dropWhile(_ < 0).takeWhile(_ < s.length).map(s).toArray)
val result = new java.lang.StringBuilder()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

init the StringBuilder's size

if (Character.isSurrogate(c)) {
// Handle surrogate pair
val cp = s.codePointAt(sIdx)
if (rel % step == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use nextInclude += step , codepointIndex==nextInclude to avoid the %, which can be slow

@He-Pin
Copy link
Contributor

He-Pin commented Sep 4, 2025

@JoshRosen, what if we record if the string contains only ASCII or Latin-1 characters after parsing? Would that help the later evaluation?

Due to checks performed at its callers, one of the branches in codePointOffsetsToStringIndices
was unreachable and untested. It's clearer (and likely more performant)
to eliminate this method and inline specialized versions of its logic at
its former callsites.
@JoshRosen JoshRosen marked this pull request as ready for review September 9, 2025 01:44
@JoshRosen
Copy link
Contributor Author

@JoshRosen, what if we record if the string contains only ASCII or Latin-1 characters after parsing? Would that help the later evaluation?

Based on some quick toy benchmarks with test files, I don't think that this would be worth it: the overall perf. hit doesn't seem significant and the affected code probably isn't one of the top bottlenecks in regular evaluation.

@stephenamar-db
Copy link
Collaborator

stephenamar-db commented Sep 10, 2025

I will find time to run benchmarks against one of our pathological cases this week or next.

@stephenamar-db stephenamar-db self-requested a review September 10, 2025 21:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants