Fix handling of non-BMP Unicode strings (jsonnet strings should operate over codepoints, not UTF-16 code units) #500

JoshRosen · 2025-09-03T19:31:32Z

TL;DR: sjsonnet string operations (length, indexing, ordering) were operating on UTF-16 code units (i.e. Java / Scala chars) but this is inconsistent with the jsonnet language reference, which defines jsonnet strings as sequences of Unicode codepoints.

As a consequence, sjsonnet returned incorrect results (w.r.t. the go/c++ reference implementations) for several operations involving strings containing Unicode characters with codepoints outside of the Basic Multilingual Plane.

For example, the string "🌍" consists of a single codepoint (\u1F30D, "earth globe europe-africa") and its UTF-16 representation is the surrogate pair \uD83C\uDF0D. In Java, this string has .length == 2 but .codePointCount == 1. Sjsonnet was returning std.length("🌍") == 2 instead of 1. Similar issues affected string indexing, slicing, and sorting.

google/go-jsonnet#600 is a related past issue in go-jsonnet.

This PR aims to holistically update string operations and std functions to operate over codepoints:

std.codepoint: now handles non-BMP characters; previously, errored with a complaint about a too-long input string.
Slicing and indexing:
- [] operator
- std.substr
- std.length
- std.findSubstr
Iteration:
- std.stringChars,
- std.map
- std.flatMap
Comparison / ordering / sorting:
- binary operations (<, <=, >, >=`)
- Evaluator.compare
- Materializer object field sorting
- std.manfiestTomlEx, std.manifestJson
- std.objectFields, std.objectFieldsAll

Performance considerations

Since most jsonnet inputs probably contain only ASCII or Latin-1 characters, it's important that we don't regress performance too much.

In some cases, Java's compact string optimizations might hide most of the perf. impact. For example, String.codePointCount has a O(1) fastpath for Latin-1 strings and similar shortcuts exist for codePointAt. The biggest impact is likely to be in sorting / comparison: I didn't notice huge differences in end-to-end benchmarks sorting lots of strings with long common prefixes, so I wager that this comparison performance is unlikely to be a significant issue in real practice.

Design decision: preserving existing handling of unpaired surrogates

There are pre-existing small discrepancies in how c++ and go jsonnet handle unpaired surrogates:

Both reject unpaired surrogates in string escapes, whereas sjsonnet permits them.
In go-jsonnet, std.char() maps surrogates' codepoints to the replacement character (e.g. std.codepoint(std.char(55296)) == 65533), whereas c++ jsonnet preserves them as unpaired surrogates.

In this PR, I have (somewhat arbitrarily) chosen to preserve sjsonnet's existing permissive behavior and have added new test cases to cover it.

sjsonnet/src/sjsonnet/Util.scala

He-Pin · 2025-09-04T05:21:31Z

sjsonnet/src/sjsonnet/Evaluator.scala

-        if (int >= v.value.length)
-          Error.fail(s"string bounds error: $int not within [0, ${v.value.length})", pos)
-        Val.Str(pos, new String(Array(v.value(int))))
+        val unicodeLength = v.value.codePointCount(0, v.value.length)


extract v.value to a local varaiable.

He-Pin · 2025-09-04T05:31:07Z

sjsonnet/src/sjsonnet/Std.scala

-      i += 1
+    while (i < str.length) {
+      val codePoint = str.codePointAt(i)
+      chars(charIndex) = Val.Str(pos, Character.toString(codePoint))


There is a String.toString(codePoint) in java 11 :(, but those two are the same anyway.

He-Pin · 2025-09-04T05:36:23Z

sjsonnet/src/sjsonnet/Util.scala

        case _ =>
-          val range = start until end by step
-          new String(range.dropWhile(_ < 0).takeWhile(_ < s.length).map(s).toArray)
+          val result = new java.lang.StringBuilder()


init the StringBuilder's size

He-Pin · 2025-09-04T05:40:23Z

sjsonnet/src/sjsonnet/Util.scala

+            if (Character.isSurrogate(c)) {
+              // Handle surrogate pair
+              val cp = s.codePointAt(sIdx)
+              if (rel % step == 0) {


use nextInclude += step , codepointIndex==nextInclude to avoid the %, which can be slow

He-Pin · 2025-09-04T07:01:18Z

@JoshRosen, what if we record if the string contains only ASCII or Latin-1 characters after parsing? Would that help the later evaluation?

Due to checks performed at its callers, one of the branches in codePointOffsetsToStringIndices was unreachable and untested. It's clearer (and likely more performant) to eliminate this method and inline specialized versions of its logic at its former callsites.

JoshRosen · 2025-09-09T01:46:55Z

@JoshRosen, what if we record if the string contains only ASCII or Latin-1 characters after parsing? Would that help the later evaluation?

Based on some quick toy benchmarks with test files, I don't think that this would be worth it: the overall perf. hit doesn't seem significant and the affected code probably isn't one of the top bottlenecks in regular evaluation.

stephenamar-db · 2025-09-10T21:37:15Z

I will find time to run benchmarks against one of our pathological cases this week or next.

JoshRosen added 8 commits September 3, 2025 10:52

Add regression tests, plus test of current unpaired surrogate behaviors.

8096716

Fix std.length

4e01cf7

Fix codepoint.

e01705a

Fix slicing, indexing, substr

742220a

Fix stringChars (and, by extension, map).

21050ea

Fix string comparison / ordering

0571d6e

Use Java 11 Character.toString(codepoint: int); minor import cleanup

f48762e

Fixes to flatMap and findSubstr (WIP)

1d461dd

JoshRosen commented Sep 3, 2025

View reviewed changes

sjsonnet/src/sjsonnet/Util.scala Show resolved Hide resolved

scalafmt

2d60994

He-Pin reviewed Sep 4, 2025

View reviewed changes

He-Pin mentioned this pull request Sep 4, 2025

chore: Rewrite escape1 to unicodeEscape #502

Open

JoshRosen added 5 commits September 8, 2025 16:34

Extract v.value into a local

9497d57

Pre-size string builder and avoid modulus.

603aed9

scalafmt

733b6e9

Test cleanups.

8d5a95f

JoshRosen force-pushed the unicode-fixes branch from e344d47 to 60454bc Compare September 9, 2025 01:28

JoshRosen marked this pull request as ready for review September 9, 2025 01:44

stephenamar-db self-requested a review September 10, 2025 21:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix handling of non-BMP Unicode strings (jsonnet strings should operate over codepoints, not UTF-16 code units) #500

Fix handling of non-BMP Unicode strings (jsonnet strings should operate over codepoints, not UTF-16 code units) #500

Uh oh!

JoshRosen commented Sep 3, 2025 •

edited

Loading

Uh oh!

Uh oh!

He-Pin Sep 4, 2025

Uh oh!

He-Pin Sep 4, 2025

Uh oh!

He-Pin Sep 4, 2025

Uh oh!

He-Pin Sep 4, 2025

Uh oh!

He-Pin commented Sep 4, 2025

Uh oh!

JoshRosen commented Sep 9, 2025

Uh oh!

stephenamar-db commented Sep 10, 2025 •

edited

Loading

Uh oh!

Uh oh!

Fix handling of non-BMP Unicode strings (jsonnet strings should operate over codepoints, not UTF-16 code units) #500

Are you sure you want to change the base?

Fix handling of non-BMP Unicode strings (jsonnet strings should operate over codepoints, not UTF-16 code units) #500

Uh oh!

Conversation

JoshRosen commented Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Performance considerations

Design decision: preserving existing handling of unpaired surrogates

Uh oh!

Uh oh!

He-Pin Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

He-Pin Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

He-Pin Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

He-Pin Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

He-Pin commented Sep 4, 2025

Uh oh!

JoshRosen commented Sep 9, 2025

Uh oh!

stephenamar-db commented Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

JoshRosen commented Sep 3, 2025 •

edited

Loading

stephenamar-db commented Sep 10, 2025 •

edited

Loading