-
Notifications
You must be signed in to change notification settings - Fork 82
docs(clp-s): Add documentation for the lossless float representation feature. #1298
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 10 commits
150497b
f0c4036
359f295
5b89380
8a0bb5f
cc49e38
6297adc
049ed6f
585ab66
aa008af
b4d624f
304919e
df4e32b
1b76e6c
288ce21
03f2d38
b864ab5
a945bd2
e16c73d
c1a4903
9dbeb9d
4909490
3a46b3a
0c85740
c098515
da06d72
016d8e4
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
@@ -0,0 +1,216 @@ | ||||||||||
# What does it take to losslessly retain JSON floating-point numbers? | ||||||||||
|
||||||||||
Since our goal is to losslessly retain floating-point numbers that come from JSON input, it is worth | ||||||||||
taking a look at what kinds of floating-point numbers can appear in JSON. | ||||||||||
|
||||||||||
The [JSON specification][json_spec] treats fields matching the following grammar as number values: | ||||||||||
gibber9809 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||||||
```text | ||||||||||
number = [ minus ] int [ frac ] [ exp ] | ||||||||||
coderabbitai[bot] marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||||||
decimal-point = '.' | ||||||||||
digit1-9 = '1'-'9' | ||||||||||
e = 'e' | 'E' | ||||||||||
exp = e [ minus | plus ] ( digit1-9 | zero )+ | ||||||||||
frac = decimal-point ( digit1-9 | zero )+ | ||||||||||
int = zero | ( digit1-9 ( digit1-9 | zero )* ) | ||||||||||
minus = '-' | ||||||||||
plus = '+' | ||||||||||
zero = '0' | ||||||||||
quinntaylormitchell marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||||||
``` | ||||||||||
|
||||||||||
For our purposes, floating-point numbers are numbers which match this grammar and have either a | ||||||||||
fraction, an exponent, or both. | ||||||||||
|
||||||||||
Note that this restricts what kinds of floating-point numbers that are allowed in a few ways: | ||||||||||
coderabbitai[bot] marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||||||
- NaN and +/- Infinity are not allowed | ||||||||||
- The exponent must contain at least one digit | ||||||||||
- The fractional part of a number must contain at least one digit | ||||||||||
- The integer part of a number can only start with '0' if the entire integer part is '0' | ||||||||||
- The integer part of a number must contain at least one digit | ||||||||||
- Positive numbers cannot begin with an explicit '+' | ||||||||||
|
||||||||||
coderabbitai[bot] marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||||||
What it doesn't do is place any restrictions on how a given floating-point number should be written, | ||||||||||
or whether the floating-point numbers have to correspond to values from a standard such as IEEE-754 | ||||||||||
binary64. | ||||||||||
|
||||||||||
coderabbitai[bot] marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||||||
Since the first point is less abstract, we'll explain it first, with an example. Say we're trying to | ||||||||||
represent the number `16` as a floating point number using **3** digits of precision. We might write | ||||||||||
it as: | ||||||||||
- `16.0`; or | ||||||||||
- `16.0e0`; or | ||||||||||
- `1.60e1` | ||||||||||
Note that for scientific-notation representations of a number we can shift the decimal and change | ||||||||||
the exponent arbitrarily to represent the same number in infinite possible ways. For example: | ||||||||||
- `0.160e2`; and | ||||||||||
- `0.00000000160e10` | ||||||||||
are both valid representations of `16` using 3 significant digits. | ||||||||||
|
||||||||||
Likewise, we can come up with infinite representations of `16` by choosing to represent arbitrarily | ||||||||||
many significant digits. | ||||||||||
quinntaylormitchell marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
gibber9809 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||||||
|
||||||||||
The point about whether the original values correspond to IEEE-754 is a bit abstract, but is | ||||||||||
important for understanding our approach for losslessly storing floating-point numbers. | ||||||||||
|
||||||||||
It is probably easiest to show an example. Of the numbers: | ||||||||||
gibber9809 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||||||
- `1.2345678901234567` | ||||||||||
- `1.2345678901234568` | ||||||||||
- `1.2345678901234570` | ||||||||||
we know that only the first and third number correspond to IEEE-754 binary64 floating-point numbers. | ||||||||||
The reason we can tell is that the [IEEE-754 specification][ieee754] requires that when converting a | ||||||||||
floating-point number to a decimal string at a given precision, the decimal string must correspond | ||||||||||
to the nearest decimal representation. Likewise, when converting a decimal string to a | ||||||||||
floating-point number, the standard requires that the number be converted to the nearest | ||||||||||
floating-point representation. If you use any standards-compliant implementation to turn | ||||||||||
`1.2345678901234568` into a floating-point number, and back to a decimal string, you will find that | ||||||||||
it has been rounded to `1.2345678901234567`. | ||||||||||
|
||||||||||
Overall the implications here are that: | ||||||||||
- For any given number there are many possible representations (infinitely many in fact); and | ||||||||||
- Not all floating-point numbers that are valid JSON correspond to values from a standard like | ||||||||||
IEEE-754 | ||||||||||
|
||||||||||
In practice though, we know that most of the time we should be dealing with very standard | ||||||||||
machine-generated data. This means that most inputs _do_ correspond to IEEE-754 binary64 | ||||||||||
floating-point numbers in practice, and that of the infinitely many ways of representing a number | ||||||||||
only a few will be common. | ||||||||||
|
||||||||||
Our approach then is to store most floating-point numbers as an IEEE-754 binary64 floating-point | ||||||||||
number alongside some formatting information, with the fallback of storing the number as a string | ||||||||||
when that doesn't work. | ||||||||||
|
||||||||||
# Retaining floating-point format information | ||||||||||
gibber9809 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||||||
|
||||||||||
To losslessly retain the string representation of floating point numbers we use two encoding | ||||||||||
strategies: | ||||||||||
* `FormattedFloat`: similar to `DateString`, we store formatting information about a floating point | ||||||||||
number alongside its IEEE-754 binary64 representation. | ||||||||||
* `DictionaryFloat`: we store the full string representation of the floating point number in the | ||||||||||
variable dictionary, and encode numbers as their corresponding variable dictionary IDs. | ||||||||||
|
||||||||||
Generally we prefer storing floating point numbers as `FormattedFloat` over `DictionaryFloat` | ||||||||||
because: | ||||||||||
* We can directly compare against the stored IEEE-754 binary64 float at query time instead of having | ||||||||||
to first parse the string representation of a floating point number. | ||||||||||
* We avoid bloating the variable dictionary with non-repetitive floating point strings. | ||||||||||
|
||||||||||
gibber9809 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||||||
Unfortunately, even though `FormattedFloat` is designed to represent most common encodings of | ||||||||||
IEEE-754 binary64 floats, we cannot guarantee that our input follows a common format or was | ||||||||||
converted from a binary64 floating point number. As a result, at parsing time, we check if a given | ||||||||||
floating point number is representable as a `FormattedFloat`, and if it isn't, we encode it as a | ||||||||||
`DictionaryFloat`. | ||||||||||
|
||||||||||
## High-level `FormattedFloat` specification | ||||||||||
|
||||||||||
Each `FormattedFloat` node contains: | ||||||||||
|
||||||||||
- The double value in IEEE-754 binary64 format. | ||||||||||
- A 2-byte little-endian *format* field encoding the necessary output formatting information so | ||||||||||
that, upon decompression, the value can be decompressed exactly to the original text. | ||||||||||
coderabbitai[bot] marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||||||
|
||||||||||
Note that the unused lowest 5 bits of the 2‑byte field are currently reserved, encoders must write | ||||||||||
gibber9809 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||||||
them as 0, and decoders must ignore them (treat as “don’t care”) for forward compatibility. | ||||||||||
|
||||||||||
From MSB to LSB, the 2-byte format field contains the following sections: | ||||||||||
- [Scientific notation marker](#scientific-notation-marker) (2 bits) | ||||||||||
- [Exponent sign](#exponent-sign) (2 bits) | ||||||||||
- [Exponent digits](#exponent-digits) (2 bits) | ||||||||||
- [Digits from first non-zero to end of number](#digits-from-first-non-zero-to-end-of-number) (5 bits) | ||||||||||
- Reserved for future use (5 bits) | ||||||||||
|
||||||||||
To clarify the floating point formats that `FormattedFloat` can represent, we describe them in text | ||||||||||
here: | ||||||||||
* For non-scientific numbers we accept: | ||||||||||
* Any number that has at most 16 digits after the first non-zero digit | ||||||||||
* Or at most 1 zero before the decimal and 16 zeroes after the decimal, if the number is a zero | ||||||||||
* For scientific numbers we accept: | ||||||||||
* Single digit numbers with no decimal, followed by an exponent | ||||||||||
* Or numbers with **1** digit preceding the decimal and up to 16 digits following the | ||||||||||
decimal, followed by an exponent | ||||||||||
* Where zero can not be the digit before the decimal, unless every digit in the number is zero | ||||||||||
* And where the exponent is specified by `e` or `E` optionally followed by `+` or `-` | ||||||||||
* With at most **4** exponent digits, which can be left-padded with `0` | ||||||||||
|
||||||||||
With the added restrictions that: | ||||||||||
* The floating point number follows the JSON grammar for floating point numbers. | ||||||||||
* There exists an IEEE-754 binary64 number for which the string is the closest decimal | ||||||||||
representation at the given precision. | ||||||||||
|
||||||||||
These restrictions really correspond to "canonical" representations of floating point numbers with | ||||||||||
up to 17 digits of precision. This means that our formatting scheme can always represent numbers | ||||||||||
produced by format specifiers such as '%f', '%e', and '%g', so long as they don't use too many | ||||||||||
gibber9809 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||||||
digits of precision, and the underlying number isn't NaN, or +/- Infinity. | ||||||||||
|
||||||||||
### Scientific notation marker | ||||||||||
|
||||||||||
Indicates whether the number is in scientific notation, and if so, whether the exponent is denoted | ||||||||||
by `E` or `e`. | ||||||||||
|
||||||||||
- `00`: Not scientific | ||||||||||
- `01`: Scientific notation using `e` | ||||||||||
- `11`: Scientific notation using `E` | ||||||||||
|
||||||||||
`10` is unused so that the lowest bit can act as a simple “scientific” flag, making condition | ||||||||||
checks cleaner. | ||||||||||
|
||||||||||
gibber9809 marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||
### Exponent sign | ||||||||||
|
||||||||||
Records whether the exponent has a sign: | ||||||||||
|
||||||||||
- `00`: No sign | ||||||||||
- `01`: `+` | ||||||||||
- `10`: `-` | ||||||||||
|
||||||||||
`11` is unused by the current implementation. | ||||||||||
|
||||||||||
For example, exponents of `0` may appear as `0`, `+0`, or `-0`, and these two bits can record the | ||||||||||
format correctly. | ||||||||||
|
||||||||||
gibber9809 marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||
### Exponent digits | ||||||||||
|
||||||||||
Since the maximum and minimum decimal exponents for a double, `308` and `-324` respectively, are | ||||||||||
gibber9809 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||||||
both three digits, two bits are enough to represent the digit count. We allow up to 4 digits to | ||||||||||
support exponents left-padded with `0`. | ||||||||||
|
||||||||||
The stored value is **actual digits − 1**, since there is always at least one digit | ||||||||||
(e.g., `00` → 1 digit). The two-bit mapping is: | ||||||||||
gibber9809 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||||||
|
||||||||||
- `00` → 1 digit | ||||||||||
- `01` → 2 digits | ||||||||||
- `10` → 3 digits | ||||||||||
- `11` → 4 digits | ||||||||||
|
||||||||||
### Digits from first non-zero to end of number | ||||||||||
|
||||||||||
This counts the digits from the first non-zero digit up to the last digit of the integer or | ||||||||||
fractional part (excluding the exponent). Examples: | ||||||||||
|
||||||||||
Comment on lines
+202
to
+204
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🧹 Nitpick Explicitly state the decimal point is excluded from the count. Removes a small ambiguity in implementer interpretation. Apply this diff: -This counts the digits from the first non-zero digit up to the last digit of the integer or
-fractional part (excluding the exponent). Examples:
+This counts the digits from the first non-zero digit up to the last digit of the integer or
+fractional part (excluding the decimal point and the exponent). Examples: 📝 Committable suggestion
Suggested change
🤖 Prompt for AI Agents
|
||||||||||
- `123456789.1234567000` → **19** (from first `1` to last `0`) | ||||||||||
- `1.234567890E16` → **10** (from first `1` to `0` before exponent) | ||||||||||
- `0.000000123000` → **6** (from first `1` to last `0`) | ||||||||||
- `0.00` → **3** (counts all zeros for zero value) | ||||||||||
|
||||||||||
Per the [JSON specification][json_grammar], the integer part of a floating point number cannot be | ||||||||||
empty, so the minimum number of digits is **1**. To take advantage of this fact, we store this field | ||||||||||
as **actual number of non-zero digits to end of number - 1**; for the numeric value zero we store | ||||||||||
**actual number of digits - 1**. | ||||||||||
anlowee marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
quinntaylormitchell marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||||||
|
||||||||||
As well, according to IEEE-754, only 17 decimal significant digits are needed to represent all | ||||||||||
binary64 floating point numbers without precision loss. As a result, we currently allow a maximum of | ||||||||||
**17 digits**. Because the stored value is **digits - 1** the maximum encoded value is 16, which | ||||||||||
requires 5 bits. | ||||||||||
|
||||||||||
We could support representing binary64 numbers with up to 32 significant digits, and we may choose | ||||||||||
to do so in the future, but this is explicitly not supported in the current version of the format. | ||||||||||
The rationale for not doing so now is that as the number of digits increases beyond 17, the | ||||||||||
likelihood that the number corresponds to a valid IEEE-754 binary64 float decreases. | ||||||||||
gibber9809 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||||||
|
||||||||||
[json_spec]: https://datatracker.ietf.org/doc/html/rfc7159 | ||||||||||
[ieee754]: https://ieeexplore.ieee.org/document/4610935/ | ||||||||||
coderabbitai[bot] marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
Uh oh!
There was an error while loading. Please reload this page.