-
Notifications
You must be signed in to change notification settings - Fork 82
docs(clp-s): Add documentation for the lossless float representation feature. #1298
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 2 commits
150497b
f0c4036
359f295
5b89380
8a0bb5f
cc49e38
6297adc
049ed6f
585ab66
aa008af
b4d624f
304919e
df4e32b
1b76e6c
288ce21
03f2d38
b864ab5
a945bd2
e16c73d
c1a4903
9dbeb9d
4909490
3a46b3a
0c85740
c098515
da06d72
016d8e4
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
@@ -0,0 +1,119 @@ | ||||||||||
# Retaining Floating-Point Format Information | ||||||||||
|
||||||||||
To losslessly retain the string representation of floating point numbers we use two encoding | ||||||||||
strategies: | ||||||||||
* `FormattedFloat`: similar to `DateString`, we store formatting information about a floating point | ||||||||||
number alongside its IEEE-754 binary64 representation. | ||||||||||
* `DictionaryFloat`: we store the full string representation of the floating point number in the | ||||||||||
variable dictionary, and encode numbers as their corresponding variable dictionary IDs. | ||||||||||
|
||||||||||
Generally we prefer storing floating point numbers as `FormattedFloat` over `DictionaryFloat` | ||||||||||
because: | ||||||||||
* We can directly compare against the stored IEEE-754 binary64 float at query time instead of having | ||||||||||
to first parse the string representation of a floating point number. | ||||||||||
* We avoid bloating the variable dictionary with non-repetitive floating point strings. | ||||||||||
|
||||||||||
Unfortunately, even though `FormattedFloat` is designed to represent most common encodings of | ||||||||||
IEEE-754 binary64 floats, we can not guarantee that our input follows a common format or was | ||||||||||
converted from a binary64 floating point number. As a result, at parsing time, we check if a given | ||||||||||
floating point number is representable as a `FormattedFloat`, and if it isn't, we encode it as a | ||||||||||
`DictionaryFloat`. | ||||||||||
|
||||||||||
gibber9809 marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||
## High-Level `FormattedFloat` Specification | ||||||||||
|
||||||||||
Each `FormattedFloat` node contains: | ||||||||||
|
||||||||||
- The double value in IEEE-754 binary64 format. | ||||||||||
- A 2-byte *format* field encoding the necessary output formatting information so that, upon | ||||||||||
decompression, the value can be reproduced to match the original text as closely as possible. | ||||||||||
Note that the remaining bits of the 2‑byte field are currently reserved. | ||||||||||
Encoders must write them as 0, and decoders must ignore them (treat as “don’t care”) for forward | ||||||||||
anlowee marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||||||
compatibility. | ||||||||||
|
||||||||||
```text | ||||||||||
gibber9809 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||||||
+-------------------------------------+------------------------+--------------------------+------------------------------------------------------+ | ||||||||||
| Scientific Notation Marker (2 bits) | Exponent Sign (2 bits) | Exponent Digits (2 bits) | Digits from First Non-Zero to End of Number (5 bits) | | ||||||||||
+-------------------------------------+------------------------+--------------------------+------------------------------------------------------+ | ||||||||||
``` | ||||||||||
|
||||||||||
To clarify the floating point formats that `FormattedFloat` can represent, we describe them in text | ||||||||||
here: | ||||||||||
* For non-scientific numbers we accept: | ||||||||||
* Any number that has at most 16 digits after the first non-zero digit | ||||||||||
* Or at most 1 zero before the decimal and 16 zeroes after the decimal, if the number is a zero | ||||||||||
* For scientific numbers we accept: | ||||||||||
* Single digit numbers with no decimal, followed by an exponent | ||||||||||
* Or numbers with **1** digit preceding the decimal and up to 16 digits following the | ||||||||||
decimal, followed by an exponent | ||||||||||
* Where zero can not be the digit before the decimal, unless the number is a zero | ||||||||||
* And where the exponent is specified by `e` or `E` optionally followed by `+` or `-` | ||||||||||
* With at most **4** exponent digits, which must be left-padded with `0` | ||||||||||
gibber9809 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||||||
|
||||||||||
coderabbitai[bot] marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||||||
With the added restrictions that: | ||||||||||
* The floating point number follows the JSON grammar for floating point numbers. | ||||||||||
* There exists an IEEE-754 binary64 number for which the string is the closest decimal | ||||||||||
representation at the given precision. | ||||||||||
|
||||||||||
### Scientific Notation Marker | ||||||||||
|
||||||||||
Indicates whether the number is in scientific notation, and if so, whether the exponent is denoted | ||||||||||
by `E` or `e`. | ||||||||||
|
||||||||||
- `00`: Not scientific | ||||||||||
- `01`: Scientific notation using `e` | ||||||||||
- `11`: Scientific notation using `E` | ||||||||||
|
||||||||||
`10` is unused so that the lowest bit can act as a simple “scientific” flag, making condition | ||||||||||
checks cleaner. | ||||||||||
|
||||||||||
gibber9809 marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||
### Exponent Sign | ||||||||||
|
||||||||||
Records whether the exponent has a sign: | ||||||||||
|
||||||||||
- `00`: No sign | ||||||||||
- `01`: `+` | ||||||||||
- `10`: `-` | ||||||||||
|
||||||||||
`11` is unused by the current implementation. | ||||||||||
|
||||||||||
For example, exponents of `0` may appear as `0`, `+0`, or `-0`, and these two bits can record the | ||||||||||
format correctly. | ||||||||||
|
||||||||||
### Exponent Digits | ||||||||||
|
||||||||||
Since the maximum and minimal decimal exponents for a double, `308` and `-324 respectively`, are | ||||||||||
both three digits, two bits are enough to represent the digit count. | ||||||||||
|
||||||||||
The stored value is **actual digits − 1**, since there is always at least one digit | ||||||||||
(e.g., `00` → 1 digit). The two-bit mapping is: | ||||||||||
gibber9809 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||||||
|
||||||||||
- `00` → 1 digit | ||||||||||
- `01` → 2 digits | ||||||||||
- `10` → 3 digits | ||||||||||
- `11` → 4 digits | ||||||||||
|
||||||||||
coderabbitai[bot] marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||||||
### Digits from First Non-Zero to End of Number | ||||||||||
|
||||||||||
This counts the digits from the first non-zero digit up to the last digit of the integer or | ||||||||||
fractional part (excluding the exponent). Examples: | ||||||||||
|
||||||||||
Comment on lines
+202
to
+204
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🧹 Nitpick Explicitly state the decimal point is excluded from the count. Removes a small ambiguity in implementer interpretation. Apply this diff: -This counts the digits from the first non-zero digit up to the last digit of the integer or
-fractional part (excluding the exponent). Examples:
+This counts the digits from the first non-zero digit up to the last digit of the integer or
+fractional part (excluding the decimal point and the exponent). Examples: 📝 Committable suggestion
Suggested change
🤖 Prompt for AI Agents
|
||||||||||
- `123456789.1234567000` → **19** (from first `1` to last `0`) | ||||||||||
- `1.234567890E16` → **10** (from first `1` to `0` before exponent) | ||||||||||
- `0.000000123000` → **6** (from first `1` to last `0`) | ||||||||||
- `0.00` → **3** (counts all zeros for zero value) | ||||||||||
|
||||||||||
Per the [JSON grammar][json_grammar], the integer part of a floating point number can not be empty, | ||||||||||
so the minimum number of digits is **1**. To take advantage of this fact, we store this field as | ||||||||||
**actual number of non-zero digits to end of number - 1**. | ||||||||||
|
||||||||||
As well, according to IEEE-754, only 17 decimal significant digits are needed to represent all | ||||||||||
binary64 floating point numbers without precision loss. As a result, we currently choose to allow | ||||||||||
a maximum value of **17 digits** in this field. Unfortunately, even with our encoding scheme, this | ||||||||||
requires 5 bits to store the maximum encoded value of 16. | ||||||||||
|
||||||||||
gibber9809 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
gibber9809 marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||
We could support representing binary64 numbers with up to 32 significant digits, and we may choose | ||||||||||
to do so in the future, but this is explicitly not supported in the current version of the format. | ||||||||||
The rationale for not doing so now is that as the number of digits increases beyond 17, the | ||||||||||
likelihood that the number corresponds to a valid IEEE-754 binary64 float decreases. | ||||||||||
gibber9809 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||||||
|
||||||||||
[json_grammar]: https://www.crockford.com/mckeeman.html |
Uh oh!
There was an error while loading. Please reload this page.