Skip to content
Open
Show file tree
Hide file tree
Changes from 25 commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
150497b
Add documentation for the lossless float representation feature.
gibber9809 Sep 10, 2025
f0c4036
Small edits
gibber9809 Sep 10, 2025
359f295
Apply rabbit comments
gibber9809 Sep 10, 2025
5b89380
Apply suggestions from code review
gibber9809 Sep 11, 2025
8a0bb5f
Address review comments
gibber9809 Sep 11, 2025
cc49e38
Merge branch 'main' into retain-float-format-docs
gibber9809 Sep 11, 2025
6297adc
Describe fields in float format using list with links
gibber9809 Sep 19, 2025
049ed6f
Change casing for all section headers
gibber9809 Sep 19, 2025
585ab66
Add a lot of motivation
gibber9809 Sep 19, 2025
aa008af
Merge remote-tracking branch 'upstream/main' into retain-float-format…
gibber9809 Sep 19, 2025
b4d624f
Address rabbit comments
gibber9809 Sep 19, 2025
304919e
Some edits
gibber9809 Sep 19, 2025
df4e32b
More edits
gibber9809 Sep 19, 2025
1b76e6c
Apply one more rabbit comment
gibber9809 Sep 19, 2025
288ce21
Apply suggestions from code review
gibber9809 Sep 22, 2025
03f2d38
Apply suggestions from code review
gibber9809 Sep 22, 2025
b864ab5
Update docs/src/dev-docs/design-retain-float-format.md
gibber9809 Sep 22, 2025
a945bd2
Improve presentation of json number grammar
gibber9809 Sep 22, 2025
e16c73d
Clarify decription for encoding the number of digits
gibber9809 Sep 22, 2025
c1a4903
Merge remote-tracking branch 'upstream/main' into retain-float-format…
gibber9809 Sep 22, 2025
9dbeb9d
Update docs/src/dev-docs/design-retain-float-format.md
gibber9809 Sep 22, 2025
4909490
Merge branch 'main' into retain-float-format-docs
gibber9809 Sep 22, 2025
3a46b3a
Merge branch 'main' into retain-float-format-docs
gibber9809 Sep 23, 2025
0c85740
Update docs to reflect no-retain-float-format argument change
gibber9809 Sep 23, 2025
c098515
Merge remote-tracking branch 'upstream/main' into retain-float-format…
gibber9809 Sep 23, 2025
da06d72
Address rabbit comment
gibber9809 Sep 23, 2025
016d8e4
Merge branch 'main' into retain-float-format-docs
gibber9809 Sep 23, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
226 changes: 226 additions & 0 deletions docs/src/dev-docs/design-retain-float-format.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,226 @@
# Lossless JSON float storage

Since our goal is to losslessly retain floating-point numbers that come from JSON input, it is worth
taking a look at what kinds of floating-point numbers can appear in JSON.

The [JSON specification][json_spec] treats fields that match the following grammar as number values:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick

Optionally cite ECMA‑404 alongside RFC 8259.

Adds the concise syntax reference many implementers expect.

Apply this diff:

-The [JSON specification][json_spec] treats fields that match the following grammar as number values:
+The [JSON specification][json_spec] (see also [ECMA‑404][ecma404]) treats fields that match the
+following grammar as number values:

And append this reference:

 [json_spec]: https://datatracker.ietf.org/doc/html/rfc8259
 [ieee754]: https://ieeexplore.ieee.org/document/4610935/
+[ecma404]: https://www.ecma-international.org/publications-and-standards/standards/ecma-404/

Also applies to: 225-227

🤖 Prompt for AI Agents
In docs/src/dev-docs/design-retain-float-format.md around lines 6 (and also
apply the same change to lines 225-227), the JSON spec reference is incomplete;
add an additional citation to ECMA-404 alongside RFC 8259 by appending a brief
parenthetical or bracketed reference to ECMA-404 after the existing RFC 8259
mention so the sentence reads as referencing both RFC 8259 and ECMA-404;
replicate the same edit at lines 225-227 to ensure both places cite ECMA-404.


```text
number = [ minus ] int [ frac ] [ exp ]
int = zero | ( digit1-9 digit* )
frac = decimal-point digit+
exp = e [ minus | plus ] digit+
digit = zero | digit1-9
digit1-9 = '1'-'9'
zero = '0'
e = 'e' | 'E'
decimal-point = '.'
minus = '-'
plus = '+'
```

For our purposes, floating-point numbers are numbers which match this grammar and have either a
fraction, an exponent, or both.

Note that this restricts which floating‑point numbers are allowed in a few ways:

- NaN and +/- Infinity are not allowed
- The exponent must contain at least one digit
- The fractional part of a number must contain at least one digit
- The integer part of a number can only start with '0' if the entire integer part is '0'
- The integer part of a number must contain at least one digit
- Positive numbers cannot begin with an explicit '+'

There are two important points to note about this grammar:
1. It doesn't place any restrictions on how a given floating-point number _should_ be written
2. It doesn't dictate whether or not floating-point numbers have to correspond to values from a
standard such as IEEE-754 binary64.

Since point 1. is less abstract than point 2., we'll explain it first. Say we're trying to represent
the number `16` as a floating-point number using **3** digits of precision. We might write it as:

- `16.0`; or
- `16.0e0`; or
- `1.60e1`

Note that for scientific-notation representations of a number, we can shift the decimal and change
the exponent arbitrarily to represent the same number in infinite possible ways. For example:

- `0.160e2`; and
- `0.00000000160e10`

are both valid representations of `16` using 3 significant digits.

Likewise, we can come up with infinite representations of `16` (at different precisions) by
choosing to represent arbitrarily many significant digits.

Point 2. is a bit more abstract, but it is important for understanding our approach to losslessly
storing floating-point numbers.

It's most easily demonstrated with an example. Of the numbers:

- `1.2345678901234567`
- `1.2345678901234568`
- `1.2345678901234570`

we know that only the first and third number correspond to IEEE-754 binary64 floating-point numbers.
The reason we can tell is that the [IEEE-754 specification][ieee754] requires that when converting a
floating-point number to a decimal string at a given precision, the decimal string must correspond
to the nearest decimal representation. Likewise, when converting a decimal string to a
floating-point number, the standard requires that the number be converted to the nearest
floating-point representation. If you use any standards-compliant implementation to turn
`1.2345678901234568` into a binary64 floating-point number, and back to a decimal string, you will
find that it has been rounded to `1.2345678901234567`.

Overall the implications here are that:
- For any given number there are many possible representations (infinitely many in fact); and
- Not all floating-point numbers that are valid JSON correspond to values from a standard like
IEEE-754

In practice though, we know that most of the time we should be dealing with very standard
machine-generated data. This means that most inputs _do_ correspond to IEEE-754 binary64
floating-point numbers, and that of the infinitely many ways of representing a number only a few
will be common.

Our approach, then, is to store most floating-point numbers as an IEEE-754 binary64 floating-point
number alongside some formatting information, falling back to storing the number as a string when
that doesn't work.

## Retaining floating-point format information

To losslessly retain the string representation of floating-point numbers we use two encoding
strategies:

- `FormattedFloat`: similar to `DateString`, we store formatting information about a floating-point
number alongside its IEEE-754 binary64 representation.
- `DictionaryFloat`: we store the full string representation of the floating-point number in the
variable dictionary, and encode numbers as their corresponding variable dictionary IDs.

Generally we prefer storing floating-point numbers as `FormattedFloat` over `DictionaryFloat`
because:

- `FormattedFloat` allows us to directly compare against the stored IEEE-754 binary64 float at query
time instead of having to first parse the string representation of a floating-point number.
- Fewer `DictionaryFloat` encodings means that we can avoid bloating the variable dictionary with
non-repetitive floating-point strings.

Unfortunately, even though `FormattedFloat` is designed to represent most common encodings of
IEEE-754 binary64 floats, we cannot guarantee that our input follows a common format or was
converted from a binary64 floating-point number. As a result, at parsing time, we check if a given
floating-point number is representable as a `FormattedFloat`, and if it isn't, we encode it as a
`DictionaryFloat`.

## High-level `FormattedFloat` specification

Each `FormattedFloat` node contains:

- The double value in IEEE-754 binary64 format.
- A 2-byte little-endian _format_ field encoding the necessary output formatting information so
that, upon decompression, the value can be decompressed exactly to the original text.

Comment on lines +120 to +123
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick

Wording: “reconstructed exactly” vs “decompressed exactly”

We’re reconstructing the original text; suggest clearer terminology.

-- A 2-byte little-endian _format_ field encoding the necessary output formatting information so
-  that, upon decompression, the value can be decompressed exactly to the original text.
+- A 2‑byte little‑endian _format_ field encoding the necessary output formatting information so
+  that, upon decompression, the original text can be reconstructed exactly.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- The double value in IEEE-754 binary64 format.
- A 2-byte little-endian _format_ field encoding the necessary output formatting information so
that, upon decompression, the value can be decompressed exactly to the original text.
- The double value in IEEE-754 binary64 format.
- A 2byte littleendian _format_ field encoding the necessary output formatting information so
that, upon decompression, the original text can be reconstructed exactly.
🤖 Prompt for AI Agents
In docs/src/dev-docs/design-retain-float-format.md around lines 124 to 127, the
wording uses "decompressed exactly" which is ambiguous; replace it with
"reconstructed exactly" (or "reconstructed to the original text") for clarity,
updating the sentence to say the 2-byte little-endian format field encodes
output formatting info so that, upon decompression, the value can be
reconstructed exactly to the original text.

Note that the unused lowest 5 bits of the 2‑byte field are currently reserved. Encoders must write
them as 0, and decoders must ignore them (treat as “don’t care”) for forward compatibility.

From MSB to LSB, the 2-byte format field contains the following sections:

- [Scientific notation marker](#scientific-notation-marker) (2 bits)
- [Exponent sign](#exponent-sign) (2 bits)
- [Exponent digits](#exponent-digits) (2 bits)
- [Digits from first non-zero to end of number](#digits-from-first-non-zero-to-end-of-number) (5 bits)
- Reserved for future use (5 bits)

The floating-point formats that `FormattedFloat` can represent are described below:

- Numbers not written in scientific notation are accepted if **either** of the following are true:
1. Any non-zero number with at most **17** digits starting at the first non-zero digit
2. A zero written with at most **one** zero before the decimal point and at most **16** zeros
after the decimal point
- Numbers written in scientific notation are accepted if all **four** of the following are true:
1. The significand is either
* a **single** digit with no decimal point, followed by an exponent
* **one** digit before the decimal point, and up to **16** digits after the decimal point,
followed by an exponent
2. The digit before the decimal point cannot be zero unless every digit of the significand is zero
3. The exponent uses `e` or `E`, optionally followed by `+` or `-`
4. The exponent has at most **4** digits (left-padding with `0` is allowed)

With the added restrictions that:

- The floating-point number follows the JSON grammar for floating-point numbers.
- There exists an IEEE-754 binary64 number for which the string is the closest decimal
representation at the given precision.

These restrictions really correspond to "canonical" representations of floating-point numbers with
up to 17 digits of precision. This means that our formatting scheme can always represent numbers
produced by format specifiers such as `%f`, `%e`, and `%g`, so long as they don't use too many
digits of precision, and the underlying number isn't NaN, or +/- Infinity.

### Scientific notation marker

Indicates whether the number is in scientific notation, and if so, whether the exponent is denoted
by `E` or `e`.

- `00`: Not scientific
- `01`: Scientific notation using `e`
- `11`: Scientific notation using `E`

`10` is unused so that the lowest bit can act as a simple “scientific” flag, making condition
checks cleaner.

### Exponent sign

Records whether the exponent has a sign:

- `00`: No sign
- `01`: `+`
- `10`: `-`

`11` is unused by the current implementation.

For example, exponents of `0` may appear as `0`, `+0`, or `-0`, and these two bits can record the
format correctly.

### Exponent digits

Since the maximum and minimum decimal exponents for a double (`308` and `-324` respectively) are
both three digits, two bits are enough to represent the digit count. We allow up to 4 digits to
support exponents left-padded with `0`.

The stored value is **one less than the number of actual digits**, since there is always at least
one digit (e.g., `00` → 1 digit). The two-bit mapping is:

- `00` → 1 digit
- `01` → 2 digits
- `10` → 3 digits
- `11` → 4 digits

### Digits from first non-zero to end of number

This counts the digits from the first non-zero digit up to the last digit of the integer or
fractional part (excluding the exponent). Examples:

Comment on lines +202 to +204
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick

Explicitly state the decimal point is excluded from the count.

Removes a small ambiguity in implementer interpretation.

Apply this diff:

-This counts the digits from the first non-zero digit up to the last digit of the integer or
-fractional part (excluding the exponent). Examples:
+This counts the digits from the first non-zero digit up to the last digit of the integer or
+fractional part (excluding the decimal point and the exponent). Examples:
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
This counts the digits from the first non-zero digit up to the last digit of the integer or
fractional part (excluding the exponent). Examples:
This counts the digits from the first non-zero digit up to the last digit of the integer or
fractional part (excluding the decimal point and the exponent). Examples:
🤖 Prompt for AI Agents
In docs/src/dev-docs/design-retain-float-format.md around lines 202 to 204, the
description of how digits are counted is ambiguous about whether the decimal
point is counted; update the sentence to explicitly state that the decimal point
is excluded from the count. Modify the wording so it clearly reads that digits
are counted from the first non-zero digit to the last digit of the integer or
fractional part, excluding both the exponent and the decimal point, and add a
short clarifying example if desired.

- `123456789.1234567000`**19** (from first `1` to last `0`)
- `1.234567890E16`**10** (from first `1` to `0` before exponent)
- `0.000000123000`**6** (from first `1` to last `0`)
- `0.00`**3** (counts all zeros for zero value)

Per the [JSON specification][json_spec], the integer part of a floating-point number cannot be
empty, so the minimum number of digits is **1**. To take advantage of this fact, we store this field
as **one less than the actual number of digits from the first non-zero digit to the end of the
number**; for the numeric value zero we store **one less than the actual number of digits**.

As well, according to IEEE-754, only 17 decimal significant digits are needed to represent all
binary64 floating-point numbers without precision loss. As a result, we currently allow a maximum of
**17 digits**. Because the stored value is **digits - 1** the maximum encoded value is 16, which
requires 5 bits.

We could support representing binary64 numbers with up to 32 significant digits, and we may choose
to do so in the future, but this is explicitly not supported in the current version of the format.
The rationale for not doing so now is that the likelihood that a number corresponds to a valid
IEEE-754 binary64 float decreases as the number of digits increases beyond 17.

[json_spec]: https://datatracker.ietf.org/doc/html/rfc8259
[ieee754]: https://ieeexplore.ieee.org/document/4610935/
1 change: 1 addition & 0 deletions docs/src/dev-docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -99,4 +99,5 @@ design-project-structure
design-kv-ir-streams/index
design-metadata-db
design-parsing-wildcard-queries
design-retain-float-format
:::
6 changes: 6 additions & 0 deletions docs/src/user-docs/core-clp-s.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,8 @@ Usage:
where `size` is the total size of the dictionaries and encoded messages in an archive.
* This option acts as a soft limit on memory usage for compression, decompression, and search.
* This option significantly affects compression ratio.
* `--no-retain-float-format` specifies that clp-s should not store extra information to
losslessly decompress floats.
* `--structurize-arrays` specifies that arrays should be fully parsed and array entries should be
encoded into dedicated columns.
* `--auth <s3|none>` specifies the authentication method that should be used for network requests
Expand Down Expand Up @@ -152,6 +154,10 @@ compressed data:**
* The order of log events is not preserved.
* The input directory structure is not preserved and during decompression all files are written to
the same file.
* For floating-point numbers:
* KV-IR inputs currently don't support preserving the original printed float formats.
* Comparisons against floating point numbers at query time treat each stored number as if it were
the nearest representable double-precision value, which can potentially lose precision.
* In addition, there are a few limitations, related to querying arrays, described in the search
syntax [reference](reference-json-search-syntax).

Expand Down
Loading