Skip to content
Open
Show file tree
Hide file tree
Changes from 10 commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
150497b
Add documentation for the lossless float representation feature.
gibber9809 Sep 10, 2025
f0c4036
Small edits
gibber9809 Sep 10, 2025
359f295
Apply rabbit comments
gibber9809 Sep 10, 2025
5b89380
Apply suggestions from code review
gibber9809 Sep 11, 2025
8a0bb5f
Address review comments
gibber9809 Sep 11, 2025
cc49e38
Merge branch 'main' into retain-float-format-docs
gibber9809 Sep 11, 2025
6297adc
Describe fields in float format using list with links
gibber9809 Sep 19, 2025
049ed6f
Change casing for all section headers
gibber9809 Sep 19, 2025
585ab66
Add a lot of motivation
gibber9809 Sep 19, 2025
aa008af
Merge remote-tracking branch 'upstream/main' into retain-float-format…
gibber9809 Sep 19, 2025
b4d624f
Address rabbit comments
gibber9809 Sep 19, 2025
304919e
Some edits
gibber9809 Sep 19, 2025
df4e32b
More edits
gibber9809 Sep 19, 2025
1b76e6c
Apply one more rabbit comment
gibber9809 Sep 19, 2025
288ce21
Apply suggestions from code review
gibber9809 Sep 22, 2025
03f2d38
Apply suggestions from code review
gibber9809 Sep 22, 2025
b864ab5
Update docs/src/dev-docs/design-retain-float-format.md
gibber9809 Sep 22, 2025
a945bd2
Improve presentation of json number grammar
gibber9809 Sep 22, 2025
e16c73d
Clarify decription for encoding the number of digits
gibber9809 Sep 22, 2025
c1a4903
Merge remote-tracking branch 'upstream/main' into retain-float-format…
gibber9809 Sep 22, 2025
9dbeb9d
Update docs/src/dev-docs/design-retain-float-format.md
gibber9809 Sep 22, 2025
4909490
Merge branch 'main' into retain-float-format-docs
gibber9809 Sep 22, 2025
3a46b3a
Merge branch 'main' into retain-float-format-docs
gibber9809 Sep 23, 2025
0c85740
Update docs to reflect no-retain-float-format argument change
gibber9809 Sep 23, 2025
c098515
Merge remote-tracking branch 'upstream/main' into retain-float-format…
gibber9809 Sep 23, 2025
da06d72
Address rabbit comment
gibber9809 Sep 23, 2025
016d8e4
Merge branch 'main' into retain-float-format-docs
gibber9809 Sep 23, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
216 changes: 216 additions & 0 deletions docs/src/dev-docs/design-retain-float-format.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,216 @@
# What does it take to losslessly retain JSON floating-point numbers?

Since our goal is to losslessly retain floating-point numbers that come from JSON input, it is worth
taking a look at what kinds of floating-point numbers can appear in JSON.

The [JSON specification][json_spec] treats fields matching the following grammar as number values:
```text
number = [ minus ] int [ frac ] [ exp ]
decimal-point = '.'
digit1-9 = '1'-'9'
e = 'e' | 'E'
exp = e [ minus | plus ] ( digit1-9 | zero )+
frac = decimal-point ( digit1-9 | zero )+
int = zero | ( digit1-9 ( digit1-9 | zero )* )
minus = '-'
plus = '+'
zero = '0'
```

For our purposes, floating-point numbers are numbers which match this grammar and have either a
fraction, an exponent, or both.

Note that this restricts what kinds of floating-point numbers that are allowed in a few ways:
- NaN and +/- Infinity are not allowed
- The exponent must contain at least one digit
- The fractional part of a number must contain at least one digit
- The integer part of a number can only start with '0' if the entire integer part is '0'
- The integer part of a number must contain at least one digit
- Positive numbers cannot begin with an explicit '+'

What it doesn't do is place any restrictions on how a given floating-point number should be written,
or whether the floating-point numbers have to correspond to values from a standard such as IEEE-754
binary64.

Since the first point is less abstract, we'll explain it first, with an example. Say we're trying to
represent the number `16` as a floating point number using **3** digits of precision. We might write
it as:
- `16.0`; or
- `16.0e0`; or
- `1.60e1`
Note that for scientific-notation representations of a number we can shift the decimal and change
the exponent arbitrarily to represent the same number in infinite possible ways. For example:
- `0.160e2`; and
- `0.00000000160e10`
are both valid representations of `16` using 3 significant digits.

Likewise, we can come up with infinite representations of `16` by choosing to represent arbitrarily
many significant digits.

The point about whether the original values correspond to IEEE-754 is a bit abstract, but is
important for understanding our approach for losslessly storing floating-point numbers.

It is probably easiest to show an example. Of the numbers:
- `1.2345678901234567`
- `1.2345678901234568`
- `1.2345678901234570`
we know that only the first and third number correspond to IEEE-754 binary64 floating-point numbers.
The reason we can tell is that the [IEEE-754 specification][ieee754] requires that when converting a
floating-point number to a decimal string at a given precision, the decimal string must correspond
to the nearest decimal representation. Likewise, when converting a decimal string to a
floating-point number, the standard requires that the number be converted to the nearest
floating-point representation. If you use any standards-compliant implementation to turn
`1.2345678901234568` into a floating-point number, and back to a decimal string, you will find that
it has been rounded to `1.2345678901234567`.

Overall the implications here are that:
- For any given number there are many possible representations (infinitely many in fact); and
- Not all floating-point numbers that are valid JSON correspond to values from a standard like
IEEE-754

In practice though, we know that most of the time we should be dealing with very standard
machine-generated data. This means that most inputs _do_ correspond to IEEE-754 binary64
floating-point numbers in practice, and that of the infinitely many ways of representing a number
only a few will be common.

Our approach then is to store most floating-point numbers as an IEEE-754 binary64 floating-point
number alongside some formatting information, with the fallback of storing the number as a string
when that doesn't work.

# Retaining floating-point format information

To losslessly retain the string representation of floating point numbers we use two encoding
strategies:
* `FormattedFloat`: similar to `DateString`, we store formatting information about a floating point
number alongside its IEEE-754 binary64 representation.
* `DictionaryFloat`: we store the full string representation of the floating point number in the
variable dictionary, and encode numbers as their corresponding variable dictionary IDs.

Generally we prefer storing floating point numbers as `FormattedFloat` over `DictionaryFloat`
because:
* We can directly compare against the stored IEEE-754 binary64 float at query time instead of having
to first parse the string representation of a floating point number.
* We avoid bloating the variable dictionary with non-repetitive floating point strings.

Unfortunately, even though `FormattedFloat` is designed to represent most common encodings of
IEEE-754 binary64 floats, we cannot guarantee that our input follows a common format or was
converted from a binary64 floating point number. As a result, at parsing time, we check if a given
floating point number is representable as a `FormattedFloat`, and if it isn't, we encode it as a
`DictionaryFloat`.

## High-level `FormattedFloat` specification

Each `FormattedFloat` node contains:

- The double value in IEEE-754 binary64 format.
- A 2-byte little-endian *format* field encoding the necessary output formatting information so
that, upon decompression, the value can be decompressed exactly to the original text.

Note that the unused lowest 5 bits of the 2‑byte field are currently reserved, encoders must write
them as 0, and decoders must ignore them (treat as “don’t care”) for forward compatibility.

From MSB to LSB, the 2-byte format field contains the following sections:
- [Scientific notation marker](#scientific-notation-marker) (2 bits)
- [Exponent sign](#exponent-sign) (2 bits)
- [Exponent digits](#exponent-digits) (2 bits)
- [Digits from first non-zero to end of number](#digits-from-first-non-zero-to-end-of-number) (5 bits)
- Reserved for future use (5 bits)

To clarify the floating point formats that `FormattedFloat` can represent, we describe them in text
here:
* For non-scientific numbers we accept:
* Any number that has at most 16 digits after the first non-zero digit
* Or at most 1 zero before the decimal and 16 zeroes after the decimal, if the number is a zero
* For scientific numbers we accept:
* Single digit numbers with no decimal, followed by an exponent
* Or numbers with **1** digit preceding the decimal and up to 16 digits following the
decimal, followed by an exponent
* Where zero can not be the digit before the decimal, unless every digit in the number is zero
* And where the exponent is specified by `e` or `E` optionally followed by `+` or `-`
* With at most **4** exponent digits, which can be left-padded with `0`

With the added restrictions that:
* The floating point number follows the JSON grammar for floating point numbers.
* There exists an IEEE-754 binary64 number for which the string is the closest decimal
representation at the given precision.

These restrictions really correspond to "canonical" representations of floating point numbers with
up to 17 digits of precision. This means that our formatting scheme can always represent numbers
produced by format specifiers such as '%f', '%e', and '%g', so long as they don't use too many
digits of precision, and the underlying number isn't NaN, or +/- Infinity.

### Scientific notation marker

Indicates whether the number is in scientific notation, and if so, whether the exponent is denoted
by `E` or `e`.

- `00`: Not scientific
- `01`: Scientific notation using `e`
- `11`: Scientific notation using `E`

`10` is unused so that the lowest bit can act as a simple “scientific” flag, making condition
checks cleaner.

### Exponent sign

Records whether the exponent has a sign:

- `00`: No sign
- `01`: `+`
- `10`: `-`

`11` is unused by the current implementation.

For example, exponents of `0` may appear as `0`, `+0`, or `-0`, and these two bits can record the
format correctly.

### Exponent digits

Since the maximum and minimum decimal exponents for a double, `308` and `-324` respectively, are
both three digits, two bits are enough to represent the digit count. We allow up to 4 digits to
support exponents left-padded with `0`.

The stored value is **actual digits − 1**, since there is always at least one digit
(e.g., `00` → 1 digit). The two-bit mapping is:

- `00` → 1 digit
- `01` → 2 digits
- `10` → 3 digits
- `11` → 4 digits

### Digits from first non-zero to end of number

This counts the digits from the first non-zero digit up to the last digit of the integer or
fractional part (excluding the exponent). Examples:

Comment on lines +202 to +204
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick

Explicitly state the decimal point is excluded from the count.

Removes a small ambiguity in implementer interpretation.

Apply this diff:

-This counts the digits from the first non-zero digit up to the last digit of the integer or
-fractional part (excluding the exponent). Examples:
+This counts the digits from the first non-zero digit up to the last digit of the integer or
+fractional part (excluding the decimal point and the exponent). Examples:
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
This counts the digits from the first non-zero digit up to the last digit of the integer or
fractional part (excluding the exponent). Examples:
This counts the digits from the first non-zero digit up to the last digit of the integer or
fractional part (excluding the decimal point and the exponent). Examples:
🤖 Prompt for AI Agents
In docs/src/dev-docs/design-retain-float-format.md around lines 202 to 204, the
description of how digits are counted is ambiguous about whether the decimal
point is counted; update the sentence to explicitly state that the decimal point
is excluded from the count. Modify the wording so it clearly reads that digits
are counted from the first non-zero digit to the last digit of the integer or
fractional part, excluding both the exponent and the decimal point, and add a
short clarifying example if desired.

- `123456789.1234567000`**19** (from first `1` to last `0`)
- `1.234567890E16`**10** (from first `1` to `0` before exponent)
- `0.000000123000`**6** (from first `1` to last `0`)
- `0.00`**3** (counts all zeros for zero value)

Per the [JSON specification][json_grammar], the integer part of a floating point number cannot be
empty, so the minimum number of digits is **1**. To take advantage of this fact, we store this field
as **actual number of non-zero digits to end of number - 1**; for the numeric value zero we store
**actual number of digits - 1**.

As well, according to IEEE-754, only 17 decimal significant digits are needed to represent all
binary64 floating point numbers without precision loss. As a result, we currently allow a maximum of
**17 digits**. Because the stored value is **digits - 1** the maximum encoded value is 16, which
requires 5 bits.

We could support representing binary64 numbers with up to 32 significant digits, and we may choose
to do so in the future, but this is explicitly not supported in the current version of the format.
The rationale for not doing so now is that as the number of digits increases beyond 17, the
likelihood that the number corresponds to a valid IEEE-754 binary64 float decreases.

[json_spec]: https://datatracker.ietf.org/doc/html/rfc7159
[ieee754]: https://ieeexplore.ieee.org/document/4610935/
1 change: 1 addition & 0 deletions docs/src/dev-docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -99,4 +99,5 @@ design-project-structure
design-kv-ir-streams/index
design-metadata-db
design-parsing-wildcard-queries
design-retain-float-format
:::
22 changes: 22 additions & 0 deletions docs/src/user-docs/core-clp-s.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,9 @@ Usage:
where `size` is the total size of the dictionaries and encoded messages in an archive.
* This option acts as a soft limit on memory usage for compression, decompression, and search.
* This option significantly affects compression ratio.
* `--retain-float-format` specifies that floating-point numbers should be stored with extra
metadata to preserve their textual representation after decompression. This feature is currently
not supported when ingesting KV-IR.
* `--structurize-arrays` specifies that arrays should be fully parsed and array entries should be
encoded into dedicated columns.
* `--auth <s3|none>` specifies the authentication method that should be used for network requests
Expand Down Expand Up @@ -76,6 +79,21 @@ AWS_ACCESS_KEY_ID='...' AWS_SECRET_ACCESS_KEY='...' \
/mnt/logs/log1.json
```

:::{tip}
Use the `--retain-float-format` flag during compression. Internally, switch to using a different
encoding approach for floating point numbers that always retains their original formats. For
example, values like `1.000e+00` or `0.000000012300` will be decompressed unchanged.
:::

**Enable retaining float numbers' formats:**

```shell
./clp-s c \
--retain-float-format \
/mnt/data/archives1 \
/mnt/logs/log1.json
```

## Decompression

Usage:
Expand Down Expand Up @@ -154,6 +172,10 @@ compressed data:**
* The order of log events is not preserved.
* The input directory structure is not preserved and during decompression all files are written to
the same file.
* When using the `--retain-float-format` flag:
* KV-IR inputs currently don't support preserving the original printed float formats.
* Comparisons against floating point numbers at query time treat each stored number as if it were
the nearest representable double-precision value, which can potentially lose precision.
* In addition, there are a few limitations, related to querying arrays, described in the search
syntax [reference](reference-json-search-syntax).

Expand Down