Skip to content
Open
Show file tree
Hide file tree
Changes from 2 commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
150497b
Add documentation for the lossless float representation feature.
gibber9809 Sep 10, 2025
f0c4036
Small edits
gibber9809 Sep 10, 2025
359f295
Apply rabbit comments
gibber9809 Sep 10, 2025
5b89380
Apply suggestions from code review
gibber9809 Sep 11, 2025
8a0bb5f
Address review comments
gibber9809 Sep 11, 2025
cc49e38
Merge branch 'main' into retain-float-format-docs
gibber9809 Sep 11, 2025
6297adc
Describe fields in float format using list with links
gibber9809 Sep 19, 2025
049ed6f
Change casing for all section headers
gibber9809 Sep 19, 2025
585ab66
Add a lot of motivation
gibber9809 Sep 19, 2025
aa008af
Merge remote-tracking branch 'upstream/main' into retain-float-format…
gibber9809 Sep 19, 2025
b4d624f
Address rabbit comments
gibber9809 Sep 19, 2025
304919e
Some edits
gibber9809 Sep 19, 2025
df4e32b
More edits
gibber9809 Sep 19, 2025
1b76e6c
Apply one more rabbit comment
gibber9809 Sep 19, 2025
288ce21
Apply suggestions from code review
gibber9809 Sep 22, 2025
03f2d38
Apply suggestions from code review
gibber9809 Sep 22, 2025
b864ab5
Update docs/src/dev-docs/design-retain-float-format.md
gibber9809 Sep 22, 2025
a945bd2
Improve presentation of json number grammar
gibber9809 Sep 22, 2025
e16c73d
Clarify decription for encoding the number of digits
gibber9809 Sep 22, 2025
c1a4903
Merge remote-tracking branch 'upstream/main' into retain-float-format…
gibber9809 Sep 22, 2025
9dbeb9d
Update docs/src/dev-docs/design-retain-float-format.md
gibber9809 Sep 22, 2025
4909490
Merge branch 'main' into retain-float-format-docs
gibber9809 Sep 22, 2025
3a46b3a
Merge branch 'main' into retain-float-format-docs
gibber9809 Sep 23, 2025
0c85740
Update docs to reflect no-retain-float-format argument change
gibber9809 Sep 23, 2025
c098515
Merge remote-tracking branch 'upstream/main' into retain-float-format…
gibber9809 Sep 23, 2025
da06d72
Address rabbit comment
gibber9809 Sep 23, 2025
016d8e4
Merge branch 'main' into retain-float-format-docs
gibber9809 Sep 23, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
119 changes: 119 additions & 0 deletions docs/src/dev-docs/design-retain-float-format.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
# Retaining Floating-Point Format Information

To losslessly retain the string representation of floating point numbers we use two encoding
strategies:
* `FormattedFloat`: similar to `DateString`, we store formatting information about a floating point
number alongside its IEEE-754 binary64 representation.
* `DictionaryFloat`: we store the full string representation of the floating point number in the
variable dictionary, and encode numbers as their corresponding variable dictionary IDs.

Generally we prefer storing floating point numbers as `FormattedFloat` over `DictionaryFloat`
because:
* We can directly compare against the stored IEEE-754 binary64 float at query time instead of having
to first parse the string representation of a floating point number.
* We avoid bloating the variable dictionary with non-repetitive floating point strings.

Unfortunately, even though `FormattedFloat` is designed to represent most common encodings of
IEEE-754 binary64 floats, we can not guarantee that our input follows a common format or was
converted from a binary64 floating point number. As a result, at parsing time, we check if a given
floating point number is representable as a `FormattedFloat`, and if it isn't, we encode it as a
`DictionaryFloat`.

## High-Level `FormattedFloat` Specification

Each `FormattedFloat` node contains:

- The double value in IEEE-754 binary64 format.
- A 2-byte *format* field encoding the necessary output formatting information so that, upon
decompression, the value can be reproduced to match the original text as closely as possible.
Note that the remaining bits of the 2‑byte field are currently reserved.
Encoders must write them as 0, and decoders must ignore them (treat as “don’t care”) for forward
compatibility.

```text
+-------------------------------------+------------------------+--------------------------+------------------------------------------------------+
| Scientific Notation Marker (2 bits) | Exponent Sign (2 bits) | Exponent Digits (2 bits) | Digits from First Non-Zero to End of Number (5 bits) |
+-------------------------------------+------------------------+--------------------------+------------------------------------------------------+
```

To clarify the floating point formats that `FormattedFloat` can represent, we describe them in text
here:
* For non-scientific numbers we accept:
* Any number that has at most 16 digits after the first non-zero digit
* Or at most 1 zero before the decimal and 16 zeroes after the decimal, if the number is a zero
* For scientific numbers we accept:
* Single digit numbers with no decimal, followed by an exponent
* Or numbers with **1** digit preceding the decimal and up to 16 digits following the
decimal, followed by an exponent
* Where zero can not be the digit before the decimal, unless the number is a zero
* And where the exponent is specified by `e` or `E` optionally followed by `+` or `-`
* With at most **4** exponent digits, which must be left-padded with `0`

With the added restrictions that:
* The floating point number follows the JSON grammar for floating point numbers.
* There exists an IEEE-754 binary64 number for which the string is the closest decimal
representation at the given precision.

### Scientific Notation Marker

Indicates whether the number is in scientific notation, and if so, whether the exponent is denoted
by `E` or `e`.

- `00`: Not scientific
- `01`: Scientific notation using `e`
- `11`: Scientific notation using `E`

`10` is unused so that the lowest bit can act as a simple “scientific” flag, making condition
checks cleaner.

### Exponent Sign

Records whether the exponent has a sign:

- `00`: No sign
- `01`: `+`
- `10`: `-`

`11` is unused by the current implementation.

For example, exponents of `0` may appear as `0`, `+0`, or `-0`, and these two bits can record the
format correctly.

### Exponent Digits

Since the maximum and minimal decimal exponents for a double, `308` and `-324 respectively`, are
both three digits, two bits are enough to represent the digit count.

The stored value is **actual digits − 1**, since there is always at least one digit
(e.g., `00` → 1 digit). The two-bit mapping is:

- `00` → 1 digit
- `01` → 2 digits
- `10` → 3 digits
- `11` → 4 digits

### Digits from First Non-Zero to End of Number

This counts the digits from the first non-zero digit up to the last digit of the integer or
fractional part (excluding the exponent). Examples:

Comment on lines +202 to +204
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick

Explicitly state the decimal point is excluded from the count.

Removes a small ambiguity in implementer interpretation.

Apply this diff:

-This counts the digits from the first non-zero digit up to the last digit of the integer or
-fractional part (excluding the exponent). Examples:
+This counts the digits from the first non-zero digit up to the last digit of the integer or
+fractional part (excluding the decimal point and the exponent). Examples:
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
This counts the digits from the first non-zero digit up to the last digit of the integer or
fractional part (excluding the exponent). Examples:
This counts the digits from the first non-zero digit up to the last digit of the integer or
fractional part (excluding the decimal point and the exponent). Examples:
🤖 Prompt for AI Agents
In docs/src/dev-docs/design-retain-float-format.md around lines 202 to 204, the
description of how digits are counted is ambiguous about whether the decimal
point is counted; update the sentence to explicitly state that the decimal point
is excluded from the count. Modify the wording so it clearly reads that digits
are counted from the first non-zero digit to the last digit of the integer or
fractional part, excluding both the exponent and the decimal point, and add a
short clarifying example if desired.

- `123456789.1234567000`**19** (from first `1` to last `0`)
- `1.234567890E16`**10** (from first `1` to `0` before exponent)
- `0.000000123000`**6** (from first `1` to last `0`)
- `0.00`**3** (counts all zeros for zero value)

Per the [JSON grammar][json_grammar], the integer part of a floating point number can not be empty,
so the minimum number of digits is **1**. To take advantage of this fact, we store this field as
**actual number of non-zero digits to end of number - 1**.

As well, according to IEEE-754, only 17 decimal significant digits are needed to represent all
binary64 floating point numbers without precision loss. As a result, we currently choose to allow
a maximum value of **17 digits** in this field. Unfortunately, even with our encoding scheme, this
requires 5 bits to store the maximum encoded value of 16.

We could support representing binary64 numbers with up to 32 significant digits, and we may choose
to do so in the future, but this is explicitly not supported in the current version of the format.
The rationale for not doing so now is that as the number of digits increases beyond 17, the
likelihood that the number corresponds to a valid IEEE-754 binary64 float decreases.

[json_grammar]: https://www.crockford.com/mckeeman.html
1 change: 1 addition & 0 deletions docs/src/dev-docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -83,4 +83,5 @@ design-project-structure
design-kv-ir-streams/index
design-metadata-db
design-parsing-wildcard-queries
design-retain-float-format
:::
20 changes: 20 additions & 0 deletions docs/src/user-docs/core-clp-s.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,9 @@ Usage:
where `size` is the total size of the dictionaries and encoded messages in an archive.
* This option acts as a soft limit on memory usage for compression, decompression, and search.
* This option significantly affects compression ratio.
* `--retain-float-format` specifies that float numbers should be stored with format information
to allow retaining original float numbers' formats after decompression. This feature is
currently not supported when ingesting KV-IR.
* `--structurize-arrays` specifies that arrays should be fully parsed and array entries should be
encoded into dedicated columns.
* `--auth <s3|none>` specifies the authentication method that should be used for network requests
Expand Down Expand Up @@ -76,6 +79,21 @@ AWS_ACCESS_KEY_ID='...' AWS_SECRET_ACCESS_KEY='...' \
/mnt/logs/log1.json
```

:::{tip}
Use the `--retain-float-format` flag during compression. Internally, switch to using a different
encoding approach for floating point numbers that always retains their original formats. For
example, values like `1.000e+00` or `0.000000012300` will be decompressed unchanged.
:::

**Enable retaining float numbers' formats:**

```shell
./clp-s c \
--retain-float-format \
/mnt/data/archives1 \
/mnt/logs/log1.json
```

## Decompression

Usage:
Expand Down Expand Up @@ -154,6 +172,8 @@ compressed data:**
* The order of log events is not preserved.
* The input directory structure is not preserved and during decompression all files are written to
the same file.
* When using the `--retain-float-format` flag:
* KV-IR inputs currently don't support preserving the original printed float formats.
* In addition, there are a few limitations, related to querying arrays, described in the search
syntax [reference](reference-json-search-syntax).

Expand Down
Loading