y-scope · gibber9809 · Sep 10, 2025 · Sep 10, 2025 · Sep 10, 2025 · Sep 11, 2025
@@ -0,0 +1,122 @@
+# Retaining Floating-Point Format Information
+
+To losslessly retain the string representation of floating point numbers we use two encoding
+strategies:
+* `FormattedFloat`: similar to `DateString`, we store formatting information about a floating point
+number alongside its IEEE-754 binary64 representation.
+* `DictionaryFloat`: we store the full string representation of the floating point number in the
+variable dictionary, and encode numbers as their corresponding variable dictionary IDs.
+
+Generally we prefer storing floating point numbers as `FormattedFloat` over `DictionaryFloat`
+because:
+* We can directly compare against the stored IEEE-754 binary64 float at query time instead of having
+to first parse the string representation of a floating point number.
+* We avoid bloating the variable dictionary with non-repetitive floating point strings.
+
+Unfortunately, even though `FormattedFloat` is designed to represent most common encodings of
+IEEE-754 binary64 floats, we cannot guarantee that our input follows a common format or was
+converted from a binary64 floating point number. As a result, at parsing time, we check if a given
+floating point number is representable as a `FormattedFloat`, and if it isn't, we encode it as a
+`DictionaryFloat`.
+
+## High-Level `FormattedFloat` Specification  
+
+Each `FormattedFloat` node contains:
+
+- The double value in IEEE-754 binary64 format.
+- A 2-byte little-endian *format* field encoding the necessary output formatting information so
+  that, upon decompression, the value can be decompressed exactly to the original text.
+
+Note that the unused lowest 5 bits of the 2‑byte field are currently reserved, encoders must write
+them as 0, and decoders must ignore them (treat as “don’t care”) for forward compatibility.
+
+```text
++-------------------------------------+------------------------+--------------------------+------------------------------------------------------+-------------------+
+| Scientific Notation Marker (2 bits) | Exponent Sign (2 bits) | Exponent Digits (2 bits) | Digits from First Non-Zero to End of Number (5 bits) | Reserved (5 bits) |
++-------------------------------------+------------------------+--------------------------+------------------------------------------------------+-------------------+
+MSB                                                                                                                                                                LSB
+```
+
+To clarify the floating point formats that `FormattedFloat` can represent, we describe them in text
+here:
+* For non-scientific numbers we accept:
+  * Any number that has at most 16 digits after the first non-zero digit
+  * Or at most 1 zero before the decimal and 16 zeroes after the decimal, if the number is a zero
+* For scientific numbers we accept:
+  * Single digit numbers with no decimal, followed by an exponent
+  * Or numbers with **1** digit preceding the decimal and up to 16 digits following the
+    decimal, followed by an exponent
+  * Where zero can not be the digit before the decimal, unless every digit in the number is zero
+  * And where the exponent is specified by `e` or `E` optionally followed by `+` or `-`
+  * With at most **4** exponent digits, which can be left-padded with `0`
+
+With the added restrictions that:
+* The floating point number follows the JSON grammar for floating point numbers.
+* There exists an IEEE-754 binary64 number for which the string is the closest decimal
+  representation at the given precision.
+
+### Scientific Notation Marker
+
+Indicates whether the number is in scientific notation, and if so, whether the exponent is denoted
+by `E` or `e`.
+
+- `00`: Not scientific
+- `01`: Scientific notation using `e`
+- `11`: Scientific notation using `E`
+
+`10` is unused so that the lowest bit can act as a simple “scientific” flag, making condition
+checks cleaner.
+
+### Exponent Sign
+
+Records whether the exponent has a sign:
+
+- `00`: No sign
+- `01`: `+`
+- `10`: `-`
+
+`11` is unused by the current implementation.
+
+For example, exponents of `0` may appear as `0`, `+0`, or `-0`, and these two bits can record the
+format correctly.
+
+### Exponent Digits
+
+Since the maximum and minimum decimal exponents for a double, `308` and `-324` respectively, are
+both three digits, two bits are enough to represent the digit count. We allow up to 4 digits to
+support exponents left-padded with `0`.
+
+The stored value is **actual digits − 1**, since there is always at least one digit
+(e.g., `00` → 1 digit). The two-bit mapping is:
+
+- `00` → 1 digit
+- `01` → 2 digits
+- `10` → 3 digits
+- `11` → 4 digits
+
+### Digits from First Non-Zero to End of Number
+
+This counts the digits from the first non-zero digit up to the last digit of the integer or
+fractional part (excluding the exponent). Examples:
+
-This counts the digits from the first non-zero digit up to the last digit of the integer or
-fractional part (excluding the exponent). Examples:
+This counts the digits from the first non-zero digit up to the last digit of the integer or
+fractional part (excluding the decimal point and the exponent). Examples:
-This counts the digits from the first non-zero digit up to the last digit of the integer or
-fractional part (excluding the exponent). Examples:
+This counts the digits from the first non-zero digit up to the last digit of the integer or
+fractional part (excluding the decimal point and the exponent). Examples:
+- `123456789.1234567000` → **19** (from first `1` to last `0`)
+- `1.234567890E16` → **10** (from first `1` to `0` before exponent)
+- `0.000000123000` → **6** (from first `1` to last `0`)
+- `0.00` → **3** (counts all zeros for zero value)
+
+Per the [JSON grammar][json_grammar], the integer part of a floating point number cannot be empty,
+so the minimum number of digits is **1**. To take advantage of this fact, we store this field as
+**actual number of non-zero digits to end of number - 1**; for the numeric value zero we store
+**actual number of digits - 1**.
+
+As well, according to IEEE-754, only 17 decimal significant digits are needed to represent all
+binary64 floating point numbers without precision loss. As a result, we currently allow a maximum of
+**17 digits**. Because the stored value is **digits - 1** the maximum encoded value is 16, which
+requires 5 bits.
+
+We could support representing binary64 numbers with up to 32 significant digits, and we may choose
+to do so in the future, but this is explicitly not supported in the current version of the format.
+The rationale for not doing so now is that as the number of digits increases beyond 17, the
+likelihood that the number corresponds to a valid IEEE-754 binary64 float decreases.
+
+[json_grammar]: https://www.crockford.com/mckeeman.html
@@ -83,4 +83,5 @@ design-project-structure
 design-kv-ir-streams/index
 design-metadata-db
 design-parsing-wildcard-queries
+design-retain-float-format
 :::
@@ -26,6 +26,9 @@ Usage:
     where `size` is the total size of the dictionaries and encoded messages in an archive.
     * This option acts as a soft limit on memory usage for compression, decompression, and search.
     * This option significantly affects compression ratio.
+  * `--retain-float-format` specifies that floating-point numbers should be stored with extra
+    metadata to preserve their textual representation after decompression. This feature is currently
+    not supported when ingesting KV-IR.
   * `--structurize-arrays` specifies that arrays should be fully parsed and array entries should be
     encoded into dedicated columns.
   * `--auth <s3|none>` specifies the authentication method that should be used for network requests
@@ -76,6 +79,21 @@ AWS_ACCESS_KEY_ID='...' AWS_SECRET_ACCESS_KEY='...' \
     /mnt/logs/log1.json
 ```
 
+:::{tip}
+Use the `--retain-float-format` flag during compression. Internally, switch to using a different
+encoding approach for floating point numbers that always retains their original formats. For
+example, values like `1.000e+00` or `0.000000012300` will be decompressed unchanged.
+:::
+
+**Enable retaining float numbers' formats:**
+
+```shell
+./clp-s c \
+    --retain-float-format \
+    /mnt/data/archives1 \
+    /mnt/logs/log1.json
+```
+
 ## Decompression
 
 Usage:
@@ -154,6 +172,10 @@ compressed data:**
 * The order of log events is not preserved.
 * The input directory structure is not preserved and during decompression all files are written to
   the same file.
+* When using the `--retain-float-format` flag:
+  * KV-IR inputs currently don't support preserving the original printed float formats.
+  * Comparisons against floating point numbers at query time treat each stored number as if it were
+    the nearest representable double-precision value, which can potentially lose precision.
 * In addition, there are a few limitations, related to querying arrays, described in the search
   syntax [reference](reference-json-search-syntax).