Gzip Decompression Failure Due to 100MB Limit in Fluent Bit 3.0.7 #9058

aydosman · 2024-07-08T08:22:27Z

Bug Report

I'm encountering an issue with Fluent Bit where the gzip decompression fails due to exceeding the maximum decompression size of 100MB. Below are the relevant error logs and configurations for both the collector and aggregator.

To Reproduce

Example log message

[2024/07/08 08:05:26] [error] [gzip] maximum decompression size is 100MB
[2024/07/08 08:05:26] [error] [input:forward:forward.0] gzip uncompress failure
[2024/07/08 08:05:52] [error] [gzip] maximum decompression size is 100MB
[2024/07/08 08:05:52] [error] [input:forward:forward.0] gzip uncompress failure
[2024/07/08 08:06:08] [error] [gzip] maximum decompression size is 100MB
[2024/07/08 08:06:08] [error] [input:forward:forward.0] gzip uncompress failure
[2024/07/08 08:06:20] [error] [gzip] maximum decompression size is 100MB
[2024/07/08 08:06:20] [error] [input:forward:forward.0] gzip uncompress failure

Steps to reproduce the problem

Set up Fluent Bit with the provided collector and aggregator configurations.

Monitor the logs for gzip decompression errors.

Expected behavior

Fluent Bit should handle the gzip decompression without exceeding the maximum decompression size limit.

Screenshots

N/A

Your Environment

Version used: Fluent Bit 3.0.7

Configuration:

Collector Configuration:

[SERVICE]
    daemon false
    log_level warn
    storage.path /var/fluent-bit/state/flb-storage/
    storage.sync normal
    storage.max_chunks_up 32
    storage.backlog.mem_limit 32MB
    storage.metrics true
    storage.delete_irrecoverable_chunks true
    http_server true
    http_listen 0.0.0.0
    http_Port 2020

[INPUT]
    name tail
    path /var/log/containers/*.log
    tag_regex (?<pod_name>[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*)_(?<namespace_name>[^_]+)_(?<container_name>.+)-
    tag kube.<namespace_name>.<pod_name>.<container_name>
    read_from_head true
    multiline.parser cri
    skip_long_lines true
    skip_empty_lines true
    buffer_chunk_size 32KB
    buffer_max_size 32KB
    db /var/fluent-bit/state/flb-storage/tail-containers.db
    db.sync normal
    db.locking true
    db.journal_mode wal
    storage.type filesystem

[OUTPUT]
    name forward
    match *
    host fluent-bit-aggregator.observability.svc.cluster.local
    port 24224
    compress gzip
    workers 2
    retry_limit false
    storage.total_limit_size 16GB

Aggregator Configuration:

[SERVICE]
    daemon false
    log_level warn
    storage.path /fluent-bit/data
    storage.sync full
    storage.backlog.mem_limit 128M
    storage.metrics true
    storage.delete_irrecoverable_chunks true
    storage.max_chunks_up 64
    http_server true
    http_listen 0.0.0.0
    http_Port 2020

[INPUT]
    name forward
    listen 0.0.0.0
    port 24224
    buffer_chunk_size 1M
    buffer_max_size 4M
    storage.type filesystem

[OUTPUT]
    name loki
    match *
    host loki-gateway.logging.svc.cluster.local
    port 80
    line_format json
    auto_kubernetes_labels false
    label_keys $cluster, $namespace, $app
    storage.total_limit_size 16GB

Environment name and version (e.g. Kubernetes? What version?)

Kubernetes 1.30, 1.29, 1.28

Server type and version

AKS/EKS

Operating System and version

Ubuntu, AL2, AL2023 and BottlerocketOS

Filters and plugins

See above

Additional context

This issue persists across all Fluent Bit instances with the same configuration. Both collector and aggregator are using the same Fluent Bit version (3.0.7). The rate of records processed per second is consistently around 800 (so not too much). Any guidance or solution to resolve this issue would be greatly appreciated.

edsiper · 2024-07-08T20:05:40Z

just curious, what's the use case where one payload might expand to over 100MB ?

today that's a hard limit, we will need to extend it per component, besides in_forward is being used in other areas in your use case ?

aydosman · 2024-07-09T10:20:28Z

Could it be down to back pressure on the collector side, let me try and prove that. I’ll run some simulations and provide all the related metrics.

in_forward is being used in other areas in your use case ?

Not at this time

mirko-lazarevic · 2024-07-09T10:25:56Z

@edsiper I experience the same issue with the Fluent Bit version 3.0.4, however using the same configuration with the Fluent Bit version 2.2.2 we don't encounter this error. I believe, although I might be wrong that the error was introduced with this change #8665

FYI: @cosmo0920

stevehipwell · 2024-07-26T10:12:28Z

@edsiper has this been investigated?

cosmo0920 · 2024-07-29T08:08:04Z

Hi, I'm trying to add full width confirmation of concatenated gzip stream of forwarded payloads in #9139. Would you mind if you tried to test that patch?

stevehipwell · 2024-07-29T08:39:17Z

@cosmo0920 is there an OCI image built as part of the PR?

cosmo0920 · 2024-07-29T12:28:15Z

No. I tried to generate PR specific images. But no luck.

stevehipwell · 2024-08-12T13:05:01Z

Has this been fixed in v3.1.5?

aydosman · 2024-08-15T22:04:59Z

fb version 3.1.5 – Bug still exists
fb version 3.1.6 – Bug still exists

To add and to prove a theory we had, the data we send/persist/DBs from the collector and aggregator might have been somehow corrupted, so these were tested on fresh new cloud nodes.

cosmo0920 · 2024-08-16T08:04:31Z

fb version 3.1.5 – Bug still exists fb version 3.1.6 – Bug still exists

To add and to prove a theory we had, the data we send/persist/DBs from the collector and aggregator might have been somehow corrupted, so these were tested on fresh new cloud nodes.

Do you have a reproducible step?

aydosman · 2024-08-16T11:35:15Z

fb version 3.1.5 – Bug still exists fb version 3.1.6 – Bug still exists
To add and to prove a theory we had, the data we send/persist/DBs from the collector and aggregator might have been somehow corrupted, so these were tested on fresh new cloud nodes.

Do you have a reproducible step?

The configuration shown above has not changed only the Fluent Bit container image version has been updated. Let me know if you need anything else.

aydosman · 2024-09-26T20:38:47Z

Any update on this issue?

ksauzz · 2024-10-28T09:09:06Z

Hi, how can we proceed this issue? I feel we cannot upgrade our fluent-bit aggregator to 3.x series until fixing this issue.
Of cause we can disable gzip payload as workaround though...

stevehipwell · 2024-10-28T09:54:21Z

CC @patrick-stephens 👀

patrick-stephens · 2024-10-28T11:10:30Z

@cosmo0920 did we provide a configuration around this for Core?

cosmo0920 · 2024-10-29T06:51:47Z

No, we didn't. We just process decompressing operations for compressed buffers.

stevehipwell · 2024-10-29T09:22:15Z

@cosmo0920 the original bug still exists, is there a fix planned?

ksauzz · 2024-11-26T10:15:52Z

For further insight I tried to reproduce the issue w/ minimum setup on local environment but I couldn't observe gzip concatenation. How can I reproduce gzip concatenation in in_forward? In my local, concatenated gzip payload count is always 0.

[2024/11/26 09:23:18] [debug] [input:forward:forward.0] concatenated gzip payload count is 0

stevehipwell · 2024-11-26T22:29:45Z

CC @aydosman

ksauzz · 2024-11-27T07:39:24Z

I finally succeeded reproducing the issue w/ fluent-bit 3.0.4, 3.1.9 and 3.2.0 on local. I prepared the git repo for testing.

Thank you.

CC @cosmo0920

Environment:

fluent-bit aggregator x 1
nginx as reverse proxy for TLS termination x 1
fluent-bit collector x 3 or 4

Aggregator's error logs

It seems gzip decompression immediately failed after concatenated gzip payload appeared.

[2024/11/27 06:12:04] [trace] [input:forward:forward.0 at /src/fluent-bit/plugins/in_forward/fw_conn.c:104] read()=114688 pre_len=131072 now_len=245760
[2024/11/27 06:12:04] [trace] [input:forward:forward.0 at /src/fluent-bit/plugins/in_forward/fw_conn.c:68] handshake status = 3
[2024/11/27 06:12:04] [trace] [input:forward:forward.0 at /src/fluent-bit/plugins/in_forward/fw_conn.c:104] read()=19418 pre_len=245760 now_len=265178
[2024/11/27 06:12:04] [debug] [input:forward:forward.0] concatenated gzip payload count is 1
[2024/11/27 06:12:04] [trace] [input:forward:forward.0 at /src/fluent-bit/plugins/in_forward/fw_prot.c:1569] [gzip decompression] loop = 0, len = 220861, original_len = 265124
[2024/11/27 06:12:04] [error] [gzip] maximum decompression size is 100MB
[2024/11/27 06:12:04] [error] [input:forward:forward.0] gzip uncompress failure
[2024/11/27 06:12:04] [trace] [input:forward:forward.0 at /src/fluent-bit/plugins/in_forward/fw_conn.c:68] handshake status = 3
[2024/11/27 06:12:04] [trace] [input:forward:forward.0 at /src/fluent-bit/plugins/in_forward/fw_conn.c:104] read()=49152 pre_len=0 now_len=49152
[2024/11/27 06:12:04] [trace] [input:forward:forward.0 at /src/fluent-bit/plugins/in_forward/fw_conn.c:68] handshake status = 3
[2024/11/27 06:12:04] [trace] [input:forward:forward.0 at /src/fluent-bit/plugins/in_forward/fw_conn.c:104] read()=65536 pre_len=49152 now_len=114688

To reproduce

git clone --branch flb-9058 https://github.com/ksauzz/fluent-bit-sandbox.git
cd fluent-bit-sandbox
sudo journalctl -n 1000000 > ./logs/messages to generate test logs for collectors
run ./aggregator.sh in a terminal
run ./nginx.sh in another terminal
run ./collector.sh in another terminal 3 or 4 times for multiple fluent-bit collectors

You can change the version of fluent-bit by specifying VERSION=x.x.x like

VERSION=3.2.0 ./aggregator.sh

ksauzz · 2024-12-24T10:03:26Z

Hi, any update?

I think the multiple root causes could exist. It seems the one was fixed by #9139 in 3.1.5, but the another one still exists which causes the error at a low frequency.

aydosman · 2025-02-02T13:57:06Z

Hi everyone, it's been a little while since the last update on this issue. Is there any update?

Tangui0232 · 2025-04-07T23:47:05Z

I spent some time digging into this issue and have found the likely culprit.

The issue has to do with the assumption that the gzip body cannot contain a valid gzip header made with the handling of concatenated gzip payloads #8665.

Concatenated gzip payloads are directly appended to each other without any length context so the forward input plugin searches the binary data it receives for valid gzip headers and assumes those are valid. However, since the gzip body can contain valid header sequences those are falsely flagged as concatenated gzip payloads resulting in gzip body content being read as the decompression size (from the gzip footer).

flb_gzip_count is the relevant function here. Line 823 scans for valid headers:

fluent-bit/src/flb_gzip.c

Line 803 in 3f4a024

    
           size_t flb_gzip_count(const char *data, size_t len, size_t **out_borders, size_t border_count)

I'm not sure the best course of action from here. To me this seems like a protocol issue. Some additional context (eg. gzip payload length) is probably required rather than simply mashing the gzip payloads together.

Tangui0232 · 2025-04-08T00:13:10Z

Example unit test demonstrating the issue:

void test_not_concat()
{
    size_t border_count = 0;
    char data[] = {
        0x06, 0x03, 0x00, 0x00, 0x07, 0x01, 0x05, 0x04, 0x07, 0x07, 0x02, 0x03, 0x00, 0x01, 0x00, 0x04, 0x06, 0x02,
        0x02, 0x02, 0x06, 0x02, 0x00, 0x06, 0x04, 0x06, 0x00, 0x06, 0x07, 0x00, 0x03, 0x05, 0x03, 0x04, 0x06, 0x03,
        0x05, 0x03, 0x07, 0x05, 0x02, 0x01, 0x00, 0x02, 0x02, 0x00, 0x06, 0x01, 0x03, 0x00, 0x03, 0x01, 0x02, 0x03,
        0x07, 0x07, 0x01, 0x07, 0x05, 0x01, 0x00, 0x00, 0x06, 0x03, 0x04, 0x04, 0x06, 0x02, 0x07, 0x05, 0x07, 0x02,
        0x06, 0x07, 0x04, 0x01, 0x00, 0x03, 0x02, 0x03, 0x03, 0x05, 0x04, 0x06, 0x00, 0x03, 0x05, 0x02, 0x02, 0x02,
        0x03, 0x02, 0x02, 0x01, 0x06, 0x07, 0x06, 0x04, 0x01, 0x05
    };
    size_t len = sizeof(data);
    void *compressed = NULL;
    size_t compressed_len = 0;

    flb_gzip_compress(&data, len, &compressed, &compressed_len);
    border_count = flb_gzip_count((char *) compressed, compressed_len, NULL, 0);

    TEST_CHECK(border_count == 1);
}

Test compress...                                [ OK ]
Test count...                                   [ OK ]
Test not_overflow...                            [ OK ]
Test not_concat...                              [ OK ]
SUCCESS: All unit tests have passed.

This is a custom test case I created. The border_count is expected to be 0 when there is only one gzip payload, but it's 1 in this case.

edsiper · 2025-04-11T04:20:11Z

hi folks, I have submitted a potential solution in this PR: #10204

would you please give it a try to that branch ?

ksauzz · 2025-04-11T06:17:21Z

@edsiper Thank you for the fix!
I tested the fix but unfortunately it seems failed to process gzip payloads...

Error logs

aggregator-1  | [2025/04/11 06:03:40] [error] [gzip] no valid gzip members found
aggregator-1  | [2025/04/11 06:03:40] [error] [input:forward:forward.0] gzip uncompress failure

To reproduce

build docker image with the fix.
run fluent-bit collectors and aggregator

git clone https://github.com/ksauzz/fluent-bit-sandbox.git
cd fluent-bit-sandbox
sudo journalctl -n 1000000 > ./logs/messages   # prepare logs to be forwarded
VERSION=<docker image tag for the fix> docker compose up       # start nginx for TLS, fluent-bit collectors, and fluent-bit aggregator

Tangui0232 · 2025-04-16T01:15:57Z

@edsiper Thanks for the fix. Unfortunately I don't think this addresses the underlying issue as we are still searching for gzip headers to determine boundaries between gzip payloads.

I did some investigation on how fluentd is handling this and they are using the unused field from the Zlib::GzipReader to know where the next gzip payload begins:
https://github.com/fluent/fluentd/blob/master/lib/fluent/plugin/compressable.rb#L103

I put together a branch based on those changes that seems to be working for me (the in_avail member of the mz_stream holds the information we need to know the end of the deflate gzip body):
https://github.com/fluent/fluent-bit/compare/master...Tangui0232:fluent-bit:fix-concatenated_gzip_payloads?expand=1

This commit provides a good view of the changes since it's a diff against the old code and has a description of the idea behind the changes:
1e5ccad

Feel free to use these changes in any way. If you want me to create a PR I can as well. I haven't looked in detail on everything that needs to be done to create a PR so I'm sure the code is not up to standards, but it should demonstrate the idea.

edsiper · 2025-04-23T18:30:32Z

@Tangui0232 thanks for helping on this! do you think you can submit a PR on top of my test branch gzip-concatenated ?

aydosman added the status: waiting-for-triage label Jul 8, 2024

ksauzz mentioned this issue Aug 27, 2024

in_forward: the max size of gzipped message should be optional #9286

Closed

edsiper linked a pull request Apr 11, 2025 that will close this issue

gzip: fix handling of concatenated payloads #10204

Open

Tangui0232 mentioned this issue Apr 26, 2025

Fix concatenated gzip payloads gzip concatenated #10259

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gzip Decompression Failure Due to 100MB Limit in Fluent Bit 3.0.7 #9058

Gzip Decompression Failure Due to 100MB Limit in Fluent Bit 3.0.7 #9058

aydosman commented Jul 8, 2024 •

edited

Loading

edsiper commented Jul 8, 2024

aydosman commented Jul 9, 2024

mirko-lazarevic commented Jul 9, 2024

stevehipwell commented Jul 26, 2024

cosmo0920 commented Jul 29, 2024

stevehipwell commented Jul 29, 2024

cosmo0920 commented Jul 29, 2024

stevehipwell commented Aug 12, 2024

aydosman commented Aug 15, 2024

cosmo0920 commented Aug 16, 2024

aydosman commented Aug 16, 2024

aydosman commented Sep 26, 2024

ksauzz commented Oct 28, 2024

stevehipwell commented Oct 28, 2024

patrick-stephens commented Oct 28, 2024

cosmo0920 commented Oct 29, 2024 •

edited

Loading

stevehipwell commented Oct 29, 2024

ksauzz commented Nov 26, 2024

stevehipwell commented Nov 26, 2024

ksauzz commented Nov 27, 2024 •

edited

Loading

ksauzz commented Dec 24, 2024

aydosman commented Feb 2, 2025

Tangui0232 commented Apr 7, 2025

Tangui0232 commented Apr 8, 2025 •

edited

Loading

edsiper commented Apr 11, 2025

ksauzz commented Apr 11, 2025 •

edited

Loading

Tangui0232 commented Apr 16, 2025

edsiper commented Apr 23, 2025

Gzip Decompression Failure Due to 100MB Limit in Fluent Bit 3.0.7 #9058

Gzip Decompression Failure Due to 100MB Limit in Fluent Bit 3.0.7 #9058

Comments

aydosman commented Jul 8, 2024 • edited Loading

Bug Report

To Reproduce

Example log message

Steps to reproduce the problem

Expected behavior

Screenshots

Your Environment

Configuration:

Environment name and version (e.g. Kubernetes? What version?)

Server type and version

Operating System and version

Filters and plugins

Additional context

edsiper commented Jul 8, 2024

aydosman commented Jul 9, 2024

mirko-lazarevic commented Jul 9, 2024

stevehipwell commented Jul 26, 2024

cosmo0920 commented Jul 29, 2024

stevehipwell commented Jul 29, 2024

cosmo0920 commented Jul 29, 2024

stevehipwell commented Aug 12, 2024

aydosman commented Aug 15, 2024

cosmo0920 commented Aug 16, 2024

aydosman commented Aug 16, 2024

aydosman commented Sep 26, 2024

ksauzz commented Oct 28, 2024

stevehipwell commented Oct 28, 2024

patrick-stephens commented Oct 28, 2024

cosmo0920 commented Oct 29, 2024 • edited Loading

stevehipwell commented Oct 29, 2024

ksauzz commented Nov 26, 2024

stevehipwell commented Nov 26, 2024

ksauzz commented Nov 27, 2024 • edited Loading

Environment:

Aggregator's error logs

To reproduce

ksauzz commented Dec 24, 2024

aydosman commented Feb 2, 2025

Tangui0232 commented Apr 7, 2025

Tangui0232 commented Apr 8, 2025 • edited Loading

edsiper commented Apr 11, 2025

ksauzz commented Apr 11, 2025 • edited Loading

Error logs

To reproduce

Tangui0232 commented Apr 16, 2025

edsiper commented Apr 23, 2025

aydosman commented Jul 8, 2024 •

edited

Loading

cosmo0920 commented Oct 29, 2024 •

edited

Loading

ksauzz commented Nov 27, 2024 •

edited

Loading

Tangui0232 commented Apr 8, 2025 •

edited

Loading

ksauzz commented Apr 11, 2025 •

edited

Loading