-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Gzip Decompression Failure Due to 100MB Limit in Fluent Bit 3.0.7 #9058
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
just curious, what's the use case where one payload might expand to over 100MB ? today that's a hard limit, we will need to extend it per component, besides in_forward is being used in other areas in your use case ? |
Could it be down to back pressure on the collector side, let me try and prove that. I’ll run some simulations and provide all the related metrics.
Not at this time |
@edsiper I experience the same issue with the Fluent Bit version 3.0.4, however using the same configuration with the Fluent Bit version 2.2.2 we don't encounter this error. I believe, although I might be wrong that the error was introduced with this change #8665 FYI: @cosmo0920 |
@edsiper has this been investigated? |
Hi, I'm trying to add full width confirmation of concatenated gzip stream of forwarded payloads in #9139. Would you mind if you tried to test that patch? |
@cosmo0920 is there an OCI image built as part of the PR? |
No. I tried to generate PR specific images. But no luck. |
Has this been fixed in v3.1.5? |
fb version 3.1.5 – Bug still exists To add and to prove a theory we had, the data we send/persist/DBs from the collector and aggregator might have been somehow corrupted, so these were tested on fresh new cloud nodes. |
Do you have a reproducible step? |
The configuration shown above has not changed only the Fluent Bit container image version has been updated. Let me know if you need anything else. |
Any update on this issue? |
Hi, how can we proceed this issue? I feel we cannot upgrade our fluent-bit aggregator to 3.x series until fixing this issue. |
CC @patrick-stephens 👀 |
@cosmo0920 did we provide a configuration around this for Core? |
No, we didn't. We just process decompressing operations for compressed buffers. |
@cosmo0920 the original bug still exists, is there a fix planned? |
For further insight I tried to reproduce the issue w/ minimum setup on local environment but I couldn't observe gzip concatenation. How can I reproduce gzip concatenation in in_forward? In my local, concatenated gzip payload count is always 0.
|
CC @aydosman |
I finally succeeded reproducing the issue w/ fluent-bit 3.0.4, 3.1.9 and 3.2.0 on local. I prepared the git repo for testing. Thank you. CC @cosmo0920 Environment:
Aggregator's error logsIt seems gzip decompression immediately failed after concatenated gzip payload appeared.
To reproduce
You can change the version of fluent-bit by specifying
|
Hi, any update? I think the multiple root causes could exist. It seems the one was fixed by #9139 in 3.1.5, but the another one still exists which causes the error at a low frequency. |
Hi everyone, it's been a little while since the last update on this issue. Is there any update? |
I spent some time digging into this issue and have found the likely culprit. The issue has to do with the assumption that the gzip body cannot contain a valid gzip header made with the handling of concatenated gzip payloads #8665. Concatenated gzip payloads are directly appended to each other without any length context so the forward input plugin searches the binary data it receives for valid gzip headers and assumes those are valid. However, since the gzip body can contain valid header sequences those are falsely flagged as concatenated gzip payloads resulting in gzip body content being read as the decompression size (from the gzip footer). flb_gzip_count is the relevant function here. Line 823 scans for valid headers: Line 803 in 3f4a024
I'm not sure the best course of action from here. To me this seems like a protocol issue. Some additional context (eg. gzip payload length) is probably required rather than simply mashing the gzip payloads together. |
Example unit test demonstrating the issue:
This is a custom test case I created. The border_count is expected to be 0 when there is only one gzip payload, but it's 1 in this case. |
hi folks, I have submitted a potential solution in this PR: #10204 would you please give it a try to that branch ? |
@edsiper Thank you for the fix! Error logs
To reproduce
|
@edsiper Thanks for the fix. Unfortunately I don't think this addresses the underlying issue as we are still searching for gzip headers to determine boundaries between gzip payloads. I did some investigation on how fluentd is handling this and they are using the unused field from the Zlib::GzipReader to know where the next gzip payload begins: I put together a branch based on those changes that seems to be working for me (the in_avail member of the mz_stream holds the information we need to know the end of the deflate gzip body): This commit provides a good view of the changes since it's a diff against the old code and has a description of the idea behind the changes: Feel free to use these changes in any way. If you want me to create a PR I can as well. I haven't looked in detail on everything that needs to be done to create a PR so I'm sure the code is not up to standards, but it should demonstrate the idea. |
@Tangui0232 thanks for helping on this! do you think you can submit a PR on top of my test branch gzip-concatenated ? |
Bug Report
I'm encountering an issue with Fluent Bit where the gzip decompression fails due to exceeding the maximum decompression size of 100MB. Below are the relevant error logs and configurations for both the collector and aggregator.
To Reproduce
Example log message
Steps to reproduce the problem
Set up Fluent Bit with the provided collector and aggregator configurations.
Monitor the logs for gzip decompression errors.
Expected behavior
Fluent Bit should handle the gzip decompression without exceeding the maximum decompression size limit.
Screenshots
N/A
Your Environment
Version used: Fluent Bit 3.0.7
Configuration:
Collector Configuration:
Aggregator Configuration:
Environment name and version (e.g. Kubernetes? What version?)
Kubernetes 1.30, 1.29, 1.28
Server type and version
AKS/EKS
Operating System and version
Ubuntu, AL2, AL2023 and BottlerocketOS
Filters and plugins
See above
Additional context
This issue persists across all Fluent Bit instances with the same configuration. Both collector and aggregator are using the same Fluent Bit version (3.0.7). The rate of records processed per second is consistently around 800 (so not too much). Any guidance or solution to resolve this issue would be greatly appreciated.
The text was updated successfully, but these errors were encountered: