fix: trim user-function error log; move full PollContext to DEBUG (fixes #640)#919
Open
Vishal Kumar Singh (singhvishalkr) wants to merge 1 commit intoconfluentinc:masterfrom
Conversation
The ERROR path in AbstractParallelEoSStreamProcessor#runUserFunction used to log the entire PollContextInternal (which is @tostring'd via Lombok and transitively prints every ConsumerRecord in the batch) on every user-function failure. With reasonable batch sizes this easily overflows the line length budget of log shippers (Filebeat, Fluent Bit, journald) and ends up truncated. It also leaks record payloads into error logs, which is awkward when those payloads contain sensitive data -- users were already working around this by masking the log pattern at log4j level. Keep a concise ERROR line (partitions + record count) and move the full PollContext dump to DEBUG so the signal stays useful without polluting error logs. The existing DEBUG-only PCRetriableException branch is preserved. Fixes confluentinc#640
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Fixes #640.
AbstractParallelEoSStreamProcessor#runUserFunctioncurrently logs the entirePollContextInternalon every user-function failure:Two problems with this:
Truncation.
PollContextInternalis@ToStringd via Lombok, which transitively prints everyConsumerRecordin the batch. With realisticbatchSizesettings the resulting line easily exceeds the per-line budget of common log shippers (Filebeat, Fluent Bit, journald, Docker JSON driver), so the ERROR gets cut off and the stack trace below it becomes harder to correlate.Sensitive data in error logs. The
@ToStringexpansion serialises record keys and values, which can include user payloads — exactly the concern raised by Vinh Nguyen (@ndqvinh2109) in the thread above. Users were already working around this at log4j level by masking the log pattern.Modifications
Split the message into two tiers:
partitions=[...]andrecords=<count>— usingPollContext#getByTopicPartitionMap().keySet()andPollContext#size(). That's enough to correlate the failure to the right topic-partition without dragging record payloads into the error log.log.isDebugEnabled()): the fullPollContextdump, so operators troubleshooting a specific failure can still see the original context by bumping the logger toDEBUG.The existing
PCRetriableExceptionbranch (which already only logs atDEBUG) is preserved with the shorter message, since it never went to ERROR anyway.Added a short inline comment pointing at this issue so the rationale doesn't get lost.
Verifying this change
Flow is unchanged (same retry / mailbox / throw semantics), only log output differs. A unit test that exercises
runUserFunctionwith a throwing user function should now emit an ERROR line containingpartitions=andrecords=and no record payload, and (if DEBUG is enabled) a second DEBUG line with the full context.Happy to add an explicit test under
parallel-consumer-core/src/testif you'd prefer to lock this in. Noting that my previous PR #918 for the similar #631 was already along these lines.Does this pull request introduce a user-facing change?