Skip to content

fix: trim user-function error log; move full PollContext to DEBUG (fixes #640)#919

Open
Vishal Kumar Singh (singhvishalkr) wants to merge 1 commit intoconfluentinc:masterfrom
singhvishalkr:fix-640-verbose-error-log-user-function
Open

fix: trim user-function error log; move full PollContext to DEBUG (fixes #640)#919
Vishal Kumar Singh (singhvishalkr) wants to merge 1 commit intoconfluentinc:masterfrom
singhvishalkr:fix-640-verbose-error-log-user-function

Conversation

@singhvishalkr
Copy link
Copy Markdown

Motivation

Fixes #640.

AbstractParallelEoSStreamProcessor#runUserFunction currently logs the entire PollContextInternal on every user-function failure:

String msg = msg("Exception caught in user function running stage, registering WC as failed, returning to" +
        " mailbox. Context: {}", context, e);
if (cause instanceof PCRetriableException) {
    log.debug("Explicit ... caught, logging at DEBUG only. " + msg, e);
} else {
    log.error(msg, e);
}

Two problems with this:

  1. Truncation. PollContextInternal is @ToStringd via Lombok, which transitively prints every ConsumerRecord in the batch. With realistic batchSize settings the resulting line easily exceeds the per-line budget of common log shippers (Filebeat, Fluent Bit, journald, Docker JSON driver), so the ERROR gets cut off and the stack trace below it becomes harder to correlate.

  2. Sensitive data in error logs. The @ToString expansion serialises record keys and values, which can include user payloads — exactly the concern raised by Vinh Nguyen (@ndqvinh2109) in the thread above. Users were already working around this at log4j level by masking the log pattern.

Modifications

Split the message into two tiers:

  • ERROR (always emitted): a concise summary — partitions=[...] and records=<count> — using PollContext#getByTopicPartitionMap().keySet() and PollContext#size(). That's enough to correlate the failure to the right topic-partition without dragging record payloads into the error log.
  • DEBUG (only when log.isDebugEnabled()): the full PollContext dump, so operators troubleshooting a specific failure can still see the original context by bumping the logger to DEBUG.

The existing PCRetriableException branch (which already only logs at DEBUG) is preserved with the shorter message, since it never went to ERROR anyway.

Added a short inline comment pointing at this issue so the rationale doesn't get lost.

Verifying this change

Flow is unchanged (same retry / mailbox / throw semantics), only log output differs. A unit test that exercises runUserFunction with a throwing user function should now emit an ERROR line containing partitions= and records= and no record payload, and (if DEBUG is enabled) a second DEBUG line with the full context.

Happy to add an explicit test under parallel-consumer-core/src/test if you'd prefer to lock this in. Noting that my previous PR #918 for the similar #631 was already along these lines.

Does this pull request introduce a user-facing change?

  • Yes — log line for user-function failures is now shorter; full context moves to DEBUG.

The ERROR path in AbstractParallelEoSStreamProcessor#runUserFunction
used to log the entire PollContextInternal (which is @tostring'd via
Lombok and transitively prints every ConsumerRecord in the batch) on
every user-function failure. With reasonable batch sizes this easily
overflows the line length budget of log shippers (Filebeat, Fluent Bit,
journald) and ends up truncated. It also leaks record payloads into
error logs, which is awkward when those payloads contain sensitive
data -- users were already working around this by masking the log
pattern at log4j level.

Keep a concise ERROR line (partitions + record count) and move the
full PollContext dump to DEBUG so the signal stays useful without
polluting error logs. The existing DEBUG-only PCRetriableException
branch is preserved.

Fixes confluentinc#640
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Error log to verbose in AbstractParallelEoSStreamProcessor

1 participant