-
Notifications
You must be signed in to change notification settings - Fork 7.3k
ZOOKEEPER-4952:Reduce the GC overhead of Prometheus reject exception handling #2279
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
public void rejectedExecution(final Runnable r, final ThreadPoolExecutor e) { | ||
if (!maxQueueSizeExceeded) { | ||
maxQueueSizeExceeded = true; | ||
LOG.warn("Prometheus metrics queue size exceeded the max"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why drop the rate limit logger ?
I would expect rejections are periodically events but not a lifetime event.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rate limit logger was dropped based on the following two considerations:
- To avoid the unnecessary overhead of logging the same warn msg repeatedly even it's rate limited
- It's important to know whether rejection happens, but don't see too many benefits of knowing when the rejection stops and restarts.
I am ok with either way, so let me know if there is a strong preference to have it back. @kezhuw
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's important to know whether rejection happens, but don't see too many benefits of knowing when the rejection stops and restarts.
My concern is that: people could lost the rejection if we log it only once as it could be too far from inspection time. ZooKeeper servers in production are supposed to run for a long time, logs could be truncated or archived periodically. If we log it only once at the first occurence, then it would be hard or even impossible to notice that the metrics backend is overloading now.
To avoid the unnecessary overhead of logging the same warn msg repeatedly even it's rate limited
Is it possible to optimize this as the log string is a const.
I personally think rejection policy deserve rate limit as it could happen shortly and periodically due to system load.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, let me just add the rate limit logger.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Enhanced
RateLogger
reduce string concatenation overhead. - Added the rate logger back to visibility when exceeding max happens
Is it possible to add the perf code to tests so others can evaluate it ? I saw three in our codebase.
|
Any tests that perform some write operation can be used if we set the max queue size to a very small number (e.g. 1), however, the challenge is how to measure and compare the perf improvement. This is not something can be simply done by printing out some duration like the sample tests. If we analyze how I would be very happy to provide any clarification on the perf improvement, but personally don't think it's possible to write a meaningful perf test for it in the current code base and don't feel it's really necessary neither. |
I am ok for this, please go ahead! |
1d59100
to
f5d0ea5
Compare
…handling Author: Li Wang <liwang@apple.com>
f5d0ea5
to
be0ca7b
Compare
Hi @kezhuw The rate logger has been added back with optimized RateLogger for reducing string concatenations. Would you mind taking a look at it? Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks @kezhuw I will go ahead to merge it. |
@kezhuw Do you know if we need to open a separate JIRA and PR if we want to include this in the upcoming 3.10.0 release? |
Hi @li4wang, master will be 3.10.0. |
Implemented
PrometheusRejectedExecutionHandler
and override therejectedExecution()
API to fix the large memory footprint and long GC pause of RejectedExecutionException handler. No more RejectedExecutionException() will be thrown.This enhancement increased read throughput by 140% according to our perf comparison tests result.