Skip to content

Time framed thread-pool utilization #131898

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

mhl-b
Copy link
Contributor

@mhl-b mhl-b commented Jul 25, 2025

This PR changes Thread Pool utilization reporting intervals from dynamic (based on caller's frequency) into static (time frame based).

Originally there was a single consumer of utilization metric that runs on specific interval, so we would calculate utilization on the fly comparing previous invocation time and current. And it worked, kind of... With growing demand on utilization metrics, from shards balancing and allocation, we have new consumers, so this poll mechanism does not scale well. Thread pool needs to create a separate tracker for each caller.

This PR introduces total execution time measurements for thread-pool per time frame. Default frame duration is 30 seconds, but can configured per thread-pool via TaskTrackingConfig. All past time frames have accurate data, because we only update current frame. That means utilization metric can be stale up to duration of the time-frame in worst case. It's a tradeoff between liveliness and accuracy/efficiency. Utilization is not a "real-time" metric and there are no immediate actions on a single utilization metric.

A new FramedTimeTracker incapsulates frame tracking logic, comes with own set of tests. Also added JMH benchmark, advised by Nick, due to use of synchronized methods and risk of high contention. So far I cannot see slowness from the bench, but would like to hear feedback on it. Those numbers are funny looking, but consistent through multiple runs, tracking thread-pool somehow faster with larger number of threads.

Latest result from my machine

Benchmark                           (poolSize)  (tasksNum)  (trackUtilization)  (utilizationIntervalMs)  Mode  Cnt     Score   Error  Units
ThreadPoolUtilizationBenchmark.run           4     1000000               false                       10  avgt        108.888          ms/op
ThreadPoolUtilizationBenchmark.run           4     1000000                true                       10  avgt        420.603          ms/op
ThreadPoolUtilizationBenchmark.run           8     1000000               false                       10  avgt        780.527          ms/op
ThreadPoolUtilizationBenchmark.run           8     1000000                true                       10  avgt        478.278          ms/op
ThreadPoolUtilizationBenchmark.run          16     1000000               false                       10  avgt       1123.840          ms/op
ThreadPoolUtilizationBenchmark.run          16     1000000                true                       10  avgt        506.619          ms/op

@mhl-b mhl-b force-pushed the framed-thread-pool-utilization branch from 23fe464 to 8345da4 Compare July 26, 2025 05:35
@mhl-b mhl-b changed the title framed-time-tracker Time framed thread-pool utilization Jul 26, 2025
@mhl-b mhl-b marked this pull request as ready for review July 26, 2025 05:59
@mhl-b mhl-b requested a review from a team as a code owner July 26, 2025 05:59
@elasticsearchmachine elasticsearchmachine added the needs:triage Requires assignment of a team area label label Jul 26, 2025
@mhl-b mhl-b added >enhancement :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) Team:Distributed Coordination Meta label for Distributed Coordination team and removed needs:triage Requires assignment of a team area label labels Jul 26, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

@elasticsearchmachine
Copy link
Collaborator

Hi @mhl-b, I've created a changelog YAML for you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) >enhancement Team:Distributed Coordination Meta label for Distributed Coordination team v9.2.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants