Skip to content

Added Prometheus client to get model server metrics #64

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

aish1331
Copy link

@aish1331 aish1331 commented May 1, 2025

Issue# #17

Description

Added Prometheus Client that connects to a Prometheus instance to fetch metrics for the benchmark. This is assuming Prometheus has model server metrics.

Engine tested: vLLM

Metrics fetched from vLLM server:

Queue Size
TTFT
TPOT
Input Token Count
Output Token Count
Successful Request Count
Request Latency
If metric_client is not passed in config.yml, the report generation falls back to benchmark request metrics.

TODO
Fetch KV cache usage metric from vLLM server via Prometheus.

Following are the logs with the report generated for a 2 stage benchmark run

As of this PR, metrics for the complete run are reported i.e. metrics for all the stages are aggregated.

Using configuration from: config.yml
Stage 0 - run started
Stage 0 - run completed
Stage 1 - run started
Stage 1 - run completed
Waiting for 20 seconds for Prometheus to collect metrics...


Generating Report ..
--------------------------------------------------
Request Summary
--------------------------------------------------
prompt_tokens_per_second: 85.52936651769419
output_tokens_per_second: 8.609513454470223
requests_per_second: 0.46737358752838354
avg_request_latency: 0.0852452470969997
median_request_latency: 0.12763349153101444
p90_request_latency: 0.13852914329618216
p99_request_latency: 0.14108252413570882
avg_time_to_first_token: 0.0
median_time_to_first_token: 0.0
p90_time_to_first_token: 0.0
p99_time_to_first_token: 0.0
avg_time_per_output_token: 0.0
median_time_per_output_token: 0.0
p90_time_per_output_token: 0.0
p99_time_per_output_token: 0.0
total_requests: 19
avg_prompt_tokens: 183
avg_output_tokens: 18
avg_queue_length: 0
--------------------------------------------------
Metrics Client Summary
--------------------------------------------------
prompt_tokens_per_second: 94.227051529898
output_tokens_per_second: 9.866008932737817
requests_per_second: 0.49996666888874075
avg_request_latency: 0.08623366355895996
median_request_latency: 0.15
p90_request_latency: 0.27
p99_request_latency: 0.297
avg_time_to_first_token: 0.006058311462402343
median_time_to_first_token: 0.007321428571428572
p90_time_to_first_token: 0.009464285714285715
p99_time_to_first_token: 0.009946428571428571
avg_time_per_output_token: 0.004279663193029653
median_time_per_output_token: 0.005
p90_time_per_output_token: 0.009
p99_time_per_output_token: 0.0099
total_requests: 20
avg_prompt_tokens: 115
avg_output_tokens: 11
avg_queue_length: 0

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels May 1, 2025
@k8s-ci-robot
Copy link
Contributor

Welcome @aish1331!

It looks like this is your first PR to kubernetes-sigs/inference-perf 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/inference-perf has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label May 1, 2025
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 1, 2025
@achandrasekar
Copy link
Contributor

/assign @SachinVarghese

Copy link
Contributor

@SachinVarghese SachinVarghese left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Minor changes requested

metrics_client: MetricsClient | None = None
if config.metrics_client:
if config.metrics_client.type == MetricsClientType.PROMETHEUS:
metrics_client = PrometheusMetricsClient(metrics_client_config=config.metrics_client)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will simplify the client code

Suggested change
metrics_client = PrometheusMetricsClient(metrics_client_config=config.metrics_client)
metrics_client = PrometheusMetricsClient(metrics_client_config=config.metrics_client.prometheus)

# assumption: all metrics clients have metrics exported in Prometheus format
raise NotImplementedError

@abstractmethod
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

methods repeated?

time_per_request=3,
)
)
print("Processing request - " + str(payload.data))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can keep the sleep statement here to mock the processing

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And lets retain the behaviour of collecting request metrics in this client

@@ -11,3 +11,8 @@ tokenizer:
pretrained_model_name_or_path: HuggingFaceTB/SmolLM2-135M-Instruct
data:
type: shareGPT
# metrics_client:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets uncomment these and let the test runner fallback incase the prometheus server is unreachable

Comment on lines +30 to +81
request_metrics = runtime_parameters.model_server_client.get_request_metrics()
if len(request_metrics) > 0:
total_prompt_tokens = sum([x.prompt_tokens for x in request_metrics])
total_output_tokens = sum([x.output_tokens for x in request_metrics])
if runtime_parameters.duration > 0:
prompt_tokens_per_second = total_prompt_tokens / runtime_parameters.duration
output_tokens_per_second = total_output_tokens / runtime_parameters.duration
requests_per_second = len(request_metrics) / runtime_parameters.duration
else:
prompt_tokens_per_second = 0.0
output_tokens_per_second = 0.0
requests_per_second = 0.0

request_summary = ModelServerMetrics(
total_requests=len(request_metrics),
requests_per_second=requests_per_second,
prompt_tokens_per_second=prompt_tokens_per_second,
output_tokens_per_second=output_tokens_per_second,
avg_prompt_tokens=int(statistics.mean([x.prompt_tokens for x in request_metrics])),
avg_output_tokens=int(statistics.mean([x.output_tokens for x in request_metrics])),
avg_request_latency=statistics.mean([x.time_per_request for x in request_metrics]),
median_request_latency=statistics.median([x.time_per_request for x in request_metrics]),
p90_request_latency=statistics.quantiles([x.time_per_request for x in request_metrics], n=10)[8],
p99_request_latency=statistics.quantiles([x.time_per_request for x in request_metrics], n=100)[98],
avg_time_to_first_token=0.0,
median_time_to_first_token=0.0,
p90_time_to_first_token=0.0,
p99_time_to_first_token=0.0,
median_time_per_output_token=0.0,
p90_time_per_output_token=0.0,
p99_time_per_output_token=0.0,
avg_time_per_output_token=0.0,
avg_queue_length=0,
)
for field_name, value in summary:
print("-" * 50)
print("Request Summary")
print("-" * 50)
for field_name, value in request_summary:
print(f"{field_name}: {value}")
else:
print("Report generation failed - no metrics collected")
print("Report generation failed - no request metrics collected")

if self.metrics_client is not None:
print("-" * 50)
print("Metrics Client Summary")
print("-" * 50)
metric_client_summary = self.metrics_client.collect_model_server_metrics(runtime_parameters)
if metric_client_summary is not None:
for field_name, value in metric_client_summary:
print(f"{field_name}: {value}")
else:
print("Report generation failed - no metrics client summary collected")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move the calculation relevant parts of the code here to the base class so that it can be reused by various report types?

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: aish1331
Once this PR has been reviewed and has the lgtm label, please ask for approval from sachinvarghese. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants