Added Prometheus client to get model server metrics #64

aish1331 · 2025-05-01T20:24:02Z

Issue# #17

Description

Added Prometheus Client that connects to a Prometheus instance to fetch metrics for the benchmark. This is assuming Prometheus has model server metrics.

Engine tested: vLLM

Metrics fetched from vLLM server:

Queue Size
TTFT
TPOT
Input Token Count
Output Token Count
Successful Request Count
Request Latency
If metric_client is not passed in config.yml, the report generation falls back to benchmark request metrics.

TODO
Fetch KV cache usage metric from vLLM server via Prometheus.

Following are the logs with the report generated for a 2 stage benchmark run

As of this PR, metrics for the complete run are reported i.e. metrics for all the stages are aggregated.

Using configuration from: config.yml
Stage 0 - run started
Stage 0 - run completed
Stage 1 - run started
Stage 1 - run completed
Waiting for 20 seconds for Prometheus to collect metrics...


Generating Report ..
--------------------------------------------------
Request Summary
--------------------------------------------------
prompt_tokens_per_second: 85.52936651769419
output_tokens_per_second: 8.609513454470223
requests_per_second: 0.46737358752838354
avg_request_latency: 0.0852452470969997
median_request_latency: 0.12763349153101444
p90_request_latency: 0.13852914329618216
p99_request_latency: 0.14108252413570882
avg_time_to_first_token: 0.0
median_time_to_first_token: 0.0
p90_time_to_first_token: 0.0
p99_time_to_first_token: 0.0
avg_time_per_output_token: 0.0
median_time_per_output_token: 0.0
p90_time_per_output_token: 0.0
p99_time_per_output_token: 0.0
total_requests: 19
avg_prompt_tokens: 183
avg_output_tokens: 18
avg_queue_length: 0
--------------------------------------------------
Metrics Client Summary
--------------------------------------------------
prompt_tokens_per_second: 94.227051529898
output_tokens_per_second: 9.866008932737817
requests_per_second: 0.49996666888874075
avg_request_latency: 0.08623366355895996
median_request_latency: 0.15
p90_request_latency: 0.27
p99_request_latency: 0.297
avg_time_to_first_token: 0.006058311462402343
median_time_to_first_token: 0.007321428571428572
p90_time_to_first_token: 0.009464285714285715
p99_time_to_first_token: 0.009946428571428571
avg_time_per_output_token: 0.004279663193029653
median_time_per_output_token: 0.005
p90_time_per_output_token: 0.009
p99_time_per_output_token: 0.0099
total_requests: 20
avg_prompt_tokens: 115
avg_output_tokens: 11
avg_queue_length: 0

Linter and other Validation Checks

Output of make validate command:

Creating virtual environment if it doesn't exist...
Activating virtual environment and installing dependencies...
.venv/bin/pip install --upgrade pip
Requirement already satisfied: pip in ./.venv/lib/python3.12/site-packages (25.1.1)
.venv/bin/pip install -e .
Obtaining file:///home/ubuntu/forks/clone/inference-perf
  Installing build dependencies ... done
  Checking if build backend supports build_editable ... done
...
Successfully installed inference-perf-0.0.1
Formatting Python files with ruff format...
.venv/bin/ruff format
27 files left unchanged
Linting Python files with ruff check...
.venv/bin/ruff check
All checks passed!
Running type checking with mypy...
.venv/bin/mypy --strict ./inference_perf
Success: no issues found in 25 source files

k8s-ci-robot · 2025-05-01T20:24:10Z

Welcome @aish1331!

It looks like this is your first PR to kubernetes-sigs/inference-perf 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/inference-perf has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

achandrasekar · 2025-05-01T21:25:25Z

/assign @SachinVarghese

SachinVarghese

Looks great! Minor changes requested

SachinVarghese · 2025-05-02T17:40:39Z

inference_perf/main.py

+    metrics_client: MetricsClient | None = None
+    if config.metrics_client:
+        if config.metrics_client.type == MetricsClientType.PROMETHEUS:
+            metrics_client = PrometheusMetricsClient(metrics_client_config=config.metrics_client)


This will simplify the client code

Suggested change

metrics_client = PrometheusMetricsClient(metrics_client_config=config.metrics_client)

metrics_client = PrometheusMetricsClient(metrics_client_config=config.metrics_client.prometheus)

SachinVarghese · 2025-05-02T17:45:16Z

inference_perf/client/base.py

+        # assumption: all metrics clients have metrics exported in Prometheus format
+        raise NotImplementedError
+
+    @abstractmethod


methods repeated?

SachinVarghese · 2025-05-02T17:46:07Z

inference_perf/client/mock_client.py

-                time_per_request=3,
-            )
-        )
+        print("Processing request - " + str(payload.data))


we can keep the sleep statement here to mock the processing

And lets retain the behaviour of collecting request metrics in this client

SachinVarghese · 2025-05-02T17:47:21Z

config.yml

@@ -11,3 +11,8 @@ tokenizer:
  pretrained_model_name_or_path: HuggingFaceTB/SmolLM2-135M-Instruct
 data:
  type: shareGPT
+# metrics_client:


lets uncomment these and let the test runner fallback incase the prometheus server is unreachable

done, the test runner prints the request summary but it also prints all the error logs for connection failures for each query.

SachinVarghese · 2025-05-02T17:58:25Z

inference_perf/reportgen/mock_reportgen.py

+        request_metrics = runtime_parameters.model_server_client.get_request_metrics()
+        if len(request_metrics) > 0:
+            total_prompt_tokens = sum([x.prompt_tokens for x in request_metrics])
+            total_output_tokens = sum([x.output_tokens for x in request_metrics])
+            if runtime_parameters.duration > 0:
+                prompt_tokens_per_second = total_prompt_tokens / runtime_parameters.duration
+                output_tokens_per_second = total_output_tokens / runtime_parameters.duration
+                requests_per_second = len(request_metrics) / runtime_parameters.duration
+            else:
+                prompt_tokens_per_second = 0.0
+                output_tokens_per_second = 0.0
+                requests_per_second = 0.0
+
+            request_summary = ModelServerMetrics(
+                total_requests=len(request_metrics),
+                requests_per_second=requests_per_second,
+                prompt_tokens_per_second=prompt_tokens_per_second,
+                output_tokens_per_second=output_tokens_per_second,
+                avg_prompt_tokens=int(statistics.mean([x.prompt_tokens for x in request_metrics])),
+                avg_output_tokens=int(statistics.mean([x.output_tokens for x in request_metrics])),
+                avg_request_latency=statistics.mean([x.time_per_request for x in request_metrics]),
+                median_request_latency=statistics.median([x.time_per_request for x in request_metrics]),
+                p90_request_latency=statistics.quantiles([x.time_per_request for x in request_metrics], n=10)[8],
+                p99_request_latency=statistics.quantiles([x.time_per_request for x in request_metrics], n=100)[98],
+                avg_time_to_first_token=0.0,
+                median_time_to_first_token=0.0,
+                p90_time_to_first_token=0.0,
+                p99_time_to_first_token=0.0,
+                median_time_per_output_token=0.0,
+                p90_time_per_output_token=0.0,
+                p99_time_per_output_token=0.0,
+                avg_time_per_output_token=0.0,
+                avg_queue_length=0,
            )
-            for field_name, value in summary:
+            print("-" * 50)
+            print("Request Summary")
+            print("-" * 50)
+            for field_name, value in request_summary:
                print(f"{field_name}: {value}")
        else:
-            print("Report generation failed - no metrics collected")
+            print("Report generation failed - no request metrics collected")
+
+        if self.metrics_client is not None:
+            print("-" * 50)
+            print("Metrics Client Summary")
+            print("-" * 50)
+            metric_client_summary = self.metrics_client.collect_model_server_metrics(runtime_parameters)
+            if metric_client_summary is not None:
+                for field_name, value in metric_client_summary:
+                    print(f"{field_name}: {value}")
+            else:
+                print("Report generation failed - no metrics client summary collected")


Can we move the calculation relevant parts of the code here to the base class so that it can be reused by various report types?

linux-foundation-easycla · 2025-05-05T21:21:46Z

The committers listed above are authorized under a signed CLA.

✅ login: aish1331 / name: Aishwarya Raimule (bb7c122, bcf2ee6, dc7b7ef, 034cdba, 1909426)

achandrasekar · 2025-05-05T23:41:20Z

/ok-to-test

achandrasekar · 2025-05-06T17:28:36Z

inference_perf/main.py

+        # Wait for the metrics to be ready
+        metrics_client.wait()
+
+    end_time = time.time()


We shouldn't include the metrics wait time above in the total test run time if all the requests have finished already since that can skew the test results if we are waiting for the metrics scrape interval.

Yes the duration of the test should be calculated correctly and also consider that this could be a multi-stage run so durations needs to be claculated for each stage.

Good point!
To tackle that I am now separately calculating perf run duration here and have moved the wait logic to Prometheus Client.
This duration will be used for computing the default request summary.

I now pass benchmark start time to the metrics client in perf runtime params. We call the wait function and then calculate the end time.
In this way, prometheus query runs for:
start timestamp = benchmark start time and
end timestamp = benchmark end + wait() ensuring we get all the metrics.

achandrasekar · 2025-05-06T17:45:01Z

Also, if you can run make validate and post the lint check output here that it is successful that would be great. We are trying to get this to run for all PRs. But it might take a day or two before we can enable that.

SachinVarghese · 2025-05-07T23:13:31Z

@aish1331 can you rebase, address pending comments and create any follow up issues so that we can get this work in?

… main.py

…f into prometheus-metrics-client

aish1331 · 2025-05-08T05:18:30Z

@Bslabe123 FYI I am creating separate report files for request summary and the summary generated by metrics client.

aish1331 · 2025-05-08T05:24:26Z

Added an issue to modify reportgen for multi-stage runs: #70

SachinVarghese

/lgtm

k8s-ci-robot · 2025-05-08T12:06:05Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aish1331, SachinVarghese

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [SachinVarghese]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot requested review from ArangoGutierrez and Jeffwan May 1, 2025 20:24

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels May 1, 2025

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels May 1, 2025

k8s-ci-robot assigned SachinVarghese May 1, 2025

SachinVarghese requested changes May 2, 2025

View reviewed changes

added prometheus metrics client to record metrics for a perf run

bcf2ee6

aish1331 force-pushed the prometheus-metrics-client branch from 802ed3d to bcf2ee6 Compare May 5, 2025 21:21

removed unused import, initialized variable, updated log

034cdba

aish1331 force-pushed the prometheus-metrics-client branch from db44434 to 034cdba Compare May 5, 2025 22:20

k8s-ci-robot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label May 5, 2025

achandrasekar reviewed May 6, 2025

View reviewed changes

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 7, 2025

aish1331 added 3 commits May 8, 2025 04:08

corrected the perf run duration, wait moved to prometheus client from…

bb7c122

… main.py

Merge branch 'main' of ssh://github.com/kubernetes-sigs/inference-per…

1909426

…f into prometheus-metrics-client

updated report generation to return ReportFiles

dc7b7ef

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 8, 2025

aish1331 requested a review from SachinVarghese May 8, 2025 07:25

SachinVarghese approved these changes May 8, 2025

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 8, 2025

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 8, 2025

k8s-ci-robot merged commit 2450773 into kubernetes-sigs:main May 8, 2025
2 checks passed

	metrics_client = PrometheusMetricsClient(metrics_client_config=config.metrics_client)
	metrics_client = PrometheusMetricsClient(metrics_client_config=config.metrics_client.prometheus)

Added Prometheus client to get model server metrics #64

Added Prometheus client to get model server metrics #64

Uh oh!

Conversation

aish1331 commented May 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Metrics fetched from vLLM server:

Following are the logs with the report generated for a 2 stage benchmark run

Linter and other Validation Checks

Uh oh!

k8s-ci-robot commented May 1, 2025

Uh oh!

achandrasekar commented May 1, 2025

Uh oh!

SachinVarghese left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

linux-foundation-easycla bot commented May 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

achandrasekar commented May 5, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

achandrasekar commented May 6, 2025

Uh oh!

SachinVarghese commented May 7, 2025

Uh oh!

aish1331 commented May 8, 2025

Uh oh!

aish1331 commented May 8, 2025

Uh oh!

SachinVarghese left a comment

Choose a reason for hiding this comment

Uh oh!

k8s-ci-robot commented May 8, 2025

Uh oh!

Uh oh!

Uh oh!

aish1331 commented May 1, 2025 •

edited

Loading

linux-foundation-easycla bot commented May 5, 2025 •

edited

Loading