-
Notifications
You must be signed in to change notification settings - Fork 11
Added Prometheus client to get model server metrics #64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Added Prometheus client to get model server metrics #64
Conversation
…ameters, modified config
…f into prometheus-metrics-client
…ent in config, removed MockMetricsClient from the path
…f into prometheus-metrics-client
Welcome @aish1331! |
…f into prometheus-metrics-client
/assign @SachinVarghese |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Minor changes requested
metrics_client: MetricsClient | None = None | ||
if config.metrics_client: | ||
if config.metrics_client.type == MetricsClientType.PROMETHEUS: | ||
metrics_client = PrometheusMetricsClient(metrics_client_config=config.metrics_client) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will simplify the client code
metrics_client = PrometheusMetricsClient(metrics_client_config=config.metrics_client) | |
metrics_client = PrometheusMetricsClient(metrics_client_config=config.metrics_client.prometheus) |
# assumption: all metrics clients have metrics exported in Prometheus format | ||
raise NotImplementedError | ||
|
||
@abstractmethod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
methods repeated?
time_per_request=3, | ||
) | ||
) | ||
print("Processing request - " + str(payload.data)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can keep the sleep statement here to mock the processing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And lets retain the behaviour of collecting request metrics in this client
@@ -11,3 +11,8 @@ tokenizer: | |||
pretrained_model_name_or_path: HuggingFaceTB/SmolLM2-135M-Instruct | |||
data: | |||
type: shareGPT | |||
# metrics_client: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets uncomment these and let the test runner fallback incase the prometheus server is unreachable
request_metrics = runtime_parameters.model_server_client.get_request_metrics() | ||
if len(request_metrics) > 0: | ||
total_prompt_tokens = sum([x.prompt_tokens for x in request_metrics]) | ||
total_output_tokens = sum([x.output_tokens for x in request_metrics]) | ||
if runtime_parameters.duration > 0: | ||
prompt_tokens_per_second = total_prompt_tokens / runtime_parameters.duration | ||
output_tokens_per_second = total_output_tokens / runtime_parameters.duration | ||
requests_per_second = len(request_metrics) / runtime_parameters.duration | ||
else: | ||
prompt_tokens_per_second = 0.0 | ||
output_tokens_per_second = 0.0 | ||
requests_per_second = 0.0 | ||
|
||
request_summary = ModelServerMetrics( | ||
total_requests=len(request_metrics), | ||
requests_per_second=requests_per_second, | ||
prompt_tokens_per_second=prompt_tokens_per_second, | ||
output_tokens_per_second=output_tokens_per_second, | ||
avg_prompt_tokens=int(statistics.mean([x.prompt_tokens for x in request_metrics])), | ||
avg_output_tokens=int(statistics.mean([x.output_tokens for x in request_metrics])), | ||
avg_request_latency=statistics.mean([x.time_per_request for x in request_metrics]), | ||
median_request_latency=statistics.median([x.time_per_request for x in request_metrics]), | ||
p90_request_latency=statistics.quantiles([x.time_per_request for x in request_metrics], n=10)[8], | ||
p99_request_latency=statistics.quantiles([x.time_per_request for x in request_metrics], n=100)[98], | ||
avg_time_to_first_token=0.0, | ||
median_time_to_first_token=0.0, | ||
p90_time_to_first_token=0.0, | ||
p99_time_to_first_token=0.0, | ||
median_time_per_output_token=0.0, | ||
p90_time_per_output_token=0.0, | ||
p99_time_per_output_token=0.0, | ||
avg_time_per_output_token=0.0, | ||
avg_queue_length=0, | ||
) | ||
for field_name, value in summary: | ||
print("-" * 50) | ||
print("Request Summary") | ||
print("-" * 50) | ||
for field_name, value in request_summary: | ||
print(f"{field_name}: {value}") | ||
else: | ||
print("Report generation failed - no metrics collected") | ||
print("Report generation failed - no request metrics collected") | ||
|
||
if self.metrics_client is not None: | ||
print("-" * 50) | ||
print("Metrics Client Summary") | ||
print("-" * 50) | ||
metric_client_summary = self.metrics_client.collect_model_server_metrics(runtime_parameters) | ||
if metric_client_summary is not None: | ||
for field_name, value in metric_client_summary: | ||
print(f"{field_name}: {value}") | ||
else: | ||
print("Report generation failed - no metrics client summary collected") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move the calculation relevant parts of the code here to the base class so that it can be reused by various report types?
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: aish1331 The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Issue# #17
Description
Added Prometheus Client that connects to a Prometheus instance to fetch metrics for the benchmark. This is assuming Prometheus has model server metrics.
Engine tested: vLLM
Metrics fetched from vLLM server:
Queue Size
TTFT
TPOT
Input Token Count
Output Token Count
Successful Request Count
Request Latency
If metric_client is not passed in config.yml, the report generation falls back to benchmark request metrics.
TODO
Fetch KV cache usage metric from vLLM server via Prometheus.
Following are the logs with the report generated for a 2 stage benchmark run
As of this PR, metrics for the complete run are reported i.e. metrics for all the stages are aggregated.