[benchmark] add max-concurrency in result table #21095

panpan0000 · 2025-07-17T05:33:06Z

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
[] (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

Test Plan

test CLI

(venv) root@vllm-lmcache-dev-0:/peter# OPENAI_API_KEY=$API_KEY  python3 vllm/benchmarks/benchmark_serving.py --backend openai-chat    \
 --base-url  https://api.groq.com/openai      --endpoint "/v1/chat/completions"    \
 --dataset-name=sharegpt     --dataset-path=/mnt/models/ShareGPT_Vicuna_unfiltered/ShareGPT_V3_unfiltered_cleaned_split.json     \
--model "moonshotai/Kimi-K2-Instruct"     --trust_remote_code      \
--served-model-name "moonshotai/kimi-k2-instruct"    \
--num-prompts 10    \
--max-concurrency 3    \
--save-result    --temperature 0.7    --save-detailed

Test Result

Namespace(backend='openai-chat', base_url='https://api.groq.com/openai', host='127.0.0.1', port=8000, endpoint='/v1/chat/completions', dataset_name='sharegpt', dataset_path='/mnt/models/ShareGPT_Vicuna_unfiltered/ShareGPT_V3_unfiltered_cleaned_split.json', max_concurrency=3, model='moonshotai/Kimi-K2-Instruct', tokenizer=None, use_beam_search=False, num_prompts=10, logprobs=None, request_rate=inf, burstiness=1.0, seed=0, trust_remote_code=True, disable_tqdm=False, profile=False, save_result=True, save_detailed=True, append_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, custom_output_len=256, custom_skip_chat_template=False, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=1024, random_output_len=128, random_range_ratio=0.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, top_p=None, top_k=None, min_p=None, temperature=0.7, tokenizer_mode='auto', served_model_name='moonshotai/kimi-k2-instruct', lora_modules=None, ramp_up_strategy=None, ramp_up_start_rps=None, ramp_up_end_rps=None)
WARNING 07-17 05:34:37 [tokenizer.py:262] Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf RPS.
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 3
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:08<00:00,  1.19it/s]
============ Serving Benchmark Result ============
Successful requests:                     10
Maximum request concurrency:             3         <--------------------  what's new
Benchmark duration (s):                  8.38
Total input tokens:                      1366
Total generated tokens:                  2667
Request throughput (req/s):              1.19
Output token throughput (tok/s):         318.37
Total Token throughput (tok/s):          481.44
---------------Time to First Token----------------
Mean TTFT (ms):                          698.56
Median TTFT (ms):                        645.54
P99 TTFT (ms):                           1008.26
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          7.21
Median TPOT (ms):                        4.53
P99 TPOT (ms):                           17.95
---------------Inter-token Latency----------------
Mean ITL (ms):                           6.22
Median ITL (ms):                         0.09
P99 ITL (ms):                            31.18
==================================================

(Optional) Documentation Update

github-actions · 2025-07-17T05:33:15Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Code Review

This pull request adds the max-concurrency setting to the benchmark results table. The implementation is straightforward, but I've identified a high-severity correctness issue where the benchmark could report a concurrency limit of 0 while actually running with unlimited concurrency. My review includes suggestions to mitigate this misleading output by making the reporting logic consistent with the actual execution logic.

benchmarks/benchmark_serving.py

benchmarks/benchmark_serving_structured_output.py

NickLucche

I don't have a strong opinion on adding this argument, I would just argue request-rate is probably as important.

We will need to edit vllm/benchmarks/serve.py too, as we'll want to move to vllm bench eventually.

Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>

panpan0000 · 2025-07-18T04:42:55Z

I don't have a strong opinion on adding this argument, I would just argue request-rate is probably as important.

We will need to edit vllm/benchmarks/serve.py too, as we'll want to move to vllm bench eventually.

thanks @NickLucche .

Correct, when "request-rate" < "max-concurrency" , the real concurrency which LLM server feels would be min(request-rate , max-concurrency)

But if digging deeper, there're many other aspects affecting the instantaneous concurrent value, like ramp-up-strategy and burstiness..etc. But if user adding those parameters , they should be familiar and understand their consequences.

So back to my init motivation:
Most people use benchmark suits with simple usage , like --num-prompts 100 --max-concurrency 10 only,.
The users will experience diff TTFT/TPOT for diff param combinations.
So it will be better that the stdout result tables can include some key arguments (but not all arguments, in reality ) which affect the result most, currently, only the total request num(Successful requests) is NOT enough.

and I just added request-rate as you suggested , thank for your time again, @NickLucche

NickLucche

Lgtm, thanks for contributing

mergify bot added performance Performance-related issues structured-output labels Jul 17, 2025

github-project-automation bot added this to Structured Output Jul 17, 2025

gemini-code-assist bot reviewed Jul 17, 2025

View reviewed changes

benchmarks/benchmark_serving.py Show resolved Hide resolved

benchmarks/benchmark_serving_structured_output.py Show resolved Hide resolved

NickLucche suggested changes Jul 17, 2025

View reviewed changes

panpan0000 added 2 commits July 18, 2025 12:42

[benchmark] add max-concurrency in result table

3a03965

Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>

address NickLucche comments

c992bf8

Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>

panpan0000 force-pushed the benchmark-result branch from 7483853 to c992bf8 Compare July 18, 2025 04:42

NickLucche approved these changes Jul 18, 2025

View reviewed changes

DarkLight1337 added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[benchmark] add max-concurrency in result table #21095

[benchmark] add max-concurrency in result table #21095

panpan0000 commented Jul 17, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Jul 17, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

NickLucche left a comment

Uh oh!

panpan0000 commented Jul 18, 2025 •

edited

Loading

Uh oh!

NickLucche left a comment

Uh oh!

Uh oh!

Uh oh!

[benchmark] add max-concurrency in result table #21095

Are you sure you want to change the base?

[benchmark] add max-concurrency in result table #21095

Conversation

panpan0000 commented Jul 17, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Essential Elements of an Effective PR Description Checklist

Purpose

Test Plan

Test Result

(Optional) Documentation Update

Uh oh!

github-actions bot commented Jul 17, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

NickLucche left a comment

Choose a reason for hiding this comment

Uh oh!

panpan0000 commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NickLucche left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

panpan0000 commented Jul 17, 2025 •

edited by github-actions bot

Loading

panpan0000 commented Jul 18, 2025 •

edited

Loading