[NV] add minimaxm2.5-fp4-b200-trt#1722
Conversation
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you
PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
1 similar comment
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you
PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
|
@claude add PR desc |
|
Claude finished @hshrivastava-droid's task in 47s —— View job Adding PR Description
PR description has been updated with a structured summary covering the three changed files: the new |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 1af0b14. Configure here.
| CAPTURE_BATCH_LIST+=( $(seq 768 64 $CONC)) | ||
| fi | ||
| CAPTURE_BATCH_LIST=$(printf "%s, " "${CAPTURE_BATCH_LIST[@]}") | ||
| MAX_CAPTURE_TOKENS=$(( CONC < 16 ? 4096 : MAX_NUM_TOKENS )) |
There was a problem hiding this comment.
Wrong CUDA capture token cap
Medium Severity
torch_compile_config.capture_num_tokens is sized from an early MAX_CAPTURE_TOKENS (min(16384, CONC*ISL)), while a later assignment caps low concurrency at 4096 but is never used. For CONC under 16 (e.g. 8k/1k at conc 4–8), capture lists can reach CONC*ISL or 16384 instead of 4096.
Reviewed by Cursor Bugbot for commit 1af0b14. Configure here.
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27437755655 |
|
@functionstackx could you please review this? |
|
@claude add PR description |
|
Claude finished @hshrivastava-droid's task in 54s —— View job Adding PR Description
PR description has been updated with a structured summary covering the three changed files: the new |
|
@claude launch b200 to exlsuive b200-dgxc_09 |
|
Claude finished @hshrivastava-droid's task in 1m 51s —— View job Launching B200 Benchmark on Exclusive Runner
Launched Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/27448473459 Command used: |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27437755655 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27437755655 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27455174586 |


Summary
Adds MiniMax-M2.5 NVFP4 on B200 single-node benchmark using TensorRT-LLM (
tensorrt-llm/release:1.3.0rc18).Changes
nvidia-master.yaml— new config keyminimaxm2.5-fp4-b200-trtnvcr.io#nvidia/tensorrt-llm/release:1.3.0rc18nvidia/MiniMax-M2.5-NVFP4minimaxm2.5_fp4_b200_trt.sh— new benchmark scripttrtllm-serveviampirunwith PyTorch backendrun_benchmark_serving)lm-evalwhenRUN_EVAL=trueperf-changelog.yamlminimaxm2.5-fp4-b200-trtconfig addition and image version.Context
This sits alongside the existing
minimaxm2.5-fp4-b200-vllmentry, adding a TensorRT-LLM comparison point for the same model/precision/SKU combination.Note
Low Risk
Additive benchmark configuration and shell script only; no changes to production serving, auth, or shared runtime logic.
Overview
Adds a TensorRT-LLM single-node benchmark path for MiniMax-M2.5 NVFP4 on B200, alongside the existing vLLM entry for the same model/SKU.
nvidia-master.yamlintroducesminimaxm2.5-fp4-b200-trt(tensorrt-llm/release:1.3.0rc18) with fixed-seq-len sweeps at 1k/1k and 8k/1k over TP/EP and optional DP attention concurrency ranges.minimaxm2.5_fp4_b200_trt.shwires the run: writes a runtime YAML (CUDA graphs, MoE/NVFP4 settings, optional attention DP), startstrtllm-serveviampirun, runs standard serving throughput, and optionallm-evalwhenRUN_EVAL=true.perf-changelog.yamlrecords the new config key and image.Reviewed by Cursor Bugbot for commit df5a3a2. Bugbot is set up for automated code reviews on this repo. Configure here.