[NV] add minimaxm2.5-fp4-b200-trt by hshrivastava-droid · Pull Request #1722 · SemiAnalysisAI/InferenceX

hshrivastava-droid · 2026-06-12T19:20:22Z

Summary

Adds MiniMax-M2.5 NVFP4 on B200 single-node benchmark using TensorRT-LLM (tensorrt-llm/release:1.3.0rc18).

Changes

`nvidia-master.yaml` — new config key `minimaxm2.5-fp4-b200-trt`

Image: nvcr.io#nvidia/tensorrt-llm/release:1.3.0rc18
Model: nvidia/MiniMax-M2.5-NVFP4
Runner: B200 (single-node)
Scenarios:
- 1k/1k — 6 search-space entries covering TP 1–8, EP 1–8, with and without DP attention, concurrency 4–1024
- 8k/1k — 5 search-space entries covering TP 1–8, EP 1–4, with and without DP attention, concurrency 4–1024

`minimaxm2.5_fp4_b200_trt.sh` — new benchmark script

Generates a TRT-LLM runtime YAML at launch (CUDA graphs, MoE backend, optional attention DP, FP8 KV cache, NVFP4 GEMM backends)
Launches trtllm-serve via mpirun with PyTorch backend
Runs the standard serving benchmark (run_benchmark_serving)
Optionally runs lm-eval when RUN_EVAL=true

`perf-changelog.yaml`

Documents the new minimaxm2.5-fp4-b200-trt config addition and image version.

Context

This sits alongside the existing minimaxm2.5-fp4-b200-vllm entry, adding a TensorRT-LLM comparison point for the same model/precision/SKU combination.

Note

Low Risk
Additive benchmark configuration and shell script only; no changes to production serving, auth, or shared runtime logic.

Overview
Adds a TensorRT-LLM single-node benchmark path for MiniMax-M2.5 NVFP4 on B200, alongside the existing vLLM entry for the same model/SKU.

nvidia-master.yaml introduces minimaxm2.5-fp4-b200-trt (tensorrt-llm/release:1.3.0rc18) with fixed-seq-len sweeps at 1k/1k and 8k/1k over TP/EP and optional DP attention concurrency ranges.

minimaxm2.5_fp4_b200_trt.sh wires the run: writes a runtime YAML (CUDA graphs, MoE/NVFP4 settings, optional attention DP), starts trtllm-serve via mpirun, runs standard serving throughput, and optional lm-eval when RUN_EVAL=true.

perf-changelog.yaml records the new config key and image.

^{Reviewed by Cursor Bugbot for commit df5a3a2. Bugbot is set up for automated code reviews on this repo. Configure here.}

github-actions · 2026-06-12T19:20:33Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

github-actions · 2026-06-12T19:20:33Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

hshrivastava-droid · 2026-06-12T19:21:08Z

@claude add PR desc

Klaud-Cold · 2026-06-12T19:21:32Z

Claude finished @hshrivastava-droid's task in 47s —— View job

Adding PR Description

Read changed files to understand the PR content
Generate and update PR description

PR description has been updated with a structured summary covering the three changed files: the new minimaxm2.5-fp4-b200-trt config in nvidia-master.yaml, the new minimaxm2.5_fp4_b200_trt.sh benchmark script, and the perf-changelog.yaml entry.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 1af0b14. Configure here.}

cursor · 2026-06-12T19:22:17Z

+    CAPTURE_BATCH_LIST+=( $(seq 768 64 $CONC))
+fi
+CAPTURE_BATCH_LIST=$(printf "%s, " "${CAPTURE_BATCH_LIST[@]}")
+MAX_CAPTURE_TOKENS=$(( CONC < 16 ? 4096 : MAX_NUM_TOKENS ))


Wrong CUDA capture token cap

Medium Severity

torch_compile_config.capture_num_tokens is sized from an early MAX_CAPTURE_TOKENS (min(16384, CONC*ISL)), while a later assignment caps low concurrency at 4096 but is never used. For CONC under 16 (e.g. 8k/1k at conc 4–8), capture lists can reach CONC*ISL or 16384 instead of 4096.

^{Reviewed by Cursor Bugbot for commit 1af0b14. Configure here.}

jgangani

LGTM

github-actions · 2026-06-12T21:21:03Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27437755655
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27437755655

hshrivastava-droid · 2026-06-12T22:27:27Z

@functionstackx could you please review this?

hshrivastava-droid · 2026-06-12T22:27:49Z

@claude add PR description

Klaud-Cold · 2026-06-12T22:28:18Z

Claude finished @hshrivastava-droid's task in 54s —— View job

Adding PR Description

Read changed files to understand the PR content
Generate and update PR description

PR description has been updated with a structured summary covering the three changed files: the new minimaxm2.5-fp4-b200-trt config in nvidia-master.yaml, the new minimaxm2.5_fp4_b200_trt.sh benchmark script, and the perf-changelog.yaml entry.

hshrivastava-droid · 2026-06-12T23:12:38Z

@claude launch b200 to exlsuive b200-dgxc_09

Klaud-Cold · 2026-06-12T23:13:01Z

Claude finished @hshrivastava-droid's task in 1m 51s —— View job

Launching B200 Benchmark on Exclusive Runner

Read config and workflow to understand runner targeting
Trigger e2e-tests workflow on b200-dgxc_09
Report run link

Launched minimaxm2.5-fp4-b200-trt benchmark exclusively on runner b200-dgxc_09 with conc=4 across all sequence lengths.

Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/27448473459

Command used:

test-config --config-files .github/configs/nvidia-master.yaml \
  --config-keys minimaxm2.5-fp4-b200-trt \
  --runner-node-filter b200-dgxc_09 \
  --conc 4

github-actions · 2026-06-13T00:01:39Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27437755655
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27437755655

github-actions · 2026-06-13T03:27:30Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27437755655
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27437755655

github-actions · 2026-06-13T04:54:53Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27455174586
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27455174586

add minimax2.5 b200

1af0b14

hshrivastava-droid requested a review from a team June 12, 2026 19:20

hshrivastava-droid requested review from jgangani and kedarpotdar-nv as code owners June 12, 2026 19:20

github-project-automation Bot added this to InferenceMAX Board Jun 12, 2026

hshrivastava-droid added the full-sweep-enabled label Jun 12, 2026

cursor Bot reviewed Jun 12, 2026

View reviewed changes

kedarpotdar-nv approved these changes Jun 12, 2026

View reviewed changes

jgangani approved these changes Jun 12, 2026

View reviewed changes

functionstackx approved these changes Jun 12, 2026

View reviewed changes

hshrivastava-droid changed the title ~~[WIP][NV] add minimaxm2.5-fp4-b200-trt~~ [NV] add minimaxm2.5-fp4-b200-trt Jun 12, 2026

functionstackx approved these changes Jun 12, 2026

View reviewed changes

Merge branch 'main' into nv/minimaxm2.5-fp4-b200-trt-v2

df5a3a2

hshrivastava-droid merged commit 1bf0693 into main Jun 13, 2026
10 checks passed

hshrivastava-droid deleted the nv/minimaxm2.5-fp4-b200-trt-v2 branch June 13, 2026 03:28

github-project-automation Bot moved this to Done in InferenceMAX Board Jun 13, 2026

Oseltamivir mentioned this pull request Jun 18, 2026

Add failed-ingest recovery and pre-reuse changelog validation [skip-sweep] #1821

Merged

Conversation

hshrivastava-droid commented Jun 12, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

nvidia-master.yaml — new config key minimaxm2.5-fp4-b200-trt

minimaxm2.5_fp4_b200_trt.sh — new benchmark script

perf-changelog.yaml

Context

Uh oh!

github-actions Bot commented Jun 12, 2026

Uh oh!

github-actions Bot commented Jun 12, 2026

Uh oh!

hshrivastava-droid commented Jun 12, 2026

Uh oh!

Klaud-Cold commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Adding PR Description

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Jun 12, 2026

Choose a reason for hiding this comment

Wrong CUDA capture token cap

Uh oh!

jgangani left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jun 12, 2026

Uh oh!

hshrivastava-droid commented Jun 12, 2026

Uh oh!

hshrivastava-droid commented Jun 12, 2026

Uh oh!

Klaud-Cold commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Adding PR Description

Uh oh!

hshrivastava-droid commented Jun 12, 2026

Uh oh!

Klaud-Cold commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Launching B200 Benchmark on Exclusive Runner

Uh oh!

github-actions Bot commented Jun 13, 2026

Uh oh!

github-actions Bot commented Jun 13, 2026

Uh oh!

Uh oh!

github-actions Bot commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

hshrivastava-droid commented Jun 12, 2026 •

edited by cursor Bot

Loading

`nvidia-master.yaml` — new config key `minimaxm2.5-fp4-b200-trt`

`minimaxm2.5_fp4_b200_trt.sh` — new benchmark script

`perf-changelog.yaml`

Klaud-Cold commented Jun 12, 2026 •

edited

Loading

Klaud-Cold commented Jun 12, 2026 •

edited

Loading

Klaud-Cold commented Jun 12, 2026 •

edited

Loading