Skip to content

[NV] add minimaxm2.5-fp4-b200-trt#1722

Merged
hshrivastava-droid merged 2 commits into
mainfrom
nv/minimaxm2.5-fp4-b200-trt-v2
Jun 13, 2026
Merged

[NV] add minimaxm2.5-fp4-b200-trt#1722
hshrivastava-droid merged 2 commits into
mainfrom
nv/minimaxm2.5-fp4-b200-trt-v2

Conversation

@hshrivastava-droid

@hshrivastava-droid hshrivastava-droid commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds MiniMax-M2.5 NVFP4 on B200 single-node benchmark using TensorRT-LLM (tensorrt-llm/release:1.3.0rc18).

Changes

nvidia-master.yaml — new config key minimaxm2.5-fp4-b200-trt

  • Image: nvcr.io#nvidia/tensorrt-llm/release:1.3.0rc18
  • Model: nvidia/MiniMax-M2.5-NVFP4
  • Runner: B200 (single-node)
  • Scenarios:
    • 1k/1k — 6 search-space entries covering TP 1–8, EP 1–8, with and without DP attention, concurrency 4–1024
    • 8k/1k — 5 search-space entries covering TP 1–8, EP 1–4, with and without DP attention, concurrency 4–1024

minimaxm2.5_fp4_b200_trt.sh — new benchmark script

  • Generates a TRT-LLM runtime YAML at launch (CUDA graphs, MoE backend, optional attention DP, FP8 KV cache, NVFP4 GEMM backends)
  • Launches trtllm-serve via mpirun with PyTorch backend
  • Runs the standard serving benchmark (run_benchmark_serving)
  • Optionally runs lm-eval when RUN_EVAL=true

perf-changelog.yaml

  • Documents the new minimaxm2.5-fp4-b200-trt config addition and image version.

Context

This sits alongside the existing minimaxm2.5-fp4-b200-vllm entry, adding a TensorRT-LLM comparison point for the same model/precision/SKU combination.


Note

Low Risk
Additive benchmark configuration and shell script only; no changes to production serving, auth, or shared runtime logic.

Overview
Adds a TensorRT-LLM single-node benchmark path for MiniMax-M2.5 NVFP4 on B200, alongside the existing vLLM entry for the same model/SKU.

nvidia-master.yaml introduces minimaxm2.5-fp4-b200-trt (tensorrt-llm/release:1.3.0rc18) with fixed-seq-len sweeps at 1k/1k and 8k/1k over TP/EP and optional DP attention concurrency ranges.

minimaxm2.5_fp4_b200_trt.sh wires the run: writes a runtime YAML (CUDA graphs, MoE/NVFP4 settings, optional attention DP), starts trtllm-serve via mpirun, runs standard serving throughput, and optional lm-eval when RUN_EVAL=true.

perf-changelog.yaml records the new config key and image.

Reviewed by Cursor Bugbot for commit df5a3a2. Bugbot is set up for automated code reviews on this repo. Configure here.

@github-actions

Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

1 similar comment
@github-actions

Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

@hshrivastava-droid

Copy link
Copy Markdown
Collaborator Author

@claude add PR desc

@Klaud-Cold

Klaud-Cold commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

Claude finished @hshrivastava-droid's task in 47s —— View job


Adding PR Description

  • Read changed files to understand the PR content
  • Generate and update PR description

PR description has been updated with a structured summary covering the three changed files: the new minimaxm2.5-fp4-b200-trt config in nvidia-master.yaml, the new minimaxm2.5_fp4_b200_trt.sh benchmark script, and the perf-changelog.yaml entry.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 1af0b14. Configure here.

CAPTURE_BATCH_LIST+=( $(seq 768 64 $CONC))
fi
CAPTURE_BATCH_LIST=$(printf "%s, " "${CAPTURE_BATCH_LIST[@]}")
MAX_CAPTURE_TOKENS=$(( CONC < 16 ? 4096 : MAX_NUM_TOKENS ))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong CUDA capture token cap

Medium Severity

torch_compile_config.capture_num_tokens is sized from an early MAX_CAPTURE_TOKENS (min(16384, CONC*ISL)), while a later assignment caps low concurrency at 4096 but is never used. For CONC under 16 (e.g. 8k/1k at conc 4–8), capture lists can reach CONC*ISL or 16384 instead of 4096.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 1af0b14. Configure here.

@jgangani jgangani left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions

Copy link
Copy Markdown
Contributor

@hshrivastava-droid

Copy link
Copy Markdown
Collaborator Author

@functionstackx could you please review this?

@hshrivastava-droid

Copy link
Copy Markdown
Collaborator Author

@claude add PR description

@hshrivastava-droid hshrivastava-droid changed the title [WIP][NV] add minimaxm2.5-fp4-b200-trt [NV] add minimaxm2.5-fp4-b200-trt Jun 12, 2026
@Klaud-Cold

Klaud-Cold commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

Claude finished @hshrivastava-droid's task in 54s —— View job


Adding PR Description

  • Read changed files to understand the PR content
  • Generate and update PR description

PR description has been updated with a structured summary covering the three changed files: the new minimaxm2.5-fp4-b200-trt config in nvidia-master.yaml, the new minimaxm2.5_fp4_b200_trt.sh benchmark script, and the perf-changelog.yaml entry.

@hshrivastava-droid

Copy link
Copy Markdown
Collaborator Author

@claude launch b200 to exlsuive b200-dgxc_09

@Klaud-Cold

Klaud-Cold commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

Claude finished @hshrivastava-droid's task in 1m 51s —— View job


Launching B200 Benchmark on Exclusive Runner

  • Read config and workflow to understand runner targeting
  • Trigger e2e-tests workflow on b200-dgxc_09
  • Report run link

Launched minimaxm2.5-fp4-b200-trt benchmark exclusively on runner b200-dgxc_09 with conc=4 across all sequence lengths.

Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/27448473459

Command used:

test-config --config-files .github/configs/nvidia-master.yaml \
  --config-keys minimaxm2.5-fp4-b200-trt \
  --runner-node-filter b200-dgxc_09 \
  --conc 4

@github-actions

Copy link
Copy Markdown
Contributor

@github-actions

Copy link
Copy Markdown
Contributor

@hshrivastava-droid hshrivastava-droid merged commit 1bf0693 into main Jun 13, 2026
10 checks passed
@hshrivastava-droid hshrivastava-droid deleted the nv/minimaxm2.5-fp4-b200-trt-v2 branch June 13, 2026 03:28
@github-actions

Copy link
Copy Markdown
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

5 participants