Revert "[Performance] Performance improvements in non-blockwise fp8 CUTLASS MoE (#20762) #21334

minosfuture · 2025-07-21T21:30:27Z

Purpose

This reverts commit 9fb2d22 to fix #21322

Test Plan

pytest -v -s tests/models/multimodal/generation/test_maverick.py
lm_eval maverick

Test Result

UT passed
lm_eval result:

local-chat-completions (model=meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8,base_url=http://127.0.0.1:8000/v1/chat/completions,num_concurrent=32), gen_kwargs: (None), limit: 200.0, num_fewshot: 5, batch_size: 1

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.93	±	0.0181
		strict-match	5	exact_match	↑	0.92	±	0.0192

(Optional) Documentation Update

…UTLASS MoE (vllm-project#20762)" This reverts commit 9fb2d22. Signed-off-by: Ming Yang <minos.future@gmail.com>

github-actions · 2025-07-21T21:30:35Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Code Review

This pull request reverts a previous performance improvement to fix a correctness issue. The changes are mostly removing code related to the optimization. However, I've identified a critical issue where the revert breaks CUDA graph compatibility by creating new tensors inside a function that can be captured by a CUDA graph. This will cause benchmarks and potentially other features relying on CUDA graphs to fail. I've provided detailed comments and code suggestions across multiple files to address this by re-introducing the practice of passing stride tensors as arguments, which was the behavior before the original performance-enhancing change.

gemini-code-assist · 2025-07-21T21:32:34Z

tests/kernels/moe/test_cutlass_moe.py

@@ -207,10 +207,6 @@ def run_8_bit(moe_tensors: MOETensors8Bit,
        'topk_ids': topk_ids,
        'w1_scale': moe_tensors.w1_scale,
        'w2_scale': moe_tensors.w2_scale,


To align with the proposed fix for CUDA graph compatibility, the stride tensors need to be passed to cutlass_moe_fp8 for testing.

'w2_scale': moe_tensors.w2_scale, 'ab_strides1': moe_tensors.ab_strides1, 'ab_strides2': moe_tensors.ab_strides2, 'c_strides1': moe_tensors.c_strides1, 'c_strides2': moe_tensors.c_strides2,

gemini-code-assist · 2025-07-21T21:32:34Z

tests/kernels/moe/test_cutlass_moe.py

@@ -444,11 +440,6 @@ def test_run_cutlass_moe_fp8(
        expert_map[start:end] = list(range(num_local_experts))
        expert_map = torch.tensor(expert_map, dtype=torch.int32, device="cuda")


The stride tensors need to be created for the test to be consistent with the proposed fix for CUDA graph compatibility.

expert_map = torch.tensor(expert_map, dtype=torch.int32, device="cuda") ab_strides1 = torch.full((e, ), k, device="cuda", dtype=torch.int64) ab_strides2 = torch.full((e, ), n, device="cuda", dtype=torch.int64) c_strides1 = torch.full((e, ), 2 * n, device="cuda", dtype=torch.int64) c_strides2 = torch.full((e, ), k, device="cuda", dtype=torch.int64)

gemini-code-assist · 2025-07-21T21:32:34Z

tests/kernels/moe/test_cutlass_moe.py

+            a1q_scale, None, workspace13, workspace2, None, mt.a.dtype,
+            per_act_token, per_out_channel, False)


The stride tensors should be passed to run_cutlass_moe_fp8 to align with the proposed fix for CUDA graph compatibility.

a1q_scale, None, ab_strides1, ab_strides2, c_strides1, c_strides2, workspace13, workspace2, None, mt.a.dtype, per_act_token, per_out_channel, False)

gemini-code-assist · 2025-07-21T21:32:34Z

tests/kernels/moe/test_pplx_cutlass_moe.py

    experts = CutlassExpertsFp8(num_local_experts,
                                out_dtype,
                                per_act_token,
                                per_out_ch,
-                                ab_strides1,
-                                ab_strides2,
-                                c_strides1,
-                                c_strides2,
                                num_dispatchers=num_dispatchers,
                                use_batched_format=True)


The stride tensors need to be created and passed to CutlassExpertsFp8 for the test to be consistent with the proposed fix for CUDA graph compatibility. You'll also need to re-introduce intermediate_dim which was removed in this PR.

intermediate_dim = w2.shape[2] ab_strides1 = torch.full((num_local_experts, ), hidden_dim, device="cuda", dtype=torch.int64) ab_strides2 = torch.full((num_local_experts, ), intermediate_dim, device="cuda", dtype=torch.int64) c_strides1 = torch.full((num_local_experts, ), 2 * intermediate_dim, device="cuda", dtype=torch.int64) c_strides2 = torch.full((num_local_experts, ), hidden_dim, device="cuda", dtype=torch.int64) experts = CutlassExpertsFp8(num_local_experts, out_dtype, per_act_token, per_out_ch, ab_strides1, ab_strides2, c_strides1, c_strides2, num_dispatchers=num_dispatchers, use_batched_format=True)

gemini-code-assist · 2025-07-21T21:32:34Z

vllm/model_executor/layers/fused_moe/cutlass_moe.py

@@ -34,10 +35,6 @@ def run_cutlass_moe_fp8(
    w2_scale: Optional[torch.Tensor],
    a1q_scale: Optional[torch.Tensor],
    a2_scale: Optional[torch.Tensor],


To maintain CUDA graph compatibility, the stride tensors should be passed as arguments instead of being created inside this function. Please add them back to the function signature.

Suggested change

a2_scale: Optional[torch.Tensor],

a2_scale: Optional[torch.Tensor],

ab_strides1: torch.Tensor,

ab_strides2: torch.Tensor,

c_strides1: torch.Tensor,

c_strides2: torch.Tensor,

gemini-code-assist · 2025-07-21T21:32:35Z

vllm/model_executor/layers/fused_moe/cutlass_moe.py

@@ -329,10 +332,6 @@ def cutlass_moe_fp8(
    topk_ids: torch.Tensor,
    w1_scale: torch.Tensor,
    w2_scale: torch.Tensor,


The stride tensors should be passed as arguments here as well to maintain CUDA graph compatibility.

w2_scale: torch.Tensor, ab_strides1: torch.Tensor, ab_strides2: torch.Tensor, c_strides1: torch.Tensor, c_strides2: torch.Tensor,

gemini-code-assist · 2025-07-21T21:32:35Z

vllm/model_executor/layers/fused_moe/cutlass_moe.py

@@ -403,10 +391,6 @@ def cutlass_moe_fp8(
            out_dtype=a.dtype,
            per_act_token_quant=per_act_token,
            per_out_ch_quant=per_out_ch,


The stride tensors should be passed to the CutlassExpertsFp8 constructor.

per_out_ch_quant=per_out_ch, ab_strides1=ab_strides1, ab_strides2=ab_strides2, c_strides1=c_strides1, c_strides2=c_strides2,

gemini-code-assist · 2025-07-21T21:32:35Z

vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py

@@ -859,21 +859,6 @@ def process_weights_after_loading(self, layer: torch.nn.Module) -> None:
            layer.w13_weight_scale = torch.nn.Parameter(max_w13_scales,
                                                        requires_grad=False)



The stride tensors should be pre-computed and stored here to be passed to the MoE kernel. This is necessary for CUDA graph compatibility.

Suggested change

device = layer.w13_weight.device

# ab_strides1 and c_strides2 are the same

self.ab_strides1_c_strides2 = torch.full((layer.local_num_experts, ),

layer.hidden_size,

device=device,

dtype=torch.int64)

self.ab_strides2 = torch.full((layer.local_num_experts, ),

layer.intermediate_size_per_partition,

device=device,

dtype=torch.int64)

self.c_strides1 = torch.full((layer.local_num_experts, ),

2 * layer.intermediate_size_per_partition,

device=device,

dtype=torch.int64)

gemini-code-assist · 2025-07-21T21:32:35Z

vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py

@@ -896,10 +881,6 @@ def select_gemm_impl(
            moe.in_dtype,
            self.input_quant.strategy == QuantizationStrategy.TOKEN,
            self.weight_quant.strategy == QuantizationStrategy.CHANNEL,


The stride tensors should be passed to the CutlassExpertsFp8 constructor.

self.weight_quant.strategy == QuantizationStrategy.CHANNEL, ab_strides1=self.ab_strides1_c_strides2, ab_strides2=self.ab_strides2, c_strides1=self.c_strides1, c_strides2=self.ab_strides1_c_strides2,

gemini-code-assist · 2025-07-21T21:32:35Z

vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py

@@ -968,10 +948,6 @@ def apply(
                expert_map=None if self.disable_expert_map else expert_map,
                w1_scale=layer.w13_weight_scale,
                w2_scale=layer.w2_weight_scale,


The stride tensors should be passed to cutlass_moe_fp8.

w2_scale=layer.w2_weight_scale, ab_strides1=self.ab_strides1_c_strides2, ab_strides2=self.ab_strides2, c_strides1=self.c_strides1, c_strides2=self.ab_strides1_c_strides2,

houseroad

Thanks for reverting the original PR to help recover the trunk health. This will unblock our code sync as well.

houseroad · 2025-07-21T22:01:59Z

cc: @ElizaWszola, @tlrmchlsmth, @mgoin , @robertgshaw2-redhat this is blocking our internal work, so need to revert for now to unblock. Sorry about the inconvenience, and happy to help on landing the fixed version. Also if forward-fix is easy to land, we are happy to switch to that as well. :-)

mgoin

Okay let's revert for now. Thanks for identifying this

…CUTLASS MoE (vllm-project#20762) (vllm-project#21334) This reverts commit e7b2042.

…CUTLASS MoE (vllm-project#20762) (vllm-project#21334) This reverts commit e7b2042. The original PR vllm-project#20762 is: Authored-by: ElizaWszola <ewszola@redhat.com> Signed-off-by: Ming Yang <minos.future@gmail.com>

…UTLASS MoE (vllm-project#20762) (vllm-project#21334) Signed-off-by: Ming Yang <minos.future@gmail.com> Signed-off-by: qizixi <qizixi@meta.com>

…UTLASS MoE (vllm-project#20762) (vllm-project#21334) Signed-off-by: Ming Yang <minos.future@gmail.com>

…UTLASS MoE (vllm-project#20762) (vllm-project#21334) Signed-off-by: Ming Yang <minos.future@gmail.com> Signed-off-by: avigny <47987522+avigny@users.noreply.github.com>

…UTLASS MoE (vllm-project#20762) (vllm-project#21334) Signed-off-by: Ming Yang <minos.future@gmail.com> Signed-off-by: shuw <shuw@nvidia.com>

…UTLASS MoE (vllm-project#20762) (vllm-project#21334) Signed-off-by: Ming Yang <minos.future@gmail.com> Signed-off-by: x22x22 <wadeking@qq.com>

…UTLASS MoE (vllm-project#20762) (vllm-project#21334) Signed-off-by: Ming Yang <minos.future@gmail.com>

…UTLASS MoE (vllm-project#20762) (vllm-project#21334) Signed-off-by: Ming Yang <minos.future@gmail.com> Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>

…UTLASS MoE (vllm-project#20762) (vllm-project#21334) Signed-off-by: Ming Yang <minos.future@gmail.com> Signed-off-by: Paul Pak <paulpak58@gmail.com>

Revert "[Performance] Performance improvements in non-blockwise fp8 C…

51db38e

…UTLASS MoE (vllm-project#20762)" This reverts commit 9fb2d22. Signed-off-by: Ming Yang <minos.future@gmail.com>

minosfuture requested review from tlrmchlsmth, WoosukKwon, mgoin and robertgshaw2-redhat as code owners July 21, 2025 21:30

mergify bot added the performance Performance-related issues label Jul 21, 2025

gemini-code-assist bot reviewed Jul 21, 2025

View reviewed changes

houseroad approved these changes Jul 21, 2025

View reviewed changes

houseroad enabled auto-merge (squash) July 21, 2025 22:04

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 21, 2025

houseroad added the llama Related to Llama models label Jul 21, 2025

mgoin added this to the v0.10.0 milestone Jul 22, 2025

mgoin approved these changes Jul 22, 2025

View reviewed changes

simon-mo disabled auto-merge July 22, 2025 04:48

simon-mo merged commit e7b2042 into vllm-project:main Jul 22, 2025
109 of 111 checks passed

minosfuture added a commit to minosfuture/vllm that referenced this pull request Jul 22, 2025

Reapply "[Performance] Performance improvements in non-blockwise fp8 …

2f39358

…CUTLASS MoE (vllm-project#20762) (vllm-project#21334) This reverts commit e7b2042.

LyrisZhong pushed a commit to LyrisZhong/vllm that referenced this pull request Jul 23, 2025

Revert "[Performance] Performance improvements in non-blockwise fp8 C…

663b3f1

…UTLASS MoE (vllm-project#20762) (vllm-project#21334) Signed-off-by: Ming Yang <minos.future@gmail.com>

Pradyun92 pushed a commit to Pradyun92/vllm that referenced this pull request Aug 6, 2025

Revert "[Performance] Performance improvements in non-blockwise fp8 C…

4c1cd4d

…UTLASS MoE (vllm-project#20762) (vllm-project#21334) Signed-off-by: Ming Yang <minos.future@gmail.com>

npanpaliya pushed a commit to odh-on-pz/vllm-upstream that referenced this pull request Aug 6, 2025

Revert "[Performance] Performance improvements in non-blockwise fp8 C…

26384dc

…UTLASS MoE (vllm-project#20762) (vllm-project#21334) Signed-off-by: Ming Yang <minos.future@gmail.com>

		@@ -444,11 +440,6 @@ def test_run_cutlass_moe_fp8(
		expert_map[start:end] = list(range(num_local_experts))
		expert_map = torch.tensor(expert_map, dtype=torch.int32, device="cuda")

		a1q_scale, None, workspace13, workspace2, None, mt.a.dtype,
		per_act_token, per_out_channel, False)

-    a2_scale: Optional[torch.Tensor],
+    a2_scale: Optional[torch.Tensor],
+    ab_strides1: torch.Tensor,
+    ab_strides2: torch.Tensor,
+    c_strides1: torch.Tensor,
+    c_strides2: torch.Tensor,

		@@ -859,21 +859,6 @@ def process_weights_after_loading(self, layer: torch.nn.Module) -> None:
		layer.w13_weight_scale = torch.nn.Parameter(max_w13_scales,
		requires_grad=False)

+        device = layer.w13_weight.device
+        # ab_strides1 and c_strides2 are the same
+        self.ab_strides1_c_strides2 = torch.full((layer.local_num_experts, ),
+                                                 layer.hidden_size,
+                                                 device=device,
+                                                 dtype=torch.int64)
+        self.ab_strides2 = torch.full((layer.local_num_experts, ),
+                                      layer.intermediate_size_per_partition,
+                                      device=device,
+                                      dtype=torch.int64)
+        self.c_strides1 = torch.full((layer.local_num_experts, ),
+* layer.intermediate_size_per_partition,
+                                     device=device,
+                                     dtype=torch.int64)

Uh oh!

Revert "[Performance] Performance improvements in non-blockwise fp8 CUTLASS MoE (#20762) #21334

Revert "[Performance] Performance improvements in non-blockwise fp8 CUTLASS MoE (#20762) #21334

Uh oh!

Conversation

minosfuture commented Jul 21, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

(Optional) Documentation Update

Uh oh!

github-actions bot commented Jul 21, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

houseroad left a comment

Choose a reason for hiding this comment

Uh oh!

houseroad commented Jul 21, 2025

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

minosfuture commented Jul 21, 2025 •

edited by github-actions bot

Loading