Skip to content

Revert "[Performance] Performance improvements in non-blockwise fp8 CUTLASS MoE (#20762) #21334

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 22, 2025

Conversation

minosfuture
Copy link
Contributor

@minosfuture minosfuture commented Jul 21, 2025

Purpose

This reverts commit 9fb2d22 to fix #21322

Test Plan

  1. pytest -v -s tests/models/multimodal/generation/test_maverick.py
  2. lm_eval maverick

Test Result

  1. UT passed
  2. lm_eval result:

local-chat-completions (model=meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8,base_url=http://127.0.0.1:8000/v1/chat/completions,num_concurrent=32), gen_kwargs: (None), limit: 200.0, num_fewshot: 5, batch_size: 1

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.93 ± 0.0181
strict-match 5 exact_match 0.92 ± 0.0192

(Optional) Documentation Update

…UTLASS MoE (vllm-project#20762)"

This reverts commit 9fb2d22.

Signed-off-by: Ming Yang <minos.future@gmail.com>
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added the performance Performance-related issues label Jul 21, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request reverts a previous performance improvement to fix a correctness issue. The changes are mostly removing code related to the optimization. However, I've identified a critical issue where the revert breaks CUDA graph compatibility by creating new tensors inside a function that can be captured by a CUDA graph. This will cause benchmarks and potentially other features relying on CUDA graphs to fail. I've provided detailed comments and code suggestions across multiple files to address this by re-introducing the practice of passing stride tensors as arguments, which was the behavior before the original performance-enhancing change.

@@ -207,10 +207,6 @@ def run_8_bit(moe_tensors: MOETensors8Bit,
'topk_ids': topk_ids,
'w1_scale': moe_tensors.w1_scale,
'w2_scale': moe_tensors.w2_scale,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

To align with the proposed fix for CUDA graph compatibility, the stride tensors need to be passed to cutlass_moe_fp8 for testing.

        'w2_scale': moe_tensors.w2_scale,
        'ab_strides1': moe_tensors.ab_strides1,
        'ab_strides2': moe_tensors.ab_strides2,
        'c_strides1': moe_tensors.c_strides1,
        'c_strides2': moe_tensors.c_strides2,

@@ -444,11 +440,6 @@ def test_run_cutlass_moe_fp8(
expert_map[start:end] = list(range(num_local_experts))
expert_map = torch.tensor(expert_map, dtype=torch.int32, device="cuda")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The stride tensors need to be created for the test to be consistent with the proposed fix for CUDA graph compatibility.

        expert_map = torch.tensor(expert_map, dtype=torch.int32, device="cuda")

        ab_strides1 = torch.full((e, ), k, device="cuda", dtype=torch.int64)
        ab_strides2 = torch.full((e, ), n, device="cuda", dtype=torch.int64)
        c_strides1 = torch.full((e, ), 2 * n, device="cuda", dtype=torch.int64)
        c_strides2 = torch.full((e, ), k, device="cuda", dtype=torch.int64)

Comment on lines +451 to +452
a1q_scale, None, workspace13, workspace2, None, mt.a.dtype,
per_act_token, per_out_channel, False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The stride tensors should be passed to run_cutlass_moe_fp8 to align with the proposed fix for CUDA graph compatibility.

            a1q_scale, None, ab_strides1, ab_strides2, c_strides1, c_strides2,
            workspace13, workspace2, None, mt.a.dtype, per_act_token,
            per_out_channel, False)

Comment on lines 126 to 131
experts = CutlassExpertsFp8(num_local_experts,
out_dtype,
per_act_token,
per_out_ch,
ab_strides1,
ab_strides2,
c_strides1,
c_strides2,
num_dispatchers=num_dispatchers,
use_batched_format=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The stride tensors need to be created and passed to CutlassExpertsFp8 for the test to be consistent with the proposed fix for CUDA graph compatibility. You'll also need to re-introduce intermediate_dim which was removed in this PR.

    intermediate_dim = w2.shape[2]
    ab_strides1 = torch.full((num_local_experts, ),
                             hidden_dim,
                             device="cuda",
                             dtype=torch.int64)
    ab_strides2 = torch.full((num_local_experts, ),
                             intermediate_dim,
                             device="cuda",
                             dtype=torch.int64)
    c_strides1 = torch.full((num_local_experts, ),
                            2 * intermediate_dim,
                            device="cuda",
                            dtype=torch.int64)
    c_strides2 = torch.full((num_local_experts, ),
                            hidden_dim,
                            device="cuda",
                            dtype=torch.int64)

    experts = CutlassExpertsFp8(num_local_experts,
                                out_dtype,
                                per_act_token,
                                per_out_ch,
                                ab_strides1,
                                ab_strides2,
                                c_strides1,
                                c_strides2,
                                num_dispatchers=num_dispatchers,
                                use_batched_format=True)

@@ -34,10 +35,6 @@ def run_cutlass_moe_fp8(
w2_scale: Optional[torch.Tensor],
a1q_scale: Optional[torch.Tensor],
a2_scale: Optional[torch.Tensor],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

To maintain CUDA graph compatibility, the stride tensors should be passed as arguments instead of being created inside this function. Please add them back to the function signature.

Suggested change
a2_scale: Optional[torch.Tensor],
a2_scale: Optional[torch.Tensor],
ab_strides1: torch.Tensor,
ab_strides2: torch.Tensor,
c_strides1: torch.Tensor,
c_strides2: torch.Tensor,

@@ -329,10 +332,6 @@ def cutlass_moe_fp8(
topk_ids: torch.Tensor,
w1_scale: torch.Tensor,
w2_scale: torch.Tensor,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The stride tensors should be passed as arguments here as well to maintain CUDA graph compatibility.

    w2_scale: torch.Tensor,
    ab_strides1: torch.Tensor,
    ab_strides2: torch.Tensor,
    c_strides1: torch.Tensor,
    c_strides2: torch.Tensor,

@@ -403,10 +391,6 @@ def cutlass_moe_fp8(
out_dtype=a.dtype,
per_act_token_quant=per_act_token,
per_out_ch_quant=per_out_ch,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The stride tensors should be passed to the CutlassExpertsFp8 constructor.

            per_out_ch_quant=per_out_ch,
            ab_strides1=ab_strides1,
            ab_strides2=ab_strides2,
            c_strides1=c_strides1,
            c_strides2=c_strides2,

@@ -859,21 +859,6 @@ def process_weights_after_loading(self, layer: torch.nn.Module) -> None:
layer.w13_weight_scale = torch.nn.Parameter(max_w13_scales,
requires_grad=False)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The stride tensors should be pre-computed and stored here to be passed to the MoE kernel. This is necessary for CUDA graph compatibility.

Suggested change
device = layer.w13_weight.device
# ab_strides1 and c_strides2 are the same
self.ab_strides1_c_strides2 = torch.full((layer.local_num_experts, ),
layer.hidden_size,
device=device,
dtype=torch.int64)
self.ab_strides2 = torch.full((layer.local_num_experts, ),
layer.intermediate_size_per_partition,
device=device,
dtype=torch.int64)
self.c_strides1 = torch.full((layer.local_num_experts, ),
2 * layer.intermediate_size_per_partition,
device=device,
dtype=torch.int64)

@@ -896,10 +881,6 @@ def select_gemm_impl(
moe.in_dtype,
self.input_quant.strategy == QuantizationStrategy.TOKEN,
self.weight_quant.strategy == QuantizationStrategy.CHANNEL,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The stride tensors should be passed to the CutlassExpertsFp8 constructor.

            self.weight_quant.strategy == QuantizationStrategy.CHANNEL,
            ab_strides1=self.ab_strides1_c_strides2,
            ab_strides2=self.ab_strides2,
            c_strides1=self.c_strides1,
            c_strides2=self.ab_strides1_c_strides2,

@@ -968,10 +948,6 @@ def apply(
expert_map=None if self.disable_expert_map else expert_map,
w1_scale=layer.w13_weight_scale,
w2_scale=layer.w2_weight_scale,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The stride tensors should be passed to cutlass_moe_fp8.

                w2_scale=layer.w2_weight_scale,
                ab_strides1=self.ab_strides1_c_strides2,
                ab_strides2=self.ab_strides2,
                c_strides1=self.c_strides1,
                c_strides2=self.ab_strides1_c_strides2,

Copy link
Collaborator

@houseroad houseroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reverting the original PR to help recover the trunk health. This will unblock our code sync as well.

@houseroad
Copy link
Collaborator

cc: @ElizaWszola, @tlrmchlsmth, @mgoin , @robertgshaw2-redhat this is blocking our internal work, so need to revert for now to unblock. Sorry about the inconvenience, and happy to help on landing the fixed version. Also if forward-fix is easy to land, we are happy to switch to that as well. :-)

@houseroad houseroad enabled auto-merge (squash) July 21, 2025 22:04
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 21, 2025
@houseroad houseroad added the llama Related to Llama models label Jul 21, 2025
@mgoin mgoin added this to the v0.10.0 milestone Jul 22, 2025
Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay let's revert for now. Thanks for identifying this

@simon-mo simon-mo disabled auto-merge July 22, 2025 04:48
@simon-mo simon-mo merged commit e7b2042 into vllm-project:main Jul 22, 2025
109 of 111 checks passed
minosfuture added a commit to minosfuture/vllm that referenced this pull request Jul 22, 2025
minosfuture added a commit to minosfuture/vllm that referenced this pull request Jul 23, 2025
…CUTLASS MoE (vllm-project#20762) (vllm-project#21334)

This reverts commit e7b2042.

The original PR vllm-project#20762 is:

Authored-by: ElizaWszola <ewszola@redhat.com>

Signed-off-by: Ming Yang <minos.future@gmail.com>
zixi-qi pushed a commit to zixi-qi/vllm that referenced this pull request Jul 23, 2025
…UTLASS MoE (vllm-project#20762) (vllm-project#21334)

Signed-off-by: Ming Yang <minos.future@gmail.com>
Signed-off-by: qizixi <qizixi@meta.com>
LyrisZhong pushed a commit to LyrisZhong/vllm that referenced this pull request Jul 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
llama Related to Llama models performance Performance-related issues ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: Llama4 Maverick runtime error (shuffle_rows)
4 participants