Skip to content

Conversation

linfeng-yuan
Copy link
Collaborator

@linfeng-yuan linfeng-yuan commented Sep 1, 2025

What this PR does / why we need it?

This pr fixes two problems while multistream_moe enabled in torchair graph mode:

  1. check TorchairAscendW8A8DynamicFusedMoEMethod instead of incorrect AscendW8A8DynamicFusedMoEMethod
  2. mc2_mask should be chunked no matter replace_allreduce is True or False in forward function of TorchairAscendFusedMoE

Does this PR introduce any user-facing change?

No.

How was this patch tested?

e2e vllm serving with multistream_moe (DP32TP1EP32)

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses recompilation and multistream MoE issues in torchair graph mode. The changes involve refactoring import paths for quantization methods, adjusting logic for tensor padding and splitting in torchair_fused_moe, and adding a running_in_graph flag in torchair_mla. My review focuses on improving the clarity and efficiency of the modified logic in torchair_fused_moe.py. I've identified a redundant conditional check and a duplicate function call, and provided a suggestion to refactor the code for better readability and performance.

Comment on lines 1157 to 1207
if tp_size > 1:
tp_rank = get_tensor_model_parallel_rank()
if not self.enable_shared_expert_dp:
chunk_hidden_states = torch.tensor_split(hidden_states,
tp_size,
dim=0)
chunk_router_logits = torch.tensor_split(router_logits,
tp_size,
dim=0)
hidden_states = chunk_hidden_states[tp_rank]
router_logits = chunk_router_logits[tp_rank]

chunk_mc2_mask = torch.tensor_split(mc2_mask, tp_size, dim=0)
mc2_mask = chunk_mc2_mask[tp_rank]
if not replace_allreduce:
if fused_moe_state in {FusedMoEState.MC2}:
padding_size = forward_context.padded_num_tokens
else:
# TODO: Determine if we can remove the padding
padding_size = tp_size
if num_tokens < padding_size and not self.enable_shared_expert_dp:
hidden_states = nn.functional.pad(
hidden_states, (0, 0, 0, padding_size - num_tokens))
router_logits = nn.functional.pad(
router_logits, (0, 0, 0, padding_size - num_tokens))
if tp_size > 1:
tp_rank = get_tensor_model_parallel_rank()
if not self.enable_shared_expert_dp:
chunk_hidden_states = torch.tensor_split(hidden_states,
tp_size,
dim=0)
chunk_router_logits = torch.tensor_split(router_logits,
tp_size,
dim=0)
hidden_states = chunk_hidden_states[tp_rank]
router_logits = chunk_router_logits[tp_rank]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The logic in this block can be simplified to improve readability and remove redundant operations. Specifically:

  1. The if not replace_allreduce: check at line 1161 is redundant because it's already part of the outer condition at line 1156.
  2. get_tensor_model_parallel_rank() is called twice if tp_size > 1, which is inefficient. It should be called only once and the result should be reused.

I suggest refactoring this block to remove the nested if and the duplicate function call.

            if tp_size > 1:
                tp_rank = get_tensor_model_parallel_rank()
                chunk_mc2_mask = torch.tensor_split(mc2_mask, tp_size, dim=0)
                mc2_mask = chunk_mc2_mask[tp_rank]

            if fused_moe_state in {FusedMoEState.MC2}:
                padding_size = forward_context.padded_num_tokens
            else:
                # TODO: Determine if we can remove the padding
                padding_size = tp_size
            if num_tokens < padding_size and not self.enable_shared_expert_dp:
                hidden_states = nn.functional.pad(
                    hidden_states, (0, 0, 0, padding_size - num_tokens))
                router_logits = nn.functional.pad(
                    router_logits, (0, 0, 0, padding_size - num_tokens))
            if tp_size > 1:
                if not self.enable_shared_expert_dp:
                    chunk_hidden_states = torch.tensor_split(hidden_states,
                                                             tp_size,
                                                             dim=0)
                    chunk_router_logits = torch.tensor_split(router_logits,
                                                             tp_size,
                                                             dim=0)
                    hidden_states = chunk_hidden_states[tp_rank]
                    router_logits = chunk_router_logits[tp_rank]

Copy link

github-actions bot commented Sep 1, 2025

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@linfeng-yuan linfeng-yuan force-pushed the fix_torchair_recompiles_and_multistream branch from a2f7755 to 84ea6ff Compare September 1, 2025 12:51
Copy link

codecov bot commented Sep 1, 2025

Codecov Report

❌ Patch coverage is 56.25000% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.49%. Comparing base (0df059f) to head (84ea6ff).

Files with missing lines Patch % Lines
vllm_ascend/torchair/ops/torchair_fused_moe.py 53.33% 7 Missing ⚠️

❌ Your patch check has failed because the patch coverage (56.25%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2681      +/-   ##
==========================================
- Coverage   73.49%   73.49%   -0.01%     
==========================================
  Files         151      151              
  Lines       21927    21931       +4     
==========================================
+ Hits        16116    16118       +2     
- Misses       5811     5813       +2     
Flag Coverage Δ
unittests 73.49% <56.25%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@linfeng-yuan linfeng-yuan force-pushed the fix_torchair_recompiles_and_multistream branch 2 times, most recently from 398f9d5 to 074c9a8 Compare September 3, 2025 05:39
@jianzs
Copy link
Collaborator

jianzs commented Sep 16, 2025

@linfeng-yuan This PR seems important. How's the progress?

…orchair graph mode

Signed-off-by: linfeng-yuan <1102311262@qq.com>
@linfeng-yuan linfeng-yuan force-pushed the fix_torchair_recompiles_and_multistream branch from 074c9a8 to a7a37d6 Compare September 17, 2025 08:51
@linfeng-yuan linfeng-yuan changed the title [bugfix][torchair] fix recompiles and multistream_moe problems with torchair graph mode [bugfix][torchair] fix multistream_moe problems with torchair graph mode Sep 17, 2025
@linfeng-yuan linfeng-yuan changed the title [bugfix][torchair] fix multistream_moe problems with torchair graph mode [bugfix][torchair] fix multistream_moe problems in torchair graph mode Sep 17, 2025
@linfeng-yuan
Copy link
Collaborator Author

linfeng-yuan commented Sep 17, 2025

@linfeng-yuan This PR seems important. How's the progress?

I've rebased the code and CI is passed. Currently multistream_moe with tp_size>1 scenario still confronts shape mismatch problem with v1 scheduler, and @hust17yixuan is working on this and would submit another PR.

@jianzs
Copy link
Collaborator

jianzs commented Sep 17, 2025

@linfeng-yuan This PR seems important. How's the progress?

I've rebased the code and CI is passed. Currently multistream_moe with tp_size>1 scenario still confronts shape mismatch problem with v1 scheduler, and @hust17yixuan is working on this and would submit another PR.

We've found this problem too. We can discuss solutions if you needed.

@jianzs jianzs added ready read for review ready-for-test start test by label for PR labels Sep 18, 2025
@jianzs
Copy link
Collaborator

jianzs commented Sep 18, 2025

@linfeng-yuan ready to merge? cc @Yikun @wangxiyuan

@jianzs jianzs merged commit 79a910e into vllm-project:main Sep 18, 2025
37 of 46 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready read for review ready-for-test start test by label for PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants