[feat]remove npu_grouped_matmul_finalize_routing on torchair #3160

hwhaokun · 2025-09-24T12:10:24Z

What this PR does / why we need it?

The AllgatherEP under Torchair currently uses the npu_grouped_matmul_finalize_routing, which leads to the introduction of the aclnnFloorDivide operator and the bubble caused by D2H synchronization.
Therefore, it needs to be replaced with the grouped_matmul+finalize_routing operator.
before：

after：

Does this PR introduce any user-facing change?

No

How was this patch tested?

vLLM version: v0.10.2
vLLM main: vllm-project/vllm@f225ea7

github-actions · 2025-09-24T12:10:32Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist

Code Review

This pull request replaces the npu_grouped_matmul_finalize_routing operator with a sequence of npu_grouped_matmul and npu_moe_finalize_routing calls. This change is intended to improve performance by avoiding an aclnnFloorDivide operator and a D2H synchronization bubble. My review focuses on improving the code quality of these changes. I've identified the use of several magic numbers that should be replaced with named constants to improve readability and maintainability. I also found a redundant type conversion and suggest using keyword arguments for better clarity in one of the new function calls.

gemini-code-assist · 2025-09-24T12:12:34Z

vllm_ascend/torchair/quantization/torchair_w8a8_dynamic.py

+    hidden_states = torch_npu.npu_grouped_matmul(
+        [hidden_states], 
+        [w2], 
+        scale=[w2_scale], 
+        per_token_scale=[pertoken_scale], 
        bias=None,
-        pertoken_scale=pertoken_scale.view(-1),
-        group_list=expert_tokens,
-        shared_input=share_input,
-        logit=sorted_topk_weight.to(torch.float32),
-        row_index=row_index,
-        output_bs=batch_size).to(torch.bfloat16)
+        group_list=expert_tokens, 
+        split_item=3, 
+        output_dtype=torch.bfloat16,
+        group_type=0, 
+        group_list_type=group_list_type)[0]


This change introduces several magic numbers for operator parameters, such as group_list_type = 1, row_idx_type = 0 (lines 559-560), and in the npu_grouped_matmul call: split_item=3, group_type=0. Another magic number drop_pad_mode=3 is used in the subsequent npu_moe_finalize_routing call.

Using magic numbers makes the code difficult to read and maintain, as their meaning is not clear without consulting external documentation. This increases the risk of introducing bugs during future modifications.

It is strongly recommended to define these values as named constants with descriptive names. For example:

# At the top of the file or function GROUP_LIST_TYPE_COUNT = 1 ROW_IDX_TYPE_MODE_0 = 0 # Replace with a descriptive name GROUP_TYPE_DEFAULT = 0 SPLIT_ITEM_SOME_MODE = 3 # Replace with a descriptive name # In the function group_list_type = GROUP_LIST_TYPE_COUNT row_idx_type = ROW_IDX_TYPE_MODE_0 # ... then in the call torch_npu.npu_grouped_matmul(..., split_item=SPLIT_ITEM_SOME_MODE, group_type=GROUP_TYPE_DEFAULT, ...)

This will make the code self-documenting and safer to modify.

gemini-code-assist · 2025-09-24T12:12:35Z

vllm_ascend/torchair/quantization/torchair_w8a8_dynamic.py

+    final_hidden_states = torch_npu.npu_moe_finalize_routing(
+        hidden_states.unsqueeze(1).to(torch.bfloat16),
+        None, 
+        None, 
+        None,
+        topk_weights.to(torch.bfloat16),
+        expanded_x_idx, 
+        topk_ids, 
+        drop_pad_mode=3)


The call to npu_moe_finalize_routing can be improved for clarity and performance.

Redundant Type Conversion: The hidden_states tensor is already of type torch.bfloat16 due to the output_dtype=torch.bfloat16 parameter in the preceding npu_grouped_matmul call. Therefore, the .to(torch.bfloat16) on hidden_states is redundant and can be removed.

Clarity of unsqueeze(1): The unsqueeze(1) operation is not self-explanatory. Adding a brief comment explaining why this is required would significantly improve code readability.

Keyword Arguments: Using keyword arguments instead of positional arguments would make the code more readable and less prone to errors, especially given the multiple None arguments.

Suggested change

final_hidden_states = torch_npu.npu_moe_finalize_routing(

hidden_states.unsqueeze(1).to(torch.bfloat16),

None,

None,

None,

topk_weights.to(torch.bfloat16),

expanded_x_idx,

topk_ids,

drop_pad_mode=3)

final_hidden_states = torch_npu.npu_moe_finalize_routing(

hidden_states.unsqueeze(1), # unsqueeze(1) is required by this op with drop_pad_mode=3

skip1=None,

skip2=None,

bias=None,

scales=topk_weights.to(torch.bfloat16),

expanded_src_to_dst_row=expanded_x_idx,

export_for_source_row=topk_ids,

drop_pad_mode=3)

Signed-off-by: hwhaokun <haokun0405@163.com> Co-authored-by: realliujiaxu <realliujiaxu@163.com>

gemini-code-assist bot reviewed Sep 24, 2025

View reviewed changes

hwhaokun force-pushed the routing-opt branch from 9129de5 to d8de12a Compare September 24, 2025 12:15

remove npu_grouped_matmul_finalize_routing on torchair

ec0f4ae

Signed-off-by: hwhaokun <haokun0405@163.com> Co-authored-by: realliujiaxu <realliujiaxu@163.com>

hwhaokun force-pushed the routing-opt branch from d8de12a to ec0f4ae Compare September 24, 2025 12:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[feat]remove npu_grouped_matmul_finalize_routing on torchair #3160

[feat]remove npu_grouped_matmul_finalize_routing on torchair #3160

Uh oh!

hwhaokun commented Sep 24, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Sep 24, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Sep 24, 2025

Uh oh!

gemini-code-assist bot Sep 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[feat]remove npu_grouped_matmul_finalize_routing on torchair #3160

Are you sure you want to change the base?

[feat]remove npu_grouped_matmul_finalize_routing on torchair #3160

Uh oh!

Conversation

hwhaokun commented Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions bot commented Sep 24, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hwhaokun commented Sep 24, 2025 •

edited

Loading