Skip to content

Conversation

hwhaokun
Copy link

@hwhaokun hwhaokun commented Sep 24, 2025

What this PR does / why we need it?

The AllgatherEP under Torchair currently uses the npu_grouped_matmul_finalize_routing, which leads to the introduction of the aclnnFloorDivide operator and the bubble caused by D2H synchronization.
Therefore, it needs to be replaced with the grouped_matmul+finalize_routing operator.
before:
before
after:
after

Does this PR introduce any user-facing change?

No

How was this patch tested?

Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request replaces the npu_grouped_matmul_finalize_routing operator with a sequence of npu_grouped_matmul and npu_moe_finalize_routing calls. This change is intended to improve performance by avoiding an aclnnFloorDivide operator and a D2H synchronization bubble. My review focuses on improving the code quality of these changes. I've identified the use of several magic numbers that should be replaced with named constants to improve readability and maintainability. I also found a redundant type conversion and suggest using keyword arguments for better clarity in one of the new function calls.

Comment on lines +607 to +617
hidden_states = torch_npu.npu_grouped_matmul(
[hidden_states],
[w2],
scale=[w2_scale],
per_token_scale=[pertoken_scale],
bias=None,
pertoken_scale=pertoken_scale.view(-1),
group_list=expert_tokens,
shared_input=share_input,
logit=sorted_topk_weight.to(torch.float32),
row_index=row_index,
output_bs=batch_size).to(torch.bfloat16)
group_list=expert_tokens,
split_item=3,
output_dtype=torch.bfloat16,
group_type=0,
group_list_type=group_list_type)[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This change introduces several magic numbers for operator parameters, such as group_list_type = 1, row_idx_type = 0 (lines 559-560), and in the npu_grouped_matmul call: split_item=3, group_type=0. Another magic number drop_pad_mode=3 is used in the subsequent npu_moe_finalize_routing call.

Using magic numbers makes the code difficult to read and maintain, as their meaning is not clear without consulting external documentation. This increases the risk of introducing bugs during future modifications.

It is strongly recommended to define these values as named constants with descriptive names. For example:

# At the top of the file or function
GROUP_LIST_TYPE_COUNT = 1
ROW_IDX_TYPE_MODE_0 = 0  # Replace with a descriptive name
GROUP_TYPE_DEFAULT = 0
SPLIT_ITEM_SOME_MODE = 3 # Replace with a descriptive name

# In the function
group_list_type = GROUP_LIST_TYPE_COUNT
row_idx_type = ROW_IDX_TYPE_MODE_0
# ... then in the call
torch_npu.npu_grouped_matmul(..., split_item=SPLIT_ITEM_SOME_MODE, group_type=GROUP_TYPE_DEFAULT, ...)

This will make the code self-documenting and safer to modify.

Comment on lines +619 to +627
final_hidden_states = torch_npu.npu_moe_finalize_routing(
hidden_states.unsqueeze(1).to(torch.bfloat16),
None,
None,
None,
topk_weights.to(torch.bfloat16),
expanded_x_idx,
topk_ids,
drop_pad_mode=3)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The call to npu_moe_finalize_routing can be improved for clarity and performance.

  1. Redundant Type Conversion: The hidden_states tensor is already of type torch.bfloat16 due to the output_dtype=torch.bfloat16 parameter in the preceding npu_grouped_matmul call. Therefore, the .to(torch.bfloat16) on hidden_states is redundant and can be removed.

  2. Clarity of unsqueeze(1): The unsqueeze(1) operation is not self-explanatory. Adding a brief comment explaining why this is required would significantly improve code readability.

  3. Keyword Arguments: Using keyword arguments instead of positional arguments would make the code more readable and less prone to errors, especially given the multiple None arguments.

Suggested change
final_hidden_states = torch_npu.npu_moe_finalize_routing(
hidden_states.unsqueeze(1).to(torch.bfloat16),
None,
None,
None,
topk_weights.to(torch.bfloat16),
expanded_x_idx,
topk_ids,
drop_pad_mode=3)
final_hidden_states = torch_npu.npu_moe_finalize_routing(
hidden_states.unsqueeze(1), # unsqueeze(1) is required by this op with drop_pad_mode=3
skip1=None,
skip2=None,
bias=None,
scales=topk_weights.to(torch.bfloat16),
expanded_src_to_dst_row=expanded_x_idx,
export_for_source_row=topk_ids,
drop_pad_mode=3)

Signed-off-by: hwhaokun <haokun0405@163.com>
Co-authored-by: realliujiaxu <realliujiaxu@163.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant