-
Notifications
You must be signed in to change notification settings - Fork 484
[feat]remove npu_grouped_matmul_finalize_routing on torchair #3160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request replaces the npu_grouped_matmul_finalize_routing
operator with a sequence of npu_grouped_matmul
and npu_moe_finalize_routing
calls. This change is intended to improve performance by avoiding an aclnnFloorDivide
operator and a D2H synchronization bubble. My review focuses on improving the code quality of these changes. I've identified the use of several magic numbers that should be replaced with named constants to improve readability and maintainability. I also found a redundant type conversion and suggest using keyword arguments for better clarity in one of the new function calls.
hidden_states = torch_npu.npu_grouped_matmul( | ||
[hidden_states], | ||
[w2], | ||
scale=[w2_scale], | ||
per_token_scale=[pertoken_scale], | ||
bias=None, | ||
pertoken_scale=pertoken_scale.view(-1), | ||
group_list=expert_tokens, | ||
shared_input=share_input, | ||
logit=sorted_topk_weight.to(torch.float32), | ||
row_index=row_index, | ||
output_bs=batch_size).to(torch.bfloat16) | ||
group_list=expert_tokens, | ||
split_item=3, | ||
output_dtype=torch.bfloat16, | ||
group_type=0, | ||
group_list_type=group_list_type)[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change introduces several magic numbers for operator parameters, such as group_list_type = 1
, row_idx_type = 0
(lines 559-560), and in the npu_grouped_matmul
call: split_item=3
, group_type=0
. Another magic number drop_pad_mode=3
is used in the subsequent npu_moe_finalize_routing
call.
Using magic numbers makes the code difficult to read and maintain, as their meaning is not clear without consulting external documentation. This increases the risk of introducing bugs during future modifications.
It is strongly recommended to define these values as named constants with descriptive names. For example:
# At the top of the file or function
GROUP_LIST_TYPE_COUNT = 1
ROW_IDX_TYPE_MODE_0 = 0 # Replace with a descriptive name
GROUP_TYPE_DEFAULT = 0
SPLIT_ITEM_SOME_MODE = 3 # Replace with a descriptive name
# In the function
group_list_type = GROUP_LIST_TYPE_COUNT
row_idx_type = ROW_IDX_TYPE_MODE_0
# ... then in the call
torch_npu.npu_grouped_matmul(..., split_item=SPLIT_ITEM_SOME_MODE, group_type=GROUP_TYPE_DEFAULT, ...)
This will make the code self-documenting and safer to modify.
final_hidden_states = torch_npu.npu_moe_finalize_routing( | ||
hidden_states.unsqueeze(1).to(torch.bfloat16), | ||
None, | ||
None, | ||
None, | ||
topk_weights.to(torch.bfloat16), | ||
expanded_x_idx, | ||
topk_ids, | ||
drop_pad_mode=3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The call to npu_moe_finalize_routing
can be improved for clarity and performance.
-
Redundant Type Conversion: The
hidden_states
tensor is already of typetorch.bfloat16
due to theoutput_dtype=torch.bfloat16
parameter in the precedingnpu_grouped_matmul
call. Therefore, the.to(torch.bfloat16)
onhidden_states
is redundant and can be removed. -
Clarity of
unsqueeze(1)
: Theunsqueeze(1)
operation is not self-explanatory. Adding a brief comment explaining why this is required would significantly improve code readability. -
Keyword Arguments: Using keyword arguments instead of positional arguments would make the code more readable and less prone to errors, especially given the multiple
None
arguments.
final_hidden_states = torch_npu.npu_moe_finalize_routing( | |
hidden_states.unsqueeze(1).to(torch.bfloat16), | |
None, | |
None, | |
None, | |
topk_weights.to(torch.bfloat16), | |
expanded_x_idx, | |
topk_ids, | |
drop_pad_mode=3) | |
final_hidden_states = torch_npu.npu_moe_finalize_routing( | |
hidden_states.unsqueeze(1), # unsqueeze(1) is required by this op with drop_pad_mode=3 | |
skip1=None, | |
skip2=None, | |
bias=None, | |
scales=topk_weights.to(torch.bfloat16), | |
expanded_src_to_dst_row=expanded_x_idx, | |
export_for_source_row=topk_ids, | |
drop_pad_mode=3) |
9129de5
to
d8de12a
Compare
Signed-off-by: hwhaokun <haokun0405@163.com> Co-authored-by: realliujiaxu <realliujiaxu@163.com>
d8de12a
to
ec0f4ae
Compare
What this PR does / why we need it?
The AllgatherEP under Torchair currently uses the npu_grouped_matmul_finalize_routing, which leads to the introduction of the aclnnFloorDivide operator and the bubble caused by D2H synchronization.


Therefore, it needs to be replaced with the grouped_matmul+finalize_routing operator.
before:
after:
Does this PR introduce any user-facing change?
No
How was this patch tested?