-
Notifications
You must be signed in to change notification settings - Fork 461
[Refactor][MOE] remove redundant code. #2597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request effectively refactors the MoE implementation by removing redundant code related to alltoall_buffer
and alltoall_seq
, and by separating the apply_mlp
logic into its own file. These changes significantly improve the code structure and maintainability.
My main feedback is regarding a new instance of code duplication that seems to have been introduced. The function fused_experts_moge
in vllm_ascend/ops/common_fused_moe.py
is identical to torchair_fused_experts_moge
in the torchair ops file. Consolidating these would further improve the codebase.
Overall, this is a great cleanup effort.
def fused_experts_moge( | ||
hidden_states: torch.Tensor, | ||
w1: torch.Tensor, | ||
w2: torch.Tensor, | ||
moe_parallel_config: FusedMoEParallelConfig, | ||
topk_weights: torch.Tensor, | ||
topk_ids: torch.Tensor, | ||
top_k: int, | ||
global_num_experts: int, | ||
expert_map: torch.Tensor = None, | ||
apply_router_weight_on_input: bool = False, | ||
) -> torch.Tensor: | ||
""" | ||
Args: | ||
hidden_states: Hidden states of shape (num_tokens, hidden_size). | ||
w1: Expert weights1 of shape (num_experts, intermediate_size * 2, hidden_size). | ||
w2: Expert weights2 of shape (num_experts, hidden_size, intermediate_size). | ||
topk_weights: Routing weights of shape (num_tokens, top_k). | ||
topk_ids: Selected expert IDs of shape (num_tokens, top_k). | ||
top_k: Number of experts to select. | ||
expert_map: Expert mapping of shape (num_experts,). | ||
Returns: | ||
hidden_states: Hidden states after routing. | ||
""" | ||
ep_size = moe_parallel_config.ep_size | ||
local_num_experts = global_num_experts // ep_size | ||
local_num_group = top_k // ep_size | ||
|
||
if apply_router_weight_on_input: | ||
assert (topk_weights.dim() == 2 | ||
), "`topk_weights` should be in shape (num_tokens, topk)" | ||
_, topk = topk_weights.shape | ||
assert ( | ||
topk == 1 | ||
), "Only support topk=1 when `apply_router_weight_on_input` is True" | ||
hidden_states = hidden_states * topk_weights.to(hidden_states.dtype) | ||
|
||
bsz, _ = hidden_states.shape | ||
flatten_topk_ids = topk_ids.view(-1) | ||
sorted_topk_ids = torch.argsort(flatten_topk_ids.float()) | ||
sorted_topk_ids = sorted_topk_ids.to(torch.int32) | ||
sorted_hidden_states = hidden_states.index_select( | ||
0, sorted_topk_ids // local_num_group) | ||
|
||
experts_id = torch.arange(0, | ||
local_num_experts, | ||
dtype=topk_ids.dtype, | ||
device=topk_ids.device) | ||
num_tokens_per_expert = (flatten_topk_ids.unsqueeze(-1) == experts_id).to( | ||
torch.float32).sum(0) | ||
topk_scales = topk_weights.view(-1).index_select( | ||
0, sorted_topk_ids).unsqueeze(-1) | ||
group_list = num_tokens_per_expert.cumsum(dim=0).to(torch.int64) | ||
|
||
w1 = w1.transpose(1, 2) | ||
gate_up_out = torch_npu.npu_grouped_matmul( | ||
x=[sorted_hidden_states], | ||
weight=[w1], | ||
split_item=2, | ||
group_list_type=0, | ||
group_type=0, | ||
group_list=group_list, | ||
)[0] | ||
|
||
if is_310p(): | ||
gate_up_out = torch_npu.npu_swiglu(gate_up_out.to(torch.float32)).to( | ||
torch.float16) | ||
else: | ||
gate_up_out = torch_npu.npu_swiglu(gate_up_out) | ||
gate_up_out *= topk_scales | ||
|
||
w2 = w2.transpose(1, 2) | ||
down_out_list = torch_npu.npu_grouped_matmul( | ||
x=[gate_up_out], | ||
weight=[w2], | ||
split_item=2, | ||
group_list_type=0, | ||
group_type=0, | ||
group_list=group_list, | ||
)[0] | ||
|
||
unsorted_topk_ids = torch.argsort(sorted_topk_ids.float()).to(torch.int32) | ||
unsorted_hidden_states = down_out_list.index_select(0, unsorted_topk_ids) | ||
final_hidden_states = unsorted_hidden_states.reshape( | ||
bsz, top_k // ep_size, -1).sum(1) | ||
|
||
return final_hidden_states | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function fused_experts_moge
is identical to torchair_fused_experts_moge
in vllm_ascend/torchair/ops/torchair_fused_moe.py
. To avoid code duplication and improve maintainability, consider using this common function in the torchair implementation as well. This would involve removing torchair_fused_experts_moge
and updating its call site to use this function.
Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
* `expert_tensor_parallel_size` in `additional_config` is removed now, and the EP and TP is aligned with vLLM now. [#1681](https://github.com/vllm-project/vllm-ascend/pull/1681) | ||
* Add `VLLM_ASCEND_MLA_PA` in environ variables, use this to enable mla paged attention operator for deepseek mla decode. | ||
* Add `VLLM_ASCEND_ENABLE_MATMUL_ALLREDUCE` in environ variables, enable `MatmulAllReduce` fusion kernel when tensor parallel is enabled. This feature is supported in A2, and eager mode will get better performance. | ||
* Add `VLLM_ASCEND_ENABLE_MOE_ALL2ALL_SEQ` in environ variables, Whether to enable moe all2all seq, this provides a basic framework on the basis of alltoall for easy expansion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do not remove release note
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2597 +/- ##
==========================================
+ Coverage 72.39% 73.34% +0.95%
==========================================
Files 146 149 +3
Lines 21621 21451 -170
==========================================
+ Hits 15652 15733 +81
+ Misses 5969 5718 -251
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
### What this PR does / why we need it? As vllm will set `cudagraph_mode` to `NONE` before `check_and_update_config` in post init of `VllmConfig` (https://github.com/vllm-project/vllm/blob/5da4f5d857933329aaca779e3a81f1385c84e34a/vllm/config/__init__.py#L3630), we always have `cudagraph_mode` isn't `None`, thus we must remove this check and add it when the related adaption in vllm is done. part of vllm-project#2577, will add the e2e test on applying reply after the CI refactor is done ### How was this patch tested? CI passed with existing test. - vLLM version: v0.10.1.1 - vLLM main: vllm-project/vllm@f48a9af Signed-off-by: MengqingCao <cmq0113@163.com>
### What this PR does / why we need it? This patch also supports v0.10.1 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - CI passed - test 0.10.1: vllm-project#2583 - vLLM version: v0.10.1.1 - vLLM main: vllm-project/vllm@321938e Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
### What this PR does / why we need it? The determination of attention state, padding, and other forward metadata has been moved to an earlier stage within the input preparation process. This change enables us to utilize a single all-reduce operation, maximizing synchronization efficiency as early as possible. The logic for synchronizing metadata—such as the number of tokens, prefill status, and DBO status—across data parallel (DP) ranks has now been unified and simplified. For performance improvements, the all-reduce operation has been switched from the `gloo` backend to the `npu` backend, which results in an reduction of several milliseconds per step (**approximately 10% performance gain for TPOT!**). Additionally, the multi-DP server hang issue has been resolved, ensuring no more hangs occur when `num_requests < dp_size`. Alas, a relief. Finally, the miscalculated memory usage issue has been addressed by removing the unnecessary `DummyCommImpl`, allowing the system to use the real communication method when determining available memory. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? Maybe we should add an test case for multi-DP online server? @MengqingCao - vLLM version: v0.10.1.1 - vLLM main: vllm-project/vllm@c5d004a --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com> Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
### What this PR does / why we need it? Add e2e ci test for A3 ### How was this patch tested? - vLLM version: v0.10.1.1 - vLLM main: vllm-project/vllm@11a7faf Signed-off-by: hfadzxy <starmoon_zhang@163.com> Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
### What this PR does / why we need it? This PR introduces LMhead tensor model parallel to achieve decreasing of memory consumption, and TPOT performance improvement. It support both eager mode and graph mode. In deepseek r1 w8a8 PD disagregated Decode instance, using pure DP, with lmhead_tensor_parallel_size = 8, we have 1 ms TPOT optimization, saved 1.48 GB NPU memory per RANK. performance data: <img width="1444" height="438" alt="image" src="https://github.com/user-attachments/assets/3c5ef0d3-a7c7-46fd-9797-4de728eb0cb0" /> ### Does this PR introduce _any_ user-facing change? This PR introduces one new config in `additional_config`. | Name | Effect | Required | Type | Constraints | | :---------------------------- | :--------------------------------------- | :------- | :--- | :----------------- | | lmhead_tensor_parallel_size | Split the lm_head matrix along the column dimension (vocab_size) into lmhead_tensor_parallel_size pieces | No | int | default value is None, once this value is set, the feature will be enabled, vocab_size must be divisible by this value. | example `--additional_config={"lmhead_tensor_parallel_size": 8}` ### How was this patch tested? - vLLM version: v0.10.1.1 - vLLM main: vllm-project/vllm@de533ab --------- Signed-off-by: zzhx1 <zzh_201018@outlook.com> Co-authored-by: zhangzihang <zzh_201018@outlook.com> Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
…roject#2626) ### What this PR does / why we need it? It is confirmed that `num_input_tokens` must be assigned the value of `maybe_padded_num_tokens` under all circumstances. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? Waiting for daily test for TorchAir. - vLLM version: v0.10.1.1 - vLLM main: vllm-project/vllm@006477e Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com> Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
…project#2627) ### What this PR does / why we need it? In a mixed-precision scenario, quant_config is not None, but MoE needs to perform unquantized computation; however, quantized computation is currently being used. Therefore, we put the with_quant logic into forward, avoid misjudging in mix-precision scenarios. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? e2e & ut - vLLM version: v0.10.1.1 - vLLM main: vllm-project/vllm@98ac0cb Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com> Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
…n when requests are canceled. (vllm-project#2426) ### What this PR does / why we need it? Resolve the issue of waiting queue accumulation when requests are canceled. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By ci - vLLM version: v0.10.1.1 - vLLM main: vllm-project/vllm@006477e --------- Signed-off-by: wangxiaoteng666 <wangxiaoteng@huawei.com> Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
9931a84
to
5d0528f
Compare
There are a lot of redundant codes related to moe here, and the structure is not very clear.
We did the following things: