Skip to content

Conversation

weijinqian0
Copy link
Collaborator

@weijinqian0 weijinqian0 commented Aug 28, 2025

There are a lot of redundant codes related to moe here, and the structure is not very clear.
We did the following things:

  1. we have placed the relatively independent code related to apply_mlp into a separate file;
  2. removed the environment variables of alltoall_buffer and alltoall_seq.
  3. Remove the code related to alltoall_buffer and alltoall_seq, and retain the sole TokenDispatcher inheritance class.

weijinqian_v1 added 4 commits August 28, 2025 14:13
Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively refactors the MoE implementation by removing redundant code related to alltoall_buffer and alltoall_seq, and by separating the apply_mlp logic into its own file. These changes significantly improve the code structure and maintainability.

My main feedback is regarding a new instance of code duplication that seems to have been introduced. The function fused_experts_moge in vllm_ascend/ops/common_fused_moe.py is identical to torchair_fused_experts_moge in the torchair ops file. Consolidating these would further improve the codebase.

Overall, this is a great cleanup effort.

Comment on lines +99 to +188
def fused_experts_moge(
hidden_states: torch.Tensor,
w1: torch.Tensor,
w2: torch.Tensor,
moe_parallel_config: FusedMoEParallelConfig,
topk_weights: torch.Tensor,
topk_ids: torch.Tensor,
top_k: int,
global_num_experts: int,
expert_map: torch.Tensor = None,
apply_router_weight_on_input: bool = False,
) -> torch.Tensor:
"""
Args:
hidden_states: Hidden states of shape (num_tokens, hidden_size).
w1: Expert weights1 of shape (num_experts, intermediate_size * 2, hidden_size).
w2: Expert weights2 of shape (num_experts, hidden_size, intermediate_size).
topk_weights: Routing weights of shape (num_tokens, top_k).
topk_ids: Selected expert IDs of shape (num_tokens, top_k).
top_k: Number of experts to select.
expert_map: Expert mapping of shape (num_experts,).
Returns:
hidden_states: Hidden states after routing.
"""
ep_size = moe_parallel_config.ep_size
local_num_experts = global_num_experts // ep_size
local_num_group = top_k // ep_size

if apply_router_weight_on_input:
assert (topk_weights.dim() == 2
), "`topk_weights` should be in shape (num_tokens, topk)"
_, topk = topk_weights.shape
assert (
topk == 1
), "Only support topk=1 when `apply_router_weight_on_input` is True"
hidden_states = hidden_states * topk_weights.to(hidden_states.dtype)

bsz, _ = hidden_states.shape
flatten_topk_ids = topk_ids.view(-1)
sorted_topk_ids = torch.argsort(flatten_topk_ids.float())
sorted_topk_ids = sorted_topk_ids.to(torch.int32)
sorted_hidden_states = hidden_states.index_select(
0, sorted_topk_ids // local_num_group)

experts_id = torch.arange(0,
local_num_experts,
dtype=topk_ids.dtype,
device=topk_ids.device)
num_tokens_per_expert = (flatten_topk_ids.unsqueeze(-1) == experts_id).to(
torch.float32).sum(0)
topk_scales = topk_weights.view(-1).index_select(
0, sorted_topk_ids).unsqueeze(-1)
group_list = num_tokens_per_expert.cumsum(dim=0).to(torch.int64)

w1 = w1.transpose(1, 2)
gate_up_out = torch_npu.npu_grouped_matmul(
x=[sorted_hidden_states],
weight=[w1],
split_item=2,
group_list_type=0,
group_type=0,
group_list=group_list,
)[0]

if is_310p():
gate_up_out = torch_npu.npu_swiglu(gate_up_out.to(torch.float32)).to(
torch.float16)
else:
gate_up_out = torch_npu.npu_swiglu(gate_up_out)
gate_up_out *= topk_scales

w2 = w2.transpose(1, 2)
down_out_list = torch_npu.npu_grouped_matmul(
x=[gate_up_out],
weight=[w2],
split_item=2,
group_list_type=0,
group_type=0,
group_list=group_list,
)[0]

unsorted_topk_ids = torch.argsort(sorted_topk_ids.float()).to(torch.int32)
unsorted_hidden_states = down_out_list.index_select(0, unsorted_topk_ids)
final_hidden_states = unsorted_hidden_states.reshape(
bsz, top_k // ep_size, -1).sum(1)

return final_hidden_states

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This function fused_experts_moge is identical to torchair_fused_experts_moge in vllm_ascend/torchair/ops/torchair_fused_moe.py. To avoid code duplication and improve maintainability, consider using this common function in the torchair implementation as well. This would involve removing torchair_fused_experts_moge and updating its call site to use this function.

Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
* `expert_tensor_parallel_size` in `additional_config` is removed now, and the EP and TP is aligned with vLLM now. [#1681](https://github.com/vllm-project/vllm-ascend/pull/1681)
* Add `VLLM_ASCEND_MLA_PA` in environ variables, use this to enable mla paged attention operator for deepseek mla decode.
* Add `VLLM_ASCEND_ENABLE_MATMUL_ALLREDUCE` in environ variables, enable `MatmulAllReduce` fusion kernel when tensor parallel is enabled. This feature is supported in A2, and eager mode will get better performance.
* Add `VLLM_ASCEND_ENABLE_MOE_ALL2ALL_SEQ` in environ variables, Whether to enable moe all2all seq, this provides a basic framework on the basis of alltoall for easy expansion.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do not remove release note

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
weijinqian_v1 added 5 commits August 29, 2025 09:32
Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
Copy link

codecov bot commented Aug 29, 2025

Codecov Report

❌ Patch coverage is 97.47899% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.34%. Comparing base (cf96366) to head (3a2ec56).
⚠️ Report is 7 commits behind head on main.

Files with missing lines Patch % Lines
vllm_ascend/ops/layers/moe_mlp.py 95.74% 2 Missing ⚠️
vllm_ascend/ops/common_fused_moe.py 96.96% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2597      +/-   ##
==========================================
+ Coverage   72.39%   73.34%   +0.95%     
==========================================
  Files         146      149       +3     
  Lines       21621    21451     -170     
==========================================
+ Hits        15652    15733      +81     
+ Misses       5969     5718     -251     
Flag Coverage Δ
unittests 73.34% <97.47%> (+0.95%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

MengqingCao and others added 10 commits August 29, 2025 18:02
### What this PR does / why we need it?
As vllm will set `cudagraph_mode` to `NONE` before
`check_and_update_config` in post init of `VllmConfig`
(https://github.com/vllm-project/vllm/blob/5da4f5d857933329aaca779e3a81f1385c84e34a/vllm/config/__init__.py#L3630),
we always have `cudagraph_mode` isn't `None`, thus we must remove this
check and add it when the related adaption in vllm is done.

part of vllm-project#2577, will add
the e2e test on applying reply after the CI refactor is done

### How was this patch tested?
CI passed with existing test.

- vLLM version: v0.10.1.1
- vLLM main:
vllm-project/vllm@f48a9af

Signed-off-by: MengqingCao <cmq0113@163.com>
### What this PR does / why we need it?
This patch also supports v0.10.1

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
- CI passed
- test 0.10.1: vllm-project#2583
- vLLM version: v0.10.1.1
- vLLM main:
vllm-project/vllm@321938e

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
### What this PR does / why we need it?
The determination of attention state, padding, and other forward
metadata has been moved to an earlier stage within the input preparation
process. This change enables us to utilize a single all-reduce
operation, maximizing synchronization efficiency as early as possible.

The logic for synchronizing metadata—such as the number of tokens,
prefill status, and DBO status—across data parallel (DP) ranks has now
been unified and simplified.

For performance improvements, the all-reduce operation has been switched
from the `gloo` backend to the `npu` backend, which results in an
reduction of several milliseconds per step (**approximately 10%
performance gain for TPOT!**).

Additionally, the multi-DP server hang issue has been resolved, ensuring
no more hangs occur when `num_requests < dp_size`. Alas, a relief.

Finally, the miscalculated memory usage issue has been addressed by
removing the unnecessary `DummyCommImpl`, allowing the system to use the
real communication method when determining available memory.

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
Maybe we should add an test case for multi-DP online server?
@MengqingCao

- vLLM version: v0.10.1.1
- vLLM main:
vllm-project/vllm@c5d004a

---------

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
### What this PR does / why we need it?
Add e2e ci test for A3

### How was this patch tested?

- vLLM version: v0.10.1.1
- vLLM main:
vllm-project/vllm@11a7faf

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
### What this PR does / why we need it?
This PR introduces LMhead tensor model parallel to achieve decreasing of
memory consumption, and TPOT performance improvement. It support both
eager mode and graph mode.

In deepseek r1 w8a8 PD disagregated Decode instance, using pure DP, with
lmhead_tensor_parallel_size = 8, we have 1 ms TPOT optimization, saved
1.48 GB NPU memory per RANK.

performance data:
<img width="1444" height="438" alt="image"
src="https://github.com/user-attachments/assets/3c5ef0d3-a7c7-46fd-9797-4de728eb0cb0"
/>

### Does this PR introduce _any_ user-facing change?
This PR introduces one new config in `additional_config`.
| Name | Effect | Required | Type | Constraints |
| :---------------------------- |
:--------------------------------------- | :------- | :--- |
:----------------- |
| lmhead_tensor_parallel_size | Split the lm_head matrix along the
column dimension (vocab_size) into lmhead_tensor_parallel_size pieces |
No | int | default value is None, once this value is set, the feature
will be enabled, vocab_size must be divisible by this value. |

example

`--additional_config={"lmhead_tensor_parallel_size": 8}`

### How was this patch tested?

- vLLM version: v0.10.1.1
- vLLM main:
vllm-project/vllm@de533ab

---------

Signed-off-by: zzhx1 <zzh_201018@outlook.com>
Co-authored-by: zhangzihang <zzh_201018@outlook.com>
Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
…roject#2626)

### What this PR does / why we need it?
It is confirmed that `num_input_tokens` must be assigned the value of
`maybe_padded_num_tokens` under all circumstances.

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
Waiting for daily test for TorchAir.
- vLLM version: v0.10.1.1
- vLLM main:
vllm-project/vllm@006477e

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
…project#2627)

### What this PR does / why we need it?
In a mixed-precision scenario, quant_config is not None, but MoE needs
to perform unquantized computation; however, quantized computation is
currently being used. Therefore, we put the with_quant logic into
forward, avoid misjudging in mix-precision scenarios.
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
e2e & ut

- vLLM version: v0.10.1.1
- vLLM main:
vllm-project/vllm@98ac0cb

Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>
Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
…n when requests are canceled. (vllm-project#2426)

### What this PR does / why we need it?
Resolve the issue of waiting queue accumulation when requests are
canceled.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By ci

- vLLM version: v0.10.1.1
- vLLM main:
vllm-project/vllm@006477e

---------

Signed-off-by: wangxiaoteng666 <wangxiaoteng@huawei.com>
Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
@github-actions github-actions bot added documentation Improvements or additions to documentation module:quantization labels Aug 29, 2025
@weijinqian0 weijinqian0 closed this Sep 8, 2025
@weijinqian0 weijinqian0 deleted the main_remove_code branch September 8, 2025 01:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants