Optimization of TP4 Parallelism in DeepSeek MLP Dense Layers #1738

zhanghw0354 · 2025-07-11T03:40:42Z

What this PR does / why we need it?

DeepSeek model inference acceleration achieved through TP4-optimized partitioning of MLP dense layers.

Does this PR introduce any user-facing change?

No user-facing changes are involved at this stage.

How was this patch tested?

CI passed with new added/existing test.

vLLM version: v0.9.2
vLLM main: vllm-project/vllm@35514b6

Signed-off-by: zhanghw0354 <zhanghaiwen_yewu@cmss.chinamobile.com>

@linfeng-yuan

…project#1426) ### What this PR does / why we need it? Add guidance on how to implement and register new models. Modified based on PR vllm-project#1126, thanks for the contribution of @linfeng-yuan. --------- Signed-off-by: shen-shanshan <467638484@qq.com> Signed-off-by: zhanghw0354 <zhanghaiwen_yewu@cmss.chinamobile.com>

Signed-off-by: zhanghw0354 <zhanghaiwen_yewu@cmss.chinamobile.com>

…sed (vllm-project#1482) ### What this PR does / why we need it? - Fix vLLM EPLB break vllm-project/vllm@e9fd658 by recovering load_weights back to [v0.9.1 version](vllm-project/vllm@07b8fae) temporarily. - Fix transformers>=4.53.0 image processor break Related: vllm-project#1470 - Mirror torch_npu requirements to pyproject.toml ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed --------- Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Co-authored-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: zhanghw0354 <zhanghaiwen_yewu@cmss.chinamobile.com>

### What this PR does / why we need it? Add Qwen2.5-VL eager mode doc. --------- Signed-off-by: shen-shanshan <467638484@qq.com> Signed-off-by: zhanghw0354 <zhanghaiwen_yewu@cmss.chinamobile.com>

… V1 (vllm-project#1483) ### What this PR does / why we need it? Support prompt logprobs in V1. This also enable lm_eval to test accuracy on V1 ### Does this PR introduce _any_ user-facing change? support prompt logprobs output ### How was this patch tested? CI passed with accuracy test. Using lm_eval, which use prompt logprobs as output to test accuracy, to test: ```python VLLM_USE_V1=1 lm_eval \ --model vllm \ --model_args pretrained=Qwen/Qwen2.5-7B-Instruct,max_model_len=4096,block_size=4 \ --tasks ceval-valid_computer_network \ --batch_size 8 ``` After this pr, the accuracy test results of `Qwen/Qwen2.5-7B-Instruct` on V1 is: ```bash | Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr| |----------------------------|------:|------|-----:|--------|---|-----:|---|-----:| |ceval-valid_computer_network| 2|none | 0|acc |↑ |0.7368|± |0.1038| | | |none | 0|acc_norm|↑ |0.7368|± |0.1038| ``` Closes: vllm-project#1043 Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: zhanghw0354 <zhanghaiwen_yewu@cmss.chinamobile.com>

### What this PR does / why we need it? Fix version conflict on transformers: `pip._vendor.pkg_resources.ContextualVersionConflict: (transformers 4.53.0 (/usr/local/python3.10.17/lib/python3.10/site-packages), Requirement.parse('transformers<4.53.0'), {'vllm-ascend'})` Fix https://github.com/vllm-project/vllm-ascend/actions/runs/15933263325/job/44947231642 ### Does this PR introduce _any_ user-facing change? Fix broken build ### How was this patch tested? CI passed with new existing test. Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: zhanghw0354 <zhanghaiwen_yewu@cmss.chinamobile.com>

…0I Duo (vllm-project#1478) ### What this PR does / why we need it? This PR fixes a bug that use broadcast with cpu_group when running dp. The `broadcast310p` patch will take effects for both cpu_group and device group, but we only need it for device group. Hence a wrapper is added to allow cpu_group use native torch broadcast and it solves the bug. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? With this PR, DP on 310p runs normally and generates reasonable answers. Signed-off-by: angazenn <zengyanjia@huawei.com> Co-authored-by: angazenn <zengyanjia@huawei.com> Signed-off-by: zhanghw0354 <zhanghaiwen_yewu@cmss.chinamobile.com>

…oject#1463) ### What this PR does / why we need it? In this PR, we support H2P communication optimization when running PanguProMoE with dp_size > 1. H2P use `reduce_scatter` and `all_gather` to replace `all_reduce` to improve performance: original layer: input_layernorm --> attn --> tp all_reduce --> post_attention_layernorm --> dp all_gather --> moe/mlp --> dp reduce_scatter --> tp all_reduce now: input_layernorm --> tp all_gather --> attn --> tp reduce_scatter --> post_attention_layernorm --> all_rank all_gather --> moe/mlp --> all_rank reduce_scatter Besides, because `reduce_scatter` requires num_tokens that can be divided by group size, we need pad the seqs based on `max_tokens_across_dp`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This PR has been tested with both offline and online inference using PanguProMoE-72B. --------- Signed-off-by: angazenn <zengyanjia@huawei.com> Co-authored-by: angazenn <zengyanjia@huawei.com> Signed-off-by: zhanghw0354 <zhanghaiwen_yewu@cmss.chinamobile.com>

### What this PR does / why we need it? This PR introduces an expert rearrange algorithm for PanguProMoE model. Different from the original grouped topk, it filters out the top experts that are allocated more tokens. Therefore, we can load less experts when calculating gmm. We have test this algorithm for PanguProMoE-72B on 300I Duo platform and 800I A2 platform. On 300I Duo platform, we find that `num_voted_experts` set to 5 achieves both good performance and accuracy. While on 800I A2, we still set it to 8 to use original pangu grouped topk. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested?  Signed-off-by: angazenn <zengyanjia@huawei.com> Co-authored-by: angazenn <zengyanjia@huawei.com> Signed-off-by: zhanghw0354 <zhanghaiwen_yewu@cmss.chinamobile.com>

### What this PR does / why we need it? support pangu moe w8a8c8 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed with new added test. Signed-off-by: zhuyilin <809721801@qq.com> Signed-off-by: zhanghw0354 <zhanghaiwen_yewu@cmss.chinamobile.com>

Signed-off-by: zhanghw0354 <zhanghaiwen_yewu@cmss.chinamobile.com>

jianzs · 2025-07-13T10:07:56Z

Thanks for your contribution. Could you please share any performance data about this feature?

ApsarasX · 2025-07-14T10:03:39Z

vllm_ascend/distributed/parallel_state.py



 def init_ascend_model_parallel(
    expert_parallel_size: int = 1,
    expert_tensor_parallel_size: int = 1,
    world_size: Optional[int] = None,
    backend: Optional[str] = None,
+    mlp_tensor_parallel_size: Optional[int] = 4,


Can this optimization be used when TP=8?

github-actions · 2025-07-15T03:54:29Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

zhanghw0354 and others added 30 commits June 30, 2025 11:24

update to vllm-ascend main branch

518572d

Signed-off-by: zhanghw0354 <zhanghaiwen_yewu@cmss.chinamobile.com>

[Test]Add unit test for platform.py

0fdd699

Signed-off-by: zhanghw0354 <zhanghaiwen_yewu@cmss.chinamobile.com>

delete dirs and files not exist in vllm-ascend main branch

932fe28

Signed-off-by: zhanghw0354 <zhanghaiwen_yewu@cmss.chinamobile.com>

[Doc] Add Qwen2.5-VL eager mode doc (vllm-project#1394)

5119b6e

### What this PR does / why we need it? Add Qwen2.5-VL eager mode doc. --------- Signed-off-by: shen-shanshan <467638484@qq.com> Signed-off-by: zhanghw0354 <zhanghaiwen_yewu@cmss.chinamobile.com>

Merge branch 'vllm-project:main' into main

9c84c1f

fix codespell check problem with assertIn function

8961621

Signed-off-by: zhanghw0354 <zhanghaiwen_yewu@cmss.chinamobile.com>

Merge branch 'main' of https://github.com/zhanghw0354/vllm-ascend

4c1ac3a

fix problem in the github pipeline step analysing the code with ruff

abf62dc

Signed-off-by: zhanghw0354 <zhanghaiwen_yewu@cmss.chinamobile.com>

Merge branch 'main' of https://github.com/zhanghw0354/vllm-ascend

4322edc

fix isort problems

9dbeb8f

Signed-off-by: zhanghw0354 <zhanghaiwen_yewu@cmss.chinamobile.com>

fix yapf problems

fcf6c3d

Signed-off-by: zhanghw0354 <zhanghaiwen_yewu@cmss.chinamobile.com>

Merge branch 'vllm-project:main' into main

1fdba8d

Update the parent class of TestNPUPlatform to TestBase

da00921

Signed-off-by: zhanghw0354 <zhanghaiwen_yewu@cmss.chinamobile.com>

Merge branch 'main' of https://github.com/zhanghw0354/vllm-ascend

d248032

fix isort check problem

c234ec2

Signed-off-by: zhanghw0354 <zhanghaiwen_yewu@cmss.chinamobile.com>

Merge branch 'vllm-project:main' into main

6bc0dc1

fix mypy check problem

4483fce

Signed-off-by: zhanghw0354 <zhanghaiwen_yewu@cmss.chinamobile.com>

Merge branch 'main' of https://github.com/zhanghw0354/vllm-ascend

a0445ab

test deepseek v2 mlp layer tp4

6277056

Signed-off-by: zhanghw0354 <zhanghaiwen_yewu@cmss.chinamobile.com>

fix import set_weight_attrs problem

0bb5803

Signed-off-by: zhanghw0354 <zhanghaiwen_yewu@cmss.chinamobile.com>

fix mlp_tensor_parallel_size problem

9d9da60

Signed-off-by: zhanghw0354 <zhanghaiwen_yewu@cmss.chinamobile.com>

add rank log

2d8e881

Signed-off-by: zhanghw0354 <zhanghaiwen_yewu@cmss.chinamobile.com>

zhanghw0354 added 3 commits July 8, 2025 14:17

update rank log

2fc2129

Signed-off-by: zhanghw0354 <zhanghaiwen_yewu@cmss.chinamobile.com>

sync test_platform.py from vllm-ascend main branch

1b54054

Signed-off-by: zhanghw0354 <zhanghaiwen_yewu@cmss.chinamobile.com>

sync changes from vllm-ascend main branch and fix conflict codes

5712554

Signed-off-by: zhanghw0354 <zhanghaiwen_yewu@cmss.chinamobile.com>

ApsarasX reviewed Jul 14, 2025

View reviewed changes

github-actions bot added the merge-conflicts label Jul 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimization of TP4 Parallelism in DeepSeek MLP Dense Layers #1738

Optimization of TP4 Parallelism in DeepSeek MLP Dense Layers #1738

Uh oh!

zhanghw0354 commented Jul 11, 2025 •

edited by github-actions bot

Loading

Uh oh!

jianzs commented Jul 13, 2025

Uh oh!

ApsarasX Jul 14, 2025

Uh oh!

github-actions bot commented Jul 15, 2025

Uh oh!

Uh oh!

Optimization of TP4 Parallelism in DeepSeek MLP Dense Layers #1738

Are you sure you want to change the base?

Optimization of TP4 Parallelism in DeepSeek MLP Dense Layers #1738

Uh oh!

Conversation

zhanghw0354 commented Jul 11, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

jianzs commented Jul 13, 2025

Uh oh!

ApsarasX Jul 14, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jul 15, 2025

Uh oh!

Uh oh!

zhanghw0354 commented Jul 11, 2025 •

edited by github-actions bot

Loading