-
Notifications
You must be signed in to change notification settings - Fork 264
Optimization of TP4 Parallelism in DeepSeek MLP Dense Layers #1738
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
zhanghw0354
wants to merge
33
commits into
vllm-project:main
Choose a base branch
from
zhanghw0354:main
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Signed-off-by: zhanghw0354 <zhanghaiwen_yewu@cmss.chinamobile.com>
Signed-off-by: zhanghw0354 <zhanghaiwen_yewu@cmss.chinamobile.com>
…project#1426) ### What this PR does / why we need it? Add guidance on how to implement and register new models. Modified based on PR vllm-project#1126, thanks for the contribution of @linfeng-yuan. --------- Signed-off-by: shen-shanshan <467638484@qq.com> Signed-off-by: zhanghw0354 <zhanghaiwen_yewu@cmss.chinamobile.com>
Signed-off-by: zhanghw0354 <zhanghaiwen_yewu@cmss.chinamobile.com>
…sed (vllm-project#1482) ### What this PR does / why we need it? - Fix vLLM EPLB break vllm-project/vllm@e9fd658 by recovering load_weights back to [v0.9.1 version](vllm-project/vllm@07b8fae) temporarily. - Fix transformers>=4.53.0 image processor break Related: vllm-project#1470 - Mirror torch_npu requirements to pyproject.toml ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed --------- Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Co-authored-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: zhanghw0354 <zhanghaiwen_yewu@cmss.chinamobile.com>
### What this PR does / why we need it? Add Qwen2.5-VL eager mode doc. --------- Signed-off-by: shen-shanshan <467638484@qq.com> Signed-off-by: zhanghw0354 <zhanghaiwen_yewu@cmss.chinamobile.com>
… V1 (vllm-project#1483) ### What this PR does / why we need it? Support prompt logprobs in V1. This also enable lm_eval to test accuracy on V1 ### Does this PR introduce _any_ user-facing change? support prompt logprobs output ### How was this patch tested? CI passed with accuracy test. Using lm_eval, which use prompt logprobs as output to test accuracy, to test: ```python VLLM_USE_V1=1 lm_eval \ --model vllm \ --model_args pretrained=Qwen/Qwen2.5-7B-Instruct,max_model_len=4096,block_size=4 \ --tasks ceval-valid_computer_network \ --batch_size 8 ``` After this pr, the accuracy test results of `Qwen/Qwen2.5-7B-Instruct` on V1 is: ```bash | Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr| |----------------------------|------:|------|-----:|--------|---|-----:|---|-----:| |ceval-valid_computer_network| 2|none | 0|acc |↑ |0.7368|± |0.1038| | | |none | 0|acc_norm|↑ |0.7368|± |0.1038| ``` Closes: vllm-project#1043 Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: zhanghw0354 <zhanghaiwen_yewu@cmss.chinamobile.com>
### What this PR does / why we need it? Fix version conflict on transformers: `pip._vendor.pkg_resources.ContextualVersionConflict: (transformers 4.53.0 (/usr/local/python3.10.17/lib/python3.10/site-packages), Requirement.parse('transformers<4.53.0'), {'vllm-ascend'})` Fix https://github.com/vllm-project/vllm-ascend/actions/runs/15933263325/job/44947231642 ### Does this PR introduce _any_ user-facing change? Fix broken build ### How was this patch tested? CI passed with new existing test. Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: zhanghw0354 <zhanghaiwen_yewu@cmss.chinamobile.com>
…0I Duo (vllm-project#1478) ### What this PR does / why we need it? This PR fixes a bug that use broadcast with cpu_group when running dp. The `broadcast310p` patch will take effects for both cpu_group and device group, but we only need it for device group. Hence a wrapper is added to allow cpu_group use native torch broadcast and it solves the bug. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? With this PR, DP on 310p runs normally and generates reasonable answers. Signed-off-by: angazenn <zengyanjia@huawei.com> Co-authored-by: angazenn <zengyanjia@huawei.com> Signed-off-by: zhanghw0354 <zhanghaiwen_yewu@cmss.chinamobile.com>
…oject#1463) ### What this PR does / why we need it? In this PR, we support H2P communication optimization when running PanguProMoE with dp_size > 1. H2P use `reduce_scatter` and `all_gather` to replace `all_reduce` to improve performance: original layer: input_layernorm --> attn --> tp all_reduce --> post_attention_layernorm --> dp all_gather --> moe/mlp --> dp reduce_scatter --> tp all_reduce now: input_layernorm --> tp all_gather --> attn --> tp reduce_scatter --> post_attention_layernorm --> all_rank all_gather --> moe/mlp --> all_rank reduce_scatter Besides, because `reduce_scatter` requires num_tokens that can be divided by group size, we need pad the seqs based on `max_tokens_across_dp`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This PR has been tested with both offline and online inference using PanguProMoE-72B. --------- Signed-off-by: angazenn <zengyanjia@huawei.com> Co-authored-by: angazenn <zengyanjia@huawei.com> Signed-off-by: zhanghw0354 <zhanghaiwen_yewu@cmss.chinamobile.com>
### What this PR does / why we need it? This PR introduces an expert rearrange algorithm for PanguProMoE model. Different from the original grouped topk, it filters out the top experts that are allocated more tokens. Therefore, we can load less experts when calculating gmm. We have test this algorithm for PanguProMoE-72B on 300I Duo platform and 800I A2 platform. On 300I Duo platform, we find that `num_voted_experts` set to 5 achieves both good performance and accuracy. While on 800I A2, we still set it to 8 to use original pangu grouped topk. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> Signed-off-by: angazenn <zengyanjia@huawei.com> Co-authored-by: angazenn <zengyanjia@huawei.com> Signed-off-by: zhanghw0354 <zhanghaiwen_yewu@cmss.chinamobile.com>
### What this PR does / why we need it? support pangu moe w8a8c8 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed with new added test. Signed-off-by: zhuyilin <809721801@qq.com> Signed-off-by: zhanghw0354 <zhanghaiwen_yewu@cmss.chinamobile.com>
Signed-off-by: zhanghw0354 <zhanghaiwen_yewu@cmss.chinamobile.com>
Signed-off-by: zhanghw0354 <zhanghaiwen_yewu@cmss.chinamobile.com>
Signed-off-by: zhanghw0354 <zhanghaiwen_yewu@cmss.chinamobile.com>
Signed-off-by: zhanghw0354 <zhanghaiwen_yewu@cmss.chinamobile.com>
Signed-off-by: zhanghw0354 <zhanghaiwen_yewu@cmss.chinamobile.com>
Signed-off-by: zhanghw0354 <zhanghaiwen_yewu@cmss.chinamobile.com>
Signed-off-by: zhanghw0354 <zhanghaiwen_yewu@cmss.chinamobile.com>
Signed-off-by: zhanghw0354 <zhanghaiwen_yewu@cmss.chinamobile.com>
Signed-off-by: zhanghw0354 <zhanghaiwen_yewu@cmss.chinamobile.com>
Signed-off-by: zhanghw0354 <zhanghaiwen_yewu@cmss.chinamobile.com>
Signed-off-by: zhanghw0354 <zhanghaiwen_yewu@cmss.chinamobile.com>
Signed-off-by: zhanghw0354 <zhanghaiwen_yewu@cmss.chinamobile.com>
Signed-off-by: zhanghw0354 <zhanghaiwen_yewu@cmss.chinamobile.com>
Signed-off-by: zhanghw0354 <zhanghaiwen_yewu@cmss.chinamobile.com>
Thanks for your contribution. Could you please share any performance data about this feature? |
ApsarasX
reviewed
Jul 14, 2025
|
||
|
||
def init_ascend_model_parallel( | ||
expert_parallel_size: int = 1, | ||
expert_tensor_parallel_size: int = 1, | ||
world_size: Optional[int] = None, | ||
backend: Optional[str] = None, | ||
mlp_tensor_parallel_size: Optional[int] = 4, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this optimization be used when TP=8?
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What this PR does / why we need it?
DeepSeek model inference acceleration achieved through TP4-optimized partitioning of MLP dense layers.
Does this PR introduce any user-facing change?
No user-facing changes are involved at this stage.
How was this patch tested?
CI passed with new added/existing test.