[BugFix]fixed all_reduce_merge_allgather_ep bug #1818

ttanzhiqiang · 2025-07-15T14:26:31Z

There are four situations in the previous logic

If Prefill/decode uses AllGather or NaiveMulticast at the same time, this logic is normal, and this solution is used for optimization
If Prefill/decode uses All2All/MC2 at the same time, this logic is also normal, and this solution is not used for optimization
Prefill uses AllGatherEP (using VLLM_ENABLE_FUSED_EXPERTS_ALLGATHER_EP switch), and Decode uses MC2. (bug)
In the PD separation scenario, the strategies used by P and D are separate, so there is no impact.

After modification

If Prefill/decode use AllGather or NaiveMulticast solution at the same time, this logic is normal, and this solution is used for optimization
If Prefill/decode use All2All/MC2 solution at the same time, this logic is also normal, and this solution is not used for optimization
Prefill uses AllGatherEP solution (use VLLM_ENABLE_FUSED_EXPERTS_ALLGATHER_EP switch), and Decode uses MC2 solution. (Prefill can be merged/Prefill and Decode strategies are different and cannot be merged)
In the PD separation scenario, the strategies used by P and D are separate, so there is no impact.

How was this patch tested?

Test method for case 1

export VLLM_USE_V1=1
export TASK_QUEUE_ENABLE=1
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export ASCEND_LAUNCH_BLOCKING=0
export VLLM_VERSION=0.9.1
nohup python -m vllm.entrypoints.openai.api_server --model=/mnt/deepseek/DeepSeek-R1-W8A8-VLLM
--served-model-name auto
--quantization ascend
--trust-remote-code
--distributed-executor-backend=mp
--port 8006
-tp=4
-dp=4
--max-num-seqs 24
--max-model-len 32768
--max-num-batched-tokens 32768
--block-size 128
--no-enable-prefix-caching
--additional-config '{"torchair_graph_config":{"enabled":true,"use_cached_graph":true,"graph_batch_sizes":[24]},"ascend_scheduler_config":{"enabled":true},"expert_tensor_parallel_size":16}'
--gpu-memory-utilization 0.96 &> run.log &
disown

Test method for case 2

export VLLM_USE_V1=1
export TASK_QUEUE_ENABLE=1
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export ASCEND_LAUNCH_BLOCKING=0
export VLLM_VERSION=0.9.1
nohup python -m vllm.entrypoints.openai.api_server --model=/mnt/deepseek/DeepSeek-R1-W8A8-VLLM
--served-model-name auto
--quantization ascend
--trust-remote-code
--distributed-executor-backend=mp
--port 8006
-tp=4
-dp=4
--max-num-seqs 24
--max-model-len 32768
--max-num-batched-tokens 32768
--block-size 128
--no-enable-prefix-caching
--enable_expert_parallel
--additional-config '{"torchair_graph_config":{"enabled":true,"use_cached_graph":true,"graph_batch_sizes":[24]},"ascend_scheduler_config":{"enabled":true},"expert_tensor_parallel_size":16}'
--gpu-memory-utilization 0.96 &> run.log &
disown

Test method for case 3

export VLLM_USE_V1=1
export TASK_QUEUE_ENABLE=1
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export ASCEND_LAUNCH_BLOCKING=0
export VLLM_VERSION=0.9.1
export VLLM_ENABLE_FUSED_EXPERTS_ALLGATHER_EP=1
nohup python -m vllm.entrypoints.openai.api_server --model=/mnt/deepseek/DeepSeek-R1-W8A8-VLLM
--served-model-name auto
--quantization ascend
--trust-remote-code
--distributed-executor-backend=mp
--port 8006
-tp=4
-dp=4
--max-num-seqs 24
--max-model-len 32768
--max-num-batched-tokens 32768
--block-size 128
--no-enable-prefix-caching
--enable_expert_parallel
--additional-config '{"torchair_graph_config":{"enabled":true,"use_cached_graph":true,"graph_batch_sizes":[24]},"ascend_scheduler_config":{"enabled":true},"expert_tensor_parallel_size":16}'
--gpu-memory-utilization 0.96 &> run.log &
disown

vLLM version: v0.9.2
vLLM main: vllm-project/vllm@e7e3e6d

Signed-off-by: ttanzhiqiang <389825161@qq.com>

codecov · 2025-07-15T15:12:53Z

Codecov Report

Attention: Patch coverage is 16.66667% with 5 lines in your changes missing coverage. Please review.

Project coverage is 53.49%. Comparing base (f96100f) to head (3e02626).

Files with missing lines	Patch %	Lines
vllm_ascend/models/deepseek_v2.py	25.00%	3 Missing ⚠️
vllm_ascend/ops/fused_moe.py	0.00%	2 Missing ⚠️

❌ Your patch check has failed because the patch coverage (16.66%) is below the target coverage (100.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1818      +/-   ##
==========================================
- Coverage   53.51%   53.49%   -0.02%     
==========================================
  Files          77       77              
  Lines        9435     9440       +5     
==========================================
+ Hits         5049     5050       +1     
- Misses       4386     4390       +4

Flag	Coverage Δ
unittests	`53.49% <16.66%> (-0.02%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot

Pull Request Overview

This PR refines the all_reduce_merge behavior to correctly handle the case where Prefill uses the AllGatherEP solution and Decode uses the MC2 solution, ensuring merge happens only in Prefill.

Expanded comments in get_all_reduce_merge_state to detail four merge scenarios
Introduced a conditional all-reduce for shared experts in fused_moe.py when Prefill uses AllGatherEP and Decode uses MC2
Added a stub branch in deepseek_v2.py for the new AllGatherEP-then-MC2 path

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File	Description
vllm_ascend/utils.py	Clarified four scenarios for all_reduce_merge in doc comments
vllm_ascend/ops/fused_moe.py	Added conditional merge for shared_hidden_states under EP+MC2 path
vllm_ascend/models/deepseek_v2.py	Imported `envs_ascend` and added placeholder for EP+MC2 logic

Comments suppressed due to low confidence (3)

vllm_ascend/utils.py:461

[nitpick] The inline comment list has inconsistent punctuation and phrasing across items. Consider ending each item with a period and using parallel sentence structure for better readability.

# 1. If Prefill/decode use AllGather or NaiveMulticast solution at the same time, this logic is normal, and this solution is used for optimization

vllm_ascend/ops/fused_moe.py:1421

The code references envs_ascend but there is no import of vllm_ascend.envs as envs_ascend at the top of this file, which will raise a NameError at runtime. Please add the appropriate import.

        if tp_size > 1 and envs_ascend.VLLM_ENABLE_FUSED_EXPERTS_ALLGATHER_EP and self.all_reduce_merge and fused_moe_state in [

vllm_ascend/models/deepseek_v2.py:413

The ... placeholder in this branch is a syntax stub and will cause a runtime error. Implement the intended merge logic or add a clear raise NotImplementedError and a TODO.

...

Copilot · 2025-07-16T06:08:46Z

vllm_ascend/ops/fused_moe.py

+        if tp_size > 1 and envs_ascend.VLLM_ENABLE_FUSED_EXPERTS_ALLGATHER_EP and self.all_reduce_merge and fused_moe_state in [
+                FusedMoEState.MC2
+        ]:


[nitpick] For a single-state check, using fused_moe_state == FusedMoEState.MC2 is clearer and more efficient than membership in a one-element list.

Suggested change

if tp_size > 1 and envs_ascend.VLLM_ENABLE_FUSED_EXPERTS_ALLGATHER_EP and self.all_reduce_merge and fused_moe_state in [

FusedMoEState.MC2

]:

if tp_size > 1 and envs_ascend.VLLM_ENABLE_FUSED_EXPERTS_ALLGATHER_EP and self.all_reduce_merge and fused_moe_state == FusedMoEState.MC2:

fixed all_reduce_merge_allgather_ep bug

358b89b

Signed-off-by: ttanzhiqiang <389825161@qq.com>

github-actions bot added module:ops module:core labels Jul 15, 2025

fixed all_reduce_merge_allgather_ep bug

3e02626

Signed-off-by: ttanzhiqiang <389825161@qq.com>

jianzs requested a review from Copilot July 16, 2025 06:07

Copilot AI reviewed Jul 16, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BugFix]fixed all_reduce_merge_allgather_ep bug #1818

[BugFix]fixed all_reduce_merge_allgather_ep bug #1818

Uh oh!

ttanzhiqiang commented Jul 15, 2025 •

edited by github-actions bot

Loading

Uh oh!

codecov bot commented Jul 15, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jul 16, 2025

Uh oh!

Uh oh!

[BugFix]fixed all_reduce_merge_allgather_ep bug #1818

Are you sure you want to change the base?

[BugFix]fixed all_reduce_merge_allgather_ep bug #1818

Uh oh!

Conversation

ttanzhiqiang commented Jul 15, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How was this patch tested?

Test method for case 1

Test method for case 2

Test method for case 3

Uh oh!

codecov bot commented Jul 15, 2025

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Jul 16, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ttanzhiqiang commented Jul 15, 2025 •

edited by github-actions bot

Loading