Skip to content

[BugFix]fixed all_reduce_merge_allgather_ep bug #1818

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

ttanzhiqiang
Copy link
Contributor

@ttanzhiqiang ttanzhiqiang commented Jul 15, 2025

There are four situations in the previous logic

  1. If Prefill/decode uses AllGather or NaiveMulticast at the same time, this logic is normal, and this solution is used for optimization
  2. If Prefill/decode uses All2All/MC2 at the same time, this logic is also normal, and this solution is not used for optimization
  3. Prefill uses AllGatherEP (using VLLM_ENABLE_FUSED_EXPERTS_ALLGATHER_EP switch), and Decode uses MC2. (bug)
  4. In the PD separation scenario, the strategies used by P and D are separate, so there is no impact.

After modification

  1. If Prefill/decode use AllGather or NaiveMulticast solution at the same time, this logic is normal, and this solution is used for optimization
  2. If Prefill/decode use All2All/MC2 solution at the same time, this logic is also normal, and this solution is not used for optimization
  3. Prefill uses AllGatherEP solution (use VLLM_ENABLE_FUSED_EXPERTS_ALLGATHER_EP switch), and Decode uses MC2 solution. (Prefill can be merged/Prefill and Decode strategies are different and cannot be merged)
  4. In the PD separation scenario, the strategies used by P and D are separate, so there is no impact.

How was this patch tested?

Test method for case 1

export VLLM_USE_V1=1
export TASK_QUEUE_ENABLE=1
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export ASCEND_LAUNCH_BLOCKING=0
export VLLM_VERSION=0.9.1
nohup python -m vllm.entrypoints.openai.api_server --model=/mnt/deepseek/DeepSeek-R1-W8A8-VLLM
--served-model-name auto
--quantization ascend
--trust-remote-code
--distributed-executor-backend=mp
--port 8006
-tp=4
-dp=4
--max-num-seqs 24
--max-model-len 32768
--max-num-batched-tokens 32768
--block-size 128
--no-enable-prefix-caching
--additional-config '{"torchair_graph_config":{"enabled":true,"use_cached_graph":true,"graph_batch_sizes":[24]},"ascend_scheduler_config":{"enabled":true},"expert_tensor_parallel_size":16}'
--gpu-memory-utilization 0.96 &> run.log &
disown

Test method for case 2

export VLLM_USE_V1=1
export TASK_QUEUE_ENABLE=1
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export ASCEND_LAUNCH_BLOCKING=0
export VLLM_VERSION=0.9.1
nohup python -m vllm.entrypoints.openai.api_server --model=/mnt/deepseek/DeepSeek-R1-W8A8-VLLM
--served-model-name auto
--quantization ascend
--trust-remote-code
--distributed-executor-backend=mp
--port 8006
-tp=4
-dp=4
--max-num-seqs 24
--max-model-len 32768
--max-num-batched-tokens 32768
--block-size 128
--no-enable-prefix-caching
--enable_expert_parallel
--additional-config '{"torchair_graph_config":{"enabled":true,"use_cached_graph":true,"graph_batch_sizes":[24]},"ascend_scheduler_config":{"enabled":true},"expert_tensor_parallel_size":16}'
--gpu-memory-utilization 0.96 &> run.log &
disown

Test method for case 3

export VLLM_USE_V1=1
export TASK_QUEUE_ENABLE=1
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export ASCEND_LAUNCH_BLOCKING=0
export VLLM_VERSION=0.9.1
export VLLM_ENABLE_FUSED_EXPERTS_ALLGATHER_EP=1
nohup python -m vllm.entrypoints.openai.api_server --model=/mnt/deepseek/DeepSeek-R1-W8A8-VLLM
--served-model-name auto
--quantization ascend
--trust-remote-code
--distributed-executor-backend=mp
--port 8006
-tp=4
-dp=4
--max-num-seqs 24
--max-model-len 32768
--max-num-batched-tokens 32768
--block-size 128
--no-enable-prefix-caching
--enable_expert_parallel
--additional-config '{"torchair_graph_config":{"enabled":true,"use_cached_graph":true,"graph_batch_sizes":[24]},"ascend_scheduler_config":{"enabled":true},"expert_tensor_parallel_size":16}'
--gpu-memory-utilization 0.96 &> run.log &
disown

Signed-off-by: ttanzhiqiang <389825161@qq.com>
Signed-off-by: ttanzhiqiang <389825161@qq.com>
Copy link

codecov bot commented Jul 15, 2025

Codecov Report

Attention: Patch coverage is 16.66667% with 5 lines in your changes missing coverage. Please review.

Project coverage is 53.49%. Comparing base (f96100f) to head (3e02626).

Files with missing lines Patch % Lines
vllm_ascend/models/deepseek_v2.py 25.00% 3 Missing ⚠️
vllm_ascend/ops/fused_moe.py 0.00% 2 Missing ⚠️

❌ Your patch check has failed because the patch coverage (16.66%) is below the target coverage (100.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1818      +/-   ##
==========================================
- Coverage   53.51%   53.49%   -0.02%     
==========================================
  Files          77       77              
  Lines        9435     9440       +5     
==========================================
+ Hits         5049     5050       +1     
- Misses       4386     4390       +4     
Flag Coverage Δ
unittests 53.49% <16.66%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@jianzs jianzs requested a review from Copilot July 16, 2025 06:07
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refines the all_reduce_merge behavior to correctly handle the case where Prefill uses the AllGatherEP solution and Decode uses the MC2 solution, ensuring merge happens only in Prefill.

  • Expanded comments in get_all_reduce_merge_state to detail four merge scenarios
  • Introduced a conditional all-reduce for shared experts in fused_moe.py when Prefill uses AllGatherEP and Decode uses MC2
  • Added a stub branch in deepseek_v2.py for the new AllGatherEP-then-MC2 path

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
vllm_ascend/utils.py Clarified four scenarios for all_reduce_merge in doc comments
vllm_ascend/ops/fused_moe.py Added conditional merge for shared_hidden_states under EP+MC2 path
vllm_ascend/models/deepseek_v2.py Imported envs_ascend and added placeholder for EP+MC2 logic
Comments suppressed due to low confidence (3)

vllm_ascend/utils.py:461

  • [nitpick] The inline comment list has inconsistent punctuation and phrasing across items. Consider ending each item with a period and using parallel sentence structure for better readability.
# 1. If Prefill/decode use AllGather or NaiveMulticast solution at the same time, this logic is normal, and this solution is used for optimization

vllm_ascend/ops/fused_moe.py:1421

  • The code references envs_ascend but there is no import of vllm_ascend.envs as envs_ascend at the top of this file, which will raise a NameError at runtime. Please add the appropriate import.
        if tp_size > 1 and envs_ascend.VLLM_ENABLE_FUSED_EXPERTS_ALLGATHER_EP and self.all_reduce_merge and fused_moe_state in [

vllm_ascend/models/deepseek_v2.py:413

  • The ... placeholder in this branch is a syntax stub and will cause a runtime error. Implement the intended merge logic or add a clear raise NotImplementedError and a TODO.
                ...

Comment on lines +1421 to +1423
if tp_size > 1 and envs_ascend.VLLM_ENABLE_FUSED_EXPERTS_ALLGATHER_EP and self.all_reduce_merge and fused_moe_state in [
FusedMoEState.MC2
]:
Copy link
Preview

Copilot AI Jul 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] For a single-state check, using fused_moe_state == FusedMoEState.MC2 is clearer and more efficient than membership in a one-element list.

Suggested change
if tp_size > 1 and envs_ascend.VLLM_ENABLE_FUSED_EXPERTS_ALLGATHER_EP and self.all_reduce_merge and fused_moe_state in [
FusedMoEState.MC2
]:
if tp_size > 1 and envs_ascend.VLLM_ENABLE_FUSED_EXPERTS_ALLGATHER_EP and self.all_reduce_merge and fused_moe_state == FusedMoEState.MC2:

Copilot uses AI. Check for mistakes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant