-
Notifications
You must be signed in to change notification settings - Fork 265
[BugFix]fixed all_reduce_merge_allgather_ep bug #1818
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[BugFix]fixed all_reduce_merge_allgather_ep bug #1818
Conversation
Signed-off-by: ttanzhiqiang <389825161@qq.com>
Signed-off-by: ttanzhiqiang <389825161@qq.com>
Codecov ReportAttention: Patch coverage is
❌ Your patch check has failed because the patch coverage (16.66%) is below the target coverage (100.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #1818 +/- ##
==========================================
- Coverage 53.51% 53.49% -0.02%
==========================================
Files 77 77
Lines 9435 9440 +5
==========================================
+ Hits 5049 5050 +1
- Misses 4386 4390 +4
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR refines the all_reduce_merge behavior to correctly handle the case where Prefill uses the AllGatherEP solution and Decode uses the MC2 solution, ensuring merge happens only in Prefill.
- Expanded comments in
get_all_reduce_merge_state
to detail four merge scenarios - Introduced a conditional all-reduce for shared experts in
fused_moe.py
when Prefill uses AllGatherEP and Decode uses MC2 - Added a stub branch in
deepseek_v2.py
for the new AllGatherEP-then-MC2 path
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
File | Description |
---|---|
vllm_ascend/utils.py | Clarified four scenarios for all_reduce_merge in doc comments |
vllm_ascend/ops/fused_moe.py | Added conditional merge for shared_hidden_states under EP+MC2 path |
vllm_ascend/models/deepseek_v2.py | Imported envs_ascend and added placeholder for EP+MC2 logic |
Comments suppressed due to low confidence (3)
vllm_ascend/utils.py:461
- [nitpick] The inline comment list has inconsistent punctuation and phrasing across items. Consider ending each item with a period and using parallel sentence structure for better readability.
# 1. If Prefill/decode use AllGather or NaiveMulticast solution at the same time, this logic is normal, and this solution is used for optimization
vllm_ascend/ops/fused_moe.py:1421
- The code references
envs_ascend
but there is no import ofvllm_ascend.envs as envs_ascend
at the top of this file, which will raise a NameError at runtime. Please add the appropriate import.
if tp_size > 1 and envs_ascend.VLLM_ENABLE_FUSED_EXPERTS_ALLGATHER_EP and self.all_reduce_merge and fused_moe_state in [
vllm_ascend/models/deepseek_v2.py:413
- The
...
placeholder in this branch is a syntax stub and will cause a runtime error. Implement the intended merge logic or add a clearraise NotImplementedError
and a TODO.
...
if tp_size > 1 and envs_ascend.VLLM_ENABLE_FUSED_EXPERTS_ALLGATHER_EP and self.all_reduce_merge and fused_moe_state in [ | ||
FusedMoEState.MC2 | ||
]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] For a single-state check, using fused_moe_state == FusedMoEState.MC2
is clearer and more efficient than membership in a one-element list.
if tp_size > 1 and envs_ascend.VLLM_ENABLE_FUSED_EXPERTS_ALLGATHER_EP and self.all_reduce_merge and fused_moe_state in [ | |
FusedMoEState.MC2 | |
]: | |
if tp_size > 1 and envs_ascend.VLLM_ENABLE_FUSED_EXPERTS_ALLGATHER_EP and self.all_reduce_merge and fused_moe_state == FusedMoEState.MC2: |
Copilot uses AI. Check for mistakes.
There are four situations in the previous logic
After modification
How was this patch tested?
Test method for case 1
export VLLM_USE_V1=1
export TASK_QUEUE_ENABLE=1
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export ASCEND_LAUNCH_BLOCKING=0
export VLLM_VERSION=0.9.1
nohup python -m vllm.entrypoints.openai.api_server --model=/mnt/deepseek/DeepSeek-R1-W8A8-VLLM
--served-model-name auto
--quantization ascend
--trust-remote-code
--distributed-executor-backend=mp
--port 8006
-tp=4
-dp=4
--max-num-seqs 24
--max-model-len 32768
--max-num-batched-tokens 32768
--block-size 128
--no-enable-prefix-caching
--additional-config '{"torchair_graph_config":{"enabled":true,"use_cached_graph":true,"graph_batch_sizes":[24]},"ascend_scheduler_config":{"enabled":true},"expert_tensor_parallel_size":16}'
--gpu-memory-utilization 0.96 &> run.log &
disown
Test method for case 2
export VLLM_USE_V1=1
export TASK_QUEUE_ENABLE=1
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export ASCEND_LAUNCH_BLOCKING=0
export VLLM_VERSION=0.9.1
nohup python -m vllm.entrypoints.openai.api_server --model=/mnt/deepseek/DeepSeek-R1-W8A8-VLLM
--served-model-name auto
--quantization ascend
--trust-remote-code
--distributed-executor-backend=mp
--port 8006
-tp=4
-dp=4
--max-num-seqs 24
--max-model-len 32768
--max-num-batched-tokens 32768
--block-size 128
--no-enable-prefix-caching
--enable_expert_parallel
--additional-config '{"torchair_graph_config":{"enabled":true,"use_cached_graph":true,"graph_batch_sizes":[24]},"ascend_scheduler_config":{"enabled":true},"expert_tensor_parallel_size":16}'
--gpu-memory-utilization 0.96 &> run.log &
disown
Test method for case 3
export VLLM_USE_V1=1
export TASK_QUEUE_ENABLE=1
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export ASCEND_LAUNCH_BLOCKING=0
export VLLM_VERSION=0.9.1
export VLLM_ENABLE_FUSED_EXPERTS_ALLGATHER_EP=1
nohup python -m vllm.entrypoints.openai.api_server --model=/mnt/deepseek/DeepSeek-R1-W8A8-VLLM
--served-model-name auto
--quantization ascend
--trust-remote-code
--distributed-executor-backend=mp
--port 8006
-tp=4
-dp=4
--max-num-seqs 24
--max-model-len 32768
--max-num-batched-tokens 32768
--block-size 128
--no-enable-prefix-caching
--enable_expert_parallel
--additional-config '{"torchair_graph_config":{"enabled":true,"use_cached_graph":true,"graph_batch_sizes":[24]},"ascend_scheduler_config":{"enabled":true},"expert_tensor_parallel_size":16}'
--gpu-memory-utilization 0.96 &> run.log &
disown