Skip to content

Conversation

MengqingCao
Copy link
Collaborator

@MengqingCao MengqingCao commented Aug 28, 2025

What this PR does / why we need it?

  1. quickfix mc2 operator error in aclgraph + ep<16 scenario to recover CI, will be refactorred in the future
  2. disable aclgraph when testing w8a8

How was this patch tested?

CI passed with existing test.

Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a quickfix for an mc2 operator error that occurs with aclgraph when the expert parallel size is less than 16. The fix involves forcing the allgathercommimpl communication method in this scenario. Additionally, aclgraph is disabled for w8a8 tests. My review focuses on making the main fix more precise to avoid potential performance regressions on hardware not affected by the bug.

moe_comm_method_name = forward_context.moe_comm_method_name
if not self.moe_config.use_ep and moe_comm_method_name != "dummycommimpl":
# TODO: Adjusted logic to differentiate between A2 and A3, we check ep_size here since mc2 only support ep_size >= 16 on A3 now
if self.moe_config.ep_size < 16 and moe_comm_method_name != "dummycommimpl":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The added comment on line 244 states this limitation is specific to A3 hardware. However, this condition applies the restriction unconditionally, which could cause performance regressions on other hardware like A2. To align with the comment and make the fix more targeted, consider making the condition hardware-specific by checking the is_ascend_a3 flag from the mc2commimpl instance.

Suggested change
if self.moe_config.ep_size < 16 and moe_comm_method_name != "dummycommimpl":
if self.mc2commimpl.is_ascend_a3 and self.moe_config.ep_size < 16 and moe_comm_method_name != "dummycommimpl":

Copy link

codecov bot commented Aug 28, 2025

Codecov Report

❌ Patch coverage is 0% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 72.40%. Comparing base (dfc7eb3) to head (018ba13).
⚠️ Report is 6 commits behind head on main.

Files with missing lines Patch % Lines
vllm_ascend/ops/common_fused_moe.py 0.00% 1 Missing ⚠️

❌ Your patch status has failed because the patch coverage (0.00%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #2609   +/-   ##
=======================================
  Coverage   72.40%   72.40%           
=======================================
  Files         146      146           
  Lines       21598    21598           
=======================================
  Hits        15639    15639           
  Misses       5959     5959           
Flag Coverage Δ
unittests 72.40% <0.00%> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: MengqingCao <cmq0113@163.com>
@Yikun Yikun added accuracy-test enable all accuracy test for PR ready-for-test start test by label for PR labels Aug 29, 2025
@MengqingCao
Copy link
Collaborator Author

all the CI passed, plz help merge this to recover CI cc @Yikun @wangxiyuan

@wangxiyuan wangxiyuan merged commit 91c35d7 into vllm-project:main Aug 29, 2025
59 of 67 checks passed
845473182 pushed a commit to raindaywhu/vllm-ascend that referenced this pull request Sep 1, 2025
…into main_829

* 'main_829' of https://github.com/raindaywhu/vllm-ascend:
  [torchair]remove aicpu op (vllm-project#2640)
  bugfix for torchair graph (vllm-project#2639)
  [CI] fix UT error. (vllm-project#2644)
  [3/N][Feat][Graph] Support `all-to-all` and quantized models with ACL Graph (vllm-project#2614)
  [Bugfix] Fix mc2 operator error in aclgraph + ep<16 scenario (vllm-project#2609)
wenba0 pushed a commit to wenba0/vllm-ascend that referenced this pull request Sep 5, 2025
…oject#2609)

### What this PR does / why we need it?
1. quickfix mc2 operator error in aclgraph + ep<16 scenario to recover
CI, will be refactorred in the future
2. disable aclgraph when testing w8a8

### How was this patch tested?
CI passed with existing test.

- vLLM version: v0.10.1.1
- vLLM main:
vllm-project/vllm@9508960

Signed-off-by: MengqingCao <cmq0113@163.com>
Signed-off-by: lijiaojiao <lijiaojiao990304@163.com>
@MengqingCao MengqingCao deleted the compilation2 branch September 21, 2025 01:20
wangxiaoteng888 pushed a commit to LCAIZJ/vllm-ascend that referenced this pull request Sep 25, 2025
…oject#2609)

### What this PR does / why we need it?
1. quickfix mc2 operator error in aclgraph + ep<16 scenario to recover
CI, will be refactorred in the future
2. disable aclgraph when testing w8a8

### How was this patch tested?
CI passed with existing test.

- vLLM version: v0.10.1.1
- vLLM main:
vllm-project/vllm@9508960

Signed-off-by: MengqingCao <cmq0113@163.com>
chopper0126 pushed a commit to chopper0126/vllm-ascend that referenced this pull request Sep 26, 2025
…oject#2609)

### What this PR does / why we need it?
1. quickfix mc2 operator error in aclgraph + ep<16 scenario to recover
CI, will be refactorred in the future
2. disable aclgraph when testing w8a8

### How was this patch tested?
CI passed with existing test.

- vLLM version: v0.10.1.1
- vLLM main:
vllm-project/vllm@9508960

Signed-off-by: MengqingCao <cmq0113@163.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accuracy-test enable all accuracy test for PR module:ops module:tests ready-for-test start test by label for PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants