Skip to content

Conversation

zhoux77899
Copy link
Contributor

@zhoux77899 zhoux77899 commented Aug 8, 2025

What this PR does / why we need it?

Fuse GroupedMatmul, Swiglu and DynamicQuant into one fusion operation GroupedMatmulSwigluQuant.

  1. extract common functions in w4a8_dynamic.py and w8a8_dynamic.py
  2. if in supported occasion, use fusion operation npu_grouped_matmul_swiglu_quant

Does this PR introduce any user-facing change?

How was this patch tested?

Tested on W8A8 quantized Qwen3-235B-A22B model with bs=16

  1. tp=8, dp=1, moe_tp=8, moe_ep=1, TPOP increased 21.54%, Output Token Throughput increased 27.35%
image
  1. tp=8, dp=1, moe_tp=1, moe_ep=8, TPOP increased 17.38%, Output Token Throughput increased 6.86%
image

Copy link

github-actions bot commented Aug 8, 2025

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Copy link

codecov bot commented Aug 8, 2025

Codecov Report

❌ Patch coverage is 96.52174% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.03%. Comparing base (24d4dad) to head (902b79a).
⚠️ Report is 20 commits behind head on main.

Files with missing lines Patch % Lines
vllm_ascend/ops/layers/moe_mlp.py 88.88% 3 Missing ⚠️
vllm_ascend/quantization/w8a8_dynamic.py 50.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2275      +/-   ##
==========================================
- Coverage   73.71%   73.03%   -0.69%     
==========================================
  Files         152      151       -1     
  Lines       21967    21533     -434     
==========================================
- Hits        16194    15726     -468     
- Misses       5773     5807      +34     
Flag Coverage Δ
unittests 73.03% <96.52%> (-0.69%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@zhoux77899 zhoux77899 force-pushed the main_gmmswigluquant branch from a4bb618 to 4705afb Compare August 14, 2025 06:58
@zhoux77899 zhoux77899 changed the title [main] Support GroupedMatmulSwigluQuant in W8A8_DYNAMIC quantized MoE layers [main] Fuse GroupedMatmul, Swiglu and DynamicQuant in W8A8_DYNAMIC quantized MoE layers Aug 16, 2025
Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@wangxiyuan
Copy link
Collaborator

Sorry that after offline discuess, we'll merge #2500 first. You can then rebase then

Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

zhoux77899 and others added 7 commits August 30, 2025 08:25
Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>
Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>
Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>
Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>
Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>
Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>
@wangxiyuan wangxiyuan merged commit aff5189 into vllm-project:main Sep 4, 2025
25 checks passed
@zhoux77899 zhoux77899 deleted the main_gmmswigluquant branch September 4, 2025 03:46
Angazenn pushed a commit to Angazenn/vllm-ascend that referenced this pull request Sep 10, 2025
…quantized MoE layers (vllm-project#2275)

### What this PR does / why we need it?

Fuse `GroupedMatmul`, `Swiglu` and `DynamicQuant` into one fusion
operation `GroupedMatmulSwigluQuant`.

1. extract common functions in `w4a8_dynamic.py` and `w8a8_dynamic.py`
2. if in supported occasion, use fusion operation
`npu_grouped_matmul_swiglu_quant`

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Tested on W8A8 quantized Qwen3-235B-A22B model with `bs=16`

1. `tp=8`, `dp=1`, `moe_tp=8`, `moe_ep=1`, TPOP increased 21.54%, Output
Token Throughput increased 27.35%
<img width="3443" height="211" alt="image"
src="https://github.com/user-attachments/assets/a1a9c14d-2310-41be-9a03-36125dabae6e"
/>

3. `tp=8`, `dp=1`, `moe_tp=1`, `moe_ep=8`, TPOP increased 17.38%, Output
Token Throughput increased 6.86%
<img width="3443" height="211" alt="image"
src="https://github.com/user-attachments/assets/1ce92e92-720d-40c0-8b4d-c493e5cb10a6"
/>


- vLLM version: v0.10.1.1
- vLLM main:
vllm-project/vllm@6997a25

---------

Signed-off-by: Ruri <33858552+zhoux77899@users.noreply.github.com>
Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>
offline893 pushed a commit to offline893/vllm-ascend that referenced this pull request Sep 16, 2025
…quantized MoE layers (vllm-project#2275)

### What this PR does / why we need it?

Fuse `GroupedMatmul`, `Swiglu` and `DynamicQuant` into one fusion
operation `GroupedMatmulSwigluQuant`.

1. extract common functions in `w4a8_dynamic.py` and `w8a8_dynamic.py`
2. if in supported occasion, use fusion operation
`npu_grouped_matmul_swiglu_quant`

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Tested on W8A8 quantized Qwen3-235B-A22B model with `bs=16`

1. `tp=8`, `dp=1`, `moe_tp=8`, `moe_ep=1`, TPOP increased 21.54%, Output
Token Throughput increased 27.35%
<img width="3443" height="211" alt="image"
src="https://github.com/user-attachments/assets/a1a9c14d-2310-41be-9a03-36125dabae6e"
/>

3. `tp=8`, `dp=1`, `moe_tp=1`, `moe_ep=8`, TPOP increased 17.38%, Output
Token Throughput increased 6.86%
<img width="3443" height="211" alt="image"
src="https://github.com/user-attachments/assets/1ce92e92-720d-40c0-8b4d-c493e5cb10a6"
/>

- vLLM version: v0.10.1.1
- vLLM main:
vllm-project/vllm@6997a25

---------

Signed-off-by: Ruri <33858552+zhoux77899@users.noreply.github.com>
Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>
Signed-off-by: offline0806 <z00858301@china.huawei.com>
wangxiaoteng888 pushed a commit to LCAIZJ/vllm-ascend that referenced this pull request Sep 25, 2025
…quantized MoE layers (vllm-project#2275)

### What this PR does / why we need it?

Fuse `GroupedMatmul`, `Swiglu` and `DynamicQuant` into one fusion
operation `GroupedMatmulSwigluQuant`.

1. extract common functions in `w4a8_dynamic.py` and `w8a8_dynamic.py`
2. if in supported occasion, use fusion operation
`npu_grouped_matmul_swiglu_quant`

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Tested on W8A8 quantized Qwen3-235B-A22B model with `bs=16`

1. `tp=8`, `dp=1`, `moe_tp=8`, `moe_ep=1`, TPOP increased 21.54%, Output
Token Throughput increased 27.35%
<img width="3443" height="211" alt="image"
src="https://github.com/user-attachments/assets/a1a9c14d-2310-41be-9a03-36125dabae6e"
/>

3. `tp=8`, `dp=1`, `moe_tp=1`, `moe_ep=8`, TPOP increased 17.38%, Output
Token Throughput increased 6.86%
<img width="3443" height="211" alt="image"
src="https://github.com/user-attachments/assets/1ce92e92-720d-40c0-8b4d-c493e5cb10a6"
/>


- vLLM version: v0.10.1.1
- vLLM main:
vllm-project/vllm@6997a25

---------

Signed-off-by: Ruri <33858552+zhoux77899@users.noreply.github.com>
Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>
chopper0126 pushed a commit to chopper0126/vllm-ascend that referenced this pull request Sep 26, 2025
…quantized MoE layers (vllm-project#2275)

### What this PR does / why we need it?

Fuse `GroupedMatmul`, `Swiglu` and `DynamicQuant` into one fusion
operation `GroupedMatmulSwigluQuant`.

1. extract common functions in `w4a8_dynamic.py` and `w8a8_dynamic.py`
2. if in supported occasion, use fusion operation
`npu_grouped_matmul_swiglu_quant`

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Tested on W8A8 quantized Qwen3-235B-A22B model with `bs=16`

1. `tp=8`, `dp=1`, `moe_tp=8`, `moe_ep=1`, TPOP increased 21.54%, Output
Token Throughput increased 27.35%
<img width="3443" height="211" alt="image"
src="https://github.com/user-attachments/assets/a1a9c14d-2310-41be-9a03-36125dabae6e"
/>

3. `tp=8`, `dp=1`, `moe_tp=1`, `moe_ep=8`, TPOP increased 17.38%, Output
Token Throughput increased 6.86%
<img width="3443" height="211" alt="image"
src="https://github.com/user-attachments/assets/1ce92e92-720d-40c0-8b4d-c493e5cb10a6"
/>


- vLLM version: v0.10.1.1
- vLLM main:
vllm-project/vllm@6997a25

---------

Signed-off-by: Ruri <33858552+zhoux77899@users.noreply.github.com>
Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants