-
Notifications
You must be signed in to change notification settings - Fork 461
[main] Fuse GroupedMatmul, Swiglu and DynamicQuant in W8A8_DYNAMIC
quantized MoE layers
#2275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2275 +/- ##
==========================================
- Coverage 73.71% 73.03% -0.69%
==========================================
Files 152 151 -1
Lines 21967 21533 -434
==========================================
- Hits 16194 15726 -468
- Misses 5773 5807 +34
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
a4bb618
to
4705afb
Compare
GroupedMatmulSwigluQuant
in W8A8_DYNAMIC
quantized MoE layersW8A8_DYNAMIC
quantized MoE layers
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
Sorry that after offline discuess, we'll merge #2500 first. You can then rebase then |
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>
Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>
Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>
Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>
Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>
Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>
Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>
Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>
…quantized MoE layers (vllm-project#2275) ### What this PR does / why we need it? Fuse `GroupedMatmul`, `Swiglu` and `DynamicQuant` into one fusion operation `GroupedMatmulSwigluQuant`. 1. extract common functions in `w4a8_dynamic.py` and `w8a8_dynamic.py` 2. if in supported occasion, use fusion operation `npu_grouped_matmul_swiglu_quant` ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Tested on W8A8 quantized Qwen3-235B-A22B model with `bs=16` 1. `tp=8`, `dp=1`, `moe_tp=8`, `moe_ep=1`, TPOP increased 21.54%, Output Token Throughput increased 27.35% <img width="3443" height="211" alt="image" src="https://github.com/user-attachments/assets/a1a9c14d-2310-41be-9a03-36125dabae6e" /> 3. `tp=8`, `dp=1`, `moe_tp=1`, `moe_ep=8`, TPOP increased 17.38%, Output Token Throughput increased 6.86% <img width="3443" height="211" alt="image" src="https://github.com/user-attachments/assets/1ce92e92-720d-40c0-8b4d-c493e5cb10a6" /> - vLLM version: v0.10.1.1 - vLLM main: vllm-project/vllm@6997a25 --------- Signed-off-by: Ruri <33858552+zhoux77899@users.noreply.github.com> Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>
…quantized MoE layers (vllm-project#2275) ### What this PR does / why we need it? Fuse `GroupedMatmul`, `Swiglu` and `DynamicQuant` into one fusion operation `GroupedMatmulSwigluQuant`. 1. extract common functions in `w4a8_dynamic.py` and `w8a8_dynamic.py` 2. if in supported occasion, use fusion operation `npu_grouped_matmul_swiglu_quant` ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Tested on W8A8 quantized Qwen3-235B-A22B model with `bs=16` 1. `tp=8`, `dp=1`, `moe_tp=8`, `moe_ep=1`, TPOP increased 21.54%, Output Token Throughput increased 27.35% <img width="3443" height="211" alt="image" src="https://github.com/user-attachments/assets/a1a9c14d-2310-41be-9a03-36125dabae6e" /> 3. `tp=8`, `dp=1`, `moe_tp=1`, `moe_ep=8`, TPOP increased 17.38%, Output Token Throughput increased 6.86% <img width="3443" height="211" alt="image" src="https://github.com/user-attachments/assets/1ce92e92-720d-40c0-8b4d-c493e5cb10a6" /> - vLLM version: v0.10.1.1 - vLLM main: vllm-project/vllm@6997a25 --------- Signed-off-by: Ruri <33858552+zhoux77899@users.noreply.github.com> Signed-off-by: zhoux77899 <zhouxiang100@huawei.com> Signed-off-by: offline0806 <z00858301@china.huawei.com>
…quantized MoE layers (vllm-project#2275) ### What this PR does / why we need it? Fuse `GroupedMatmul`, `Swiglu` and `DynamicQuant` into one fusion operation `GroupedMatmulSwigluQuant`. 1. extract common functions in `w4a8_dynamic.py` and `w8a8_dynamic.py` 2. if in supported occasion, use fusion operation `npu_grouped_matmul_swiglu_quant` ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Tested on W8A8 quantized Qwen3-235B-A22B model with `bs=16` 1. `tp=8`, `dp=1`, `moe_tp=8`, `moe_ep=1`, TPOP increased 21.54%, Output Token Throughput increased 27.35% <img width="3443" height="211" alt="image" src="https://github.com/user-attachments/assets/a1a9c14d-2310-41be-9a03-36125dabae6e" /> 3. `tp=8`, `dp=1`, `moe_tp=1`, `moe_ep=8`, TPOP increased 17.38%, Output Token Throughput increased 6.86% <img width="3443" height="211" alt="image" src="https://github.com/user-attachments/assets/1ce92e92-720d-40c0-8b4d-c493e5cb10a6" /> - vLLM version: v0.10.1.1 - vLLM main: vllm-project/vllm@6997a25 --------- Signed-off-by: Ruri <33858552+zhoux77899@users.noreply.github.com> Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>
…quantized MoE layers (vllm-project#2275) ### What this PR does / why we need it? Fuse `GroupedMatmul`, `Swiglu` and `DynamicQuant` into one fusion operation `GroupedMatmulSwigluQuant`. 1. extract common functions in `w4a8_dynamic.py` and `w8a8_dynamic.py` 2. if in supported occasion, use fusion operation `npu_grouped_matmul_swiglu_quant` ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Tested on W8A8 quantized Qwen3-235B-A22B model with `bs=16` 1. `tp=8`, `dp=1`, `moe_tp=8`, `moe_ep=1`, TPOP increased 21.54%, Output Token Throughput increased 27.35% <img width="3443" height="211" alt="image" src="https://github.com/user-attachments/assets/a1a9c14d-2310-41be-9a03-36125dabae6e" /> 3. `tp=8`, `dp=1`, `moe_tp=1`, `moe_ep=8`, TPOP increased 17.38%, Output Token Throughput increased 6.86% <img width="3443" height="211" alt="image" src="https://github.com/user-attachments/assets/1ce92e92-720d-40c0-8b4d-c493e5cb10a6" /> - vLLM version: v0.10.1.1 - vLLM main: vllm-project/vllm@6997a25 --------- Signed-off-by: Ruri <33858552+zhoux77899@users.noreply.github.com> Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>
What this PR does / why we need it?
Fuse
GroupedMatmul
,Swiglu
andDynamicQuant
into one fusion operationGroupedMatmulSwigluQuant
.w4a8_dynamic.py
andw8a8_dynamic.py
npu_grouped_matmul_swiglu_quant
Does this PR introduce any user-facing change?
How was this patch tested?
Tested on W8A8 quantized Qwen3-235B-A22B model with
bs=16
tp=8
,dp=1
,moe_tp=8
,moe_ep=1
, TPOP increased 21.54%, Output Token Throughput increased 27.35%tp=8
,dp=1
,moe_tp=1
,moe_ep=8
, TPOP increased 17.38%, Output Token Throughput increased 6.86%