-
Notifications
You must be signed in to change notification settings - Fork 459
[main] [bugfix] Fix misjudging quantized/unquantized scenarios #2627
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request refactors how quantization is handled by removing the with_quant
flag from the forward context and instead passing it as an explicit parameter down the call stack. This is a good improvement for code clarity and maintainability. The changes are consistent across the modified files. However, I found a critical syntax error due to a full-width comma being used instead of a standard one, which will break the code.
shared_gate_up=shared_gate_up, | ||
shared_dequant_scale=shared_dequant_scale, | ||
mc2_mask=kwargs.get("mc2_mask", None)) | ||
mc2_mask=kwargs.get("mc2_mask", None), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
e0175d8
to
4129bc8
Compare
e6db64d
to
dfd0d7e
Compare
Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>
dfd0d7e
to
8071af7
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #2627 +/- ##
==========================================
- Coverage 72.61% 72.60% -0.01%
==========================================
Files 147 147
Lines 21805 21792 -13
==========================================
- Hits 15833 15822 -11
+ Misses 5972 5970 -2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Error from e2e test is not introduced by this pr. |
…project#2627) ### What this PR does / why we need it? In a mixed-precision scenario, quant_config is not None, but MoE needs to perform unquantized computation; however, quantized computation is currently being used. Therefore, we put the with_quant logic into forward, avoid misjudging in mix-precision scenarios. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? e2e & ut - vLLM version: v0.10.1.1 - vLLM main: vllm-project/vllm@98ac0cb Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>
…project#2627) ### What this PR does / why we need it? In a mixed-precision scenario, quant_config is not None, but MoE needs to perform unquantized computation; however, quantized computation is currently being used. Therefore, we put the with_quant logic into forward, avoid misjudging in mix-precision scenarios. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? e2e & ut - vLLM version: v0.10.1.1 - vLLM main: vllm-project/vllm@98ac0cb Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com> Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
### What this PR does / why we need it? Fix MTP torchair bug caused by torchair refactor and moe refactor Depends on PRs: fused moe fix: #2627 torchair multi DP fix: #2626 ### Does this PR introduce _any_ user-facing change? when dp is enabled, to run mtp online server, need to disable server log due to the current metrics does not support multi dp `--disable-log-stats` ### How was this patch tested? - vLLM version: v0.10.1.1 - vLLM main: vllm-project/vllm@7c8271c Signed-off-by: xuyexiong <xuyexiong@huawei.com>
…project#2627) ### What this PR does / why we need it? In a mixed-precision scenario, quant_config is not None, but MoE needs to perform unquantized computation; however, quantized computation is currently being used. Therefore, we put the with_quant logic into forward, avoid misjudging in mix-precision scenarios. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? e2e & ut - vLLM version: v0.10.1.1 - vLLM main: vllm-project/vllm@98ac0cb Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com> Signed-off-by: lijiaojiao <lijiaojiao990304@163.com>
### What this PR does / why we need it? Fix MTP torchair bug caused by torchair refactor and moe refactor Depends on PRs: fused moe fix: vllm-project#2627 torchair multi DP fix: vllm-project#2626 ### Does this PR introduce _any_ user-facing change? when dp is enabled, to run mtp online server, need to disable server log due to the current metrics does not support multi dp `--disable-log-stats` ### How was this patch tested? - vLLM version: v0.10.1.1 - vLLM main: vllm-project/vllm@7c8271c Signed-off-by: xuyexiong <xuyexiong@huawei.com> Signed-off-by: offline0806 <z00858301@china.huawei.com>
…project#2627) ### What this PR does / why we need it? In a mixed-precision scenario, quant_config is not None, but MoE needs to perform unquantized computation; however, quantized computation is currently being used. Therefore, we put the with_quant logic into forward, avoid misjudging in mix-precision scenarios. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? e2e & ut - vLLM version: v0.10.1.1 - vLLM main: vllm-project/vllm@98ac0cb Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>
### What this PR does / why we need it? Fix MTP torchair bug caused by torchair refactor and moe refactor Depends on PRs: fused moe fix: vllm-project#2627 torchair multi DP fix: vllm-project#2626 ### Does this PR introduce _any_ user-facing change? when dp is enabled, to run mtp online server, need to disable server log due to the current metrics does not support multi dp `--disable-log-stats` ### How was this patch tested? - vLLM version: v0.10.1.1 - vLLM main: vllm-project/vllm@7c8271c Signed-off-by: xuyexiong <xuyexiong@huawei.com>
…project#2627) ### What this PR does / why we need it? In a mixed-precision scenario, quant_config is not None, but MoE needs to perform unquantized computation; however, quantized computation is currently being used. Therefore, we put the with_quant logic into forward, avoid misjudging in mix-precision scenarios. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? e2e & ut - vLLM version: v0.10.1.1 - vLLM main: vllm-project/vllm@98ac0cb Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>
### What this PR does / why we need it? Fix MTP torchair bug caused by torchair refactor and moe refactor Depends on PRs: fused moe fix: vllm-project#2627 torchair multi DP fix: vllm-project#2626 ### Does this PR introduce _any_ user-facing change? when dp is enabled, to run mtp online server, need to disable server log due to the current metrics does not support multi dp `--disable-log-stats` ### How was this patch tested? - vLLM version: v0.10.1.1 - vLLM main: vllm-project/vllm@7c8271c Signed-off-by: xuyexiong <xuyexiong@huawei.com>
What this PR does / why we need it?
In a mixed-precision scenario, quant_config is not None, but MoE needs to perform unquantized computation; however, quantized computation is currently being used. Therefore, we put the with_quant logic into forward, avoid misjudging in mix-precision scenarios.
Does this PR introduce any user-facing change?
no
How was this patch tested?
e2e & ut