-
Notifications
You must be signed in to change notification settings - Fork 461
[Feat] Unquantized linear nz support #2619
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feat] Unquantized linear nz support #2619
Conversation
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces an optimization for unquantized linear layers on Ascend hardware by converting weights to the FRACTAL_NZ
format for CANN 8.3. This is implemented through a new AscendUnquantizedLinearMethod
. The changes correctly propagate this new method to various linear layers, including a new AscendQKVParallelLinear
for QKV projections, and the tests are updated accordingly. However, I've found a couple of critical typos in type hints that would lead to NameError
s, and an indentation error in a test file that would cause a SyntaxError
. These issues need to be addressed before merging.
vllm_ascend/ops/linear.py
Outdated
self.quant_method: Qptional[ | ||
QuantizeMethodBase] = AscendUnquantizedLinearMethod() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a typo in the type hint. Qptional
should be Optional
. This will cause a NameError
at runtime as Qptional
is not defined.
self.quant_method: Qptional[ | |
QuantizeMethodBase] = AscendUnquantizedLinearMethod() | |
self.quant_method: Optional[ | |
QuantizeMethodBase] = AscendUnquantizedLinearMethod() |
vllm_ascend/ops/linear.py
Outdated
self.quant_method: Qptional[ | ||
QuantizeMethodBase] = AscendUnquantizedLinearMethod() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a typo in the type hint. Qptional
should be Optional
. This will cause a NameError
at runtime as Qptional
is not defined.
self.quant_method: Qptional[ | |
QuantizeMethodBase] = AscendUnquantizedLinearMethod() | |
self.quant_method: Optional[ | |
QuantizeMethodBase] = AscendUnquantizedLinearMethod() |
tests/ut/ops/test_linear.py
Outdated
expect_data = torch_npu.npu_format_cast( | ||
expect_data, ACL_FORMAT_FRACTAL_NZ) | ||
self.assertTrue(torch.equal(layer.weight.data, expect_data)) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
|
||
def process_weights_after_loading(self, layer: torch.nn.Module) -> None: | ||
super().process_weights_after_loading(layer) | ||
if torch.version.cann.startswith("8.3"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is a cudagraph check necessary?
2ac497c
to
24773fd
Compare
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
24773fd
to
11e14dd
Compare
864f9ae
to
c30eed9
Compare
e5f110f
to
66e6579
Compare
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
Signed-off-by: anon189Ty <Stari_Falcon@outlook.com>
53fdccd
to
0bb2a32
Compare
@MengqingCao All clear, except for some known issues which is not introduced by this PR, please merge this ASAP. |
### What this PR does / why we need it? Currently, when executing to the Linear layer of the model in vLLM-Ascend, the weights input format is ND in unquantized case and skipped ascend case, which is slower than FRACTAL_NZ. This PR supplements the execution logic for Linear layer. When VLLM_ASCEND_ENABLE_MLP_OPTIMIZE=1 and CANN version is 8.3, the weights of the Linear layer will be converted to FRACTAL_NZ, in both unquantized case and skipped ascend case. - vLLM version: main - vLLM main: vllm-project/vllm@267c80d Signed-off-by: anon189Ty <Stari_Falcon@outlook.com> Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
This reverts commit 7b2ecc1.
This reverts commit 7b2ecc1. Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
This reverts commit 7b2ecc1. Signed-off-by: anon189Ty <Stari_Falcon@outlook.com>
### What this PR does / why we need it? This reverts commit 7b2ecc1. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed - vLLM version: main - vLLM main: vllm-project/vllm@64d90c3 Closes: #2890 Closes: #2887 Closes: #2885 Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
### What this PR does / why we need it? Currently, when executing to the Linear layer of the model in vLLM-Ascend, the weights input format is ND in unquantized case and skipped ascend case, which is slower than FRACTAL_NZ. This PR supplements the execution logic for Linear layer. When VLLM_ASCEND_ENABLE_MLP_OPTIMIZE=1 and CANN version is 8.3, the weights of the Linear layer will be converted to FRACTAL_NZ, in both unquantized case and skipped ascend case. - vLLM version: main - vLLM main: vllm-project/vllm@267c80d Signed-off-by: anon189Ty <Stari_Falcon@outlook.com> Signed-off-by: offline0806 <z00858301@china.huawei.com>
…lm-project#2896) ### What this PR does / why we need it? This reverts commit 7b2ecc1. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed - vLLM version: main - vLLM main: vllm-project/vllm@64d90c3 Closes: vllm-project#2890 Closes: vllm-project#2887 Closes: vllm-project#2885 Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: offline0806 <z00858301@china.huawei.com>
### What this PR does / why we need it? Currently, when executing to the Linear layer of the model in vLLM-Ascend, the weights input format is ND in unquantized case and skipped ascend case, which is slower than FRACTAL_NZ. This PR supplements the execution logic for Linear layer. When VLLM_ASCEND_ENABLE_MLP_OPTIMIZE=1 and CANN version is 8.3, the weights of the Linear layer will be converted to FRACTAL_NZ, in both unquantized case and skipped ascend case. - vLLM version: main - vLLM main: vllm-project/vllm@267c80d Signed-off-by: anon189Ty <Stari_Falcon@outlook.com>
…lm-project#2896) ### What this PR does / why we need it? This reverts commit 7b2ecc1. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed - vLLM version: main - vLLM main: vllm-project/vllm@64d90c3 Closes: vllm-project#2890 Closes: vllm-project#2887 Closes: vllm-project#2885 Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
### What this PR does / why we need it? Currently, when executing to the Linear layer of the model in vLLM-Ascend, the weights input format is ND in unquantized case and skipped ascend case, which is slower than FRACTAL_NZ. This PR supplements the execution logic for Linear layer. When VLLM_ASCEND_ENABLE_MLP_OPTIMIZE=1 and CANN version is 8.3, the weights of the Linear layer will be converted to FRACTAL_NZ, in both unquantized case and skipped ascend case. - vLLM version: main - vLLM main: vllm-project/vllm@267c80d Signed-off-by: anon189Ty <Stari_Falcon@outlook.com>
…lm-project#2896) ### What this PR does / why we need it? This reverts commit 7b2ecc1. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed - vLLM version: main - vLLM main: vllm-project/vllm@64d90c3 Closes: vllm-project#2890 Closes: vllm-project#2887 Closes: vllm-project#2885 Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
What this PR does / why we need it?
Currently, when executing to the Linear layer of the model in vLLM-Ascend, the weights input format is ND in unquantized case and skipped ascend case, which is slower than FRACTAL_NZ.
This PR supplements the execution logic for Linear layer. When VLLM_ASCEND_ENABLE_MLP_OPTIMIZE=1 and CANN version is 8.3, the weights of the Linear layer will be converted to FRACTAL_NZ, in both unquantized case and skipped ascend case.
Does this PR introduce any user-facing change?
How was this patch tested?