-
Notifications
You must be signed in to change notification settings - Fork 460
[main][quantization] Support deepseek w4a8 per-channel quantization #3011
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Wang Kunpeng <1289706727@qq.com>
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for deepseek w4a8 per-channel quantization. The changes include documentation updates and modifications to the quantization logic in both the standard and torchair execution paths. While the changes are mostly in the right direction, I've identified a critical bug in the scale processing logic for per-channel quantization that affects both w4a8_dynamic.py
and torchair_w4a8_dynamic.py
. This bug will lead to incorrect quantization scales. Additionally, the torchair implementation is missing a performance-critical format conversion that is present in the standard implementation. I've provided suggestions to fix these issues. On a general note, there is significant code duplication between AscendW4A8DynamicFusedMoEMethod
and TorchairAscendW4A8DynamicFusedMoEMethod
. It would be beneficial to refactor this into a base class with specialized subclasses to improve maintainability and avoid propagating bugs to multiple places.
scale_np = scale.cpu().numpy() | ||
scale_np.dtype = np.uint32 | ||
scale_uint64_tensor = torch.from_numpy(scale_np.astype( | ||
np.int64)).npu() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There appears to be a bug in how the scale tensor is processed for per-channel quantization. The use of scale_np.astype(np.int64)
performs a value-preserving cast from uint32
to int64
, which is likely not the intended behavior for bit-packing. The else
branch for per-group quantization uses a more complex but correct packing logic that reinterprets bits. A similar bit-packing logic should be used here to ensure correctness. Using astype
will lead to incorrect scale values being passed to the kernel.
scale_np = scale.cpu().numpy() | |
scale_np.dtype = np.uint32 | |
scale_uint64_tensor = torch.from_numpy(scale_np.astype( | |
np.int64)).npu() | |
scale_np = scale.cpu().numpy() | |
packed_np = np.zeros(scale_np.shape[:-1] + (scale_np.shape[-1] * 2,), dtype=np.uint32) | |
packed_np[..., ::2] = scale_np.view(np.uint32) | |
scale_uint64_tensor = torch.from_numpy(np.frombuffer(packed_np.tobytes(), dtype=np.int64).copy().reshape(scale_np.shape)).npu() |
scale_np = scale.cpu().numpy() | ||
scale_np.dtype = np.uint32 | ||
scale_uint64_tensor = torch.from_numpy(scale_np.astype( | ||
np.int64)).npu() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There appears to be a bug in how the scale tensor is processed for per-channel quantization. The use of scale_np.astype(np.int64)
performs a value-preserving cast from uint32
to int64
, which is likely not the intended behavior for bit-packing. The else
branch for per-group quantization uses a more complex but correct packing logic that reinterprets bits. A similar bit-packing logic should be used here to ensure correctness. Using astype
will lead to incorrect scale values being passed to the kernel.
scale_np = scale.cpu().numpy() | |
scale_np.dtype = np.uint32 | |
scale_uint64_tensor = torch.from_numpy(scale_np.astype( | |
np.int64)).npu() | |
scale_np = scale.cpu().numpy() | |
packed_np = np.zeros(scale_np.shape[:-1] + (scale_np.shape[-1] * 2,), dtype=np.uint32) | |
packed_np[..., ::2] = scale_np.view(np.uint32) | |
scale_uint64_tensor = torch.from_numpy(np.frombuffer(packed_np.tobytes(), dtype=np.int64).copy().reshape(scale_np.shape)).npu() |
if hasattr(layer, "w13_weight_scale_second"): | ||
# scale_second is no longer used, release this part of the memory | ||
del layer.w13_weight_scale_second | ||
del layer.w2_weight_scale_second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The npu_format_cast
to ACL_FORMAT_FRACTAL_NZ
is missing for the weights before packing. The non-torchair version (vllm_ascend/quantization/w4a8_dynamic.py
) includes this conversion for performance. Its omission here could lead to suboptimal performance or even correctness issues if the subsequent operations expect this format.
Please add the format conversion after self.update_bias(...)
and before self.pack_to_int32(...)
, like this:
self.update_bias(layer, w13_bias, w2_bias)
layer.w13_weight.data = torch_npu.npu_format_cast(
layer.w13_weight.data, ACL_FORMAT_FRACTAL_NZ)
layer.w2_weight.data = torch_npu.npu_format_cast(
layer.w2_weight.data, ACL_FORMAT_FRACTAL_NZ)
layer.w13_weight.data = self.pack_to_int32(layer.w13_weight.data)
layer.w2_weight.data = self.pack_to_int32(layer.w2_weight.data)
You will also need to add the import: from vllm_ascend.utils import ACL_FORMAT_FRACTAL_NZ
.
### 3. When converting deepseek series models with modelslim, what should you pay attention? | ||
When using the weight generated by modelslim with the `--dynamic` parameter, if torchair graph mode is enabled, please modify the configuration file in the CANN package to prevent incorrect inference results. | ||
When the mla portion of the weights used `W8A8_DYNAMIC` quantization, if torchair graph mode is enabled, please modify the configuration file in the CANN package to prevent incorrect inference results. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice update. Thanks.
Signed-off-by: Wang Kunpeng <1289706727@qq.com>
Signed-off-by: Wang Kunpeng <1289706727@qq.com>
What this PR does / why we need it?
1.Support deepseek w4a8 per-channel quantization
2.The eager mode supports converting weights to the NZ format
Does this PR introduce any user-facing change?
no
How was this patch tested?
How to get weights using Modelslim
Installation steps
git clone https://gitcode.com/Ascend/msit.git
cd msit/msmodelslim
bash install.sh
Generate w4a8 per-channel weights
cd /example/DeepSeek
Command reference: msmodelslim/example/DeepSeek/README.md Execute the pre-check and w4a8 per-channel quantization chapter