Skip to content

Conversation

kunpengW-code
Copy link
Contributor

@kunpengW-code kunpengW-code commented Sep 18, 2025

What this PR does / why we need it?

1.Support deepseek w4a8 per-channel quantization
2.The eager mode supports converting weights to the NZ format

Does this PR introduce any user-facing change?

no

How was this patch tested?

How to get weights using Modelslim

Installation steps

git clone https://gitcode.com/Ascend/msit.git
cd msit/msmodelslim
bash install.sh

Generate w4a8 per-channel weights

cd /example/DeepSeek
Command reference: msmodelslim/example/DeepSeek/README.md Execute the pre-check and w4a8 per-channel quantization chapter

Signed-off-by: Wang Kunpeng <1289706727@qq.com>
Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@github-actions github-actions bot added documentation Improvements or additions to documentation module:quantization labels Sep 18, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for deepseek w4a8 per-channel quantization. The changes include documentation updates and modifications to the quantization logic in both the standard and torchair execution paths. While the changes are mostly in the right direction, I've identified a critical bug in the scale processing logic for per-channel quantization that affects both w4a8_dynamic.py and torchair_w4a8_dynamic.py. This bug will lead to incorrect quantization scales. Additionally, the torchair implementation is missing a performance-critical format conversion that is present in the standard implementation. I've provided suggestions to fix these issues. On a general note, there is significant code duplication between AscendW4A8DynamicFusedMoEMethod and TorchairAscendW4A8DynamicFusedMoEMethod. It would be beneficial to refactor this into a base class with specialized subclasses to improve maintainability and avoid propagating bugs to multiple places.

Comment on lines +313 to +316
scale_np = scale.cpu().numpy()
scale_np.dtype = np.uint32
scale_uint64_tensor = torch.from_numpy(scale_np.astype(
np.int64)).npu()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There appears to be a bug in how the scale tensor is processed for per-channel quantization. The use of scale_np.astype(np.int64) performs a value-preserving cast from uint32 to int64, which is likely not the intended behavior for bit-packing. The else branch for per-group quantization uses a more complex but correct packing logic that reinterprets bits. A similar bit-packing logic should be used here to ensure correctness. Using astype will lead to incorrect scale values being passed to the kernel.

Suggested change
scale_np = scale.cpu().numpy()
scale_np.dtype = np.uint32
scale_uint64_tensor = torch.from_numpy(scale_np.astype(
np.int64)).npu()
scale_np = scale.cpu().numpy()
packed_np = np.zeros(scale_np.shape[:-1] + (scale_np.shape[-1] * 2,), dtype=np.uint32)
packed_np[..., ::2] = scale_np.view(np.uint32)
scale_uint64_tensor = torch.from_numpy(np.frombuffer(packed_np.tobytes(), dtype=np.int64).copy().reshape(scale_np.shape)).npu()

Comment on lines +365 to +368
scale_np = scale.cpu().numpy()
scale_np.dtype = np.uint32
scale_uint64_tensor = torch.from_numpy(scale_np.astype(
np.int64)).npu()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There appears to be a bug in how the scale tensor is processed for per-channel quantization. The use of scale_np.astype(np.int64) performs a value-preserving cast from uint32 to int64, which is likely not the intended behavior for bit-packing. The else branch for per-group quantization uses a more complex but correct packing logic that reinterprets bits. A similar bit-packing logic should be used here to ensure correctness. Using astype will lead to incorrect scale values being passed to the kernel.

Suggested change
scale_np = scale.cpu().numpy()
scale_np.dtype = np.uint32
scale_uint64_tensor = torch.from_numpy(scale_np.astype(
np.int64)).npu()
scale_np = scale.cpu().numpy()
packed_np = np.zeros(scale_np.shape[:-1] + (scale_np.shape[-1] * 2,), dtype=np.uint32)
packed_np[..., ::2] = scale_np.view(np.uint32)
scale_uint64_tensor = torch.from_numpy(np.frombuffer(packed_np.tobytes(), dtype=np.int64).copy().reshape(scale_np.shape)).npu()

Comment on lines +438 to +441
if hasattr(layer, "w13_weight_scale_second"):
# scale_second is no longer used, release this part of the memory
del layer.w13_weight_scale_second
del layer.w2_weight_scale_second
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The npu_format_cast to ACL_FORMAT_FRACTAL_NZ is missing for the weights before packing. The non-torchair version (vllm_ascend/quantization/w4a8_dynamic.py) includes this conversion for performance. Its omission here could lead to suboptimal performance or even correctness issues if the subsequent operations expect this format.

Please add the format conversion after self.update_bias(...) and before self.pack_to_int32(...), like this:

        self.update_bias(layer, w13_bias, w2_bias)

        layer.w13_weight.data = torch_npu.npu_format_cast(
            layer.w13_weight.data, ACL_FORMAT_FRACTAL_NZ)
        layer.w2_weight.data = torch_npu.npu_format_cast(
            layer.w2_weight.data, ACL_FORMAT_FRACTAL_NZ)
        
        layer.w13_weight.data = self.pack_to_int32(layer.w13_weight.data)
        layer.w2_weight.data = self.pack_to_int32(layer.w2_weight.data)

You will also need to add the import: from vllm_ascend.utils import ACL_FORMAT_FRACTAL_NZ.

### 3. When converting deepseek series models with modelslim, what should you pay attention?
When using the weight generated by modelslim with the `--dynamic` parameter, if torchair graph mode is enabled, please modify the configuration file in the CANN package to prevent incorrect inference results.
When the mla portion of the weights used `W8A8_DYNAMIC` quantization, if torchair graph mode is enabled, please modify the configuration file in the CANN package to prevent incorrect inference results.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice update. Thanks.

@MengqingCao MengqingCao added ready read for review ready-for-test start test by label for PR labels Sep 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation module:quantization module:tests ready read for review ready-for-test start test by label for PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants