Skip to content

[CPU] Enable DA8W4 on CPU #2128

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 29 commits into from
Jun 25, 2025

Conversation

Xia-Weiwen
Copy link
Collaborator

@Xia-Weiwen Xia-Weiwen commented Apr 25, 2025

Summary
This PR enables DA8W4 on CPU.

  • It adds a new layout Int8DynamicActInt4WeightCPULayout and its implementation
  • It adds two custom ops:
    • da8w4_linear_prepack_cpu for weight packing
    • da8w4_linear_cpu for A8W4 GEMM.
  • It adds C++ kernels for the two new custom ops

The feature supports symmetric and asymmetric quantization of activation.

The ops and kernels won't be available unless

  • torchao is built from source with USE_CPP_KERNELS=1 on Linux with an X86 CPU with AVX512.
  • torchao is run on Linux with an X86 CPU with AVX512.
  • PyTorch version >= 2.7

To get the best performance, one needs a CPU with AMX support.

Implementation details

  • The weight-packing kernel is implemented with AVX512 intrinsics if available. Otherwise, a reference path is used.
  • The GEMM kernel uses at::cpublas brgemm utilities from Pytorch core if available.
  • In the GEMM kernel, if M is large (>4)
    • if brgemm is available, brgemm is used.
    • otherwise, fallback to reference implementation
  • In the GEMM kernel, if M is small (<=4):
    • if AVX512_VNNI is available, the kernel uses AVX512_VNNI intrinsics.
    • otherwise, go to the same path for large M.
  • All utilities functions used in the kernel are implemented with AVX512 if available. Otherwise fall back to reference implementation.

Usage

quantize_(
    model,
    int8_dynamic_activation_int4_weight(
        group_size=32,  # or 64, 128
        layout=Int8DynamicActInt4WeightCPULayout(),
        act_mapping_type=MappingType.SYMMETRIC,  # or MappingType.ASYMMETRIC
    ),
)

Test plan

pytest test/quantization/test_quant_api.py -k test_8da4w_cpu

Copy link

pytorch-bot bot commented Apr 25, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2128

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit e3731f7 with merge base 4ebc9c0 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 25, 2025
@Xia-Weiwen Xia-Weiwen added cpu quantize topic: new feature Use this tag if this PR adds a new feature labels Apr 25, 2025
@Xia-Weiwen Xia-Weiwen marked this pull request as ready for review April 29, 2025 02:01
@Xia-Weiwen Xia-Weiwen requested a review from jerryzh168 April 29, 2025 03:16
@Xia-Weiwen Xia-Weiwen marked this pull request as draft May 7, 2025 01:17
@Xia-Weiwen
Copy link
Collaborator Author

@leslie-fang-intel This PR is updated to use a new layout. Please review again. Thanks.

@Xia-Weiwen Xia-Weiwen changed the title [CPU] enable int8_dynamic_activation_int4_weight with Int4CPULayout [CPU] enable int8_dynamic_activation_int4_weight on CPU May 16, 2025
@Xia-Weiwen Xia-Weiwen marked this pull request as ready for review May 16, 2025 05:59
@Xia-Weiwen Xia-Weiwen changed the title [CPU] enable int8_dynamic_activation_int4_weight on CPU [CPU] Add a new layout for int8_dynamic_activation_int4_weight on CPU May 16, 2025
@Xia-Weiwen
Copy link
Collaborator Author

Hi @jerryzh168 Could you please review this PR? Thanks.

2 similar comments
@Xia-Weiwen
Copy link
Collaborator Author

Hi @jerryzh168 Could you please review this PR? Thanks.

@Xia-Weiwen
Copy link
Collaborator Author

Hi @jerryzh168 Could you please review this PR? Thanks.

@Xia-Weiwen Xia-Weiwen marked this pull request as draft May 21, 2025 02:57
@Xia-Weiwen
Copy link
Collaborator Author

Hi @leslie-fang-intel Please review this PR again. I have also added the kernel code in this PR. It showed reasonable performance in internal benchmarks. Thanks.

Copy link
Collaborator

@leslie-fang-intel leslie-fang-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also describe how we choose different implementations based on the CPU Info.

@Xia-Weiwen
Copy link
Collaborator Author

Please also describe how we choose different implementations based on the CPU Info.

I have added more details in the description. Thanks.

@Xia-Weiwen Xia-Weiwen requested a review from jerryzh168 June 6, 2025 01:49
@Xia-Weiwen Xia-Weiwen marked this pull request as ready for review June 6, 2025 01:49
@Xia-Weiwen
Copy link
Collaborator Author

Hi @jerryzh168 Could you please review this PR? Thanks. It's changed a lot since your last review.

@Xia-Weiwen
Copy link
Collaborator Author

Hi @jerryzh168 Could you please review this PR? Thanks.



@dataclass(frozen=True)
class Int8DynamicActInt4WeightCPULayout(Layout):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like you can just reuse Int4CPULayout

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you move the layout and impl to a separate file?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Done.



@register_layout(Int8DynamicActInt4WeightCPULayout)
class DA8W4CPUAQTTensorImpl(Int4CPUAQTTensorImpl):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh I see, OK if you need a separate Impl then makes sense to have a separate layout

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. We need a different impl from W16W4 because the ISA (AMX and VNNI) requires different memory formats of weight for computation in BF16 or INT8. Thanks.

Comment on lines 435 to 441
int_data = (int_data + 8).to(torch.uint8)
if scale.dim() == 1:
scale.unsqueeze_(-1)
scale = scale.to(torch.float)
if zero_point.dim() == 1:
zero_point.unsqueeze_(-1)
zero_point = zero_point.to(torch.int8) + 8
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you configure dtypes of int_data, scale, zero_point and shapes in the call to to_affine_quantized_intx?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. I have improved this part.

@Xia-Weiwen Xia-Weiwen requested a review from jerryzh168 June 15, 2025 11:32
assert "torch.ops.torchao.da8w4_linear_cpu.default" in code[0]
quantize_(
m2,
int8_dynamic_activation_int4_weight(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can you use the new API: Int8DynamicActivationInt4WeightConfig instead of int8_dynamic_activation_int4_weight?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Done.

@@ -728,9 +761,17 @@ def _int8_dynamic_activation_int4_weight_transform(
quant_min = -8
quant_max = 7

if isinstance(layout, Int8DynamicActInt4WeightCPULayout):
Copy link
Contributor

@jerryzh168 jerryzh168 Jun 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this happen in kernel? we have dtype conversions like this:

w_vals_int8_t.to(input_tensor.dtype),

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment. I have moved this to _linear_int8_act_int4_weight_cpu_impl.

@Xia-Weiwen Xia-Weiwen merged commit 8b57afe into pytorch:main Jun 25, 2025
35 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. cpu quantize topic: new feature Use this tag if this PR adds a new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants