[Quantization] Channel-wise Output Activation Quantization for Attention QKV Modules + KV-cache channel quantization #1233

horheynm · 2025-03-07T04:31:32Z

Blocked on : neuralmagic/compressed-tensors#270

SUMMARY:
Quantize the output activation of the attention layers for channel wise -> did not have support -> selected wrong dim to quantize.
~~Quantize the kv-cache for channel wise int8 -> previously only supported tensor-wise.~~ Next PR

Attention we need to worry about is the QKV. O/Up/down is not quantized.

Math:
x is the input vector -> tokenized + embedding
weight for QKV is Linear modules
output is the forward call of QKV with x

# x
(Pdb) hidden_states.shape -> torch.Size([1, 1930, 4096]) -> [batch, seq_len, hidden_size]

# weight
(Pdb) self.q_proj.weight.shape -> torch.Size([4096, 4096]) -- [hidden_size, hidden_size]
(Pdb) self.k_proj.weight.shape -> torch.Size([1024, 4096]) -- [num_key_value_heads * head_dim, hidden_size]
(Pdb) self.v_proj.weight.shape -> torch.Size([1024, 4096]) -- [num_key_value_heads * head_dim, hidden_size]

# output
(Pdb) self.q_proj(hidden_states).shape -> torch.Size([1, 1930, 4096]) -> [batch, seq_len, hidden_size]
(Pdb) self.k_proj(hidden_states).shape -> torch.Size([1, 1930, 1024]) -> [batch, seq_len, num_key_value_heads * head_dim]
(Pdb) self.v_proj(hidden_states).shape -> torch.Size([1, 1930, 1024]) -> [batch, seq_len, num_key_value_heads * head_dim]

# key_states, value_states shape
[batch, num_key_value_heads, seq_len, head_dim]

Expected output scales and zp shapes for output activations

q_proj activations -> [4096] -> [hidden_size]
k_proj activations -> [1024] -> [num_key_value_heads * head_dim]
v_proj activations -> [1024] -> [num_key_value_heads * head_dim]

Expected output scales and zp shapes for kv-cache channel

k_proj, v_proj -> [head_dim]

The observer will output the vectors in the same ndim as the given output activation tensor (ie. torch.Size([1, 1930, 1024]), then outputs torch.Size([1, 1, 1024])). Squeeze it to just get torch.Size([1024]), so ndim of 1.

TEST PLAN:

Pass tests
Pass eval

…tn_quant Signed-off-by: George Ohashi <george@neuralmagic.com>

Signed-off-by: George Ohashi <george@neuralmagic.com>

github-actions · 2025-03-07T04:31:41Z

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

horheynm · 2025-03-07T04:32:42Z

Next todo is to add support for group quantization for output activations

Signed-off-by: George Ohashi <george@neuralmagic.com>

…nto attn_quant Signed-off-by: George Ohashi <george@neuralmagic.com>

horheynm · 2025-03-07T21:25:59Z

Will break down kv-cache logic to a different PR

dsikka and others added 5 commits February 10, 2025 23:34

hacks

62e0952

update example

f4e1d05

Merge branch 'main' of github.com:vllm-project/llm-compressor into at…

76fc03d

…tn_quant Signed-off-by: George Ohashi <george@neuralmagic.com>

channel wise fp8 quantization, attention modules

c2a2016

Signed-off-by: George Ohashi <george@neuralmagic.com>

revert example script

189e9d5

Signed-off-by: George Ohashi <george@neuralmagic.com>

Merge branch 'main' into attn_quant

405dc40

horheynm added 2 commits March 6, 2025 23:36

lint

78222ba

Signed-off-by: George Ohashi <george@neuralmagic.com>

Merge branch 'attn_quant' of github.com:vllm-project/llm-compressor i…

3d19401

…nto attn_quant Signed-off-by: George Ohashi <george@neuralmagic.com>

horheynm added the ready When a PR is ready for review label Mar 7, 2025

dsikka marked this pull request as draft March 7, 2025 14:27

horheynm changed the title ~~[Quantization] Channel-wise Output Activation Quantization for Attention QKV Modules~~ [Quantization] Channel-wise Output Activation Quantization for Attention QKV Modules + KV-cache channel quantization Mar 7, 2025

horheynm force-pushed the attn_quant branch from 3e22bac to 3d19401 Compare March 10, 2025 16:19

Merge branch 'main' into attn_quant

0eb4c60

dsikka closed this Jul 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Quantization] Channel-wise Output Activation Quantization for Attention QKV Modules + KV-cache channel quantization #1233

[Quantization] Channel-wise Output Activation Quantization for Attention QKV Modules + KV-cache channel quantization #1233

horheynm commented Mar 7, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Mar 7, 2025

Uh oh!

horheynm commented Mar 7, 2025 •

edited

Loading

Uh oh!

horheynm commented Mar 7, 2025

Uh oh!

Uh oh!

[Quantization] Channel-wise Output Activation Quantization for Attention QKV Modules + KV-cache channel quantization #1233

[Quantization] Channel-wise Output Activation Quantization for Attention QKV Modules + KV-cache channel quantization #1233

Conversation

horheynm commented Mar 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 7, 2025

Uh oh!

horheynm commented Mar 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

horheynm commented Mar 7, 2025

Uh oh!

Uh oh!

horheynm commented Mar 7, 2025 •

edited

Loading

horheynm commented Mar 7, 2025 •

edited

Loading