Add Qwix quantization + per-tensor KV cache quantization #205

jrplatin · 2025-07-16T00:43:20Z

Description

Start with a short description of what the PR does and how this is a change from
the past.

The rest of the description includes relevant details and context, examples:

why is this change being made,
the problem being solved and any relevant context,
why this is a good solution,
some information about the specific implementation,
shortcomings of the solution and possible future improvements.

If the change fixes a bug or a Github issue, please include a link, e.g.,:
FIXES: b/123456
FIXES: #123456

Tests

Please describe how you tested this change, and include any instructions and/or
commands to reproduce.

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have made or will make corresponding changes to any relevant documentation.

Signed-off-by: Hongmin Fan <fanhongmin@google.com>

Signed-off-by: Siyuan Liu <lsiyuan@google.com>

Signed-off-by: Jacob Platin <jacobplatin@google.com>

Signed-off-by: Siyuan Liu <lsiyuan@google.com>

Signed-off-by: wwl2755-google <wenlongwang@google.com>

Manual CI tests passed: https://buildkite.com/tpu-commons/tpu-commons-ci/builds/565 Signed-off-by: bzgoogle <beinuoz@google.com>

Signed-off-by: Xiang Xu <xiangxu@google.com>

Signed-off-by: Jacob Platin <jacobplatin@google.com>

Signed-off-by: Xiang Xu <xiangxu@google.com>

Signed-off-by: Jacob Platin <jacobplatin@google.com>

Signed-off-by: Hongmin Fan <fanhongmin@google.com>

Signed-off-by: Jacob Platin <jacobplatin@google.com>

vipannalla · 2025-07-17T22:09:35Z

tpu_commons/runner/jax/tpu_jax_runner.py

+            num_kv_heads=kv_cache_spec.
+            num_kv_heads,  # NOTE: we'll multiply by 2 in the function


nit: can you make this into single like? The current split is confusing and if the comment is the reason for split, we you add in a separate like above?

mitalisi · 2025-07-17T22:52:29Z

README.md

+]
+```
+
+You may also create a file that defines your own rules (e.g. `tpu_commons/models/jax/utils/quantization/quantize_all_modules_int8_wa.yaml`), where each entry under `rules` corresponds to a `qwix.QuantizationRule`.  To pass this file (which is mutually exclusive with `quantization.dtype`), you can something similar to:


Why not make a standard dir for quantization configs and ask user to add config to that dir. Can just pass the file name. We can store kv cache quantization config in the same file.

--additional_config='{"quantization": "int8_default.yaml"}'

And for a custom file
--additional_config='{"quantization": "int8_default_int8_kv.yaml"}'

Files can be in tpu_commons/models/jax/utils/quantization/configs/

We will likely have different configs checked in to this dir for different models/datasets.

Agreed on making a standardized dir, but I think keeping KV cache / model quant separate makes more sense since we really only need the YAML for the Qwix rules, so I think it keeps the code / UX much cleaner if separate the two quants out (since the KV quant is only a dtype specification) -- can iterate more on this in the KV cache quant PR

Kv cache config also had multiple options - dtype, per_tensor/per_token/dimension_to_quantize_on, diff conf for key/value. We will only support a simple per_tensor int8 initially but keeping it in file keeps it flexible. Also one config for all quant is easier to read.

mitalisi

Can we add details on perf and accuracy results for these techniques.

mitalisi · 2025-07-17T23:11:57Z

README.md

+
+By default, we will use the following Qwix rules (with the given `dtype`), which will quantize attention weights-only and MLP with weights and activations:
+
+```


Better to add path to file here so that it remains up to date

jrplatin · 2025-07-16T15:28:05Z

tpu_commons/kernels/ragged_paged_attention/kernel.py

+
+                    k_scale, v_scale = None, None
+                    if k_scale_ref is not None:
+                        k_scale = k_scale_ref[0]


TODO @jrplatin: do we want to keep this astype?

jrplatin · 2025-07-18T00:24:27Z

README.md

+]
+```
+
+You may also create a file that defines your own rules (e.g. `tpu_commons/models/jax/utils/quantization/quantize_all_modules_int8_wa.yaml`), where each entry under `rules` corresponds to a `qwix.QuantizationRule`.  To pass this file (which is mutually exclusive with `quantization.dtype`), you can something similar to:


Agreed on making a standardized dir, but I think keeping KV cache / model quant separate makes more sense since we really only need the YAML for the Qwix rules, so I think it keeps the code / UX much cleaner if separate the two quants out (since the KV quant is only a dtype specification) -- can iterate more on this in the KV cache quant PR

Signed-off-by: Jacob Platin <jacobplatin@google.com>

hfan and others added 13 commits July 14, 2025 16:17

Add test_jax_qkv_parallel_linear.py (#187)

490f965

Signed-off-by: Hongmin Fan <fanhongmin@google.com>

[torchax] Fix import failure due to upstream change (#186)

32436de

Signed-off-by: Siyuan Liu <lsiyuan@google.com>

Merge with main

3ac285b

Signed-off-by: Jacob Platin <jacobplatin@google.com>

Add merged parallel linear for torchax jax path and test (#189)

8984b41

Signed-off-by: Siyuan Liu <lsiyuan@google.com>

[CI] Add qwen2.5-0.5B-instruct into JAX CI only (#181)

934ea4a

Signed-off-by: wwl2755-google <wenlongwang@google.com>

Add the new model implementation in buildkite CI test (#179)

59e0ee8

Manual CI tests passed: https://buildkite.com/tpu-commons/tpu-commons-ci/builds/565 Signed-off-by: bzgoogle <beinuoz@google.com>

Clean up CI pipeline_jax naming (#190)

ef331d3

Signed-off-by: Xiang Xu <xiangxu@google.com>

Begin cut-over to new model impl.

8fbb887

Signed-off-by: Jacob Platin <jacobplatin@google.com>

Add tests/test_utils_jax.py (#191)

91e98f3

Signed-off-by: Xiang Xu <xiangxu@google.com>

Remove exposed HF token (#192)

68d312c

Signed-off-by: Jacob Platin <jacobplatin@google.com>

Add test_jax_fused_moe.py (#193)

d56b11f

Signed-off-by: Hongmin Fan <fanhongmin@google.com>

Merge with main

82e1779

Park work

f6a18ae

Signed-off-by: Jacob Platin <jacobplatin@google.com>

jrplatin changed the title ~~[WIP] Add Qwix quantization + per-tensor KV cache quantization~~ Add Qwix quantization + per-tensor KV cache quantization Jul 16, 2025

Clean up

be643d1

Signed-off-by: Jacob Platin <jacobplatin@google.com>

jrplatin force-pushed the jacobplatin/quant/qwix-kv-cache-initial-pr branch from 6de3c9b to be643d1 Compare July 16, 2025 15:52

Lumosis force-pushed the main branch 2 times, most recently from f7561a2 to 66ad626 Compare July 16, 2025 19:40

xiangxu-google force-pushed the main branch from fdf717b to 9370864 Compare July 17, 2025 00:24

jrplatin added 2 commits July 17, 2025 14:55

Add testing

6c2a299

Signed-off-by: Jacob Platin <jacobplatin@google.com>

Merge with main

28b0267

Signed-off-by: Jacob Platin <jacobplatin@google.com>

vipannalla reviewed Jul 17, 2025

View reviewed changes

mitalisi reviewed Jul 17, 2025

View reviewed changes

jrplatin commented Jul 18, 2025

View reviewed changes

Begin to address PR feedback

4f2e455

Signed-off-by: Jacob Platin <jacobplatin@google.com>

jrplatin closed this Jul 18, 2025

jrplatin deleted the jacobplatin/quant/qwix-kv-cache-initial-pr branch September 10, 2025 17:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Qwix quantization + per-tensor KV cache quantization #205

Add Qwix quantization + per-tensor KV cache quantization #205

Uh oh!

jrplatin commented Jul 16, 2025

Uh oh!

vipannalla Jul 17, 2025

Uh oh!

mitalisi Jul 17, 2025 •

edited

Loading

Uh oh!

jrplatin Jul 18, 2025

Uh oh!

mitalisi Jul 18, 2025 •

edited

Loading

Uh oh!

mitalisi left a comment

Uh oh!

mitalisi Jul 17, 2025

Uh oh!

jrplatin Jul 16, 2025

Uh oh!

jrplatin Jul 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

		num_kv_heads=kv_cache_spec.
		num_kv_heads, # NOTE: we'll multiply by 2 in the function


		By default, we will use the following Qwix rules (with the given `dtype`), which will quantize attention weights-only and MLP with weights and activations:

		```

Add Qwix quantization + per-tensor KV cache quantization #205

Add Qwix quantization + per-tensor KV cache quantization #205

Uh oh!

Conversation

jrplatin commented Jul 16, 2025

Description

Tests

Checklist

Uh oh!

vipannalla Jul 17, 2025

Choose a reason for hiding this comment

Uh oh!

mitalisi Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jrplatin Jul 18, 2025

Choose a reason for hiding this comment

Uh oh!

mitalisi Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mitalisi left a comment

Choose a reason for hiding this comment

Uh oh!

mitalisi Jul 17, 2025

Choose a reason for hiding this comment

Uh oh!

jrplatin Jul 16, 2025

Choose a reason for hiding this comment

Uh oh!

jrplatin Jul 18, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

mitalisi Jul 17, 2025 •

edited

Loading

mitalisi Jul 18, 2025 •

edited

Loading