Skip to content

Commit 3d19401

Browse files
committed
Merge branch 'attn_quant' of github.com:vllm-project/llm-compressor into attn_quant
Signed-off-by: George Ohashi <george@neuralmagic.com>
2 parents 78222ba + 405dc40 commit 3d19401

File tree

1 file changed

+33
-7
lines changed

1 file changed

+33
-7
lines changed

README.md

Lines changed: 33 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,32 @@
2323
* SmoothQuant
2424
* SparseGPT
2525

26+
### When to Use Which Optimization
27+
28+
#### PTQ
29+
PTQ is performed to reduce the precision of quantizable weights (e.g., linear layers) to a lower bit-width. Supported formats are:
30+
31+
##### [W4A16](./examples/quantization_w4a16/README.md)
32+
- Uses GPTQ to compress weights to 4 bits. Requires calibration dataset.
33+
- Useful speed ups in low QPS regimes with more weight compression.
34+
- Recommended for any GPUs types.
35+
##### [W8A8-INT8](./examples/quantization_w8a8_int8/README.md)
36+
- Uses channel-wise quantization to compress weights to 8 bits using GPTQ, and uses dynamic per-token quantization to compress activations to 8 bits. Requires calibration dataset for weight quantization. Activation quantization is carried out during inference on vLLM.
37+
- Useful for speed ups in high QPS regimes or offline serving on vLLM.
38+
- Recommended for NVIDIA GPUs with compute capability <8.9 (Ampere, Turing, Volta, Pascal, or older).
39+
##### [W8A8-FP8](./examples/quantization_w8a8_fp8/README.md)
40+
- Uses channel-wise quantization to compress weights to 8 bits, and uses dynamic per-token quantization to compress activations to 8 bits. Does not require calibration dataset. Activation quantization is carried out during inference on vLLM.
41+
- Useful for speed ups in high QPS regimes or offline serving on vLLM.
42+
- Recommended for NVIDIA GPUs with compute capability >8.9 (Hopper and Ada Lovelace).
43+
44+
#### Sparsification
45+
Sparsification reduces model complexity by pruning selected weight values to zero while retaining essential weights in a subset of parameters. Supported formats include:
46+
47+
##### [2:4-Sparsity with FP8 Weight, FP8 Input Activation](./examples/sparse_2of4_quantization_fp8/README.md)
48+
- Uses (1) semi-structured sparsity (SparseGPT), where, for every four contiguous weights in a tensor, two are set to zero. (2) Uses channel-wise quantization to compress weights to 8 bits and dynamic per-token quantization to compress activations to 8 bits.
49+
- Useful for better inference than W8A8-fp8, with almost no drop in its evaluation score [blog](https://neuralmagic.com/blog/24-sparse-llama-fp8-sota-performance-for-nvidia-hopper-gpus/). Note: Small models may experience accuracy drops when the remaining non-zero weights are insufficient to recapitulate the original distribution.
50+
- Recommended for compute capability >8.9 (Hopper and Ada Lovelace).
51+
2652

2753
## Installation
2854

@@ -35,16 +61,16 @@ pip install llmcompressor
3561
### End-to-End Examples
3662

3763
Applying quantization with `llmcompressor`:
38-
* [Activation quantization to `int8`](examples/quantization_w8a8_int8)
39-
* [Activation quantization to `fp8`](examples/quantization_w8a8_fp8)
40-
* [Weight only quantization to `int4`](examples/quantization_w4a16)
41-
* [Quantizing MoE LLMs](examples/quantizing_moe)
42-
* [Quantizing Vision-Language Models](examples/multimodal_vision)
43-
* [Quantizing Audio-Language Models](examples/multimodal_audio)
64+
* [Activation quantization to `int8`](examples/quantization_w8a8_int8/README.md)
65+
* [Activation quantization to `fp8`](examples/quantization_w8a8_fp8/README.md)
66+
* [Weight only quantization to `int4`](examples/quantization_w4a16/README.md)
67+
* [Quantizing MoE LLMs](examples/quantizing_moe/README.md)
68+
* [Quantizing Vision-Language Models](examples/multimodal_vision/README.md)
69+
* [Quantizing Audio-Language Models](examples/multimodal_audio/README.md)
4470

4571
### User Guides
4672
Deep dives into advanced usage of `llmcompressor`:
47-
* [Quantizing with large models with the help of `accelerate`](examples/big_models_with_accelerate)
73+
* [Quantizing with large models with the help of `accelerate`](examples/big_models_with_accelerate/README.md)
4874

4975

5076
## Quick Tour

0 commit comments

Comments
 (0)