You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- Uses channel-wise quantization to compress weights to 8 bits using GPTQ, and uses dynamic per-token quantization to compress activations to 8 bits. Requires calibration dataset for weight quantization. Activation quantization is carried out during inference on vLLM.
37
+
- Useful for speed ups in high QPS regimes or offline serving on vLLM.
38
+
- Recommended for NVIDIA GPUs with compute capability <8.9 (Ampere, Turing, Volta, Pascal, or older).
- Uses channel-wise quantization to compress weights to 8 bits, and uses dynamic per-token quantization to compress activations to 8 bits. Does not require calibration dataset. Activation quantization is carried out during inference on vLLM.
41
+
- Useful for speed ups in high QPS regimes or offline serving on vLLM.
42
+
- Recommended for NVIDIA GPUs with compute capability >8.9 (Hopper and Ada Lovelace).
43
+
44
+
#### Sparsification
45
+
Sparsification reduces model complexity by pruning selected weight values to zero while retaining essential weights in a subset of parameters. Supported formats include:
46
+
47
+
##### [2:4-Sparsity with FP8 Weight, FP8 Input Activation](./examples/sparse_2of4_quantization_fp8/README.md)
48
+
- Uses (1) semi-structured sparsity (SparseGPT), where, for every four contiguous weights in a tensor, two are set to zero. (2) Uses channel-wise quantization to compress weights to 8 bits and dynamic per-token quantization to compress activations to 8 bits.
49
+
- Useful for better inference than W8A8-fp8, with almost no drop in its evaluation score [blog](https://neuralmagic.com/blog/24-sparse-llama-fp8-sota-performance-for-nvidia-hopper-gpus/). Note: Small models may experience accuracy drops when the remaining non-zero weights are insufficient to recapitulate the original distribution.
50
+
- Recommended for compute capability >8.9 (Hopper and Ada Lovelace).
51
+
26
52
27
53
## Installation
28
54
@@ -35,16 +61,16 @@ pip install llmcompressor
35
61
### End-to-End Examples
36
62
37
63
Applying quantization with `llmcompressor`:
38
-
*[Activation quantization to `int8`](examples/quantization_w8a8_int8)
39
-
*[Activation quantization to `fp8`](examples/quantization_w8a8_fp8)
40
-
*[Weight only quantization to `int4`](examples/quantization_w4a16)
0 commit comments