Add user guide for quantization #1206

22dimensions · 2025-06-13T09:03:59Z

What this PR does / why we need it?

Add user guide for quantization

Does this PR introduce any user-facing change?

No

How was this patch tested?

Preview

wangxiyuan · 2025-06-13T09:19:39Z

docs/source/user_guide/quantization.md

+
+Model quantization is a technique that reduces the size and computational requirements of a model by lowering the data precision of the weights and activation values in the model, thereby improving the inference speed.
+
+In 0.9.0rc1 version, only W8A8 quantization is supported. And only Qwen, DeepSeek series models are well tested. We’ll support more quantization algorithm and models in the next release. The following


0.9.0rc2

in the next release - > in the future.

Since 0.9.0rc2 version, quantization feature is experimentally supported in vLLM Ascend.

Users can enable quantization feature by specifying --quantization ascend, currently only Qwen, DeepSeek series models are well tested. We’ll support more quantization algorithm and models in the future.

wangxiyuan · 2025-06-13T09:21:15Z

docs/source/user_guide/quantization.md

+
+## Install modelslim
+
+To quantize a model, we should install [modelslim](https://gitee.com/ascend/msit/blob/master/msmodelslim/README.md) which is Ascend compression acceleration tool, an affinity compression tool that aims at acceleration, takes compression as the technology, and is based on Ascend.


To quantize a model, uers should install ModelSlim , which is the Ascend compression and acceleration tool. It is an affinity-based compression tool designed for acceleration, using compression as its core technology and built upon the Ascend platform.

wangxiyuan · 2025-06-13T09:24:00Z

docs/source/user_guide/quantization.md

+
+To quantize a model, we should install [modelslim](https://gitee.com/ascend/msit/blob/master/msmodelslim/README.md) which is Ascend compression acceleration tool, an affinity compression tool that aims at acceleration, takes compression as the technology, and is based on Ascend.
+
+Currently, only this specific tag [modelslim-VLLM-8.1.RC1.b020_001](https://gitee.com/ascend/msit/blob/modelslim-VLLM-8.1.RC1.b020_001/msmodelslim/README.md) of modelslim has already adapted vLLM Ascend, please don't install other version. And we will make modelslim master version avaliable as soon as possible.


Currently, only the specific tag modelslim-VLLM-8.1.RC1.b020_001 of modelslim works with vLLM Ascend,. Please do not install other version until modelslim master version is avaliable for vLLM Ascend in the future.

wangxiyuan · 2025-06-13T09:28:38Z

docs/source/user_guide/quantization.md

+
+## Quantize model
+
+Take [DeepSeek-V2-Lite](https://modelscope.cn/models/deepseek-ai/DeepSeek-V2-Lite) as an example, download the model, and execute the command that can be found in modelslim [deepseek w8a8 dynamic quantization docs](https://gitee.com/ascend/msit/blob/modelslim-VLLM-8.1.RC1.b020_001/msmodelslim/example/DeepSeek/README.md#deepseek-v2-w8a8-dynamic%E9%87%8F%E5%8C%96).


Take DeepSeek-V2-Lite as an example, you just need to download the model, and then execute the convert command. The command is shown below. More info can be found in modelslim doc deepseek w8a8 dynamic quantization docs.

wangxiyuan · 2025-06-13T09:31:00Z

docs/source/user_guide/quantization.md

+```
+
+:::{note}
+You can choose to convert the model yourself or use the quantized model we uploaded, 


You can also download the quantized model from we uploaded. Please note that these weights should be used for test only. For example, https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V2-Lite-W8A8

wangxiyuan · 2025-06-13T09:32:23Z

docs/source/user_guide/quantization.md

+see https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V2-Lite-W8A8
+:::
+
+After quantization, there are two files that we need to pay attention to.


Once convert action is done, there are two important files generated.

wangxiyuan · 2025-06-13T09:33:11Z

docs/source/user_guide/quantization.md

+
+After quantization, there are two files that we need to pay attention to.
+
+1. [confg.json](https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V2-Lite-W8A8/file/view/master/config.json?status=1). Unlike most other quantized models, it doesn't contain `quantization_config` field. If it contains, vLLM Ascend won't work correctly.


Unlike most other quantized models, it doesn't contain quantization_config field. If it contains, vLLM Ascend won't work correctly.

->

Please make sure that there is no quantization_config field in it.

wangxiyuan · 2025-06-13T09:34:25Z

docs/source/user_guide/quantization.md

+
+1. [confg.json](https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V2-Lite-W8A8/file/view/master/config.json?status=1). Unlike most other quantized models, it doesn't contain `quantization_config` field. If it contains, vLLM Ascend won't work correctly.
+
+2. [quant_model_description.json](https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V2-Lite-W8A8/file/view/master/quant_model_description.json?status=1). Actually, the quantization information is moved from `config.json` to a file named `quant_model_description.json`, it records per layers quantization parameters. If vLLM connot find this file, loading quantization config process will fail.


Actually, the quantization information is moved from config.json to a file named quant_model_description.json, it records per layers quantization parameters. If vLLM connot find this file, loading quantization config process will fail.

All the converted weights info are recorded in this file.

wangxiyuan · 2025-06-13T09:36:22Z

docs/source/user_guide/quantization.md

+
+## Run the model
+
+vLLM Ascend register a custom quantization method called `ascend`, so we need to specify this quantization method when running the quantized model.


Now, you can run the quantized models with vLLM Ascend. Here is the example for online and offline inference.

Yikun · 2025-06-13T09:42:00Z

docs/source/user_guide/quantization.md

@@ -0,0 +1,83 @@
+# Quantization Guide
+
+Model quantization is a technique that reduces the size and computational requirements of a model by lowering the data precision of the weights and activation values in the model, thereby improving the inference speed.


Suggested change

Model quantization is a technique that reduces the size and computational requirements of a model by lowering the data precision of the weights and activation values in the model, thereby improving the inference speed.

Model quantization is a technique that reduces the size and computational requirements of a model by lowering the data precision of the weights and activation values in the model, thereby saving the memory and improving the inference speed.

Especially note for ascend...

Yikun · 2025-06-13T09:43:36Z

docs/source/user_guide/quantization.md

+
+Model quantization is a technique that reduces the size and computational requirements of a model by lowering the data precision of the weights and activation values in the model, thereby improving the inference speed.
+
+In 0.9.0rc1 version, only W8A8 quantization is supported. And only Qwen, DeepSeek series models are well tested. We’ll support more quantization algorithm and models in the next release. The following


Since 0.9.0rc2 version, quantization feature is experimentally supported in vLLM Ascend.

Users can enable quantization feature by specifying --quantization ascend, currently only Qwen, DeepSeek series models are well tested. We’ll support more quantization algorithm and models in the future.

Yikun · 2025-06-13T09:48:59Z

docs/source/user_guide/quantization.md

+git clone https://gitee.com/ascend/msit -b modelslim-VLLM-8.1.RC1.b020_001
+cd msit/msmodelslim
+bash install.sh
+pip install accelerate


Add a validation here maybe print version or someting

Yikun · 2025-06-13T09:50:41Z

docs/source/user_guide/quantization.md

+cd example/DeepSeek
+python3 quant_deepseek.py --model_path {original_model_path} --save_directory {quantized_model_save_path} --device_type cpu --act_method 2 --w_bit 8 --a_bit 8  --is_dynamic True
+```
+


Add section verify the quantized model that about key file list like: https://vllm-ascend.readthedocs.io/en/latest/tutorials/multi_npu_quantization.html#verify-the-quantized-model

Yikun · 2025-06-13T09:52:20Z

docs/source/user_guide/quantization.md

+]
+sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40)
+
+llm = LLM(model="{quantized_model_save_path}",


add a note before here:

# Enable quantization by specifing `quantization="ascend"`

I mean the code comments before the key line😂

Yikun · 2025-06-13T09:53:03Z

docs/source/user_guide/quantization.md

+### Online inference
+
+```bash
+vllm serve {quantized_model_save_path} --served-model-name "deepseek-v2-lite-w8a8" --max-model-len 2048 --quantization ascend --trust-remote-code


# Enable quantization by specifing `--quantization ascend`

Yikun · 2025-06-13T09:57:27Z

docs/source/user_guide/quantization.md

+
+## FAQs
+
+### 1. How to solve the KeyError: 'xxx.layers.0.self_attn.q_proj.weight' problem?


also link the faq to https://vllm-ascend.readthedocs.io/en/v0.7.3-dev/faqs.html

and update: https://vllm-ascend.readthedocs.io/en/v0.7.3-dev/faqs.html#how-to-run-w8a8-deepseek-model

and mention all case in here: #619 (comment)

I will submit another pr for v0.7.3-dev branch

Yikun · 2025-06-13T12:48:03Z

docs/source/user_guide/quantization.md

+
+## Install modelslim
+
+To quantize a model, uers should install [ModelSlim](https://gitee.com/ascend/msit/blob/master/msmodelslim/README.md) which is the Ascend compression and acceleration tool. It is an affinity-based compression tool designed for acceleration, using compression as its core technology and built upon the Ascend platform.


uers--->users

Yikun · 2025-06-13T12:49:28Z

docs/source/user_guide/quantization.md

+```
+
+Here is part of installation log:
+


the python installation log seems unused, if we cann't run the cmd, we can drop this

I have droped this log

Yikun · 2025-06-13T12:53:06Z

docs/source/user_guide/quantization.md

+]
+sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40)
+
+llm = LLM(model="{quantized_model_save_path}",


I mean the code comments before the key line😂

Yikun · 2025-06-13T12:53:16Z

docs/source/user_guide/quantization.md

+### Online inference
+
+```bash
+vllm serve {quantized_model_save_path} --served-model-name "deepseek-v2-lite-w8a8" --max-model-len 2048 --quantization ascend --trust-remote-code


Signed-off-by: 22dimensions <waitingwind@foxmail.com>

Yikun

LGTM

Yikun · 2025-06-20T04:23:36Z

@ApsarasX Would you mind taking a look? Because many quantization issue resolved with your help. Hope this doc will help some.

ApsarasX · 2025-06-20T07:07:26Z

I think this PR is great—now I can quantize model weights myself too.

github-actions bot added the documentation Improvements or additions to documentation label Jun 13, 2025

22dimensions force-pushed the quant_docs branch 8 times, most recently from 183d62c to 885de28 Compare June 13, 2025 09:35

wangxiyuan reviewed Jun 13, 2025

View reviewed changes

22dimensions force-pushed the quant_docs branch 2 times, most recently from fdcd7ca to eb08dad Compare June 13, 2025 09:53

Yikun reviewed Jun 13, 2025

View reviewed changes

22dimensions force-pushed the quant_docs branch 4 times, most recently from d7a33d0 to ea496b3 Compare June 13, 2025 10:31

Yikun reviewed Jun 13, 2025

View reviewed changes

22dimensions force-pushed the quant_docs branch from ea496b3 to 3c5d99a Compare June 13, 2025 15:16

add user guide of quantization

42bea02

Signed-off-by: 22dimensions <waitingwind@foxmail.com>

22dimensions force-pushed the quant_docs branch from 3c5d99a to 42bea02 Compare June 16, 2025 03:31

wangxiyuan mentioned this pull request Jun 17, 2025

[RFC]: Doc enhancement #1248

Open

40 tasks

Yikun approved these changes Jun 20, 2025

View reviewed changes

Yikun added the ready read for review label Jun 20, 2025

ApsarasX approved these changes Jun 20, 2025

View reviewed changes

wangxiyuan merged commit 761bd3d into vllm-project:main Jun 20, 2025
8 checks passed


		Model quantization is a technique that reduces the size and computational requirements of a model by lowering the data precision of the weights and activation values in the model, thereby improving the inference speed.

		In 0.9.0rc1 version, only W8A8 quantization is supported. And only Qwen, DeepSeek series models are well tested. We’ll support more quantization algorithm and models in the next release. The following


		## Install modelslim

		To quantize a model, we should install [modelslim](https://gitee.com/ascend/msit/blob/master/msmodelslim/README.md) which is Ascend compression acceleration tool, an affinity compression tool that aims at acceleration, takes compression as the technology, and is based on Ascend.


		To quantize a model, we should install [modelslim](https://gitee.com/ascend/msit/blob/master/msmodelslim/README.md) which is Ascend compression acceleration tool, an affinity compression tool that aims at acceleration, takes compression as the technology, and is based on Ascend.

		Currently, only this specific tag [modelslim-VLLM-8.1.RC1.b020_001](https://gitee.com/ascend/msit/blob/modelslim-VLLM-8.1.RC1.b020_001/msmodelslim/README.md) of modelslim has already adapted vLLM Ascend, please don't install other version. And we will make modelslim master version avaliable as soon as possible.


		## Quantize model

		Take [DeepSeek-V2-Lite](https://modelscope.cn/models/deepseek-ai/DeepSeek-V2-Lite) as an example, download the model, and execute the command that can be found in modelslim [deepseek w8a8 dynamic quantization docs](https://gitee.com/ascend/msit/blob/modelslim-VLLM-8.1.RC1.b020_001/msmodelslim/example/DeepSeek/README.md#deepseek-v2-w8a8-dynamic%E9%87%8F%E5%8C%96).


		After quantization, there are two files that we need to pay attention to.

		1. [confg.json](https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V2-Lite-W8A8/file/view/master/config.json?status=1). Unlike most other quantized models, it doesn't contain `quantization_config` field. If it contains, vLLM Ascend won't work correctly.


		1. [confg.json](https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V2-Lite-W8A8/file/view/master/config.json?status=1). Unlike most other quantized models, it doesn't contain `quantization_config` field. If it contains, vLLM Ascend won't work correctly.

		2. [quant_model_description.json](https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V2-Lite-W8A8/file/view/master/quant_model_description.json?status=1). Actually, the quantization information is moved from `config.json` to a file named `quant_model_description.json`, it records per layers quantization parameters. If vLLM connot find this file, loading quantization config process will fail.


		## Run the model

		vLLM Ascend register a custom quantization method called `ascend`, so we need to specify this quantization method when running the quantized model.

		@@ -0,0 +1,83 @@
		# Quantization Guide

		Model quantization is a technique that reduces the size and computational requirements of a model by lowering the data precision of the weights and activation values in the model, thereby improving the inference speed.


		## FAQs

		### 1. How to solve the KeyError: 'xxx.layers.0.self_attn.q_proj.weight' problem?

Add user guide for quantization #1206

Add user guide for quantization #1206

Uh oh!

Conversation

22dimensions commented Jun 13, 2025 • edited by Yikun Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Yikun Jun 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Yikun Jun 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Yikun Jun 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Yikun left a comment

Choose a reason for hiding this comment

Uh oh!

Yikun commented Jun 20, 2025

Uh oh!

ApsarasX commented Jun 20, 2025

22dimensions commented Jun 13, 2025 •

edited by Yikun

Loading

Yikun Jun 13, 2025 •

edited

Loading

Yikun Jun 13, 2025 •

edited

Loading

Yikun Jun 13, 2025 •

edited

Loading