-
Notifications
You must be signed in to change notification settings - Fork 483
Add user guide for quantization #1206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
183d62c
to
885de28
Compare
|
||
Model quantization is a technique that reduces the size and computational requirements of a model by lowering the data precision of the weights and activation values in the model, thereby improving the inference speed. | ||
|
||
In 0.9.0rc1 version, only W8A8 quantization is supported. And only Qwen, DeepSeek series models are well tested. We’ll support more quantization algorithm and models in the next release. The following |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0.9.0rc2
in the next release - > in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since 0.9.0rc2 version, quantization feature is experimentally supported in vLLM Ascend.
Users can enable quantization feature by specifying --quantization ascend
, currently only Qwen, DeepSeek series models are well tested. We’ll support more quantization algorithm and models in the future.
|
||
## Install modelslim | ||
|
||
To quantize a model, we should install [modelslim](https://gitee.com/ascend/msit/blob/master/msmodelslim/README.md) which is Ascend compression acceleration tool, an affinity compression tool that aims at acceleration, takes compression as the technology, and is based on Ascend. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To quantize a model, uers should install ModelSlim , which is the Ascend compression and acceleration tool. It is an affinity-based compression tool designed for acceleration, using compression as its core technology and built upon the Ascend platform.
|
||
To quantize a model, we should install [modelslim](https://gitee.com/ascend/msit/blob/master/msmodelslim/README.md) which is Ascend compression acceleration tool, an affinity compression tool that aims at acceleration, takes compression as the technology, and is based on Ascend. | ||
|
||
Currently, only this specific tag [modelslim-VLLM-8.1.RC1.b020_001](https://gitee.com/ascend/msit/blob/modelslim-VLLM-8.1.RC1.b020_001/msmodelslim/README.md) of modelslim has already adapted vLLM Ascend, please don't install other version. And we will make modelslim master version avaliable as soon as possible. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, only the specific tag modelslim-VLLM-8.1.RC1.b020_001 of modelslim works with vLLM Ascend,. Please do not install other version until modelslim master version is avaliable for vLLM Ascend in the future.
|
||
## Quantize model | ||
|
||
Take [DeepSeek-V2-Lite](https://modelscope.cn/models/deepseek-ai/DeepSeek-V2-Lite) as an example, download the model, and execute the command that can be found in modelslim [deepseek w8a8 dynamic quantization docs](https://gitee.com/ascend/msit/blob/modelslim-VLLM-8.1.RC1.b020_001/msmodelslim/example/DeepSeek/README.md#deepseek-v2-w8a8-dynamic%E9%87%8F%E5%8C%96). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Take DeepSeek-V2-Lite as an example, you just need to download the model, and then execute the convert command. The command is shown below. More info can be found in modelslim doc deepseek w8a8 dynamic quantization docs.
``` | ||
|
||
:::{note} | ||
You can choose to convert the model yourself or use the quantized model we uploaded, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can also download the quantized model from we uploaded. Please note that these weights should be used for test only. For example, https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V2-Lite-W8A8
see https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V2-Lite-W8A8 | ||
::: | ||
|
||
After quantization, there are two files that we need to pay attention to. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once convert action is done, there are two important files generated.
|
||
After quantization, there are two files that we need to pay attention to. | ||
|
||
1. [confg.json](https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V2-Lite-W8A8/file/view/master/config.json?status=1). Unlike most other quantized models, it doesn't contain `quantization_config` field. If it contains, vLLM Ascend won't work correctly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unlike most other quantized models, it doesn't contain quantization_config
field. If it contains, vLLM Ascend won't work correctly.
->
Please make sure that there is no quantization_config
field in it.
|
||
1. [confg.json](https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V2-Lite-W8A8/file/view/master/config.json?status=1). Unlike most other quantized models, it doesn't contain `quantization_config` field. If it contains, vLLM Ascend won't work correctly. | ||
|
||
2. [quant_model_description.json](https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V2-Lite-W8A8/file/view/master/quant_model_description.json?status=1). Actually, the quantization information is moved from `config.json` to a file named `quant_model_description.json`, it records per layers quantization parameters. If vLLM connot find this file, loading quantization config process will fail. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, the quantization information is moved from config.json
to a file named quant_model_description.json
, it records per layers quantization parameters. If vLLM connot find this file, loading quantization config process will fail.
All the converted weights info are recorded in this file.
|
||
## Run the model | ||
|
||
vLLM Ascend register a custom quantization method called `ascend`, so we need to specify this quantization method when running the quantized model. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now, you can run the quantized models with vLLM Ascend. Here is the example for online and offline inference.
fdcd7ca
to
eb08dad
Compare
@@ -0,0 +1,83 @@ | |||
# Quantization Guide | |||
|
|||
Model quantization is a technique that reduces the size and computational requirements of a model by lowering the data precision of the weights and activation values in the model, thereby improving the inference speed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Model quantization is a technique that reduces the size and computational requirements of a model by lowering the data precision of the weights and activation values in the model, thereby improving the inference speed. | |
Model quantization is a technique that reduces the size and computational requirements of a model by lowering the data precision of the weights and activation values in the model, thereby saving the memory and improving the inference speed. |
Especially note for ascend...
|
||
Model quantization is a technique that reduces the size and computational requirements of a model by lowering the data precision of the weights and activation values in the model, thereby improving the inference speed. | ||
|
||
In 0.9.0rc1 version, only W8A8 quantization is supported. And only Qwen, DeepSeek series models are well tested. We’ll support more quantization algorithm and models in the next release. The following |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since 0.9.0rc2 version, quantization feature is experimentally supported in vLLM Ascend.
Users can enable quantization feature by specifying --quantization ascend
, currently only Qwen, DeepSeek series models are well tested. We’ll support more quantization algorithm and models in the future.
git clone https://gitee.com/ascend/msit -b modelslim-VLLM-8.1.RC1.b020_001 | ||
cd msit/msmodelslim | ||
bash install.sh | ||
pip install accelerate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a validation here maybe print version or someting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
cd example/DeepSeek | ||
python3 quant_deepseek.py --model_path {original_model_path} --save_directory {quantized_model_save_path} --device_type cpu --act_method 2 --w_bit 8 --a_bit 8 --is_dynamic True | ||
``` | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add section verify the quantized model
that about key file list like: https://vllm-ascend.readthedocs.io/en/latest/tutorials/multi_npu_quantization.html#verify-the-quantized-model
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
] | ||
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40) | ||
|
||
llm = LLM(model="{quantized_model_save_path}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a note before here:
# Enable quantization by specifing `quantization="ascend"`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean the code comments before the key line😂
### Online inference | ||
|
||
```bash | ||
vllm serve {quantized_model_save_path} --served-model-name "deepseek-v2-lite-w8a8" --max-model-len 2048 --quantization ascend --trust-remote-code |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Enable quantization by specifing `--quantization ascend`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
d7a33d0
to
ea496b3
Compare
|
||
## FAQs | ||
|
||
### 1. How to solve the KeyError: 'xxx.layers.0.self_attn.q_proj.weight' problem? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also link the faq to https://vllm-ascend.readthedocs.io/en/v0.7.3-dev/faqs.html
and update: https://vllm-ascend.readthedocs.io/en/v0.7.3-dev/faqs.html#how-to-run-w8a8-deepseek-model
and mention all case in here: #619 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will submit another pr for v0.7.3-dev branch
|
||
## Install modelslim | ||
|
||
To quantize a model, uers should install [ModelSlim](https://gitee.com/ascend/msit/blob/master/msmodelslim/README.md) which is the Ascend compression and acceleration tool. It is an affinity-based compression tool designed for acceleration, using compression as its core technology and built upon the Ascend platform. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uers--->users
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
``` | ||
|
||
Here is part of installation log: | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the python installation log seems unused, if we cann't run the cmd, we can drop this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have droped this log
] | ||
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40) | ||
|
||
llm = LLM(model="{quantized_model_save_path}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean the code comments before the key line😂
### Online inference | ||
|
||
```bash | ||
vllm serve {quantized_model_save_path} --served-model-name "deepseek-v2-lite-w8a8" --max-model-len 2048 --quantization ascend --trust-remote-code |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
Signed-off-by: 22dimensions <waitingwind@foxmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@ApsarasX Would you mind taking a look? Because many quantization issue resolved with your help. Hope this doc will help some. |
I think this PR is great—now I can quantize model weights myself too. |
What this PR does / why we need it?
Add user guide for quantization
Does this PR introduce any user-facing change?
No
How was this patch tested?
Preview