-
Notifications
You must be signed in to change notification settings - Fork 500
Add user guide for quantization #1206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,106 @@ | ||
# Quantization Guide | ||
|
||
Model quantization is a technique that reduces the size and computational requirements of a model by lowering the data precision of the weights and activation values in the model, thereby saving the memory and improving the inference speed. | ||
|
||
Since 0.9.0rc2 version, quantization feature is experimentally supported in vLLM Ascend. Users can enable quantization feature by specifying `--quantization ascend`. Currently, only Qwen, DeepSeek series models are well tested. We’ll support more quantization algorithm and models in the future. | ||
|
||
## Install modelslim | ||
|
||
To quantize a model, users should install [ModelSlim](https://gitee.com/ascend/msit/blob/master/msmodelslim/README.md) which is the Ascend compression and acceleration tool. It is an affinity-based compression tool designed for acceleration, using compression as its core technology and built upon the Ascend platform. | ||
|
||
Currently, only the specific tag [modelslim-VLLM-8.1.RC1.b020_001](https://gitee.com/ascend/msit/blob/modelslim-VLLM-8.1.RC1.b020_001/msmodelslim/README.md) of modelslim works with vLLM Ascend. Please do not install other version until modelslim master version is avaliable for vLLM Ascend in the future. | ||
|
||
Install modelslim: | ||
```bash | ||
git clone https://gitee.com/ascend/msit -b modelslim-VLLM-8.1.RC1.b020_001 | ||
cd msit/msmodelslim | ||
bash install.sh | ||
pip install accelerate | ||
``` | ||
|
||
## Quantize model | ||
|
||
Take [DeepSeek-V2-Lite](https://modelscope.cn/models/deepseek-ai/DeepSeek-V2-Lite) as an example, you just need to download the model, and then execute the convert command. The command is shown below. More info can be found in modelslim doc [deepseek w8a8 dynamic quantization docs](https://gitee.com/ascend/msit/blob/modelslim-VLLM-8.1.RC1.b020_001/msmodelslim/example/DeepSeek/README.md#deepseek-v2-w8a8-dynamic%E9%87%8F%E5%8C%96). | ||
|
||
|
||
```bash | ||
cd example/DeepSeek | ||
python3 quant_deepseek.py --model_path {original_model_path} --save_directory {quantized_model_save_path} --device_type cpu --act_method 2 --w_bit 8 --a_bit 8 --is_dynamic True | ||
``` | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Add section There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. done |
||
:::{note} | ||
You can also download the quantized model that we uploaded. Please note that these weights should be used for test only. For example, https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V2-Lite-W8A8 | ||
::: | ||
|
||
Once convert action is done, there are two important files generated. | ||
|
||
1. [confg.json](https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V2-Lite-W8A8/file/view/master/config.json?status=1). Please make sure that there is no `quantization_config` field in it. | ||
|
||
2. [quant_model_description.json](https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V2-Lite-W8A8/file/view/master/quant_model_description.json?status=1). All the converted weights info are recorded in this file. | ||
|
||
Here is the full converted model files: | ||
```bash | ||
. | ||
├── config.json | ||
├── configuration_deepseek.py | ||
├── configuration.json | ||
├── generation_config.json | ||
├── quant_model_description.json | ||
├── quant_model_weight_w8a8_dynamic-00001-of-00004.safetensors | ||
├── quant_model_weight_w8a8_dynamic-00002-of-00004.safetensors | ||
├── quant_model_weight_w8a8_dynamic-00003-of-00004.safetensors | ||
├── quant_model_weight_w8a8_dynamic-00004-of-00004.safetensors | ||
├── quant_model_weight_w8a8_dynamic.safetensors.index.json | ||
├── README.md | ||
├── tokenization_deepseek_fast.py | ||
├── tokenizer_config.json | ||
└── tokenizer.json | ||
``` | ||
|
||
## Run the model | ||
|
||
Now, you can run the quantized models with vLLM Ascend. Here is the example for online and offline inference. | ||
|
||
### Offline inference | ||
|
||
```python | ||
import torch | ||
|
||
from vllm import LLM, SamplingParams | ||
|
||
prompts = [ | ||
"Hello, my name is", | ||
"The future of AI is", | ||
] | ||
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40) | ||
|
||
llm = LLM(model="{quantized_model_save_path}", | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. add a note before here:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I mean the code comments before the key line😂 |
||
max_model_len=2048, | ||
trust_remote_code=True, | ||
# Enable quantization by specifing `quantization="ascend"` | ||
quantization="ascend") | ||
|
||
outputs = llm.generate(prompts, sampling_params) | ||
for output in outputs: | ||
prompt = output.prompt | ||
generated_text = output.outputs[0].text | ||
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") | ||
``` | ||
|
||
### Online inference | ||
|
||
```bash | ||
# Enable quantization by specifing `--quantization ascend` | ||
vllm serve {quantized_model_save_path} --served-model-name "deepseek-v2-lite-w8a8" --max-model-len 2048 --quantization ascend --trust-remote-code | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. done There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. same |
||
``` | ||
|
||
## FAQs | ||
|
||
### 1. How to solve the KeyError: 'xxx.layers.0.self_attn.q_proj.weight' problem? | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. also link the faq to https://vllm-ascend.readthedocs.io/en/v0.7.3-dev/faqs.html and update: https://vllm-ascend.readthedocs.io/en/v0.7.3-dev/faqs.html#how-to-run-w8a8-deepseek-model and mention all case in here: #619 (comment) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I will submit another pr for v0.7.3-dev branch |
||
|
||
First, make sure you specify `ascend` quantization method. Second, check if your model is converted by this `modelslim-VLLM-8.1.RC1.b020_001` modelslim version. Finally, if it still doesn't work, please | ||
submit a issue, maybe some new models need to be adapted. | ||
|
||
### 2. How to solve the error "Could not locate the configuration_deepseek.py"? | ||
|
||
Please convert DeepSeek series models using `modelslim-VLLM-8.1.RC1.b020_001` modelslim, this version has fixed the missing configuration_deepseek.py error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a validation here maybe print version or someting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done