Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/source/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,7 @@ user_guide/supported_models
user_guide/env_vars
user_guide/additional_config
user_guide/graph_mode.md
user_guide/quantization.md
user_guide/release_notes
:::

Expand Down
106 changes: 106 additions & 0 deletions docs/source/user_guide/quantization.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
# Quantization Guide

Model quantization is a technique that reduces the size and computational requirements of a model by lowering the data precision of the weights and activation values in the model, thereby saving the memory and improving the inference speed.

Since 0.9.0rc2 version, quantization feature is experimentally supported in vLLM Ascend. Users can enable quantization feature by specifying `--quantization ascend`. Currently, only Qwen, DeepSeek series models are well tested. We’ll support more quantization algorithm and models in the future.

## Install modelslim

To quantize a model, users should install [ModelSlim](https://gitee.com/ascend/msit/blob/master/msmodelslim/README.md) which is the Ascend compression and acceleration tool. It is an affinity-based compression tool designed for acceleration, using compression as its core technology and built upon the Ascend platform.

Currently, only the specific tag [modelslim-VLLM-8.1.RC1.b020_001](https://gitee.com/ascend/msit/blob/modelslim-VLLM-8.1.RC1.b020_001/msmodelslim/README.md) of modelslim works with vLLM Ascend. Please do not install other version until modelslim master version is avaliable for vLLM Ascend in the future.

Install modelslim:
```bash
git clone https://gitee.com/ascend/msit -b modelslim-VLLM-8.1.RC1.b020_001
cd msit/msmodelslim
bash install.sh
pip install accelerate
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a validation here maybe print version or someting

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

```

## Quantize model

Take [DeepSeek-V2-Lite](https://modelscope.cn/models/deepseek-ai/DeepSeek-V2-Lite) as an example, you just need to download the model, and then execute the convert command. The command is shown below. More info can be found in modelslim doc [deepseek w8a8 dynamic quantization docs](https://gitee.com/ascend/msit/blob/modelslim-VLLM-8.1.RC1.b020_001/msmodelslim/example/DeepSeek/README.md#deepseek-v2-w8a8-dynamic%E9%87%8F%E5%8C%96).


```bash
cd example/DeepSeek
python3 quant_deepseek.py --model_path {original_model_path} --save_directory {quantized_model_save_path} --device_type cpu --act_method 2 --w_bit 8 --a_bit 8 --is_dynamic True
```

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

:::{note}
You can also download the quantized model that we uploaded. Please note that these weights should be used for test only. For example, https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V2-Lite-W8A8
:::

Once convert action is done, there are two important files generated.

1. [confg.json](https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V2-Lite-W8A8/file/view/master/config.json?status=1). Please make sure that there is no `quantization_config` field in it.

2. [quant_model_description.json](https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V2-Lite-W8A8/file/view/master/quant_model_description.json?status=1). All the converted weights info are recorded in this file.

Here is the full converted model files:
```bash
.
├── config.json
├── configuration_deepseek.py
├── configuration.json
├── generation_config.json
├── quant_model_description.json
├── quant_model_weight_w8a8_dynamic-00001-of-00004.safetensors
├── quant_model_weight_w8a8_dynamic-00002-of-00004.safetensors
├── quant_model_weight_w8a8_dynamic-00003-of-00004.safetensors
├── quant_model_weight_w8a8_dynamic-00004-of-00004.safetensors
├── quant_model_weight_w8a8_dynamic.safetensors.index.json
├── README.md
├── tokenization_deepseek_fast.py
├── tokenizer_config.json
└── tokenizer.json
```

## Run the model

Now, you can run the quantized models with vLLM Ascend. Here is the example for online and offline inference.

### Offline inference

```python
import torch

from vllm import LLM, SamplingParams

prompts = [
"Hello, my name is",
"The future of AI is",
]
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40)

llm = LLM(model="{quantized_model_save_path}",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a note before here:

# Enable quantization by specifing `quantization="ascend"`

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean the code comments before the key line😂

max_model_len=2048,
trust_remote_code=True,
# Enable quantization by specifing `quantization="ascend"`
quantization="ascend")

outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```

### Online inference

```bash
# Enable quantization by specifing `--quantization ascend`
vllm serve {quantized_model_save_path} --served-model-name "deepseek-v2-lite-w8a8" --max-model-len 2048 --quantization ascend --trust-remote-code
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# Enable quantization by specifing `--quantization ascend`

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

```

## FAQs

### 1. How to solve the KeyError: 'xxx.layers.0.self_attn.q_proj.weight' problem?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will submit another pr for v0.7.3-dev branch


First, make sure you specify `ascend` quantization method. Second, check if your model is converted by this `modelslim-VLLM-8.1.RC1.b020_001` modelslim version. Finally, if it still doesn't work, please
submit a issue, maybe some new models need to be adapted.

### 2. How to solve the error "Could not locate the configuration_deepseek.py"?

Please convert DeepSeek series models using `modelslim-VLLM-8.1.RC1.b020_001` modelslim, this version has fixed the missing configuration_deepseek.py error.