Skip to content

Commit ef90f8f

Browse files
authored
Merge branch 'main' into main
2 parents 5473fb6 + d2f87ed commit ef90f8f

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

48 files changed

+4041
-455
lines changed

.github/workflows/vllm_ascend_test.yaml

Lines changed: 13 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -114,14 +114,20 @@ jobs:
114114
# pytest -sv tests/singlecard/test_guided_decoding.py.py
115115
# test_ascend_config.py should be ran separately because it will regenerate the global config many times.
116116
pytest -sv tests/singlecard/test_ascend_config.py
117+
pytest -sv tests/singlecard/test_camem.py
117118
pytest -sv tests/singlecard/ \
118119
--ignore=tests/singlecard/test_offline_inference.py \
119120
--ignore=tests/singlecard/test_scheduler.py \
120121
--ignore=tests/singlecard/test_guided_decoding.py \
121-
--ignore=tests/singlecard/test_ascend_config.py
122+
--ignore=tests/singlecard/test_ascend_config.py \
123+
--ignore=tests/singlecard/test_camem.py
122124
else
123125
pytest -sv tests/multicard/test_ilama_lora_tp2.py
124-
VLLM_USE_MODELSCOPE=True pytest -sv tests/multicard/ --ignore=tests/multicard/test_ilama_lora_tp2.py
126+
# To avoid oom, we need to run the test in a single process.
127+
VLLM_USE_MODELSCOPE=True pytest -sv tests/multicard/test_offline_inference_distributed.py::test_models_distributed_QwQ
128+
VLLM_USE_MODELSCOPE=True pytest -sv tests/multicard/test_offline_inference_distributed.py::test_models_distributed_DeepSeek
129+
VLLM_USE_MODELSCOPE=True pytest -sv tests/multicard/test_offline_inference_distributed.py::test_models_distributed_topk
130+
VLLM_USE_MODELSCOPE=True pytest -sv tests/multicard/ --ignore=tests/multicard/test_ilama_lora_tp2.py --ignore=tests/multicard/test_offline_inference_distributed.py
125131
fi
126132
127133
- name: Run vllm-project/vllm-ascend test on V0 engine
@@ -136,16 +142,20 @@ jobs:
136142
pytest -sv tests/singlecard/test_camem.py
137143
# test_ascend_config.py should be ran separately because it will regenerate the global config many times.
138144
pytest -sv tests/singlecard/test_ascend_config.py
145+
pytest -sv tests/singlecard/test_prompt_embedding.py
139146
pytest -sv tests/singlecard/ \
140147
--ignore=tests/singlecard/test_offline_inference.py \
141148
--ignore=tests/singlecard/test_scheduler.py \
142149
--ignore=tests/singlecard/test_guided_decoding.py \
143150
--ignore=tests/singlecard/test_camem.py \
144-
--ignore=tests/singlecard/test_ascend_config.py
151+
--ignore=tests/singlecard/test_ascend_config.py \
152+
--ignore=tests/singlecard/test_prompt_embedding.py
145153
else
146154
pytest -sv tests/multicard/test_ilama_lora_tp2.py
147155
# Fixme: run VLLM_USE_MODELSCOPE=True pytest -sv tests/multicard/test_offline_inference_distributed.py will raise error.
156+
# To avoid oom, we need to run the test in a single process.
148157
VLLM_USE_MODELSCOPE=True pytest -sv tests/multicard/test_offline_inference_distributed.py::test_models_distributed_QwQ
149158
VLLM_USE_MODELSCOPE=True pytest -sv tests/multicard/test_offline_inference_distributed.py::test_models_distributed_DeepSeek
159+
VLLM_USE_MODELSCOPE=True pytest -sv tests/multicard/test_offline_inference_distributed.py::test_models_distributed_topk
150160
VLLM_USE_MODELSCOPE=True pytest -sv tests/multicard/ --ignore=tests/multicard/test_ilama_lora_tp2.py --ignore=tests/multicard/test_offline_inference_distributed.py
151161
fi

docs/source/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -47,6 +47,7 @@ user_guide/suppoted_features
4747
user_guide/supported_models
4848
user_guide/env_vars
4949
user_guide/additional_config
50+
user_guide/graph_mode.md
5051
user_guide/release_notes
5152
:::
5253

docs/source/user_guide/additional_config.md

Lines changed: 8 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,8 @@ The following table lists the additional configuration options available in vLLM
2828
| ---- | ---- | ------- | ----------- |
2929
| `torchair_graph_config` | dict | `{}` | The config options for torchair graph mode |
3030
| `ascend_scheduler_config` | dict | `{}` | The config options for ascend scheduler |
31-
| `expert_tensor_parallel_size` | str | `1` | Expert tensor parallel size the model to use. |
31+
| `expert_tensor_parallel_size` | str | `0` | Expert tensor parallel size the model to use. |
32+
| `refresh` | bool | `false` | Whether to refresh global ascend config content. This value is usually used by rlhf case. |
3233

3334
The details of each config option are as follows:
3435

@@ -37,9 +38,11 @@ The details of each config option are as follows:
3738
| Name | Type | Default | Description |
3839
| ---- | ---- | ------- | ----------- |
3940
| `enabled` | bool | `False` | Whether to enable torchair graph mode |
41+
| `enable_view_optimize` | bool | `True` | Whether to enable torchair view optimization |
4042
| `use_cached_graph` | bool | `False` | Whether to use cached graph |
4143
| `graph_batch_sizes` | list[int] | `[]` | The batch size for torchair graph cache |
4244
| `graph_batch_sizes_init` | bool | `False` | Init graph batch size dynamically if `graph_batch_sizes` is empty |
45+
| `enable_multistream_shared_expert`| bool | `False` | Whether to enable multistream shared expert |
4346

4447
**ascend_scheduler_config**
4548

@@ -59,12 +62,14 @@ A full example of additional configuration is as follows:
5962
"enabled": true,
6063
"use_cached_graph": true,
6164
"graph_batch_sizes": [1, 2, 4, 8],
62-
"graph_batch_sizes_init": true
65+
"graph_batch_sizes_init": false,
66+
"enable_multistream_shared_expert": false
6367
},
6468
"ascend_scheduler_config": {
6569
"enabled": true,
6670
"chunked_prefill_enabled": true,
6771
},
68-
"expert_tensor_parallel_size": 1
72+
"expert_tensor_parallel_size": 1,
73+
"refresh": false,
6974
}
7075
```
Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,82 @@
1+
# Graph Mode Guide
2+
3+
4+
This feature is currently experimental. In future versions, there may be behavioral changes around configuration, coverage, performance improvement.
5+
6+
This guide provides instructions for using Ascend Graph Mode with vLLM Ascend. Please note that graph mode is only available on V1 Engine. And only Qwen, DeepSeek series models are well tested in 0.9.0rc1. We'll make it stable and generalize in the next release.
7+
8+
## Getting Started
9+
10+
From v0.9.0rc1 with V1 Engine, vLLM Ascend will run models in graph mode by default to keep the same behavior with vLLM. If you hit any issues, please feel free to open an issue on GitHub and fallback to eager mode temporarily by set `enforce_eager=True` when initializing the model.
11+
12+
There are two kinds for graph mode supported by vLLM Ascend:
13+
- **ACLGraph**: This is the default graph mode supported by vLLM Ascend. In v0.9.0rc1, only Qwen series models are well tested.
14+
- **TorchAirGraph**: This is the GE graph mode. In v0.9.0rc1, only DeepSeek series models are supported.
15+
16+
## Using ACLGraph
17+
ACLGraph is enabled by default. Take Qwen series models as an example, just set to use V1 Engine is enough.
18+
19+
offline example:
20+
21+
```python
22+
import os
23+
24+
from vllm import LLM
25+
26+
os.environ["VLLM_USE_V1"] = 1
27+
28+
model = LLM(model="Qwen/Qwen2-7B-Instruct")
29+
outputs = model.generate("Hello, how are you?")
30+
```
31+
32+
online example:
33+
34+
```shell
35+
vllm serve Qwen/Qwen2-7B-Instruct
36+
```
37+
38+
## Using TorchAirGraph
39+
40+
If you want to run DeepSeek series models with graph mode, you should use [TorchAirGraph](https://www.hiascend.com/document/detail/zh/Pytorch/700/modthirdparty/torchairuseguide/torchair_0002.html). In this case, additional config is required.
41+
42+
offline example:
43+
44+
```python
45+
import os
46+
from vllm import LLM
47+
48+
os.environ["VLLM_USE_V1"] = 1
49+
50+
model = LLM(model="deepseek-ai/DeepSeek-R1-0528", additional_config={"torchair_graph_config": {"enable": True}})
51+
outputs = model.generate("Hello, how are you?")
52+
```
53+
54+
online example:
55+
56+
```shell
57+
vllm serve Qwen/Qwen2-7B-Instruct --additional-config='{"torchair_graph_config": {"enable": True}}'
58+
```
59+
60+
You can find more detail about additional config [here](./additional_config.md)
61+
62+
## Fallback to Eager Mode
63+
64+
If both `ACLGraph` and `TorchAirGraph` fail to run, you should fallback to eager mode.
65+
66+
offline example:
67+
68+
```python
69+
import os
70+
from vllm import LLM
71+
72+
os.environ["VLLM_USE_V1"] = 1
73+
74+
model = LLM(model="someother_model_weight", enforce_eager=True)
75+
outputs = model.generate("Hello, how are you?")
76+
```
77+
78+
online example:
79+
80+
```shell
81+
vllm serve Qwen/Qwen2-7B-Instruct --enforce-eager
82+
```
Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
import os
2+
import time
3+
4+
from vllm import LLM, SamplingParams
5+
6+
# enable dual-batch overlap for vllm ascend
7+
os.environ["VLLM_ASCEND_ENABLE_DBO"] = "1"
8+
os.environ["VLLM_USE_V1"] = "1"
9+
10+
# Sample prompts.
11+
prompts = ["The president of the United States is"] * 41
12+
# Create a sampling params object.
13+
sampling_params = SamplingParams(max_tokens=100, temperature=0.0)
14+
15+
16+
def main():
17+
# Create an LLM.
18+
llm = LLM(model="deepseek-ai/DeepSeek-V3-Lite-base-latest-w8a8-dynamic",
19+
enforce_eager=True,
20+
tensor_parallel_size=2,
21+
max_model_len=4096,
22+
trust_remote_code=True,
23+
additional_config={
24+
"torchair_graph_config": {
25+
"enabled": False
26+
},
27+
"ascend_scheduler_config": {
28+
"enabled": True
29+
},
30+
"expert_tensor_parallel_size": 1
31+
})
32+
33+
# Generate texts from the prompts. The output is a list of RequestOutput
34+
# objects that contain the prompt, generated text, and other information.
35+
outputs = llm.generate(prompts, sampling_params)
36+
37+
# Print the outputs.
38+
print("-" * 50)
39+
for output in outputs:
40+
prompt = output.prompt
41+
generated_text = output.outputs[0].text
42+
print(f"Prompt: {prompt!r}\nGenerated text: {generated_text!r}")
43+
print("-" * 50)
44+
45+
# Add a buffer to wait for profiler in the background process
46+
# (in case MP is on) to finish writing profiling output.
47+
time.sleep(10)
48+
49+
50+
if __name__ == "__main__":
51+
main()
Lines changed: 83 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,83 @@
1+
import torch
2+
from transformers import (AutoModelForCausalLM, AutoTokenizer,
3+
PreTrainedTokenizer)
4+
from vllm import LLM
5+
6+
7+
def init_tokenizer_and_llm(model_name: str):
8+
tokenizer = AutoTokenizer.from_pretrained(model_name)
9+
transformers_model = AutoModelForCausalLM.from_pretrained(model_name)
10+
embedding_layer = transformers_model.get_input_embeddings()
11+
llm = LLM(model=model_name, enable_prompt_embeds=True)
12+
return tokenizer, embedding_layer, llm
13+
14+
15+
def get_prompt_embeds(chat: list[dict[str,
16+
str]], tokenizer: PreTrainedTokenizer,
17+
embedding_layer: torch.nn.Module):
18+
token_ids = tokenizer.apply_chat_template(chat,
19+
add_generation_prompt=True,
20+
return_tensors='pt')
21+
prompt_embeds = embedding_layer(token_ids).squeeze(0)
22+
return prompt_embeds
23+
24+
25+
def single_prompt_inference(llm: LLM, tokenizer: PreTrainedTokenizer,
26+
embedding_layer: torch.nn.Module):
27+
chat = [{
28+
"role": "user",
29+
"content": "Please tell me about the capital of France."
30+
}]
31+
prompt_embeds = get_prompt_embeds(chat, tokenizer, embedding_layer)
32+
33+
outputs = llm.generate({
34+
"prompt_embeds": prompt_embeds,
35+
})
36+
37+
print("\n[Single Inference Output]")
38+
print("-" * 30)
39+
for o in outputs:
40+
print(o.outputs[0].text)
41+
print("-" * 30)
42+
43+
44+
def batch_prompt_inference(llm: LLM, tokenizer: PreTrainedTokenizer,
45+
embedding_layer: torch.nn.Module):
46+
chats = [[{
47+
"role": "user",
48+
"content": "Please tell me about the capital of France."
49+
}],
50+
[{
51+
"role": "user",
52+
"content": "When is the day longest during the year?"
53+
}],
54+
[{
55+
"role": "user",
56+
"content": "Where is bigger, the moon or the sun?"
57+
}]]
58+
59+
prompt_embeds_list = [
60+
get_prompt_embeds(chat, tokenizer, embedding_layer) for chat in chats
61+
]
62+
63+
outputs = llm.generate([{
64+
"prompt_embeds": embeds
65+
} for embeds in prompt_embeds_list])
66+
67+
print("\n[Batch Inference Outputs]")
68+
print("-" * 30)
69+
for i, o in enumerate(outputs):
70+
print(f"Q{i+1}: {chats[i][0]['content']}")
71+
print(f"A{i+1}: {o.outputs[0].text}\n")
72+
print("-" * 30)
73+
74+
75+
def main():
76+
model_name = "meta-llama/Llama-3.2-1B-Instruct"
77+
tokenizer, embedding_layer, llm = init_tokenizer_and_llm(model_name)
78+
single_prompt_inference(llm, tokenizer, embedding_layer)
79+
batch_prompt_inference(llm, tokenizer, embedding_layer)
80+
81+
82+
if __name__ == "__main__":
83+
main()

pyproject.toml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,5 +16,8 @@ requires = [
1616
"torch>=2.5.1",
1717
"torchvision<0.21.0",
1818
"wheel",
19+
"msgpack",
20+
"quart",
21+
"numba",
1922
]
2023
build-backend = "setuptools.build_meta"

requirements-dev.txt

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,4 @@ ray
99
types-jsonschema
1010
xgrammar
1111
zmq
12-
numba
13-
quart
1412
types-psutil

requirements.txt

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,3 +18,6 @@ wheel
1818
# requirements for disaggregated prefill
1919
msgpack
2020
quart
21+
22+
# Required for N-gram speculative decoding
23+
numba

tests/long_term/test_deepseek_v2_lite_tp2_accuracy.py

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -34,8 +34,7 @@
3434
# 3% relative tolerance for numerical accuracy.
3535
RTOL = 0.03
3636
# Baseline accuracy after VLLM optimization.
37-
# FIXME: fix the accuracy issue
38-
EXPECTED_VALUE = 0.000758150113722517
37+
EXPECTED_VALUE = 0.3843821076573162
3938

4039

4140
def run_test(model_name, queue, more_args=None):

0 commit comments

Comments
 (0)