Skip to content

Commit 6824773

Browse files
committed
add Qwen3-Omni-30B-A3B-Thinking doc
Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
1 parent 5f08e07 commit 6824773

File tree

2 files changed

+187
-0
lines changed

2 files changed

+187
-0
lines changed

docs/source/tutorials/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@ multi_npu_qwen3_next
1212
multi_npu
1313
multi_npu_moge
1414
multi_npu_qwen3_moe
15+
multi_npu_qwen3_omni_30B_A3B_Thinking
1516
multi_npu_quantization
1617
single_node_300i
1718
DeepSeek-V3.2-Exp.md
Lines changed: 186 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,186 @@
1+
# Multi-NPU (Qwen3-Omni-30B-A3B-Thinking)
2+
3+
## Run vllm-ascend on Multi-NPU with Qwen3-Omni-30B-A3B-Thinking
4+
5+
Run docker container:
6+
7+
```{code-block} bash
8+
:substitutions:
9+
# Update the vllm-ascend image
10+
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
11+
docker run --rm \
12+
--name vllm-ascend \
13+
--shm-size=1g \
14+
--device /dev/davinci0 \
15+
--device /dev/davinci1 \
16+
--device /dev/davinci_manager \
17+
--device /dev/devmm_svm \
18+
--device /dev/hisi_hdc \
19+
-v /usr/local/dcmi:/usr/local/dcmi \
20+
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
21+
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
22+
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
23+
-v /etc/ascend_install.info:/etc/ascend_install.info \
24+
-v /root/.cache:/root/.cache \
25+
-p 8000:8000 \
26+
-it $IMAGE bash
27+
```
28+
29+
Set up environment variables:
30+
31+
```bash
32+
# Load model from ModelScope to speed up download
33+
export VLLM_USE_MODELSCOPE=True
34+
35+
# Set `max_split_size_mb` to reduce memory fragmentation and avoid out of memory
36+
export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256
37+
```
38+
39+
Install system dependencies:
40+
41+
```bash
42+
# If you already have transformers installed, please update transformers version >= 4.57.0.dev0
43+
# pip install transformers -U
44+
pip install qwen_vl_utils --extra-index-url https://download.pytorch.org/whl/cpu/
45+
```
46+
47+
48+
### Offline Inference on Multi-NPU
49+
50+
Run the following script to execute offline inference on multi-NPU:
51+
52+
```python
53+
import gc
54+
import torch
55+
import os
56+
from vllm import LLM, SamplingParams
57+
from vllm.distributed.parallel_state import (
58+
destroy_distributed_environment,
59+
destroy_model_parallel
60+
)
61+
from modelscope import Qwen3OmniMoeProcessor
62+
from qwen_omni_utils import process_mm_info
63+
64+
65+
def clean_up():
66+
"""Clean up distributed resources and NPU memory"""
67+
destroy_model_parallel()
68+
destroy_distributed_environment()
69+
gc.collect() # Garbage collection to free up memory
70+
torch.npu.empty_cache()
71+
72+
73+
def main():
74+
MODEL_PATH = "Qwen/Qwen3-Omni-30B-A3B-Thinking"
75+
llm = LLM(
76+
model=MODEL_PATH,
77+
tensor_parallel_size=2,
78+
distributed_executor_backend="mp",
79+
limit_mm_per_prompt={'image': 5, 'video': 2, 'audio': 3},
80+
max_model_len=32768,
81+
)
82+
83+
sampling_params = SamplingParams(
84+
temperature=0.6,
85+
top_p=0.95,
86+
top_k=20,
87+
max_tokens=16384,
88+
)
89+
90+
processor = Qwen3OmniMoeProcessor.from_pretrained(MODEL_PATH)
91+
messages = [
92+
{
93+
"role": "user",
94+
"content": [
95+
{"type": "image", "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cars.jpg"},
96+
{"type": "audio", "audio": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cough.wav"},
97+
{"type": "video", "video": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/draw.mp4"},
98+
{"type": "text", "text": "Analyze this audio, image, and video together."}
99+
]
100+
}
101+
]
102+
103+
text = processor.apply_chat_template(
104+
messages,
105+
tokenize=False,
106+
add_generation_prompt=True
107+
)
108+
audios, images, videos = process_mm_info(messages)
109+
110+
inputs = {
111+
"prompt": text,
112+
"multi_modal_data": {},
113+
"mm_processor_kwargs": {"use_audio_in_video": False}
114+
}
115+
if images is not None:
116+
inputs['multi_modal_data']['image'] = images
117+
if videos is not None:
118+
inputs['multi_modal_data']['video'] = videos
119+
if audios is not None:
120+
inputs['multi_modal_data']['audio'] = audios
121+
122+
outputs = llm.generate([inputs], sampling_params=sampling_params)
123+
for output in outputs:
124+
prompt = output.prompt
125+
generated_text = output.outputs[0].text
126+
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
127+
128+
del llm
129+
clean_up()
130+
131+
132+
if __name__ == "__main__":
133+
main()
134+
```
135+
136+
137+
### Online Inference on Multi-NPU
138+
139+
Run the following script to start the vLLM server on Multi-NPU:
140+
141+
For an Atlas A2 with 64 GB of NPU card memory, tensor-parallel-size should be at least 1, and for 32 GB of memory, tensor-parallel-size should be at least 2.
142+
143+
```bash
144+
vllm serve Qwen/Qwen3-Omni-30B-A3B-Thinking --tensor-parallel-size 2
145+
```
146+
147+
Once your server is started, you can query the model with input prompts.
148+
149+
```bash
150+
curl http://localhost:8000/v1/chat/completions \
151+
-X POST \
152+
-H "Content-Type: application/json" \
153+
-d '{
154+
"model": "Qwen/Qwen3-Omni-30B-A3B-Thinking",
155+
"messages": [
156+
{
157+
"role": "user",
158+
"content": [
159+
{
160+
"type": "image_url",
161+
"image_url": {
162+
"url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cars.jpg"
163+
}
164+
},
165+
{
166+
"type": "audio_url",
167+
"audio_url": {
168+
"url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cough.wav"
169+
}
170+
},
171+
{
172+
"type": "video_url",
173+
"video_url": {
174+
"url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/draw.mp4"
175+
}
176+
177+
},
178+
{
179+
"type": "text",
180+
"text": "Analyze this audio, image, and video together."
181+
}
182+
]
183+
}
184+
]
185+
}'
186+
```

0 commit comments

Comments
 (0)