add Qwen3-Omni-30B-A3B-Thinking doc

Meihan-chen · Meihan-chen · commit 68247739756b · 2025-11-05T09:30:12.000+08:00
Signed-off-by: Meihan-chen &lt;jcccx.cmh@gmail.com&gt;
diff --git a/docs/source/tutorials/index.md b/docs/source/tutorials/index.md
@@ -12,6 +12,7 @@ multi_npu_qwen3_next
 multi_npu
 multi_npu_moge
 multi_npu_qwen3_moe
+multi_npu_qwen3_omni_30B_A3B_Thinking
 multi_npu_quantization
 single_node_300i
 DeepSeek-V3.2-Exp.md
diff --git a/docs/source/tutorials/multi_npu_qwen3_omni_30B_A3B_Thinking.md b/docs/source/tutorials/multi_npu_qwen3_omni_30B_A3B_Thinking.md
@@ -0,0 +1,186 @@
+# Multi-NPU (Qwen3-Omni-30B-A3B-Thinking)
+
+## Run vllm-ascend on Multi-NPU with Qwen3-Omni-30B-A3B-Thinking
+
+Run docker container:
+
+```{code-block} bash
+   :substitutions:
+# Update the vllm-ascend image
+export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
+docker run --rm \
+--name vllm-ascend \
+--shm-size=1g \
+--device /dev/davinci0 \
+--device /dev/davinci1 \
+--device /dev/davinci_manager \
+--device /dev/devmm_svm \
+--device /dev/hisi_hdc \
+-v /usr/local/dcmi:/usr/local/dcmi \
+-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
+-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
+-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
+-v /etc/ascend_install.info:/etc/ascend_install.info \
+-v /root/.cache:/root/.cache \
+-p 8000:8000 \
+-it $IMAGE bash
+```
+
+Set up environment variables:
+
+```bash
+# Load model from ModelScope to speed up download
+export VLLM_USE_MODELSCOPE=True
+
+# Set `max_split_size_mb` to reduce memory fragmentation and avoid out of memory
+export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256
+```
+
+Install system dependencies：
+
+```bash
+# If you already have transformers installed, please update transformers version >= 4.57.0.dev0
+# pip install transformers -U 
+pip install qwen_vl_utils --extra-index-url https://download.pytorch.org/whl/cpu/
+```
+
+
+### Offline Inference on Multi-NPU
+
+Run the following script to execute offline inference on multi-NPU:
+
+```python
+import gc
+import torch
+import os
+from vllm import LLM, SamplingParams
+from vllm.distributed.parallel_state import (
+    destroy_distributed_environment,
+    destroy_model_parallel
+)
+from modelscope import Qwen3OmniMoeProcessor  
+from qwen_omni_utils import process_mm_info  
+
+
+def clean_up():
+    """Clean up distributed resources and NPU memory"""
+    destroy_model_parallel()  
+    destroy_distributed_environment() 
+    gc.collect()  # Garbage collection to free up memory
+    torch.npu.empty_cache() 
+
+
+def main():
+    MODEL_PATH = "Qwen/Qwen3-Omni-30B-A3B-Thinking"
+    llm = LLM(
+        model=MODEL_PATH,
+        tensor_parallel_size=2, 
+        distributed_executor_backend="mp",  
+        limit_mm_per_prompt={'image': 5, 'video': 2, 'audio': 3},  
+        max_model_len=32768, 
+    )
+    
+    sampling_params = SamplingParams(
+        temperature=0.6, 
+        top_p=0.95,
+        top_k=20,
+        max_tokens=16384, 
+    )
+    
+    processor = Qwen3OmniMoeProcessor.from_pretrained(MODEL_PATH)
+    messages = [
+        {
+            "role": "user",
+            "content": [
+                {"type": "image", "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cars.jpg"},
+                {"type": "audio", "audio": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cough.wav"},
+                {"type": "video", "video": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/draw.mp4"},
+                {"type": "text", "text": "Analyze this audio, image, and video together."}
+            ]
+        }
+    ]
+  
+    text = processor.apply_chat_template(
+        messages,
+        tokenize=False,
+        add_generation_prompt=True 
+    )
+    audios, images, videos = process_mm_info(messages)
+    
+    inputs = {
+        "prompt": text,
+        "multi_modal_data": {},
+        "mm_processor_kwargs": {"use_audio_in_video": False}  
+    }
+    if images is not None:
+        inputs['multi_modal_data']['image'] = images
+    if videos is not None:
+        inputs['multi_modal_data']['video'] = videos
+    if audios is not None:
+        inputs['multi_modal_data']['audio'] = audios
+ 
+    outputs = llm.generate([inputs], sampling_params=sampling_params)
+    for output in outputs:
+        prompt = output.prompt
+        generated_text = output.outputs[0].text
+        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
+    
+    del llm
+    clean_up() 
+
+
+if __name__ == "__main__":
+    main()
+```
+
+
+### Online Inference on Multi-NPU
+
+Run the following script to start the vLLM server on Multi-NPU:
+
+For an Atlas A2 with 64 GB of NPU card memory, tensor-parallel-size should be at least 1, and for 32 GB of memory, tensor-parallel-size should be at least 2.
+
+```bash
+vllm serve Qwen/Qwen3-Omni-30B-A3B-Thinking --tensor-parallel-size 2
+```
+
+Once your server is started, you can query the model with input prompts.
+
+```bash
+curl http://localhost:8000/v1/chat/completions \
+-X POST \
+-H "Content-Type: application/json" \
+-d '{
+    "model": "Qwen/Qwen3-Omni-30B-A3B-Thinking",
+    "messages": [
+        {
+            "role": "user",
+            "content": [
+                {
+                    "type": "image_url",
+                    "image_url": {
+                        "url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cars.jpg"
+                    }
+                },
+                {
+                    "type": "audio_url",
+                    "audio_url": {
+                        "url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cough.wav"
+                    }
+                },
+                {
+                    "type": "video_url",
+                    "video_url": {
+                        "url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/draw.mp4"
+                    }
+
+                },
+                {
+                    "type": "text",
+                    "text":  "Analyze this audio, image, and video together."
+                }
+            ]
+        }
+    ]
+}'
+```