|
| 1 | +# Multi-NPU (Qwen3-Omni-30B-A3B-Thinking) |
| 2 | + |
| 3 | +## Run vllm-ascend on Multi-NPU with Qwen3-Omni-30B-A3B-Thinking |
| 4 | + |
| 5 | +Run docker container: |
| 6 | + |
| 7 | +```{code-block} bash |
| 8 | + :substitutions: |
| 9 | +# Update the vllm-ascend image |
| 10 | +export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version| |
| 11 | +docker run --rm \ |
| 12 | +--name vllm-ascend \ |
| 13 | +--shm-size=1g \ |
| 14 | +--device /dev/davinci0 \ |
| 15 | +--device /dev/davinci1 \ |
| 16 | +--device /dev/davinci_manager \ |
| 17 | +--device /dev/devmm_svm \ |
| 18 | +--device /dev/hisi_hdc \ |
| 19 | +-v /usr/local/dcmi:/usr/local/dcmi \ |
| 20 | +-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \ |
| 21 | +-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \ |
| 22 | +-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \ |
| 23 | +-v /etc/ascend_install.info:/etc/ascend_install.info \ |
| 24 | +-v /root/.cache:/root/.cache \ |
| 25 | +-p 8000:8000 \ |
| 26 | +-it $IMAGE bash |
| 27 | +``` |
| 28 | + |
| 29 | +Set up environment variables: |
| 30 | + |
| 31 | +```bash |
| 32 | +# Load model from ModelScope to speed up download |
| 33 | +export VLLM_USE_MODELSCOPE=True |
| 34 | + |
| 35 | +# Set `max_split_size_mb` to reduce memory fragmentation and avoid out of memory |
| 36 | +export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 |
| 37 | +``` |
| 38 | + |
| 39 | +Install system dependencies: |
| 40 | + |
| 41 | +```bash |
| 42 | +# If you already have transformers installed, please update transformer version >= 4.57.0.dev0 |
| 43 | +# pip install transformer -U |
| 44 | +pip install qwen_vl_utils --extra-index-url https://download.pytorch.org/whl/cpu/ |
| 45 | +``` |
| 46 | + |
| 47 | + |
| 48 | +### Offline Inference on Multi-NPU |
| 49 | + |
| 50 | +Run the following script to execute offline inference on multi-NPU: |
| 51 | + |
| 52 | +```python |
| 53 | +import gc |
| 54 | +import torch |
| 55 | +import os |
| 56 | +from vllm import LLM, SamplingParams |
| 57 | +from vllm.distributed.parallel_state import ( |
| 58 | + destroy_distributed_environment, |
| 59 | + destroy_model_parallel |
| 60 | +) |
| 61 | +from modelscope import Qwen3OmniMoeProcessor |
| 62 | +from qwen_omni_utils import process_mm_info |
| 63 | + |
| 64 | + |
| 65 | +def clean_up(): |
| 66 | + """Clean up distributed resources and NPU memory""" |
| 67 | + destroy_model_parallel() |
| 68 | + destroy_distributed_environment() |
| 69 | + gc.collect() # Garbage collection to free up memory |
| 70 | + torch.npu.empty_cache() |
| 71 | + |
| 72 | + |
| 73 | +def main(): |
| 74 | + MODEL_PATH = "/Qwen/Qwen3-Omni-30B-A3B-Thinking" |
| 75 | + llm = LLM( |
| 76 | + model=MODEL_PATH, |
| 77 | + tensor_parallel_size=2, |
| 78 | + distributed_executor_backend="mp", |
| 79 | + limit_mm_per_prompt={'image': 5, 'video': 2, 'audio': 3}, |
| 80 | + max_model_len=32768, |
| 81 | + ) |
| 82 | + |
| 83 | + sampling_params = SamplingParams( |
| 84 | + temperature=0.6, |
| 85 | + top_p=0.95, |
| 86 | + top_k=20, |
| 87 | + max_tokens=16384, |
| 88 | + ) |
| 89 | + |
| 90 | + processor = Qwen3OmniMoeProcessor.from_pretrained(MODEL_PATH) |
| 91 | + messages = [ |
| 92 | + { |
| 93 | + "role": "user", |
| 94 | + "content": [ |
| 95 | + {"type": "image", "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cars.jpg"}, |
| 96 | + {"type": "audio", "audio": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cough.wav"}, |
| 97 | + {"type": "video", "video": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/draw.mp4"}, |
| 98 | + {"type": "text", "text": "Analyze this audio, image, and video together."} |
| 99 | + ] |
| 100 | + } |
| 101 | + ] |
| 102 | + |
| 103 | + text = processor.apply_chat_template( |
| 104 | + messages, |
| 105 | + tokenize=False, |
| 106 | + add_generation_prompt=True |
| 107 | + ) |
| 108 | + audios, images, videos = process_mm_info(messages) |
| 109 | + |
| 110 | + inputs = { |
| 111 | + "prompt": text, |
| 112 | + "multi_modal_data": {}, |
| 113 | + "mm_processor_kwargs": {"use_audio_in_video": False} |
| 114 | + } |
| 115 | + if images is not None: |
| 116 | + inputs['multi_modal_data']['image'] = images |
| 117 | + if videos is not None: |
| 118 | + inputs['multi_modal_data']['video'] = videos |
| 119 | + if audios is not None: |
| 120 | + inputs['multi_modal_data']['audio'] = audios |
| 121 | + |
| 122 | + outputs = llm.generate([inputs], sampling_params=sampling_params) |
| 123 | + for output in outputs: |
| 124 | + prompt = output.prompt |
| 125 | + generated_text = output.outputs[0].text |
| 126 | + print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") |
| 127 | + |
| 128 | + del llm |
| 129 | + clean_up() |
| 130 | + |
| 131 | + |
| 132 | +if __name__ == "__main__": |
| 133 | + main() |
| 134 | +``` |
| 135 | + |
| 136 | + |
| 137 | +### Online Inference on Multi-NPU |
| 138 | + |
| 139 | +Run the following script to start the vLLM server on Multi-NPU: |
| 140 | + |
| 141 | +For an Atlas A2 with 64 GB of NPU card memory, tensor-parallel-size should be at least 1, and for 32 GB of memory, tensor-parallel-size should be at least 2. |
| 142 | + |
| 143 | +```bash |
| 144 | +vllm serve Qwen/Qwen3-Omni-30B-A3B-Thinking --tensor-parallel-size 2 |
| 145 | +``` |
| 146 | + |
| 147 | +Once your server is started, you can query the model with input prompts. |
| 148 | + |
| 149 | +```bash |
| 150 | +curl http://localhost:8000/v1/chat/completions \ |
| 151 | +-X POST \ |
| 152 | +-H "Content-Type: application/json" \ |
| 153 | +-d '{ |
| 154 | + "model": "Qwen/Qwen3-Omni-30B-A3B-Thinking", |
| 155 | + "messages": [ |
| 156 | + { |
| 157 | + "role": "user", |
| 158 | + "content": [ |
| 159 | + { |
| 160 | + "type": "image_url", |
| 161 | + "image_url": { |
| 162 | + "url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cars.jpg" |
| 163 | + } |
| 164 | + }, |
| 165 | + { |
| 166 | + "type": "audio_url", |
| 167 | + "audio_url": { |
| 168 | + "url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cough.wav" |
| 169 | + } |
| 170 | + }, |
| 171 | + { |
| 172 | + "type": "video_url", |
| 173 | + "video_url": { |
| 174 | + "url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/draw.mp4" |
| 175 | + } |
| 176 | +
|
| 177 | + }, |
| 178 | + { |
| 179 | + "type": "text", |
| 180 | + "text": "Analyze this audio, image, and video together." |
| 181 | + } |
| 182 | + ] |
| 183 | + } |
| 184 | + ] |
| 185 | +}' |
| 186 | +``` |
| 187 | + |
| 188 | +If you run this script successfully, you can see the info shown below: |
| 189 | +```bash |
| 190 | +{"id":"chatcmpl-b74594a81b3a417ebe50a3d05a9aa415","object":"chat.completion","created":1762245267,"model":"/Qwen/Qwen3-Omni-30B-A3B-Thinking","choices":[{"index":0,"message":{"role":"assistant","content":"<think>\nGot it, let's try to analyze these different elements together. First, let's list out what we have: the initial image shows four luxury cars (Rolls-Royce, Mercedes GLE, Ferrari Portofino, Porsche 911), then there's an audio of someone coughing, and then a video of someone drawing a guitar on a tablet.\n\nHmm, the task is to analyze them together. Maybe look for connections or themes. Let's break down each part.\n\nFirst, the car images: they're all high-end, luxury vehicles, representing different brands and types (sedan, SUV, convertible, sports car). They emphasize status, design, and performance. Then the audio is a cough—maybe indicating a human element, health, or interruption. The video is about digital art creation, using a tablet and stylus, drawing a guitar.\n\nWait, maybe the connection is about different forms of creativity or expression? Cars are a form of industrial design, art (drawing the guitar is artistic), and the cough might be a human moment in the process. Or maybe it's about different mediums: visual (cars), auditory (cough), and digital art (tablet drawing).\n\nLet's check each component:\n\n1. Car images: Visual, luxury, engineering, design.\n2. Audio: Human sound (cough), which is a non-verbal communication, maybe indicating the person is present, maybe tired or sick, but also part of the environment.\n3. Video: Digital art, creativity, technology (tablet), music (guitar).\n\nPossible themes: The intersection of technology and creativity (cars as tech, tablet as tech), human elements (cough as human, drawing as human activity), or different forms of expression (cars as design, guitar as art).\n\nWait, the user might want to see if there's a narrative or connection. Maybe the cough is a moment when someone is creating art, like taking a break or dealing with a minor issue while working. The cars are a separate visual, but maybe the overall theme is about luxury, creativity, and human experience.\n\nAlternatively, maybe it's a test of attention to different media types. Let's think about each:\n\n- Images: Visual art (cars as products of design)\n- Audio: Sound (cough, which is a natural human sound)\n- Video: Dynamic visual (drawing process)\n\nSo the analysis could be about how different media (still images, sound, moving video) convey different aspects of human experience—luxury products, health/physical state, creative process.\n\nLet's structure it:\n\n1. Visual Elements (Car Images): These showcase high-end automotive design, emphasizing brand identity, engineering, and aesthetic appeal. Each car represents a different segment (luxury sedan, premium SUV, exotic convertible, classic sports car), highlighting diversity in luxury transportation.\n\n2. Auditory Element (Cough): This introduces a human, organic element. Coughing is a natural bodily function, indicating presence and physicality, contrasting with the polished, manufactured nature of the cars. It might symbolize the human element behind the creation or consumption of these luxury items.\n\n3. Video (Guitar Drawing): This demonstrates digital creativity and artistic expression. Using a tablet and stylus for illustration shows the intersection of technology and art, with the guitar as a symbol of music and creativity. The act of drawing connects to the design elements seen in the cars (both involve creative design processes).\n\nOverall, the combination explores the relationship between human experience (health, creativity) and the products of human ingenuity (luxury cars, digital art tools). The cough bridges the gap between the sterile, high-end visuals of the cars and the tactile, personal act of drawing, emphasizing that even in the realm of luxury and technology, human elements remain central.\n\nWait, but maybe the cough is just a random sound, and the key is to see the three parts as separate but part of a larger context. Alternatively, maybe the user is testing if we can recognize that these are unrelated but need to describe each.\n\nBut the instruction says \"Analyze this audio, image, and video together,\" so find connections.\n\nLet me check again:\n\n- Image: Four luxury cars (Rolls-Royce Ghost, Mercedes GLE, Ferrari Portofino M, Porsche 911)\n- Audio: Coughing (a single person coughing)\n- Video: Drawing a guitar on a tablet\n\nPossible connections:\n\n- All involve human interaction: cars are driven by humans, coughing is a human action, drawing is a human creative act.\n- Technology: Cars use advanced tech, tablet is tech, cough is biological tech (body).\n- Creativity: Car design is creative engineering, drawing is artistic creativity.\n\nSo the analysis could be that these elements collectively represent different facets of human creativity and interaction with technology—whether through automotive design, digital art creation, or the natural human body. The cough adds a layer of organic humanity to the otherwise polished and engineered worlds of luxury cars and digital tools.\n\nYes, that makes sense. Let's put it all together.\n</think>\n\nTo analyze the **image of luxury cars**, **audio of coughing**, and **video of digital guitar drawing** together, we explore their interplay across themes of **human experience**, **technology**, and **creative expression**: \n\n\n### 1. Visual: Luxury Automotive Design (Image) \nThe four cars—Rolls-Royce Ghost, Mercedes GLE SUV, Ferrari Portofino M, and Porsche 911—represent **high-end engineering, brand identity, and aesthetic innovation**. Each vehicle embodies a distinct segment of luxury mobility: \n- *Rolls-Royce*: Ultra-luxury sedan (status, craftsmanship). \n- *Mercedes GLE*: Premium SUV (versatility, off-road capability). \n- *Ferrari Portofino*: Exotic convertible (performance, open-top thrill). \n- *Porsche 911*: Iconic sports car (precision, heritage). \n\nThese images highlight **human-driven design**—a blend of artistry, technology, and cultural aspiration. They symbolize the *polished, manufactured world* of consumer goods, where creativity is channeled into engineering and marketing. \n\n\n### 2. Auditory: Human Organic Presence (Audio) \nThe coughing sound introduces a **raw, biological human element**. Unlike the sleek, controlled visuals of the cars, this sound is unfiltered and spontaneous—it speaks to *physicality, vulnerability, and everyday reality*. \n\nThis contrasts with the “perfect” world of luxury products, reminding us that **human experience (health, fatigue, imperfection) underpins even the most refined creations**. The cough bridges the gap between the *manufactured* (cars) and the *organic* (the human body). \n\n\n### 3. Video: Digital Creativity (Drawing a Guitar) \nThe video shows a person using a tablet and stylus to draw a guitar—an act of **artistic creation rooted in technology**. This process mirrors the *design ethos* of the luxury cars: \n- Both involve **iterative creativity**: sketching curves, refining details, and balancing form/function. \n- Both rely on **digital tools**: car designers use CAD software; the artist uses a tablet. \n- The guitar itself symbolizes *music and emotional expression*, adding a layer of **cultural and emotional depth** to the “technology” theme. \n\n\n### Unified Analysis: Humanity at the Intersection of Design, Technology, and Expression \nTogether, these elements illustrate how **human creativity and experience shape (and are shaped by) technology and luxury**: \n- **Luxury cars** are products of human ingenuity but exist within a *human context* (e.g., the cough reminds us of the people who drive, design, and consume them). \n- **Digital art** (guitar drawing) reflects how technology democratizes creativity, while the *act of drawing*—like driving a car—requires human skill and emotion. \n- **The cough** acts as a “human anchor,” grounding the polished visuals of luxury and art in the messy, real-world reality of human existence. \n\nIn essence, the combination underscores that **even in realms of high technology and luxury, humanity remains central**: whether through the imperfections of the body, the passion of creation, or the desire to own/express identity through design.","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning_content":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":17625,"total_tokens":19340,"completion_tokens":1715,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null} |
| 191 | + |
| 192 | +``` |
0 commit comments