[Bug]: vllm can't run deepseek 70b with Huawei ascend 910b npu card #800

huazq · 2025-05-09T09:07:53Z

Your current environment

The output of `python collect_env.py`

🐛 Describe the bug

[root@cetccloud-custom-vllm-predictor-ffb4b55bc-gvmv5 code]# vllm serve /mnt/models --served-model-name "deepseek-70B" --host 0.0.0.0 --port 80 --dtype bfloat16 -tp 4 --gpu-memory-utilization 0.9
INFO 05-09 08:05:50 [importing.py:17] Triton not installed or not compatible; certain GPU-related functions will not be available.
WARNING 05-09 08:05:50 [importing.py:28] Triton is not installed. Using dummy decorators. Install it via pip install triton to enable kernelcompilation.
INFO 05-09 08:05:50 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 05-09 08:05:52 [init.py:30] Available plugins for group vllm.platform_plugins:
INFO 05-09 08:05:52 [init.py:32] name=ascend, value=vllm_ascend:register
INFO 05-09 08:05:52 [init.py:34] all available plugins for group vllm.platform_plugins will be loaded.
INFO 05-09 08:05:52 [init.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 05-09 08:05:52 [init.py:44] plugin ascend loaded.
INFO 05-09 08:05:52 [init.py:230] Platform plugin ascend is activated
WARNING 05-09 08:05:53 [_custom_ops.py:21] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
INFO 05-09 08:05:56 [init.py:30] Available plugins for group vllm.general_plugins:
INFO 05-09 08:05:56 [init.py:32] name=ascend_enhanced_model, value=vllm_ascend:register_model
INFO 05-09 08:05:56 [init.py:34] all available plugins for group vllm.general_plugins will be loaded.
INFO 05-09 08:05:56 [init.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 05-09 08:05:56 [init.py:44] plugin ascend_enhanced_model loaded.
WARNING 05-09 08:05:56 [registry.py:389] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_mtp:CustomDeepSeekMTP.
WARNING 05-09 08:05:56 [registry.py:389] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:AscendQwen2VLForConditionalGeneration.
WARNING 05-09 08:05:56 [registry.py:389] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl:AscendQwen2_5_VLForConditionalGeneration.
WARNING 05-09 08:05:56 [registry.py:389] Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM.
WARNING 05-09 08:05:56 [registry.py:389] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV3ForCausalLM.
INFO 05-09 08:06:00 [api_server.py:1043] vLLM API server version 0.8.5.post1
INFO 05-09 08:06:00 [api_server.py:1044] args: Namespace(subparser='serve', model_tag='/mnt/models', config='', host='0.0.0.0', port=80, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='/mnt/models', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, load_format='auto', download_dir=None, model_loader_extra_config={}, use_tqdm_on_load=True, config_format=<ConfigFormat.AUTO: 'auto'>, dtype='bfloat16', max_model_len=None, guided_decoding_backend='auto', reasoning_parser=None, logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=4, data_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, disable_custom_all_reduce=False, block_size=None, gpu_memory_utilization=0.9, swap_space=4, kv_cache_dtype='auto', num_gpu_blocks_override=None, enable_prefix_caching=None, prefix_caching_hash_algo='builtin', cpu_offload_gb=0, calculate_kv_scales=False, disable_sliding_window=False, use_v2_block_manager=True, seed=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_token=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config={}, limit_mm_per_prompt={}, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=None, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=None, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', speculative_config=None, ignore_patterns=[], served_model_name=['deepseek-70B'], qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, max_num_batched_tokens=None, max_num_seqs=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, num_lookahead_slots=0, scheduler_delay_factor=0.0, preemption_mode=None, num_scheduler_steps=1, multi_step_stream_outputs=True, scheduling_policy='fcfs', enable_chunked_prefill=None, disable_chunked_mm_input=False, scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, additional_config=None, enable_reasoning=False, disable_cascade_attn=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False, dispatch_function=<function ServeSubcommand.cmd at 0xfffdb8fab010>)
INFO 05-09 08:06:11 [config.py:717] This model supports multiple tasks: {'classify', 'score', 'embed', 'reward', 'generate'}. Defaulting to 'generate'.
INFO 05-09 08:06:11 [arg_utils.py:1669] npu is experimental on VLLM_USE_V1=1. Falling back to V0 Engine.
WARNING 05-09 08:06:11 [arg_utils.py:1536] The model has a long context length (131072). This may causeOOM during the initial memory profiling phase, or result in low performance due to small KV cache size. Consider setting --max-model-len to a smaller value.
INFO 05-09 08:06:11 [config.py:1770] Defaulting to use mp for distributed inference
INFO 05-09 08:06:11 [config.py:1804] Disabled the custom all-reduce kernel because it is not supported on current platform.
WARNING 05-09 08:06:11 [platform.py:125] NPU compilation support pending. Will be available in future CANN and torch_npu releases. NPU graph mode is currently experimental and disabled by default. You can just adopt additional_config={'enable_graph_mode': True} to serve deepseek models with NPU graph mode on vllm-ascend with V0 engine.
INFO 05-09 08:06:11 [platform.py:133] Compilation disabled, using eager mode by default
INFO 05-09 08:06:11 [api_server.py:246] Started engine process with PID 207
INFO 05-09 08:06:16 [importing.py:17] Triton not installed or not compatible; certain GPU-related functions will not be available.
WARNING 05-09 08:06:16 [importing.py:28] Triton is not installed. Using dummy decorators. Install it via pip install triton to enable kernelcompilation.
INFO 05-09 08:06:16 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 05-09 08:06:17 [init.py:30] Available plugins for group vllm.platform_plugins:
INFO 05-09 08:06:17 [init.py:32] name=ascend, value=vllm_ascend:register
INFO 05-09 08:06:17 [init.py:34] all available plugins for group vllm.platform_plugins will be loaded.
INFO 05-09 08:06:17 [init.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 05-09 08:06:17 [init.py:44] plugin ascend loaded.
INFO 05-09 08:06:17 [init.py:230] Platform plugin ascend is activated
WARNING 05-09 08:06:19 [_custom_ops.py:21] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
INFO 05-09 08:06:21 [init.py:30] Available plugins for group vllm.general_plugins:
INFO 05-09 08:06:21 [init.py:32] name=ascend_enhanced_model, value=vllm_ascend:register_model
INFO 05-09 08:06:21 [init.py:34] all available plugins for group vllm.general_plugins will be loaded.
INFO 05-09 08:06:21 [init.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 05-09 08:06:21 [init.py:44] plugin ascend_enhanced_model loaded.
WARNING 05-09 08:06:21 [registry.py:389] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_mtp:CustomDeepSeekMTP.
WARNING 05-09 08:06:21 [registry.py:389] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:AscendQwen2VLForConditionalGeneration.
WARNING 05-09 08:06:21 [registry.py:389] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl:AscendQwen2_5_VLForConditionalGeneration.
WARNING 05-09 08:06:21 [registry.py:389] Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM.
WARNING 05-09 08:06:21 [registry.py:389] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV3ForCausalLM.
INFO 05-09 08:06:21 [llm_engine.py:240] Initializing a V0 LLM engine (v0.8.5.post1) with config: model='/mnt/models', speculative_config=None, tokenizer='/mnt/models', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=auto, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=npu, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=deepseek-70B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":0,"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True,
WARNING 05-09 08:06:22 [multiproc_worker_utils.py:306] Reducing Torch parallelism from 192 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
WARNING 05-09 08:06:23 [utils.py:2522] Methods add_prompt_adapter,cache_config,compilation_config,current_platform,list_prompt_adapters,load_config,pin_prompt_adapter,remove_prompt_adapter not implemented in <vllm_ascend.worker.worker.NPUWorker object at 0xfffdc3814bb0>
INFO 05-09 08:06:27 [importing.py:17] Triton not installed or not compatible; certain GPU-related functions will not be available.
INFO 05-09 08:06:27 [importing.py:17] Triton not installed or not compatible; certain GPU-related functions will not be available.
INFO 05-09 08:06:27 [importing.py:17] Triton not installed or not compatible; certain GPU-related functions will not be available.
WARNING 05-09 08:06:27 [importing.py:28] Triton is not installed. Using dummy decorators. Install it via pip install triton to enable kernelcompilation.
WARNING 05-09 08:06:27 [importing.py:28] Triton is not installed. Using dummy decorators. Install it via pip install triton to enable kernelcompilation.
WARNING 05-09 08:06:27 [importing.py:28] Triton is not installed. Using dummy decorators. Install it via pip install triton to enable kernelcompilation.
INFO 05-09 08:06:27 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 05-09 08:06:27 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 05-09 08:06:27 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 05-09 08:06:28 [init.py:30] Available plugins for group vllm.platform_plugins:
INFO 05-09 08:06:28 [init.py:32] name=ascend, value=vllm_ascend:register
INFO 05-09 08:06:28 [init.py:34] all available plugins for group vllm.platform_plugins will be loaded.
INFO 05-09 08:06:28 [init.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 05-09 08:06:28 [init.py:44] plugin ascend loaded.
INFO 05-09 08:06:28 [init.py:30] Available plugins for group vllm.platform_plugins:
INFO 05-09 08:06:28 [init.py:32] name=ascend, value=vllm_ascend:register
INFO 05-09 08:06:28 [init.py:34] all available plugins for group vllm.platform_plugins will be loaded.
INFO 05-09 08:06:28 [init.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 05-09 08:06:28 [init.py:30] Available plugins for group vllm.platform_plugins:
INFO 05-09 08:06:28 [init.py:32] name=ascend, value=vllm_ascend:register
INFO 05-09 08:06:28 [init.py:34] all available plugins for group vllm.platform_plugins will be loaded.
INFO 05-09 08:06:28 [init.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 05-09 08:06:28 [init.py:44] plugin ascend loaded.
INFO 05-09 08:06:28 [init.py:44] plugin ascend loaded.
INFO 05-09 08:06:28 [init.py:230] Platform plugin ascend is activated
INFO 05-09 08:06:28 [init.py:230] Platform plugin ascend is activated
INFO 05-09 08:06:28 [init.py:230] Platform plugin ascend is activated
WARNING 05-09 08:06:30 [_custom_ops.py:21] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
WARNING 05-09 08:06:30 [_custom_ops.py:21] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
WARNING 05-09 08:06:30 [_custom_ops.py:21] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
(VllmWorkerProcess pid=343) INFO 05-09 08:06:32 [multiproc_worker_utils.py:225] Worker ready; awaiting tasks
(VllmWorkerProcess pid=345) INFO 05-09 08:06:32 [multiproc_worker_utils.py:225] Worker ready; awaiting tasks
(VllmWorkerProcess pid=343) INFO 05-09 08:06:32 [init.py:30] Available plugins for group vllm.general_plugins:
(VllmWorkerProcess pid=343) INFO 05-09 08:06:32 [init.py:32] name=ascend_enhanced_model, value=vllm_ascend:register_model
(VllmWorkerProcess pid=343) INFO 05-09 08:06:32 [init.py:34] all available plugins for group vllm.general_plugins will be loaded.
(VllmWorkerProcess pid=343) INFO 05-09 08:06:32 [init.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
(VllmWorkerProcess pid=343) INFO 05-09 08:06:32 [init.py:44] plugin ascend_enhanced_model loaded.
(VllmWorkerProcess pid=344) INFO 05-09 08:06:32 [multiproc_worker_utils.py:225] Worker ready; awaiting tasks
(VllmWorkerProcess pid=345) INFO 05-09 08:06:32 [init.py:30] Available plugins for group vllm.general_plugins:
(VllmWorkerProcess pid=345) INFO 05-09 08:06:32 [init.py:32] name=ascend_enhanced_model, value=vllm_ascend:register_model
(VllmWorkerProcess pid=345) INFO 05-09 08:06:32 [init.py:34] all available plugins for group vllm.general_plugins will be loaded.
(VllmWorkerProcess pid=345) INFO 05-09 08:06:32 [init.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
(VllmWorkerProcess pid=345) INFO 05-09 08:06:32 [init.py:44] plugin ascend_enhanced_model loaded.
(VllmWorkerProcess pid=344) INFO 05-09 08:06:32 [init.py:30] Available plugins for group vllm.general_plugins:
(VllmWorkerProcess pid=344) INFO 05-09 08:06:32 [init.py:32] name=ascend_enhanced_model, value=vllm_ascend:register_model
(VllmWorkerProcess pid=344) INFO 05-09 08:06:32 [init.py:34] all available plugins for group vllm.general_plugins will be loaded.
(VllmWorkerProcess pid=344) INFO 05-09 08:06:32 [init.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
(VllmWorkerProcess pid=344) INFO 05-09 08:06:32 [init.py:44] plugin ascend_enhanced_model loaded.
(VllmWorkerProcess pid=343) WARNING 05-09 08:06:32 [registry.py:389] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_mtp:CustomDeepSeekMTP.
(VllmWorkerProcess pid=343) WARNING 05-09 08:06:32 [registry.py:389] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:AscendQwen2VLForConditionalGeneration.
(VllmWorkerProcess pid=343) WARNING 05-09 08:06:32 [registry.py:389] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl:AscendQwen2_5_VLForConditionalGeneration.
(VllmWorkerProcess pid=343) WARNING 05-09 08:06:32 [registry.py:389] Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM.
(VllmWorkerProcess pid=343) WARNING 05-09 08:06:32 [registry.py:389] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV3ForCausalLM.
(VllmWorkerProcess pid=345) WARNING 05-09 08:06:32 [registry.py:389] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_mtp:CustomDeepSeekMTP.
(VllmWorkerProcess pid=345) WARNING 05-09 08:06:32 [registry.py:389] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:AscendQwen2VLForConditionalGeneration.
(VllmWorkerProcess pid=345) WARNING 05-09 08:06:32 [registry.py:389] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl:AscendQwen2_5_VLForConditionalGeneration.
(VllmWorkerProcess pid=345) WARNING 05-09 08:06:32 [registry.py:389] Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM.
(VllmWorkerProcess pid=345) WARNING 05-09 08:06:32 [registry.py:389] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV3ForCausalLM.
(VllmWorkerProcess pid=344) WARNING 05-09 08:06:32 [registry.py:389] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_mtp:CustomDeepSeekMTP.
(VllmWorkerProcess pid=344) WARNING 05-09 08:06:32 [registry.py:389] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:AscendQwen2VLForConditionalGeneration.
(VllmWorkerProcess pid=344) WARNING 05-09 08:06:32 [registry.py:389] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl:AscendQwen2_5_VLForConditionalGeneration.
(VllmWorkerProcess pid=344) WARNING 05-09 08:06:32 [registry.py:389] Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM.
(VllmWorkerProcess pid=344) WARNING 05-09 08:06:32 [registry.py:389] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV3ForCausalLM.
(VllmWorkerProcess pid=345) WARNING 05-09 08:06:33 [utils.py:2522] Methods add_prompt_adapter,cache_config,compilation_config,current_platform,list_prompt_adapters,load_config,pin_prompt_adapter,remove_prompt_adapter not implemented in <vllm_ascend.worker.worker.NPUWorker object at 0xffff91292cb0>
(VllmWorkerProcess pid=343) WARNING 05-09 08:06:33 [utils.py:2522] Methods add_prompt_adapter,cache_config,compilation_config,current_platform,list_prompt_adapters,load_config,pin_prompt_adapter,remove_prompt_adapter not implemented in <vllm_ascend.worker.worker.NPUWorker object at 0xffff7ac62cb0>
(VllmWorkerProcess pid=344) WARNING 05-09 08:06:33 [utils.py:2522] Methods add_prompt_adapter,cache_config,compilation_config,current_platform,list_prompt_adapters,load_config,pin_prompt_adapter,remove_prompt_adapter not implemented in <vllm_ascend.worker.worker.NPUWorker object at 0xffff80102cb0>
(VllmWorkerProcess pid=345) ERROR 05-09 08:06:34 [multiproc_worker_utils.py:238] Exception in worker VllmWorkerProcess while processing method init_device.
(VllmWorkerProcess pid=345) ERROR 05-09 08:06:34 [multiproc_worker_utils.py:238] Traceback (most recent call last):
(VllmWorkerProcess pid=345) ERROR 05-09 08:06:34 [multiproc_worker_utils.py:238] File "/vllm-workspace/vllm/vllm/executor/multiproc_worker_utils.py", line 232, in _run_worker_process
(VllmWorkerProcess pid=345) ERROR 05-09 08:06:34 [multiproc_worker_utils.py:238] output = run_method(worker, method, args, kwargs)
(VllmWorkerProcess pid=345) ERROR 05-09 08:06:34 [multiproc_worker_utils.py:238] File "/vllm-workspace/vllm/vllm/utils.py", line 2456, in run_method
(VllmWorkerProcess pid=345) ERROR 05-09 08:06:34 [multiproc_worker_utils.py:238] return func(*args, **kwargs)
(VllmWorkerProcess pid=345) ERROR 05-09 08:06:34 [multiproc_worker_utils.py:238] File "/vllm-workspace/vllm/vllm/worker/worker_base.py", line 604, in init_device
(VllmWorkerProcess pid=345) ERROR 05-09 08:06:34 [multiproc_worker_utils.py:238] self.worker.init_device() # type: ignore
(VllmWorkerProcess pid=345) ERROR 05-09 08:06:34 [multiproc_worker_utils.py:238] File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/worker.py", line 211, in init_device
(VllmWorkerProcess pid=345) ERROR 05-09 08:06:34 [multiproc_worker_utils.py:238] NPUPlatform.set_device(self.device)
(VllmWorkerProcess pid=345) ERROR 05-09 08:06:34 [multiproc_worker_utils.py:238] File "/vllm-workspace/vllm-ascend/vllm_ascend/platform.py", line 95, in set_device
(VllmWorkerProcess pid=345) ERROR 05-09 08:06:34 [multiproc_worker_utils.py:238] torch.npu.set_device(device)
(VllmWorkerProcess pid=345) ERROR 05-09 08:06:34 [multiproc_worker_utils.py:238] File "/usr/local/python3.10.17/lib/python3.10/site-packages/torch_npu/npu/utils.py", line 83, in set_device
(VllmWorkerProcess pid=345) ERROR 05-09 08:06:34 [multiproc_worker_utils.py:238] torch_npu._C._npu_setDevice(device_id)
(VllmWorkerProcess pid=345) ERROR 05-09 08:06:34 [multiproc_worker_utils.py:238] RuntimeError: GetDevice:build/CMakeFiles/torch_npu.dir/compiler_depend.ts:60 NPU function error: aclrtGetCurrentContext(&used_devices[local_device]), error code is 107002
(VllmWorkerProcess pid=343) ERROR 05-09 08:06:34 [multiproc_worker_utils.py:238] Exception in worker VllmWorkerProcess while processing method init_device.
(VllmWorkerProcess pid=345) ERROR 05-09 08:06:34 [multiproc_worker_utils.py:238] [ERROR] 2025-05-09-08:06:34 (PID:345, Device:0, RankID:-1) ERR00100 PTA call acl api failed
(VllmWorkerProcess pid=345) ERROR 05-09 08:06:34 [multiproc_worker_utils.py:238] [Error]: The context is empty.
(VllmWorkerProcess pid=343) ERROR 05-09 08:06:34 [multiproc_worker_utils.py:238] Traceback (most recent call last):
(VllmWorkerProcess pid=345) ERROR 05-09 08:06:34 [multiproc_worker_utils.py:238] Check whether acl.rt.set_context or acl.rt.set_device is called.
(VllmWorkerProcess pid=343) ERROR 05-09 08:06:34 [multiproc_worker_utils.py:238] File "/vllm-workspace/vllm/vllm/executor/multiproc_worker_utils.py", line 232, in _run_worker_process
(VllmWorkerProcess pid=345) ERROR 05-09 08:06:34 [multiproc_worker_utils.py:238] EE1001: [PID: 345] 2025-05-09-08:06:34.182.045 The argument is invalid.Reason: rtGetDevMsg execute failed, reason=[context pointer null]
(VllmWorkerProcess pid=343) ERROR 05-09 08:06:34 [multiproc_worker_utils.py:238] output = run_method(worker, method, args, kwargs)
(VllmWorkerProcess pid=345) ERROR 05-09 08:06:34 [multiproc_worker_utils.py:238] Solution: 1.Check the input parameter range of the function. 2.Check the function invocation relationship.
(VllmWorkerProcess pid=343) ERROR 05-09 08:06:34 [multiproc_worker_utils.py:238] File "/vllm-workspace/vllm/vllm/utils.py", line 2456, in run_method
(VllmWorkerProcess pid=345) ERROR 05-09 08:06:34 [multiproc_worker_utils.py:238] TraceBack (most recent call last):
(VllmWorkerProcess pid=343) ERROR 05-09 08:06:34 [multiproc_worker_utils.py:238] return func(*args, **kwargs)
(VllmWorkerProcess pid=345) ERROR 05-09 08:06:34 [multiproc_worker_utils.py:238] [Init][Version]init soc version failed, ret = 507008[FUNC:ReportInnerError][FILE:log_inner.cpp][LINE:145]
(VllmWorkerProcess pid=343) ERROR 05-09 08:06:34 [multiproc_worker_utils.py:238] File "/vllm-workspace/vllm/vllm/worker/worker_base.py", line 604, in init_device
(VllmWorkerProcess pid=345) ERROR 05-09 08:06:34 [multiproc_worker_utils.py:238] ctx is NULL![FUNC:GetDevErrMsg][FILE:api_impl.cc][LINE:5925]
(VllmWorkerProcess pid=343) ERROR 05-09 08:06:34 [multiproc_worker_utils.py:238] self.worker.init_device() # type: ignore
(VllmWorkerProcess pid=345) ERROR 05-09 08:06:34 [multiproc_worker_utils.py:238] The argument is invalid.Reason: rtGetDevMsg execute failed, reason=[context pointer null]
(VllmWorkerProcess pid=343) ERROR 05-09 08:06:34 [multiproc_worker_utils.py:238] File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/worker.py", line 211, in init_device
(VllmWorkerProcess pid=345) ERROR 05-09 08:06:34 [multiproc_worker_utils.py:238]
(VllmWorkerProcess pid=343) ERROR 05-09 08:06:34 [multiproc_worker_utils.py:238] NPUPlatform.set_device(self.device)
(VllmWorkerProcess pid=343) ERROR 05-09 08:06:34 [multiproc_worker_utils.py:238] File "/vllm-workspace/vllm-ascend/vllm_ascend/platform.py", line 95, in set_device
(VllmWorkerProcess pid=343) ERROR 05-09 08:06:34 [multiproc_worker_utils.py:238] torch.npu.set_device(device)
(VllmWorkerProcess pid=343) ERROR 05-09 08:06:34 [multiproc_worker_utils.py:238] File "/usr/local/python3.10.17/lib/python3.10/site-packages/torch_npu/npu/utils.py", line 83, in set_device
(VllmWorkerProcess pid=343) ERROR 05-09 08:06:34 [multiproc_worker_utils.py:238] torch_npu._C._npu_setDevice(device_id)
(VllmWorkerProcess pid=343) ERROR 05-09 08:06:34 [multiproc_worker_utils.py:238] RuntimeError: GetDevice:build/CMakeFiles/torch_npu.dir/compiler_depend.ts:60 NPU function error: aclrtGetCurrentContext(&used_devices[local_device]), error code is 107002
(VllmWorkerProcess pid=343) ERROR 05-09 08:06:34 [multiproc_worker_utils.py:238] [ERROR] 2025-05-09-08:06:34 (PID:343, Device:0, RankID:-1) ERR00100 PTA call acl api failed
(VllmWorkerProcess pid=343) ERROR 05-09 08:06:34 [multiproc_worker_utils.py:238] [Error]: The context is empty.
(VllmWorkerProcess pid=343) ERROR 05-09 08:06:34 [multiproc_worker_utils.py:238] Check whether acl.rt.set_context or acl.rt.set_device is called.
(VllmWorkerProcess pid=343) ERROR 05-09 08:06:34 [multiproc_worker_utils.py:238] EE1001: [PID: 343] 2025-05-09-08:06:34.182.218 The argument is invalid.Reason: rtGetDevMsg execute failed, reason=[context pointer null]
(VllmWorkerProcess pid=343) ERROR 05-09 08:06:34 [multiproc_worker_utils.py:238] Solution: 1.Check the input parameter range of the function. 2.Check the function invocation relationship.
(VllmWorkerProcess pid=343) ERROR 05-09 08:06:34 [multiproc_worker_utils.py:238] TraceBack (most recent call last):
(VllmWorkerProcess pid=343) ERROR 05-09 08:06:34 [multiproc_worker_utils.py:238] [Init][Version]init soc version failed, ret = 507008[FUNC:ReportInnerError][FILE:log_inner.cpp][LINE:145]
(VllmWorkerProcess pid=343) ERROR 05-09 08:06:34 [multiproc_worker_utils.py:238] ctx is NULL![FUNC:GetDevErrMsg][FILE:api_impl.cc][LINE:5925]
(VllmWorkerProcess pid=343) ERROR 05-09 08:06:34 [multiproc_worker_utils.py:238] The argument is invalid.Reason: rtGetDevMsg execute failed, reason=[context pointer null]
(VllmWorkerProcess pid=343) ERROR 05-09 08:06:34 [multiproc_worker_utils.py:238]
ERROR 05-09 08:06:34 [engine.py:448] GetDevice:build/CMakeFiles/torch_npu.dir/compiler_depend.ts:60 NPU function error: aclrtGetCurrentContext(&used_devices[local_device]), error code is 107002
ERROR 05-09 08:06:34 [engine.py:448] [ERROR] 2025-05-09-08:06:34 (PID:207, Device:0, RankID:-1) ERR00100 PTA call acl api failed
ERROR 05-09 08:06:34 [engine.py:448] [Error]: The context is empty.
ERROR 05-09 08:06:34 [engine.py:448] Check whether acl.rt.set_context or acl.rt.set_device is called.
ERROR 05-09 08:06:34 [engine.py:448] EE1001: [PID: 207] 2025-05-09-08:06:34.181.090 The argument is invalid.Reason: rtGetDevMsg execute failed, reason=[context pointer null]
ERROR 05-09 08:06:34 [engine.py:448] Solution: 1.Check the input parameter range of the function. 2.Check the function invocation relationship.
ERROR 05-09 08:06:34 [engine.py:448] TraceBack (most recent call last):
ERROR 05-09 08:06:34 [engine.py:448] [Init][Version]init soc version failed, ret = 507008[FUNC:ReportInnerError][FILE:log_inner.cpp][LINE:145]
ERROR 05-09 08:06:34 [engine.py:448] ctx is NULL![FUNC:GetDevErrMsg][FILE:api_impl.cc][LINE:5925]
ERROR 05-09 08:06:34 [engine.py:448] The argument is invalid.Reason: rtGetDevMsg execute failed, reason=[context pointer null]
ERROR 05-09 08:06:34 [engine.py:448] Traceback (most recent call last):
ERROR 05-09 08:06:34 [engine.py:448] File "/vllm-workspace/vllm/vllm/engine/multiprocessing/engine.py", line 436, in run_mp_engine
ERROR 05-09 08:06:34 [engine.py:448] engine = MQLLMEngine.from_vllm_config(
ERROR 05-09 08:06:34 [engine.py:448] File "/vllm-workspace/vllm/vllm/engine/multiprocessing/engine.py", line 128, in from_vllm_config
ERROR 05-09 08:06:34 [engine.py:448] return cls(
ERROR 05-09 08:06:34 [engine.py:448] File "/vllm-workspace/vllm/vllm/engine/multiprocessing/engine.py", line 82, in init
ERROR 05-09 08:06:34 [engine.py:448] self.engine = LLMEngine(*args, **kwargs)
ERROR 05-09 08:06:34 [engine.py:448] File "/vllm-workspace/vllm/vllm/engine/llm_engine.py", line 275, in init
ERROR 05-09 08:06:34 [engine.py:448] self.model_executor = executor_class(vllm_config=vllm_config)
ERROR 05-09 08:06:34 [engine.py:448] File "/vllm-workspace/vllm/vllm/executor/executor_base.py", line 286, in init
ERROR 05-09 08:06:34 [engine.py:448] super().init(*args, **kwargs)
ERROR 05-09 08:06:34 [engine.py:448] File "/vllm-workspace/vllm/vllm/executor/executor_base.py", line 52, in init
ERROR 05-09 08:06:34 [engine.py:448] self._init_executor()
ERROR 05-09 08:06:34 [engine.py:448] File "/vllm-workspace/vllm/vllm/executor/mp_distributed_executor.py", line 124, in _init_executor
ERROR 05-09 08:06:34 [engine.py:448] self._run_workers("init_device")
ERROR 05-09 08:06:34 [engine.py:448] File "/vllm-workspace/vllm/vllm/executor/mp_distributed_executor.py", line 185, in _run_workers
ERROR 05-09 08:06:34 [engine.py:448] driver_worker_output = run_method(self.driver_worker, sent_method,
ERROR 05-09 08:06:34 [engine.py:448] File "/vllm-workspace/vllm/vllm/utils.py", line 2456, in run_method
ERROR 05-09 08:06:34 [engine.py:448] return func(*args, **kwargs)
ERROR 05-09 08:06:34 [engine.py:448] File "/vllm-workspace/vllm/vllm/worker/worker_base.py", line 604, in init_device
ERROR 05-09 08:06:34 [engine.py:448] self.worker.init_device() # type: ignore
ERROR 05-09 08:06:34 [engine.py:448] File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/worker.py", line 211, in init_device
ERROR 05-09 08:06:34 [engine.py:448] NPUPlatform.set_device(self.device)
ERROR 05-09 08:06:34 [engine.py:448] File "/vllm-workspace/vllm-ascend/vllm_ascend/platform.py", line 95, in set_device
ERROR 05-09 08:06:34 [engine.py:448] torch.npu.set_device(device)
ERROR 05-09 08:06:34 [engine.py:448] File "/usr/local/python3.10.17/lib/python3.10/site-packages/torch_npu/npu/utils.py", line 83, in set_device
ERROR 05-09 08:06:34 [engine.py:448] torch_npu._C._npu_setDevice(device_id)
ERROR 05-09 08:06:34 [engine.py:448] RuntimeError: GetDevice:build/CMakeFiles/torch_npu.dir/compiler_depend.ts:60 NPU function error: aclrtGetCurrentContext(&used_devices[local_device]), error code is 107002
ERROR 05-09 08:06:34 [engine.py:448] [ERROR] 2025-05-09-08:06:34 (PID:207, Device:0, RankID:-1) ERR00100 PTA call acl api failed
ERROR 05-09 08:06:34 [engine.py:448] [Error]: The context is empty.
ERROR 05-09 08:06:34 [engine.py:448] Check whether acl.rt.set_context or acl.rt.set_device is called.
ERROR 05-09 08:06:34 [engine.py:448] EE1001: [PID: 207] 2025-05-09-08:06:34.181.090 The argument is invalid.Reason: rtGetDevMsg execute failed, reason=[context pointer null]
ERROR 05-09 08:06:34 [engine.py:448] Solution: 1.Check the input parameter range of the function. 2.Check the function invocation relationship.
ERROR 05-09 08:06:34 [engine.py:448] TraceBack (most recent call last):
ERROR 05-09 08:06:34 [engine.py:448] [Init][Version]init soc version failed, ret = 507008[FUNC:ReportInnerError][FILE:log_inner.cpp][LINE:145]
ERROR 05-09 08:06:34 [engine.py:448] ctx is NULL![FUNC:GetDevErrMsg][FILE:api_impl.cc][LINE:5925]
ERROR 05-09 08:06:34 [engine.py:448] The argument is invalid.Reason: rtGetDevMsg execute failed, reason=[context pointer null]
ERROR 05-09 08:06:34 [engine.py:448]
INFO 05-09 08:06:34 [multiproc_worker_utils.py:124] Killing local vLLM worker processes
Process SpawnProcess-1:
Traceback (most recent call last):
File "/usr/local/python3.10.17/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/local/python3.10.17/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/vllm-workspace/vllm/vllm/engine/multiprocessing/engine.py", line 450, in run_mp_engine
raise e
File "/vllm-workspace/vllm/vllm/engine/multiprocessing/engine.py", line 436, in run_mp_engine
engine = MQLLMEngine.from_vllm_config(
File "/vllm-workspace/vllm/vllm/engine/multiprocessing/engine.py", line 128, in from_vllm_config
return cls(
File "/vllm-workspace/vllm/vllm/engine/multiprocessing/engine.py", line 82, in init
self.engine = LLMEngine(*args, **kwargs)
File "/vllm-workspace/vllm/vllm/engine/llm_engine.py", line 275, in init
self.model_executor = executor_class(vllm_config=vllm_config)
File "/vllm-workspace/vllm/vllm/executor/executor_base.py", line 286, in init
super().init(*args, **kwargs)
File "/vllm-workspace/vllm/vllm/executor/executor_base.py", line 52, in init
self._init_executor()
File "/vllm-workspace/vllm/vllm/executor/mp_distributed_executor.py", line 124, in _init_executor
self._run_workers("init_device")
File "/vllm-workspace/vllm/vllm/executor/mp_distributed_executor.py", line 185, in _run_workers
driver_worker_output = run_method(self.driver_worker, sent_method,
File "/vllm-workspace/vllm/vllm/utils.py", line 2456, in run_method
return func(*args, **kwargs)
File "/vllm-workspace/vllm/vllm/worker/worker_base.py", line 604, in init_device
self.worker.init_device() # type: ignore
File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/worker.py", line 211, in init_device
NPUPlatform.set_device(self.device)
File "/vllm-workspace/vllm-ascend/vllm_ascend/platform.py", line 95, in set_device
torch.npu.set_device(device)
File "/usr/local/python3.10.17/lib/python3.10/site-packages/torch_npu/npu/utils.py", line 83, in set_device
torch_npu._C._npu_setDevice(device_id)
RuntimeError: GetDevice:build/CMakeFiles/torch_npu.dir/compiler_depend.ts:60 NPU function error: aclrtGetCurrentContext(&used_devices[local_device]), error code is 107002
[ERROR] 2025-05-09-08:06:34 (PID:207, Device:0, RankID:-1) ERR00100 PTA call acl api failed
[Error]: The context is empty.
Check whether acl.rt.set_context or acl.rt.set_device is called.
EE1001: [PID: 207] 2025-05-09-08:06:34.181.090 The argument is invalid.Reason: rtGetDevMsg execute failed, reason=[context pointer null]
Solution: 1.Check the input parameter range of the function. 2.Check the function invocation relationship.
TraceBack (most recent call last):
[Init][Version]init soc version failed, ret = 507008[FUNC:ReportInnerError][FILE:log_inner.cpp][LINE:145]
ctx is NULL![FUNC:GetDevErrMsg][FILE:api_impl.cc][LINE:5925]
The argument is invalid.Reason: rtGetDevMsg execute failed, reason=[context pointer null]

Traceback (most recent call last):
File "/usr/local/python3.10.17/bin/vllm", line 8, in
sys.exit(main())
File "/vllm-workspace/vllm/vllm/entrypoints/cli/main.py", line 53, in main
args.dispatch_function(args)
File "/vllm-workspace/vllm/vllm/entrypoints/cli/serve.py", line 27, in cmd
uvloop.run(run_server(args))
File "/usr/local/python3.10.17/lib/python3.10/site-packages/uvloop/init.py", line 82, in run
return loop.run_until_complete(wrapper())
File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
File "/usr/local/python3.10.17/lib/python3.10/site-packages/uvloop/init.py", line 61, in wrapper
return await main
File "/vllm-workspace/vllm/vllm/entrypoints/openai/api_server.py", line 1078, in run_server
async with build_async_engine_client(args) as engine_client:
File "/usr/local/python3.10.17/lib/python3.10/contextlib.py", line 199, in aenter
return await anext(self.gen)
File "/vllm-workspace/vllm/vllm/entrypoints/openai/api_server.py", line 146, in build_async_engine_client
async with build_async_engine_client_from_engine_args(
File "/usr/local/python3.10.17/lib/python3.10/contextlib.py", line 199, in aenter
return await anext(self.gen)
File "/vllm-workspace/vllm/vllm/entrypoints/openai/api_server.py", line 269, in build_async_engine_client_from_engine_args
raise RuntimeError(
RuntimeError: Engine process failed to start. See stack trace for the root cause.

The text was updated successfully, but these errors were encountered:

Yikun · 2025-05-14T01:37:15Z

[Init][Version]init soc version failed, ret = 507008

This might the key message, pls check your env, make sure the devices are OK.

[1] https://vllm-ascend.readthedocs.io/en/latest/tutorials/multi_npu.html

huazq added the bug Something isn't working label May 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: vllm can't run deepseek 70b with Huawei ascend 910b npu card #800

[Bug]: vllm can't run deepseek 70b with Huawei ascend 910b npu card #800

huazq commented May 9, 2025

Yikun commented May 14, 2025

[Bug]: vllm can't run deepseek 70b with Huawei ascend 910b npu card #800

[Bug]: vllm can't run deepseek 70b with Huawei ascend 910b npu card #800

Comments

huazq commented May 9, 2025

Your current environment

🐛 Describe the bug

Yikun commented May 14, 2025