You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to perform inference with the Qwen2-1.5B model on a Huawei 910B card, and the startup process is getting stuck.
What are some methods to further diagnose the issue? Thanks.
INFO 05-12 10:00:10 __init__.py:30] Available plugins for group vllm.platform_plugins:
INFO 05-12 10:00:10 __init__.py:32] name=ascend, value=vllm_ascend:register
INFO 05-12 10:00:10 __init__.py:34] all available plugins for group vllm.platform_plugins will be loaded.
INFO 05-12 10:00:10 __init__.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 05-12 10:00:10 __init__.py:44] plugin ascend loaded.
INFO 05-12 10:00:10 __init__.py:198] Platform plugin ascend is activated
WARNING:root:Warning: Failed to register custom ops, all custom ops will be disabled
INFO 05-12 10:00:10 __init__.py:30] Available plugins for group vllm.general_plugins:
INFO 05-12 10:00:10 __init__.py:32] name=ascend_enhanced_model, value=vllm_ascend:register_model
INFO 05-12 10:00:10 __init__.py:34] all available plugins for group vllm.general_plugins will be loaded.
INFO 05-12 10:00:10 __init__.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 05-12 10:00:10 __init__.py:44] plugin ascend_enhanced_model loaded.
WARNING 05-12 10:00:10 registry.py:351] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:AscendQwen2VLForConditionalGeneration.
WARNING 05-12 10:00:10 registry.py:351] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl:AscendQwen2_5_VLForConditionalGeneration.
WARNING 05-12 10:00:10 registry.py:351] Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM.
WARNING 05-12 10:00:10 registry.py:351] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV3ForCausalLM.
WARNING 05-12 10:00:10 registry.py:351] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_mtp:CustomDeepSeekMTP.
INFO 05-12 10:00:10 importing.py:16] Triton not installed or not compatible; certain GPU-related functions will not be available.
INFO 05-12 10:00:10 api_server.py:912] vLLM API server version 0.7.3
INFO 05-12 10:00:10 api_server.py:913] args: Namespace(host=None, port=8080, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, enable_reasoning=False, reasoning_parser=None, tool_call_parser=None, tool_parser_plugin='', model='/tmp/qwen2-1.5b', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='half', kv_cache_dtype='auto', max_model_len=None, guided_decoding_backend='xgrammar', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.95, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['qwen2_1.5b'], qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', generation_config=None, override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False)
INFO 05-12 10:00:10 api_server.py:209] Started engine process with PID 2757
WARNING 05-12 10:00:10 config.py:2448] Casting torch.bfloat16 to torch.float16.
INFO 05-12 10:00:19 __init__.py:30] Available plugins for group vllm.platform_plugins:
INFO 05-12 10:00:19 __init__.py:32] name=ascend, value=vllm_ascend:register
INFO 05-12 10:00:19 __init__.py:34] all available plugins for group vllm.platform_plugins will be loaded.
INFO 05-12 10:00:19 __init__.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 05-12 10:00:19 __init__.py:44] plugin ascend loaded.
INFO 05-12 10:00:19 __init__.py:198] Platform plugin ascend is activated
WARNING:root:Warning: Failed to register custom ops, all custom ops will be disabled
INFO 05-12 10:00:19 __init__.py:30] Available plugins for group vllm.general_plugins:
*************************************************************************************************************
INFO 05-12 10:00:19 __init__.py:44] plugin ascend_enhanced_model loaded.
WARNING 05-12 10:00:19 registry.py:351] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:AscendQwen2VLForConditionalGeneration.
WARNING 05-12 10:00:19 registry.py:351] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl:AscendQwen2_5_VLForConditionalGeneration.
WARNING 05-12 10:00:19 registry.py:351] Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM.
WARNING 05-12 10:00:19 registry.py:351] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV3ForCausalLM.
WARNING 05-12 10:00:19 registry.py:351] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_mtp:CustomDeepSeekMTP.
INFO 05-12 10:00:19 importing.py:16] Triton not installed or not compatible; certain GPU-related functions will not be available.
WARNING 05-12 10:00:19 config.py:2448] Casting torch.bfloat16 to torch.float16.
INFO 05-12 10:00:21 config.py:549] This model supports multiple tasks: {'embed', 'score', 'reward', 'generate', 'classify'}. Defaulting to 'generate'.
INFO 05-12 10:00:30 config.py:549] This model supports multiple tasks: {'reward', 'score', 'classify', 'generate', 'embed'}. Defaulting to 'generate'.
INFO 05-12 10:00:31 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.3) with config: model='/tmp/qwen2-1.5b', speculative_config=None, tokenizer='/tmp/qwen2-1.5b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=npu, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=qwen2_1.5b, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True,
WARNING 05-12 10:00:31 logger.py:202] VLLM_TRACE_FUNCTION is enabled. It will record every functionexecuted by Python. This will slow down the code. It is suggested to be used for debugging hang or crashes only.
INFO 05-12 10:00:31 logger.py:206] Trace frame log is saved to /tmp/root/vllm/vllm-instance-22a34/VLLM_TRACE_FUNCTION_for_process_2757_thread_281473419130944_at_2025-05-12_10:00:31.815092.log
ERROR 05-12 10:00:31 camem.py:69] Failed to import vllm_ascend_C:No module named 'vllm_ascend.vllm_ascend_C'
/usr/local/python3.10.17/lib/python3.10/site-packages/torch_npu/contrib/transfer_to_npu.py:292: ImportWarning:
*************************************************************************************************************
The torch.Tensor.cuda and torch.nn.Module.cuda are replaced with torch.Tensor.npu and torch.nn.Module.npu now..
The torch.cuda.DoubleTensor is replaced with torch.npu.FloatTensor cause the double type is not supported now..
The backend in torch.distributed.init_process_group set to hccl now..
The torch.cuda.* and torch.cuda.amp.* are replaced with torch.npu.* and torch.npu.amp.* now..
The device parameters have been replaced with npu in the functionbelow:
torch.logspace, torch.randint, torch.hann_window, torch.rand, torch.full_like, torch.ones_like, torch.rand_like, torch.randperm, torch.arange, torch.frombuffer, torch.normal, torch._empty_per_channel_affine_quantized, torch.empty_strided, torch.empty_like, torch.scalar_tensor, torch.tril_indices, torch.bartlett_window, torch.ones, torch.sparse_coo_tensor, torch.randn, torch.kaiser_window, torch.tensor, torch.triu_indices, torch.as_tensor, torch.zeros, torch.randint_like, torch.full, torch.eye, torch._sparse_csr_tensor_unsafe, torch.empty, torch._sparse_coo_tensor_unsafe, torch.blackman_window, torch.zeros_like, torch.range, torch.sparse_csr_tensor, torch.randn_like, torch.from_file, torch._cudnn_init_dropout_state, torch._empty_affine_quantized, torch.linspace, torch.hamming_window, torch.empty_quantized, torch._pin_memory, torch.autocast, torch.load, torch.Generator, torch.set_default_device, torch.Tensor.new_empty, torch.Tensor.new_empty_strided, torch.Tensor.new_full, torch.Tensor.new_ones, torch.Tensor.new_tensor, torch.Tensor.new_zeros, torch.Tensor.to, torch.Tensor.pin_memory, torch.nn.Module.to, torch.nn.Module.to_empty
*************************************************************************************************************
warnings.warn(msg, ImportWarning)
/usr/local/python3.10.17/lib/python3.10/site-packages/torch_npu/contrib/transfer_to_npu.py:247: RuntimeWarning: torch.jit.script and torch.jit.script_method will be disabled by transfer_to_npu, which currently does not support them, if you need to enable them, please do not use transfer_to_npu.
warnings.warn(msg, RuntimeWarning)
WARNING 05-12 10:00:32 utils.py:2262] Methods add_prompt_adapter,cache_config,compilation_config,current_platform,list_prompt_adapters,load_config,pin_prompt_adapter,remove_prompt_adapter not implemented in<vllm_ascend.worker.worker.NPUWorker object at 0xfffd306f1060>
INFO 05-12 10:00:48 model_runner.py:902] Starting to load model /tmp/qwen2-1.5b...
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:03<00:00, 3.73s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:03<00:00, 3.73s/it]
INFO 05-12 10:00:54 model_runner.py:907] Loading model weights took 2.8866 GB
The text was updated successfully, but these errors were encountered:
fyuan1316
changed the title
[Bug]: Qwen2-1.5B Inference Startup Hang on Huawei 910B Card
[Bug]: Qwen2-1.5B Inference Startup Hang on Huawei 910B Card under vNPU
May 12, 2025
Your current environment
The output of `python collect_env.py`
🐛 Describe the bug
I am trying to perform inference with the Qwen2-1.5B model on a Huawei 910B card, and the startup process is getting stuck.
What are some methods to further diagnose the issue? Thanks.
The text was updated successfully, but these errors were encountered: