You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
INFO 05-07 08:36:01 [importing.py:17] Triton not installed or not compatible; certain GPU-related functions will not be available.
WARNING 05-07 08:36:01 [importing.py:28] Triton is not installed. Using dummy decorators. Install it via `pip install triton` to enable kernelcompilation.
INFO 05-07 08:36:01 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 05-07 08:36:02 [__init__.py:30] Available plugins for group vllm.platform_plugins:
INFO 05-07 08:36:02 [__init__.py:32] name=ascend, value=vllm_ascend:register
INFO 05-07 08:36:02 [__init__.py:34] all available plugins for group vllm.platform_plugins will be loaded.
INFO 05-07 08:36:02 [__init__.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 05-07 08:36:02 [__init__.py:44] plugin ascend loaded.
INFO 05-07 08:36:02 [__init__.py:230] Platform plugin ascend is activated
WARNING 05-07 08:36:04 [_custom_ops.py:21] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
Collecting environment information...
PyTorch version: 2.5.1
Is debug build: False
OS: Ubuntu 22.04.5 LTS (aarch64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 4.0.0
Libc version: glibc-2.35
Python version: 3.10.17 (main, Apr 30 2025, 16:00:31) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.10.0-136.12.0.88.4.ctl3.aarch64-aarch64-with-glibc2.35
CPU:
Architecture: aarch64
CPU op-mode(s): 64-bit
Byte Order: Little Endian
CPU(s): 192
On-line CPU(s) list: 0-191
Vendor ID: HiSilicon
BIOS Vendor ID: HiSilicon
Model name: Kunpeng-920
BIOS Model name: HUAWEI Kunpeng 920 5250
Model: 0
Thread(s) per core: 1
Core(s) per socket: 48
Socket(s): 4
Stepping: 0x1
BogoMIPS: 200.00
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma dcpop asimddp asimdfhm ssbs
L1d cache: 12 MiB (192 instances)
L1i cache: 12 MiB (192 instances)
L2 cache: 96 MiB (192 instances)
L3 cache: 192 MiB (8 instances)
NUMA node(s): 4
NUMA node0 CPU(s): 0-47
NUMA node1 CPU(s): 48-95
NUMA node2 CPU(s): 96-143
NUMA node3 CPU(s): 144-191
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Retbleed: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1: Mitigation; __user pointer sanitization
Vulnerability Spectre v2: Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] pyzmq==26.4.0
[pip3] torch==2.5.1
[pip3] torch-npu==2.5.1
[pip3] torchvision==0.20.1
[pip3] transformers==4.51.3
[conda] Could not collect
vLLM Version: 0.8.5.post1
vLLM Ascend Version: 0.8.5rc1
ENV Variables:
ATB_OPSRUNNER_KERNEL_CACHE_TILING_SIZE=10240
ATB_OPSRUNNER_KERNEL_CACHE_LOCAL_COUNT=1
ATB_STREAM_SYNC_EVERY_RUNNER_ENABLE=0
ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
ATB_OPSRUNNER_SETUP_CACHE_ENABLE=1
ATB_WORKSPACE_MEM_ALLOC_GLOBAL=0
ATB_DEVICE_TILING_BUFFER_BLOCK_NUM=32
ATB_STREAM_SYNC_EVERY_KERNEL_ENABLE=0
ATB_OPSRUNNER_KERNEL_CACHE_GLOABL_COUNT=5
ATB_HOME_PATH=/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1
ASCEND_TOOLKIT_HOME=/usr/local/Ascend/ascend-toolkit/latest
ATB_COMPARE_TILING_EVERY_KERNEL=0
ASCEND_OPP_PATH=/usr/local/Ascend/ascend-toolkit/latest/opp
LD_LIBRARY_PATH=/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/lib:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/examples:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/tests/atbopstest:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64/plugin:/usr/local/Ascend/ascend-toolkit/latest/lib64:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/opskernel:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/nnengine:/usr/local/Ascend/ascend-toolkit/latest/opp/built-in/op_impl/ai_core/tbe/op_tiling:/usr/local/Ascend/driver/lib64/common/:/usr/local/Ascend/driver/lib64/driver/:
ASCEND_AICPU_PATH=/usr/local/Ascend/ascend-toolkit/latest
ATB_OPSRUNNER_KERNEL_CACHE_TYPE=3
ATB_RUNNER_POOL_SIZE=64
ATB_STREAM_SYNC_EVERY_OPERATION_ENABLE=0
ASCEND_HOME_PATH=/usr/local/Ascend/ascend-toolkit/latest
ATB_MATMUL_SHUFFLE_K_ENABLE=1
ATB_LAUNCH_KERNEL_WITH_TILING=1
ATB_WORKSPACE_MEM_ALLOC_ALG_TYPE=1
VLLM_USE_V1=1
ATB_HOST_TILING_BUFFER_BLOCK_NUM=128
ATB_SHARE_MEMORY_NAME_SUFFIX=
TORCH_DEVICE_BACKEND_AUTOLOAD=1
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
NPU:
+------------------------------------------------------------------------------------------------+
| npu-smi 24.1.rc2.1 Version: 24.1.rc2.1 |
+---------------------------+---------------+----------------------------------------------------+
| NPU Name | Health | Power(W) Temp(C) Hugepages-Usage(page)|
| Chip | Bus-Id | AICore(%) Memory-Usage(MB) HBM-Usage(MB) |
+===========================+===============+====================================================+
| 0 910B3 | OK | 90.9 32 0 / 0 |
| 0 | 0000:C1:00.0 | 0 0 / 0 3368 / 65536 |
+===========================+===============+====================================================+
| 1 910B3 | OK | 89.2 29 0 / 0 |
| 0 | 0000:C2:00.0 | 0 0 / 0 3369 / 65536 |
+===========================+===============+====================================================+
| 2 910B3 | OK | 90.7 30 0 / 0 |
| 0 | 0000:81:00.0 | 0 0 / 0 3369 / 65536 |
+===========================+===============+====================================================+
| 3 910B3 | OK | 95.4 30 0 / 0 |
| 0 | 0000:82:00.0 | 0 0 / 0 3369 / 65536 |
+===========================+===============+====================================================+
| 4 910B3 | OK | 90.8 37 0 / 0 |
| 0 | 0000:01:00.0 | 0 0 / 0 3369 / 65536 |
+===========================+===============+====================================================+
| 5 910B3 | OK | 88.4 34 0 / 0 |
| 0 | 0000:02:00.0 | 0 0 / 0 3369 / 65536 |
+===========================+===============+====================================================+
| 6 910B3 | OK | 95.8 35 0 / 0 |
| 0 | 0000:41:00.0 | 0 0 / 0 3365 / 65536 |
+===========================+===============+====================================================+
| 7 910B3 | OK | 92.0 36 0 / 0 |
| 0 | 0000:42:00.0 | 0 0 / 0 3368 / 65536 |
+===========================+===============+====================================================+
+---------------------------+---------------+----------------------------------------------------+
| NPU Chip | Process id | Process name | Process memory(MB) |
+===========================+===============+====================================================+
| No running processes found in NPU 0 |
+===========================+===============+====================================================+
| No running processes found in NPU 1 |
+===========================+===============+====================================================+
| No running processes found in NPU 2 |
+===========================+===============+====================================================+
| No running processes found in NPU 3 |
+===========================+===============+====================================================+
| No running processes found in NPU 4 |
+===========================+===============+====================================================+
| No running processes found in NPU 5 |
+===========================+===============+====================================================+
| No running processes found in NPU 6 |
+===========================+===============+====================================================+
| No running processes found in NPU 7 |
+===========================+===============+====================================================+
CANN:
package_name=Ascend-cann-toolkit
version=8.1.RC1
innerversion=V100R001C21SPC001B238
compatible_version=[V100R001C15],[V100R001C18],[V100R001C19],[V100R001C20],[V100R001C21]
arch=aarch64
os=linux
path=/usr/local/Ascend/ascend-toolkit/8.1.RC1/aarch64-linux
🐛 Describe the bug
When I run Qwen3-235B-A22B on 2 nodes with 16x910B3,model weight can be loaded ok. But when I send a request, the vllm server crashed. This is the log below.
ERROR 05-07 08:23:58 [core.py:398] EngineCore encountered a fatal error.
ERROR 05-07 08:23:58 [core.py:398] Traceback (most recent call last):
ERROR 05-07 08:23:58 [core.py:398] File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 389, in run_engine_core
ERROR 05-07 08:23:58 [core.py:398] engine_core.run_busy_loop()
ERROR 05-07 08:23:58 [core.py:398] File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 413, in run_busy_loop
ERROR 05-07 08:23:58 [core.py:398] self._process_engine_step()
ERROR 05-07 08:23:58 [core.py:398] File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 438, in _process_engine_step
ERROR 05-07 08:23:58 [core.py:398] outputs = self.step_fn()
ERROR 05-07 08:23:58 [core.py:398] File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 203, in step
ERROR 05-07 08:23:58 [core.py:398] output = self.model_executor.execute_model(scheduler_output)
ERROR 05-07 08:23:58 [core.py:398] File "/vllm-workspace/vllm/vllm/v1/executor/ray_distributed_executor.py", line 57, in execute_model
ERROR 05-07 08:23:58 [core.py:398] return refs[0].get()
ERROR 05-07 08:23:58 [core.py:398] File "/usr/local/python3.10.17/lib/python3.10/site-packages/ray/experimental/compiled_dag_ref.py", line 150, in get
ERROR 05-07 08:23:58 [core.py:398] return _process_return_vals(return_vals, True)
ERROR 05-07 08:23:58 [core.py:398] File "/usr/local/python3.10.17/lib/python3.10/site-packages/ray/experimental/compiled_dag_ref.py", line 27, in _process_return_vals
ERROR 05-07 08:23:58 [core.py:398] raise val.as_instanceof_cause()
ERROR 05-07 08:23:58 [core.py:398] ray.exceptions.RayTaskError(ValueError): ray::RayWorkerWrapper.__ray_call__() (pid=18591, ip=172.19.0.28)
ERROR 05-07 08:23:58 [core.py:398] File "/vllm-workspace/vllm/vllm/executor/ray_utils.py", line 130, in execute_model_ray
ERROR 05-07 08:23:58 [core.py:398] self.setup_device_if_necessary()
ERROR 05-07 08:23:58 [core.py:398] File "/vllm-workspace/vllm/vllm/executor/ray_utils.py", line 117, in setup_device_if_necessary
ERROR 05-07 08:23:58 [core.py:398] torch.cuda.set_device(self.worker.device)
ERROR 05-07 08:23:58 [core.py:398] File "/usr/local/python3.10.17/lib/python3.10/site-packages/torch/cuda/__init__.py", line 476, in set_device
ERROR 05-07 08:23:58 [core.py:398] device = _get_device_index(device)
ERROR 05-07 08:23:58 [core.py:398] File "/usr/local/python3.10.17/lib/python3.10/site-packages/torch/cuda/_utils.py", line 34, in _get_device_index
ERROR 05-07 08:23:58 [core.py:398] raise ValueError(f"Expected a cuda device, but got: {device}")
ERROR 05-07 08:23:58 [core.py:398] ValueError: Expected a cuda device, but got: npu:0
INFO 05-07 08:23:58 [ray_distributed_executor.py:127] Shutting down Ray distributed executor. If you see error log from logging.cc regarding SIGTERM received, please ignore because this is the expected termination process in Ray.
2025-05-07 08:23:58,217 INFO compiled_dag_node.py:2173 -- Tearing down compiled DAG
ERROR 05-07 08:23:58 [async_llm.py:399] AsyncLLM output_handler failed.
ERROR 05-07 08:23:58 [async_llm.py:399] Traceback (most recent call last):
ERROR 05-07 08:23:58 [async_llm.py:399] File "/vllm-workspace/vllm/vllm/v1/engine/async_llm.py", line 357, in output_handler
ERROR 05-07 08:23:58 [async_llm.py:399] outputs = await engine_core.get_output_async()
ERROR 05-07 08:23:58 [async_llm.py:399] File "/vllm-workspace/vllm/vllm/v1/engine/core_client.py", line 716, in get_output_async
ERROR 05-07 08:23:58 [async_llm.py:399] raise self._format_exception(outputs) from None
ERROR 05-07 08:23:58 [async_llm.py:399] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
INFO 05-07 08:23:58 [async_llm.py:324] Request cmpl-5a2affcafa984ca3aaf71d064ee59067-0 failed (engine dead).
INFO: 127.0.0.1:57468 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
2025-05-07 08:23:58,226 INFO compiled_dag_node.py:2178 -- Cancelling compiled worker on actor: Actor(RayWorkerWrapper, 0d858f0749acd09331c7018001000000)
2025-05-07 08:23:58,226 INFO compiled_dag_node.py:2178 -- Cancelling compiled worker on actor: Actor(RayWorkerWrapper, a3570d045e221611145cd3f901000000)
2025-05-07 08:23:58,226 INFO compiled_dag_node.py:2178 -- Cancelling compiled worker on actor: Actor(RayWorkerWrapper, baae8c108247476f262c33bb01000000)
2025-05-07 08:23:58,226 INFO compiled_dag_node.py:2178 -- Cancelling compiled worker on actor: Actor(RayWorkerWrapper, ed5458ab9f87cc254621707201000000)
2025-05-07 08:23:58,226 INFO compiled_dag_node.py:2178 -- Cancelling compiled worker on actor: Actor(RayWorkerWrapper, cf4bf04aacbdeb2f50f5e61701000000)
2025-05-07 08:23:58,226 INFO compiled_dag_node.py:2178 -- Cancelling compiled worker on actor: Actor(RayWorkerWrapper, ba1c9a30620c9d316ae3941a01000000)
2025-05-07 08:23:58,226 INFO compiled_dag_node.py:2178 -- Cancelling compiled worker on actor: Actor(RayWorkerWrapper, fcf4d090a77e9838b4f08d2901000000)
2025-05-07 08:23:58,226 INFO compiled_dag_node.py:2178 -- Cancelling compiled worker on actor: Actor(RayWorkerWrapper, 7db16050cdec7392f87ed58701000000)
2025-05-07 08:23:58,226 INFO compiled_dag_node.py:2178 -- Cancelling compiled worker on actor: Actor(RayWorkerWrapper, 4b6fc5501e61fdf10b1cd55401000000)
2025-05-07 08:23:58,226 INFO compiled_dag_node.py:2178 -- Cancelling compiled worker on actor: Actor(RayWorkerWrapper, 2e296490115d90d8ab17234601000000)
2025-05-07 08:23:58,226 INFO compiled_dag_node.py:2178 -- Cancelling compiled worker on actor: Actor(RayWorkerWrapper, 0993a140aa7bd9cebdf5f81101000000)
2025-05-07 08:23:58,226 INFO compiled_dag_node.py:2178 -- Cancelling compiled worker on actor: Actor(RayWorkerWrapper, a4d6a013dd726f3ac1047ae101000000)
2025-05-07 08:23:58,227 INFO compiled_dag_node.py:2178 -- Cancelling compiled worker on actor: Actor(RayWorkerWrapper, 89cad0caf7a239f3293ea30501000000)
2025-05-07 08:23:58,227 INFO compiled_dag_node.py:2178 -- Cancelling compiled worker on actor: Actor(RayWorkerWrapper, c19224d49013fc24e0de0ee201000000)
2025-05-07 08:23:58,227 INFO compiled_dag_node.py:2178 -- Cancelling compiled worker on actor: Actor(RayWorkerWrapper, 2585f6f9494e325736ae915301000000)
2025-05-07 08:23:58,227 INFO compiled_dag_node.py:2178 -- Cancelling compiled worker on actor: Actor(RayWorkerWrapper, 3e4e31644b452faf13558dc801000000)
INFO: Shutting down
2025-05-07 08:23:58,267 INFO compiled_dag_node.py:2200 -- Waiting for worker tasks to exit
2025-05-07 08:23:58,269 INFO compiled_dag_node.py:2203 -- Teardown complete
Process EngineCore_0:
Traceback (most recent call last):
File "/usr/local/python3.10.17/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/local/python3.10.17/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 400, in run_engine_core
raise e
File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 389, in run_engine_core
engine_core.run_busy_loop()
File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 413, in run_busy_loop
self._process_engine_step()
File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 438, in _process_engine_step
outputs = self.step_fn()
File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 203, in step
output = self.model_executor.execute_model(scheduler_output)
File "/vllm-workspace/vllm/vllm/v1/executor/ray_distributed_executor.py", line 57, in execute_model
return refs[0].get()
File "/usr/local/python3.10.17/lib/python3.10/site-packages/ray/experimental/compiled_dag_ref.py", line 150, in get
return _process_return_vals(return_vals, True)
File "/usr/local/python3.10.17/lib/python3.10/site-packages/ray/experimental/compiled_dag_ref.py", line 27, in _process_return_vals
raise val.as_instanceof_cause()
ray.exceptions.RayTaskError(ValueError): ray::RayWorkerWrapper.__ray_call__() (pid=18591, ip=172.19.0.28)
File "/vllm-workspace/vllm/vllm/executor/ray_utils.py", line 130, in execute_model_ray
self.setup_device_if_necessary()
File "/vllm-workspace/vllm/vllm/executor/ray_utils.py", line 117, in setup_device_if_necessary
torch.cuda.set_device(self.worker.device)
File "/usr/local/python3.10.17/lib/python3.10/site-packages/torch/cuda/__init__.py", line 476, in set_device
device = _get_device_index(device)
File "/usr/local/python3.10.17/lib/python3.10/site-packages/torch/cuda/_utils.py", line 34, in _get_device_index
raise ValueError(f"Expected a cuda device, but got: {device}")
ValueError: Expected a cuda device, but got: npu:0
(raylet) [2025-05-07 08:23:58,299 C 18100 18100] (raylet) experimental_mutable_object_provider.cc:156: Check failed: object_manager_->WriteAcquire(info.local_object_id, total_data_size, nullptr, total_metadata_size, info.num_readers, object_backing_store) Status not OK: ChannelError: Channel closed.
(raylet) *** StackTrace Information ***
(raylet) /usr/local/python3.10.17/lib/python3.10/site-packages/ray/core/src/ray/raylet/raylet(+0xd18eb8) [0xaaaab3428eb8] ray::operator<<()
(raylet) /usr/local/python3.10.17/lib/python3.10/site-packages/ray/core/src/ray/raylet/raylet(+0xd1b7e8) [0xaaaab342b7e8] ray::RayLog::~RayLog()
(raylet) /usr/local/python3.10.17/lib/python3.10/site-packages/ray/core/src/ray/raylet/raylet(+0x456bb0) [0xaaaab2b66bb0] ray::core::experimental::MutableObjectProvider::HandlePushMutableObject()
(raylet) /usr/local/python3.10.17/lib/python3.10/site-packages/ray/core/src/ray/raylet/raylet(+0x2421d0) [0xaaaab29521d0] ray::raylet::NodeManager::HandlePushMutableObject()
(raylet) /usr/local/python3.10.17/lib/python3.10/site-packages/ray/core/src/ray/raylet/raylet(+0x2a4c60) [0xaaaab29b4c60] ray::rpc::ServerCallImpl<>::HandleRequestImpl()
(raylet) /usr/local/python3.10.17/lib/python3.10/site-packages/ray/core/src/ray/raylet/raylet(+0x6d875c) [0xaaaab2de875c] EventTracker::RecordExecution()
(raylet) /usr/local/python3.10.17/lib/python3.10/site-packages/ray/core/src/ray/raylet/raylet(+0x6d3e90) [0xaaaab2de3e90] std::_Function_handler<>::_M_invoke()
(raylet) /usr/local/python3.10.17/lib/python3.10/site-packages/ray/core/src/ray/raylet/raylet(+0x6d4320) [0xaaaab2de4320] boost::asio::detail::completion_handler<>::do_complete()
(raylet) /usr/local/python3.10.17/lib/python3.10/site-packages/ray/core/src/ray/raylet/raylet(+0xcf5630) [0xaaaab3405630] boost::asio::detail::scheduler::do_run_one()
(raylet) /usr/local/python3.10.17/lib/python3.10/site-packages/ray/core/src/ray/raylet/raylet(+0xcf78c4) [0xaaaab34078c4] boost::asio::detail::scheduler::run()
(raylet) /usr/local/python3.10.17/lib/python3.10/site-packages/ray/core/src/ray/raylet/raylet(+0xcf7ec8) [0xaaaab3407ec8] boost::asio::io_context::run()
(raylet) /usr/local/python3.10.17/lib/python3.10/site-packages/ray/core/src/ray/raylet/raylet(+0x1ace44) [0xaaaab28bce44] main
(raylet) /lib/aarch64-linux-gnu/libc.so.6(+0x273fc) [0xffff9e4d73fc]
(raylet) /lib/aarch64-linux-gnu/libc.so.6(__libc_start_main+0x98) [0xffff9e4d74cc] __libc_start_main
(raylet) /usr/local/python3.10.17/lib/python3.10/site-packages/ray/core/src/ray/raylet/raylet(+0x1ffbdc) [0xaaaab290fbdc]
(raylet)
(RayWorkerWrapper pid=4635, ip=172.19.0.29) [rank9]:[W507 08:22:16.546296931 compiler_depend.ts:28] Warning: The oprator of MoeInitRouting will be removed from Pytorch and switch to AscendSpeed after 630. (function operator()) [repeated 15x across cluster]
INFO: Waiting for application shutdown.
INFO: Application shutdown complete.
INFO: Finished server process [18336]
*** SIGTERM received at time=1746606238 on cpu 141 ***
PC: @ 0xffffb456ea9c (unknown) select
@ 0xfffde7359698 464 absl::lts_20230802::AbslFailureSignalHandler()
@ 0xffffb4a438ec 606883984 (unknown)
@ 0xffffb489d194 128 time_sleep
@ 0xffffb4749d1c 112 cfunction_vectorcall_O
@ 0xffffb46ae64c 48 _PyEval_EvalFrameDefault
@ 0xffffb47edf34 448 _PyEval_Vector
@ 0xffffb46a9f58 48 _PyEval_EvalFrameDefault
@ 0xffffb47edf34 448 _PyEval_Vector
@ 0xffffb489870c 48 atexit_callfuncs
@ 0xffffb482dc2c 64 Py_FinalizeEx
@ 0xffffb482ea54 80 Py_Exit
@ 0xffffb4833418 32 _PyErr_PrintEx
@ 0xffffb483409c 144 PyRun_SimpleStringFlags
@ 0xffffb485333c 32 Py_RunMain
@ 0xffffb4853d4c 224 Py_BytesMain
@ 0xffffb44b73fc 192 (unknown)
@ 0xffffb44b74cc 272 __libc_start_main
[2025-05-07 08:23:58,367 E 18484 18484] logging.cc:496: *** SIGTERM received at time=1746606238 on cpu 141 ***
[2025-05-07 08:23:58,367 E 18484 18484] logging.cc:496: PC: @ 0xffffb456ea9c (unknown) select
[2025-05-07 08:23:58,372 E 18484 18484] logging.cc:496: @ 0xfffde73596c0 464 absl::lts_20230802::AbslFailureSignalHandler()
[2025-05-07 08:23:58,375 E 18484 18484] logging.cc:496: @ 0xffffb4a438ec 606883984 (unknown)
[2025-05-07 08:23:58,375 E 18484 18484] logging.cc:496: @ 0xffffb489d194 128 time_sleep
[2025-05-07 08:23:58,375 E 18484 18484] logging.cc:496: @ 0xffffb4749d1c 112 cfunction_vectorcall_O
[2025-05-07 08:23:58,375 E 18484 18484] logging.cc:496: @ 0xffffb46ae64c 48 _PyEval_EvalFrameDefault
[2025-05-07 08:23:58,375 E 18484 18484] logging.cc:496: @ 0xffffb47edf34 448 _PyEval_Vector
[2025-05-07 08:23:58,375 E 18484 18484] logging.cc:496: @ 0xffffb46a9f58 48 _PyEval_EvalFrameDefault
[2025-05-07 08:23:58,375 E 18484 18484] logging.cc:496: @ 0xffffb47edf34 448 _PyEval_Vector
[2025-05-07 08:23:58,375 E 18484 18484] logging.cc:496: @ 0xffffb489870c 48 atexit_callfuncs
[2025-05-07 08:23:58,375 E 18484 18484] logging.cc:496: @ 0xffffb482dc2c 64 Py_FinalizeEx
[2025-05-07 08:23:58,375 E 18484 18484] logging.cc:496: @ 0xffffb482ea54 80 Py_Exit
[2025-05-07 08:23:58,375 E 18484 18484] logging.cc:496: @ 0xffffb4833418 32 _PyErr_PrintEx
[2025-05-07 08:23:58,375 E 18484 18484] logging.cc:496: @ 0xffffb483409c 144 PyRun_SimpleStringFlags
[2025-05-07 08:23:58,375 E 18484 18484] logging.cc:496: @ 0xffffb485333c 32 Py_RunMain
[2025-05-07 08:23:58,375 E 18484 18484] logging.cc:496: @ 0xffffb4853d4c 224 Py_BytesMain
[2025-05-07 08:23:58,377 E 18484 18484] logging.cc:496: @ 0xffffb44b73fc 192 (unknown)
[2025-05-07 08:23:58,377 E 18484 18484] logging.cc:496: @ 0xffffb44b74cc 272 __libc_start_main
Exception ignored in atexit callback: <function shutdown at 0xfffde57845e0>
Traceback (most recent call last):
File "/usr/local/python3.10.17/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
File "/usr/local/python3.10.17/lib/python3.10/site-packages/ray/_private/worker.py", line 1957, in shutdown
time.sleep(0.5)
File "/usr/local/python3.10.17/lib/python3.10/site-packages/ray/_private/worker.py", line 1539, in sigterm_handler
sys.exit(signum)
SystemExit: 15
BestKuan
changed the title
[Bug]: Qwen3-235B cannot be run successfully
[Bug]: Qwen3-235B cannot be run successfully with vllm v1 engine on version 0.8.5rc1
May 7, 2025
From the error message, it looks like you're using Ray's DAG (Compiled Graph) feature. Currently, this feature only supports CUDA. You can try disabling the DAG feature and rerun your tests.
Meanwhile, we're actively working on enabling DAG support on NPUs in Ray. Thank you for your patience!
Your current environment
The output of `python collect_env.py`
🐛 Describe the bug
When I run Qwen3-235B-A22B on 2 nodes with 16x910B3,model weight can be loaded ok. But when I send a request, the vllm server crashed. This is the log below.
bug_details.txt
The text was updated successfully, but these errors were encountered: