-
Notifications
You must be signed in to change notification settings - Fork 5.2k
Description
Summary
whisper.cpp crashes with VK_ERROR_DEVICE_LOST on AMD RX 5500 XT (RDNA1/Navi14) when using Vulkan backend. The crash occurs during KV cache initialization before any inference happens.
Environment
- GPU: AMD Radeon RX 5500 XT (gfx1012, NAVI14, 8GB VRAM)
- Driver: RADV (Mesa 24.2.8-1ubuntu1~24.04.1)
- Vulkan: 1.3.289
- OS: Linux Mint 22.1 (Ubuntu 24.04 based), kernel 6.8.0
- whisper.cpp: Latest master (commit f53dc74)
Steps to Reproduce
git clone https://github.com/ggml-org/whisper.cpp
cd whisper.cpp
cmake -B build -DGGML_VULKAN=1
cmake --build build -j
./models/download-ggml-model.sh base.en
./build/bin/whisper-cli -m models/ggml-base.en.bin -f samples/jfk.wavExpected Behavior
Whisper transcribes the audio using the Vulkan GPU backend.
Actual Behavior
Crash with error:
radv/amdgpu: The CS has been rejected, see dmesg for more information (-22).
terminate called after throwing an instance of 'vk::DeviceLostError'
what(): vk::Queue::submit: ErrorDeviceLost
Stack Trace
#6 ggml_vk_submit(std::shared_ptr<vk_context_struct>&, vk::Fence)
#7 ggml_vk_buffer_memset(std::shared_ptr<vk_buffer_struct>&, unsigned long, unsigned int, unsigned long)
#8 whisper_kv_cache_init(whisper_kv_cache&, ggml_backend*, ggml_type, long, long, int)
#9 whisper_init_state()
#10 whisper_init_from_file_with_params()
Root Cause Analysis
The crash occurs in ggml_vk_buffer_memset() at ggml-vulkan.cpp:6588:
subctx->s->buffer.fillBuffer(dst->buffer, offset, size, c);This function uses the transfer queue (dst->device->transfer_queue.cmd_pool) to execute vkCmdFillBuffer. On RDNA1, the transfer queue (SDMA engine) appears to reject this command with EINVAL (-22).
Key observations:
- The GPU is correctly detected as RDNA1:
ggml_vulkan: 0 = Radeon RX 5500 XT (RADV NAVI14) - The crash happens before any compute shaders run
- Error code -22 (EINVAL) suggests invalid parameters or unsupported operation
- CPU mode works fine with
--no-gpu
Attempted Workarounds (all failed)
--no-flash-attn- still crashesGGML_VK_DISABLE_ASYNC=1- still crashesGGML_VK_PREFER_HOST_MEMORY=1- still crashes (not UMA, so still uses GPU path)VK_INSTANCE_LAYERS=VK_LAYER_KHRONOS_validation- still crashesRADV_PERFTEST=transfer_queue=0- still crashes
Potential Fix
The issue may be that fillBuffer on the dedicated transfer queue doesn't work correctly on RDNA1. Possible solutions:
-
Use compute queue for fillBuffer on RDNA1:
Modifyggml_vk_buffer_memset()to usecompute_queueinstead oftransfer_queuewhendevice->architecture == vk_device_architecture::AMD_RDNA1 -
Add CPU fallback for non-UMA discrete GPUs:
The current code only uses CPU memset wheneHostVisible && uma. For discrete GPUs, could allocate a staging buffer and use CPU memset + copy. -
Use vkCmdUpdateBuffer instead:
For small buffers,vkCmdUpdateBuffermight work better on the transfer queue.
Related Issues
- Similar RADV crashes on RDNA1: https://github.com/felipeagc/amd_repro
- whisper.cpp AMD issues: How to use with a AMD GPU? #2828
- Vulkan broken for AMD in v1.8.0+: Vulkan support broken at v1.8.0 #3455
System Info
$ vulkaninfo --summary | grep -A5 "GPU0"
GPU0:
apiVersion = 1.3.289
driverVersion = 24.2.8
vendorID = 0x1002
deviceID = 0x7340
deviceType = PHYSICAL_DEVICE_TYPE_DISCRETE_GPU
deviceName = Radeon RX 5500 XT (RADV NAVI14)
Note
- This appears to be specific to the SDMA/transfer queue on RDNA1