Skip to content

Vulkan crash on AMD RDNA1 (RX 5500 XT) during buffer initialization #3611

@that-guy-rob

Description

@that-guy-rob

Summary

whisper.cpp crashes with VK_ERROR_DEVICE_LOST on AMD RX 5500 XT (RDNA1/Navi14) when using Vulkan backend. The crash occurs during KV cache initialization before any inference happens.

Environment

  • GPU: AMD Radeon RX 5500 XT (gfx1012, NAVI14, 8GB VRAM)
  • Driver: RADV (Mesa 24.2.8-1ubuntu1~24.04.1)
  • Vulkan: 1.3.289
  • OS: Linux Mint 22.1 (Ubuntu 24.04 based), kernel 6.8.0
  • whisper.cpp: Latest master (commit f53dc74)

Steps to Reproduce

git clone https://github.com/ggml-org/whisper.cpp
cd whisper.cpp
cmake -B build -DGGML_VULKAN=1
cmake --build build -j
./models/download-ggml-model.sh base.en
./build/bin/whisper-cli -m models/ggml-base.en.bin -f samples/jfk.wav

Expected Behavior

Whisper transcribes the audio using the Vulkan GPU backend.

Actual Behavior

Crash with error:

radv/amdgpu: The CS has been rejected, see dmesg for more information (-22).
terminate called after throwing an instance of 'vk::DeviceLostError'
  what():  vk::Queue::submit: ErrorDeviceLost

Stack Trace

#6  ggml_vk_submit(std::shared_ptr<vk_context_struct>&, vk::Fence)
#7  ggml_vk_buffer_memset(std::shared_ptr<vk_buffer_struct>&, unsigned long, unsigned int, unsigned long)
#8  whisper_kv_cache_init(whisper_kv_cache&, ggml_backend*, ggml_type, long, long, int)
#9  whisper_init_state()
#10 whisper_init_from_file_with_params()

Root Cause Analysis

The crash occurs in ggml_vk_buffer_memset() at ggml-vulkan.cpp:6588:

subctx->s->buffer.fillBuffer(dst->buffer, offset, size, c);

This function uses the transfer queue (dst->device->transfer_queue.cmd_pool) to execute vkCmdFillBuffer. On RDNA1, the transfer queue (SDMA engine) appears to reject this command with EINVAL (-22).

Key observations:

  1. The GPU is correctly detected as RDNA1: ggml_vulkan: 0 = Radeon RX 5500 XT (RADV NAVI14)
  2. The crash happens before any compute shaders run
  3. Error code -22 (EINVAL) suggests invalid parameters or unsupported operation
  4. CPU mode works fine with --no-gpu

Attempted Workarounds (all failed)

  • --no-flash-attn - still crashes
  • GGML_VK_DISABLE_ASYNC=1 - still crashes
  • GGML_VK_PREFER_HOST_MEMORY=1 - still crashes (not UMA, so still uses GPU path)
  • VK_INSTANCE_LAYERS=VK_LAYER_KHRONOS_validation - still crashes
  • RADV_PERFTEST=transfer_queue=0 - still crashes

Potential Fix

The issue may be that fillBuffer on the dedicated transfer queue doesn't work correctly on RDNA1. Possible solutions:

  1. Use compute queue for fillBuffer on RDNA1:
    Modify ggml_vk_buffer_memset() to use compute_queue instead of transfer_queue when device->architecture == vk_device_architecture::AMD_RDNA1

  2. Add CPU fallback for non-UMA discrete GPUs:
    The current code only uses CPU memset when eHostVisible && uma. For discrete GPUs, could allocate a staging buffer and use CPU memset + copy.

  3. Use vkCmdUpdateBuffer instead:
    For small buffers, vkCmdUpdateBuffer might work better on the transfer queue.

Related Issues

System Info

$ vulkaninfo --summary | grep -A5 "GPU0"
GPU0:
    apiVersion         = 1.3.289
    driverVersion      = 24.2.8
    vendorID           = 0x1002
    deviceID           = 0x7340
    deviceType         = PHYSICAL_DEVICE_TYPE_DISCRETE_GPU
    deviceName         = Radeon RX 5500 XT (RADV NAVI14)

Note

  • This appears to be specific to the SDMA/transfer queue on RDNA1

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions