Skip to content

Conversation

@SpyrosMouselinos
Copy link
Contributor

GPU Disconnect Testing Feature

Adds simulated GPU hardware failures/disconnects to test the ability of apps to recover from them.


Available Methods

🐧 Linux Native Methods

  1. Slot Power Toggle

    • Platform: Linux only (requires PCI hotplug support)
    • Description: Actually cuts and restores PCIe slot power
    • Closest to: Physical GPU removal/insertion
    • Risk Level: Low - Hardware controlled
  2. Hot Reset

    • Platform: Linux only (requires PCIe bridge reset capability)
    • Description: Resets the PCIe link using upstream bridge controls
    • Use Case: Tests PCIe link recovery
    • Risk Level: Low - Hardware controlled
  3. Logical Remove/Rescan

    • Platform: Linux only
    • Description: Software removal from PCI bus and rescan
    • Use Case: Tests driver unbind/rebind scenarios
    • Risk Level: Low - No hardware reset
  4. NVIDIA GPU Reset

    • Platform: Linux only (requires nvidia-smi)
    • Description: Uses NVIDIA driver reset functionality via nvidia-smi --gpu-reset
    • Use Case: Tests driver-level recovery
    • Risk Level: Low - Driver managed

⚠️ Experimental Method (All Platforms)

  1. Memory Flood
    • Platform: WSL2, Docker, Linux
    • Description: Floods GPU memory (~95%) to trigger OOM/driver reset
    • Use Case: Aggressive fault testing - may cause real GPU instability
    • Risk Level: ⚠️ HIGH - May cause system instability, driver crashes, or require reboot
    • WSL2/Docker: This is the only method available (PCI methods require native Linux)

Usage

Web UI

  1. Click the Disconnect button on any GPU card
  2. Select disconnect method from dropdown
  3. Choose disconnect duration (5s - 5min)
  4. Click "Disconnect GPU"

API

# Disconnect GPU 0 for 10 seconds using auto-select
curl -X POST http://localhost:1312/api/gpu/0/disconnect \
  -H "Content-Type: application/json" \
  -d '{"method": "auto", "down_time": 10}'

# Use specific method
curl -X POST http://localhost:1312/api/gpu/0/disconnect \
  -H "Content-Type: application/json" \
  -d '{"method": "memory_flood", "down_time": 5}'

Platform Support

Method WSL2/Docker Linux Native
Slot Power Toggle ✅ (hardware dependent)
Hot Reset ✅ (hardware dependent)
Logical Remove/Rescan
NVIDIA GPU Reset ✅ (nvidia-smi required)
Memory Flood ⚠️ ⚠️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant