Taming AMD's Linux AI Stack: From Kernel Panics to 80% Idle

Taming AMD's Linux AI Stack: From Kernel Panics to 80% Idle
Debugging Linux Kernel Computer Vision

Taming AMD's Linux AI Stack:
From Kernel Panics to 87% CPU Idle

How I fixed a fatal GPU deadlock in Frigate NVR by ripping out ROCm and routing AI inference through Linux gaming drivers.

The Short Version (For Everyone)

I recently set up a smart security camera system called Frigate on my home server. Frigate is incredibly smart—it looks at camera feeds in real-time to detect people, cars, and animals. To do this without melting the server's main processor (CPU), it uses the graphics card (GPU).

My server has a brand new AMD Ryzen processor with built-in graphics (Radeon 760M). On paper, it's a beast. In reality? The server was completely crashing and freezing every 6 to 10 minutes. I had to pull the power plug to fix it.

Why was it crashing?

Imagine a busy intersection with traffic lights controlled by a highly complex, proprietary computer system built by AMD (called ROCm). The system was trying to route two massive fleets of trucks at exactly the same time: one fleet carrying video data, the other carrying AI math calculations. The traffic controller completely panicked, caused a massive pileup, and then the tow trucks (the system reset protocol) broke down on the way to the scene.

How did I fix it?

Instead of relying on AMD's proprietary AI traffic controller, I fired them. I found a different, open-source tool (called ncnn) that routes the AI math through Vulkan. Vulkan is the exact same underlying technology that makes massive 3D video games run smoothly on Linux (like on the Steam Deck).

Because the gaming drivers are heavily tested by millions of players, they are rock solid. They handled the video and the AI math perfectly. My server went from crashing every 6 minutes and using 100% of its CPU, to running flawlessly with the CPU sitting at 87% idle, sipping power.


The Deep Dive (For the Engineers)

1. The Nightmare: D State and TTM Deadlocks

The hardware: An AMD Ryzen 5 8600G (Phoenix1 architecture, RDNA3, gfx1103 APU). The software: Dockerized Frigate 0.17 utilizing ONNX Runtime.

Frigate utilizes the GPU for two distinct pipelines: VAAPI for hardware video decoding (4 camera streams), and ROCm/MIGraphX for ONNX Runtime object detection (YOLOv9).

Shortly after startup, the Frigate container would become completely unresponsive. Docker could not kill it (SIGKILL was ignored). A quick dive into the host system revealed the horror:

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root    337200  0.0  0.0      0     0 ?        D<   02:01   0:00 [kworker/u49:9+ttm]
root    337215  0.0  0.0      0     0 ?        D<   02:01   0:00 [kworker/u49:10+ttm]
... (16 workers stuck in D state)

Processes stuck in D (uninterruptible sleep) usually indicate severe I/O or kernel-level locks. Looking at dmesg confirmed it:

[37456.170181] amdgpu 0000:0f:00.0: GPU reset begin!. Source: 5

The Root Cause: Because this is an APU, system RAM is shared as VRAM. The VAAPI processes (media engine) and the MIGraphX processes (compute engine) were stepping on each other inside the TTM (Translation Table Maps) memory manager. This deadlock triggered a GPU reset (Source 5 = compute engine hang). However, on gfx1103, the amdgpu kernel reset sequence is known-buggy and never completed, resulting in a zombie GPU.

2. The Failed Attempts & ROCm Pitfalls

Before abandoning ROCm entirely, I tried standard mitigation strategies to stop the engines from fighting:

  • Attempt 1 (Software Decode + ROCm): I disabled VAAPI to dedicate the GPU strictly to MIGraphX inference. While ROCm achieved a blazing fast ~14ms inference speed, the GPU still hung after about 6 minutes. ROCm compute on consumer RDNA3 APUs on Linux is just fundamentally unstable, suffering from missing engine isolation.
  • Attempt 2 (VAAPI Decode + CPU Inference): I pushed YOLO detection to the CPU via standard ONNX Runtime and left VAAPI running on the GPU. Stability was achieved, but inference ballooned to ~77ms, and the CPU load hovered at a staggering 45 (94% to 100% utilization, completely pegged). Unacceptable for a home lab meant to run other services concurrently.

3. The Breakthrough: Vulkan and RADV

While AMD's proprietary compute stack (ROCm) is deeply flawed for consumer APUs, their Linux gaming stack is incredible. Mesa's RADV Vulkan driver is bulletproof. If I could route AI inference through Vulkan instead of ROCm, I could bypass the broken amdgpu compute paths entirely.

Microsoft's ONNX Runtime does have a Vulkan Execution Provider, but it is not shipped in their pre-built Linux wheels. Building it from source inside the Frigate container was an option, but a heavy one.

Instead, I pivoted to ncnn, Tencent's high-performance neural network inference framework optimized for mobile platforms. ncnn has native, highly-optimized Vulkan support. Because Python bindings for ncnn exist, I could write a custom detector plugin for Frigate.

4. Building the Custom Detector

I downloaded a pre-converted YOLOv5s .ncnn model and its parameter file. However, substituting ONNX for ncnn meant I lost Frigate's built-in ONNX post-processing. `ncnn` outputs raw logits; it doesn't apply sigmoid activations or grid decoding internally like the ONNX graphs do.

I wrote a custom Python class to hijack Frigate's ONNXDetector type, manually reshape the raw tensors, apply the sigmoid functions, map the anchor boxes, and run Non-Maximum Suppression (NMS). Here is a snippet of the crucial grid decoding logic that bridged the gap:

# Reshape YOLOv5 outputs: (255, H, W) -> (3, 85, H, W) -> (3*H*W, 85)
def decode_output(ncnn_mat, stride):
    arr = np.array(ncnn_mat)  # (255, grid_h, grid_w)
    na, nc = 3, 80
    no = 5 + nc  # 85 (4 box + 1 obj + 80 class)
    grid_h, grid_w = arr.shape[1], arr.shape[2]
    
    # Reshape and permute
    arr = arr.reshape(na, no, grid_h, grid_w)
    arr = np.transpose(arr, (2, 3, 0, 1))  # (grid_h, grid_w, 3, 85)
    arr = arr.reshape(-1, no)  # (grid_h*grid_w*3, 85)
    
    # Apply sigmoid (ncnn outputs raw logits)
    arr = 1.0 / (1.0 + np.exp(-arr))
    
    # Generate anchor grids
    grid_y, grid_x = np.meshgrid(np.arange(grid_h), np.arange(grid_w), indexing='ij')
    grid = np.stack([grid_x, grid_y], axis=-1)
    grid = np.expand_dims(grid, axis=2)
    grid = np.tile(grid, (1, 1, na, 1)).reshape(-1, 2)
    
    # Extract bounding boxes
    xy = arr[:, :2]
    wh = arr[:, 2:4]
    
    # YOLOv5 Decode formula
    xy = (xy * 2.0 - 0.5 + grid) * stride
    anchors_tiled = np.tile(self._anchors[stride], (grid_h * grid_w, 1))
    wh = (wh * 2.0) ** 2 * anchors_tiled
    
    # ... Confidence thresholding and NMS follows ...

5. Final Results: ROCm vs CPU vs Vulkan

With the custom image deployed, the transformation was instantaneous. By running the inference directly through Mesa's RADV gaming drivers, we kept the performance benefits of hardware acceleration while dodging the kernel panics entirely.

Metric ROCm + VAAPI
(The Original Goal)
CPU + VAAPI
(The Fallback)
ncnn Vulkan + VAAPI
(The Fix)
Inference Speed ~14ms 🚀 ~77ms 🐌 ~28ms ⚡
CPU Idle ~74% 0% (Completely Pegged) 87%
GPU Status Deadlock / Hang Video decode only Decode + Inference (~39% Util)
System Stability ❌ Kernel Panic (6 min) ✅ Stable (But system unusable) ✅ 100% Stable (3+ hrs)

The Legacy Hardware Implication: This workaround isn't just for bleeding-edge RDNA3. Older APUs like the Ryzen 3500U (Vega 8) were never officially supported by ROCm. However, because they fully support Vulkan 1.2, this exact same ncnn Vulkan pipeline enables hardware-accelerated AI on them flawlessly.

The PR for this fallback mechanism is open on GitHub. If you're running Frigate on AMD hardware and pulling your hair out over amdgpu kernel crashes, bypass ROCm entirely. Vulkan is the way.

© 2026 Engineering Log. Building resilient systems on Linux.