The AMD Vulkan Chronicles

This is a classic “Developer in the Trenches” story. It’s got a ghost in the machine, a GPU that won’t cooperate, and a happy ending with a dark-horse model.

Below is a summary of our troubleshooting session and a blog post draft you can share with the LLM community.

Executive Summary

The Goal: Run an LLM with Tool-Calling capabilities on a Windows machine with an AMD Radeon RX 5500 XT (8GB VRAM) using Ollama and the Vulkan backend.
The Conflict: * Gemma 3 1B suffered from “Split-Brain” syndrome, offloading 576MB to the CPU despite having 7.3GB of free VRAM.
- Llama 3.2 1B ignored user-defined context limits and tried to allocate a massive 128k context window (8.3GB), causing an ErrorOutOfDeviceMemory crash.
- Environmental Ghost: A hidden system-level variable (OLLAMA_CONTEXT_LENGTH: 131072) was overriding session settings.
The Solution: Switched to Qwen 2.5 Coder 1.5B. It proved to be the most stable for AMD/Vulkan, respected memory limits, and handles tool-calling/Greek characters natively.

Blog Post: The AMD Vulkan Chronicles

Title: Fighting the 128k Ghost: My Journey to Stable Local LLMs on AMD

If you’ve ever tried to run local LLMs on Windows with an AMD card, you know the “Vulkan Dance.” Last night, I went ten rounds with Ollama trying to get a simple 1B model to run fully on my Radeon RX 5500 XT. Here is the post-mortem of that battle.

1. The Ghost in the Registry

I started by trying to run Gemma 3 1B. My GPU has 8GB of VRAM, and the model is tiny. It should have been a breeze. Instead, I saw the dreaded “Split-Brain”: Ollama was shoving 500MB+ onto my CPU.

The Culprit: My logs revealed a hidden system variable: OLLAMA_CONTEXT_LENGTH: 131072. No matter what I typed in PowerShell, the system was forcing a 128k context window, which ate my VRAM for breakfast.

Lesson: Always check your ollama serve logs for the env="map[...]" section. If there’s a variable there you didn’t set, find it in your Windows Environment Variables and kill it.

2. The Llama 3.2 “Memory Leak”

Next, I tried Llama 3.2 1B. This resulted in a total panic: ggml_vulkan: Device memory allocation failed. Even though I set my context to 32k, the Llama runner ignored me and tried to reserve 8.3GB of VRAM for a 128k window. On an 8GB card, that’s an instant crash.

3. The “Split-Brain” Mystery

Even after freeing up 7.3GB of VRAM, some models (like Gemma 3 QAT) still insisted on putting a fraction of the weights on the CPU. This is often due to VRAM fragmentation. Windows and AMD drivers are conservative; if they can’t find one perfectly continuous block of memory, they bail to the CPU, tanking your tokens-per-second.

4. The Winner: Qwen 2.5 Coder 1.5B

After failing with the “big names,” I pivoted to Qwen 2.5 Coder 1.5B.

Stability: It loaded 100% onto the GPU instantly.
Speed: I went from a 3.7-second “handshake” lag to near-instant responses.
Utility: Since I’m building a tool to manage my Greek supermarket lists (including baby supplies for my 3-person family), the “Coder” variant’s strictness with JSON and tool-calling was exactly what I needed.

My Final Setup Script:

If you’re on an AMD card, don’t just run ollama run. Use a clean PowerShell session:

PowerShell

$env:OLLAMA_NUM_CTX="16384"  # Start safe
$env:OLLAMA_VULKAN="1"
ollama serve

Final Takeaway: Don’t get married to a specific model. If the drivers hate the architecture, pivot to Qwen. It’s the hidden gem of the small-model world for AMD users.

Must Feed