This is a classic “Developer in the Trenches” story. It’s got a ghost in the machine, a GPU that won’t cooperate, and a happy ending with a dark-horse model.
Below is a summary of our troubleshooting session and a blog post draft you can share with the LLM community.
Executive Summary
- The Goal: Run an LLM with Tool-Calling capabilities on a Windows machine with an AMD Radeon RX 5500 XT (8GB VRAM) using Ollama and the Vulkan backend.
- The Conflict: * Gemma 3 1B suffered from “Split-Brain” syndrome, offloading 576MB to the CPU despite having 7.3GB of free VRAM.
- Llama 3.2 1B ignored user-defined context limits and tried to allocate a massive 128k context window (8.3GB), causing an
ErrorOutOfDeviceMemorycrash. - Environmental Ghost: A hidden system-level variable (
OLLAMA_CONTEXT_LENGTH: 131072) was overriding session settings.
- Llama 3.2 1B ignored user-defined context limits and tried to allocate a massive 128k context window (8.3GB), causing an
- The Solution: Switched to Qwen 2.5 Coder 1.5B. It proved to be the most stable for AMD/Vulkan, respected memory limits, and handles tool-calling/Greek characters natively.
Blog Post: The AMD Vulkan Chronicles
Title: Fighting the 128k Ghost: My Journey to Stable Local LLMs on AMD
If you’ve ever tried to run local LLMs on Windows with an AMD card, you know the “Vulkan Dance.” Last night, I went ten rounds with Ollama trying to get a simple 1B model to run fully on my Radeon RX 5500 XT. Here is the post-mortem of that battle.
1. The Ghost in the Registry
I started by trying to run Gemma 3 1B. My GPU has 8GB of VRAM, and the model is tiny. It should have been a breeze. Instead, I saw the dreaded “Split-Brain”: Ollama was shoving 500MB+ onto my CPU.
The Culprit: My logs revealed a hidden system variable: OLLAMA_CONTEXT_LENGTH: 131072. No matter what I typed in PowerShell, the system was forcing a 128k context window, which ate my VRAM for breakfast.
Lesson: Always check your ollama serve logs for the env="map[...]" section. If there’s a variable there you didn’t set, find it in your Windows Environment Variables and kill it.
2. The Llama 3.2 “Memory Leak”
Next, I tried Llama 3.2 1B. This resulted in a total panic: ggml_vulkan: Device memory allocation failed. Even though I set my context to 32k, the Llama runner ignored me and tried to reserve 8.3GB of VRAM for a 128k window. On an 8GB card, that’s an instant crash.
3. The “Split-Brain” Mystery
Even after freeing up 7.3GB of VRAM, some models (like Gemma 3 QAT) still insisted on putting a fraction of the weights on the CPU. This is often due to VRAM fragmentation. Windows and AMD drivers are conservative; if they can’t find one perfectly continuous block of memory, they bail to the CPU, tanking your tokens-per-second.
4. The Winner: Qwen 2.5 Coder 1.5B
After failing with the “big names,” I pivoted to Qwen 2.5 Coder 1.5B.
- Stability: It loaded 100% onto the GPU instantly.
- Speed: I went from a 3.7-second “handshake” lag to near-instant responses.
- Utility: Since I’m building a tool to manage my Greek supermarket lists (including baby supplies for my 3-person family), the “Coder” variant’s strictness with JSON and tool-calling was exactly what I needed.
My Final Setup Script:
If you’re on an AMD card, don’t just run ollama run. Use a clean PowerShell session:
PowerShell
$env:OLLAMA_NUM_CTX="16384" # Start safe
$env:OLLAMA_VULKAN="1"
ollama serve
Final Takeaway: Don’t get married to a specific model. If the drivers hate the architecture, pivot to Qwen. It’s the hidden gem of the small-model world for AMD users.

