Remote OpenClaw Blog
GPU Optimization Guide for Ollama Models in OpenClaw
7 min read ·
Remote OpenClaw Blog
7 min read ·
If you are running Ollama models locally for OpenClaw, your GPU is the bottleneck that determines everything: which models you can run, how much context you can hold, how fast your agent responds, and whether your system stays stable under load. Most operators set up Ollama, pull a model, and never think about GPU optimization — and then wonder why their agent feels slow or starts dropping context mid-session.
This guide covers the practical GPU optimization decisions that matter for OpenClaw specifically. If you have not picked your model yet, start with the best Ollama models for OpenClaw guide first.
OpenClaw is not a thin chatbot. It is an agent runtime that maintains tool state, memory context, system instructions, and multi-turn conversation history simultaneously. All of that content lives in the model's context window, and the context window lives in VRAM.
A regular chat application might use 2-4K tokens per interaction. An OpenClaw agent session routinely carries 20-60K tokens of active context. That is a 10-15x difference in VRAM pressure compared to casual model usage.
This means GPU optimization for OpenClaw is fundamentally about VRAM management. Raw compute speed matters for token generation, but VRAM capacity determines whether your agent can function at all with the context it needs.
Here are the practical VRAM requirements for the most common OpenClaw-suitable Ollama models. These numbers reflect Q4_K_M quantization, which is the most common default.
| Model | Params | Q4_K_M Size | VRAM at 4K ctx | VRAM at 32K ctx | VRAM at 64K ctx |
|---|---|---|---|---|---|
| qwen3.5:9b | 9B | ~6.6GB | ~8GB | ~12GB | ~16GB |
| glm-4.7-flash | 30B (3B active) | ~18GB | ~20GB | ~24GB | ~28GB |
| qwen3-coder:30b | 30B (3.3B active) | ~18GB | ~20GB | ~24GB | ~28GB |
| qwen3.5:27b | 27B | ~17GB | ~19GB | ~23GB | ~27GB |
The key insight from this table: the model weights are only part of the VRAM story. The context window adds substantial VRAM overhead, and that overhead scales linearly with context length. A model that fits comfortably at 4K context might not fit at 64K context on the same GPU.
For VPS and cloud GPU options, see the best VPS for OpenClaw guide.
Ollama uses automatic context defaults based on your available VRAM:
For OpenClaw, these defaults are almost always wrong. Ollama's own documentation recommends at least 64K context for agent workloads. If you are on a 16GB GPU, Ollama will default to 4K context — which is far too low for OpenClaw to function properly.
Override the default explicitly:
# Set context length for the Ollama server
OLLAMA_CONTEXT_LENGTH=64000 ollama serve
# Verify the active context allocation
ollama ps
The tradeoff is straightforward: higher context uses more VRAM, leaving less room for the model itself. If you cannot fit both your model weights and 64K context in VRAM, you have three options:
Quantization reduces model precision to fit larger models into less VRAM. Ollama handles quantization automatically when you pull a model, but understanding the tradeoffs helps you make better choices.
| Quantization | Bits per weight | VRAM savings vs FP16 | Quality impact | Best for |
|---|---|---|---|---|
| FP16 | 16 | Baseline | None | Maximum quality, plenty of VRAM |
| Q8_0 | 8 | ~50% | Minimal | Quality-sensitive tasks with large VRAM |
| Q5_K_M | ~5.5 | ~65% | Small | Good balance for 24GB GPUs |
| Q4_K_M | ~4.5 | ~72% | Moderate | Best default for most operators |
| Q3_K_M | ~3.5 | ~78% | Noticeable | Squeezing large models onto small GPUs |
| Q2_K | ~2.5 | ~84% | Significant | Last resort only |
For OpenClaw specifically, Q4_K_M is the sweet spot for most operators. Agent tasks like tool calling, code generation, and instruction following are less sensitive to quantization than creative writing or nuanced reasoning. You lose very little practical performance going from Q8 to Q4 for typical OpenClaw workflows.
Below Q4, quality degradation becomes noticeable. Q3 can work for simple tasks but starts failing on multi-step reasoning. Q2 is not recommended for OpenClaw under any circumstances.
# Pull a specific quantization level
ollama pull qwen3.5:9b-q4_K_M
ollama pull qwen3.5:9b-q8_0
# Check which quantization you are running
ollama show qwen3.5:9b --modelfile
If you run a single OpenClaw instance, batch size rarely matters — you are generating one response at a time. But if you run multiple agents or have OpenClaw handling concurrent tasks, batch settings affect throughput significantly.
Marketplace
Free skills and AI personas for OpenClaw — browse the marketplace.
Browse the Marketplace →Ollama's default batch size works for single-user scenarios. For concurrent usage:
For most single-operator OpenClaw deployments, the default batch settings are fine. Optimize here only if you notice throughput problems with concurrent workloads.
RTX 3060 12GB or RTX 2080 Ti 11GB. These GPUs handle 7-9B models at Q4 quantization with moderate context. You will not hit the 64K context recommendation with larger models, but they work for lighter OpenClaw usage paired with an OpenRouter fallback.
RTX 3090 24GB or RTX 4070 Ti Super 16GB. The RTX 3090 is the best value GPU for local inference right now. 24GB VRAM fits most OpenClaw-suitable models at Q4 with 32-64K context. This is the sweet spot for serious local operators.
RTX 4090 24GB or dual RTX 3090. The 4090 offers the best single-GPU performance with 24GB VRAM and much faster inference than the 3090. Dual 3090s give you 48GB total VRAM for larger models or higher context windows, but multi-GPU inference adds complexity.
M2 Pro/Max or M3 Pro/Max. Apple Silicon shares memory between CPU and GPU, giving you effectively 32-96GB of "VRAM" depending on your configuration. Ollama has native Metal support. The M3 Max with 96GB unified memory can run very large models at full context. For the OpenClaw setup guide, Apple Silicon is one of the most practical local options.
The most common GPU-related OpenClaw problem is invisible: the model runs but delivers poor results because the context window was silently truncated to fit in VRAM. Always verify your actual allocation.
# Check NVIDIA GPU memory usage in real time
nvidia-smi -l 1
# Check what Ollama has loaded and its context allocation
ollama ps
# Check if the model is using GPU or fell back to CPU
ollama ps | grep -i "gpu\|cpu"
For the recommended 64K context window, you need at least 16GB VRAM for smaller models like qwen3.5:9b and 24GB or more for mid-size models like glm-4.7-flash or qwen3-coder:30b. The exact requirement depends on the model size and quantization level. Running at lower context windows reduces VRAM needs but also reduces OpenClaw performance.
Q4_K_M is the best starting point for most operators because it cuts VRAM usage roughly in half compared to full precision while keeping quality loss minimal for agent tasks. Q8 is noticeably better for complex reasoning but requires significantly more VRAM. Only use Q8 if your GPU has headroom after accounting for context window memory.
Yes, but with limitations. The RTX 3060 has 12GB VRAM, which is enough for Q4-quantized 7-9B models at moderate context lengths. You will not be able to run 30B models or reach the full 64K context recommendation. For budget hardware, pair a smaller local model with an OpenRouter fallback for heavier tasks.
Yes, Ollama automatically detects and uses NVIDIA GPUs with CUDA support and Apple Silicon GPUs with Metal support. You do not need to configure GPU offloading manually in most cases. Use nvidia-smi or ollama ps to verify that the model is loaded on your GPU rather than running on CPU.