Remote OpenClaw Blog
Quantization Strategies: Running Large Models on Budget Hardware
8 min read ·
Remote OpenClaw Blog
8 min read ·
Quantization is the single most impactful technique for running capable language models on hardware that should not be able to handle them. If you have a $300 GPU and want to run a 30B parameter model for OpenClaw, quantization is what makes that possible — not by magic, but by trading precision you do not need for VRAM you desperately do.
This guide explains quantization in practical terms for OpenClaw operators, not researchers. You will learn which quantization levels work, which ones break agent reliability, and how to pick the right level for your specific hardware. For model selection, start with the best Ollama models for OpenClaw guide.
A language model is essentially billions of numbers (weights) that together encode everything the model knows. In their original form, these weights are stored as 16-bit floating-point numbers (FP16). Each weight takes 2 bytes of memory.
A 30B parameter model at FP16 needs roughly 60GB of memory just for the weights — before adding context window overhead, operating system memory, or anything else. That exceeds even high-end consumer GPUs.
Quantization reduces the precision of each weight. Instead of 16 bits per weight, you store them in 8 bits, 4 bits, or even 2 bits. This directly reduces memory requirements:
The reason this works is that most model weights are clustered around similar values. Reducing precision loses some information in the tails of the distribution, but the core patterns that drive model behavior are preserved.
If you use Ollama for OpenClaw — which most operators do — you are working with GGUF files. GGUF (GPT-Generated Unified Format) is the quantization format used by llama.cpp, the inference engine that powers Ollama.
GGUF matters because it determines what quantization levels are available and how they behave. Key things to know:
ollama pull, you get a GGUF file. When Ollama lists model sizes, those are GGUF sizes. You do not need to convert formats or manage files manually.Other quantization formats exist — GPTQ, AWQ, AQLM — but they are primarily for vLLM and other GPU-only inference servers. If you are running OpenClaw through Ollama, GGUF is the only format you need to care about. For raw hosting options, check the self-hosted LLM guide.
GGUF offers a range of quantization levels. Here is what each level means in practice for a 30B model:
| Level | Approx. size (30B) | VRAM at 64K ctx | Quality vs FP16 | OpenClaw viability |
|---|---|---|---|---|
| Q8_0 | ~32GB | ~42GB | ~99% | Excellent — if you have the VRAM |
| Q6_K | ~25GB | ~35GB | ~98% | Excellent |
| Q5_K_M | ~22GB | ~32GB | ~96% | Very good |
| Q4_K_M | ~18GB | ~28GB | ~93% | Good — best default for most operators |
| Q4_K_S | ~17GB | ~27GB | ~91% | Good — slightly less reliable on edge cases |
| Q3_K_M | ~15GB | ~25GB | ~85% | Marginal — noticeable degradation on multi-step tasks |
| Q3_K_S | ~13GB | ~23GB | ~80% | Poor — tool calling starts failing |
| Q2_K | ~11GB | ~21GB | ~65% | Not recommended for OpenClaw |
The quality percentages are approximate and vary by model architecture. The important pattern: quality degrades slowly from Q8 to Q4, then drops off sharply below Q4. For OpenClaw, Q4_K_M is the inflection point where you get the best VRAM savings without meaningful quality loss for agent tasks.
OpenClaw tasks have a specific quality profile that interacts with quantization differently than general chatbot usage:
OpenClaw relies on the model producing correctly formatted tool calls — JSON objects with specific keys and values. This is essentially structured output generation. At Q4 and above, modern models handle this reliably. Below Q4, you start seeing malformed JSON, missing parameters, and phantom tool calls that reference tools that do not exist. This is the single biggest quality concern for quantized OpenClaw models.
Code generation is moderately sensitive to quantization because syntax errors are binary failures — the code works or it does not. At Q4_K_M, most models maintain good code quality for common languages and patterns. You may see slightly more syntax errors in less common languages or complex algorithmic code, but for typical OpenClaw code tasks (file manipulation, API calls, scripting), Q4 is reliable.
OpenClaw feeds the model system instructions, persona definitions, skill descriptions, and task context. The model needs to follow these instructions accurately across long sessions. Quantization has minimal impact here at Q4 and above. The model's ability to maintain instruction adherence is more dependent on context length than weight precision.
Marketplace
Free skills and AI personas for OpenClaw — browse the marketplace.
Browse the Marketplace →Complex multi-step tasks — where the agent needs to plan, execute, evaluate, and adjust — are the most sensitive to quantization. At Q4, you may see occasional planning errors that a Q8 or FP16 model would avoid. For most operational tasks, this is acceptable. For complex reasoning-heavy workflows, consider using a higher quantization level or routing those specific tasks to a cloud model through OpenRouter.
At 8GB, your options are limited to small models at aggressive quantization. The practical choice is qwen3.5:3b at Q4_K_M, which leaves room for a 16-32K context window. This is enough for basic OpenClaw tasks but will struggle with complex multi-step workflows. Best paired with an OpenRouter fallback for heavier work. For the full cost breakdown, see the cheapest way to run OpenClaw.
12GB opens up 7-9B models at Q4_K_M with room for 32-48K context. qwen3.5:9b at Q4 is the best fit here. You can push to 64K context if you accept some partial CPU offloading (slower but functional). This tier delivers genuinely useful OpenClaw performance for routine tasks.
16GB is the entry point for comfortable OpenClaw usage. You can run qwen3.5:9b at Q4 with a full 64K context window, or stretch to 14B models at Q4 with moderate context. This tier handles most daily OpenClaw workflows without compromises.
24GB is the sweet spot. You can run glm-4.7-flash or qwen3-coder:30b at Q4_K_M with 32-64K context. This is the hardware tier where the recommended OpenClaw models run comfortably without sacrifices.
The decision process is straightforward:
# Check available quantization levels for a model
ollama show qwen3.5:9b --modelfile
# Pull a specific quantization
ollama pull qwen3.5:9b-q4_K_M
ollama pull qwen3.5:9b-q5_K_M
# Compare sizes
ollama list
Some GGUF quantizations use an "importance matrix" (imatrix) to determine which weights are most critical for quality. imatrix-quantized models preserve precision on the weights that matter most, producing better quality at the same bit level. When choosing between two Q4 variants, prefer the imatrix version if available.
The "K" in Q4_K_M means the quantization uses different precision levels for different layers. The first and last layers of the model — which handle input processing and output generation — typically get higher precision (Q6 or Q8), while middle layers use the target quantization level (Q4). This is why Q4_K_M outperforms Q4_0 despite having similar average bit widths.
Some operators maintain two copies of the same model at different quantization levels — a Q4 version for routine tasks and a Q6 or Q8 version for important work. You can switch between them in your OpenClaw configuration based on the task type. This is more complex to manage but gives you the best of both worlds.
Q4_K_M is the best default quantization for most OpenClaw operators. It reduces VRAM usage by roughly 72% compared to full precision while keeping quality degradation minimal for agent tasks like tool calling, code generation, and instruction following. Only move to Q5 or Q8 if you have VRAM headroom after fitting your target context window.
GGUF is the format used by llama.cpp and Ollama. It supports CPU and GPU inference, partial GPU offloading, and a wide range of quantization levels. GPTQ is an older GPU-only format used primarily with vLLM and text-generation-inference. For Ollama-based OpenClaw setups, GGUF is the only relevant format. GPTQ matters only if you are running raw inference servers.
A 70B model at Q4 quantization requires approximately 40GB of VRAM for model weights alone, before adding context window overhead. A single 24GB GPU cannot fit it entirely. You can use partial CPU offloading, but inference will be significantly slower. For 24GB GPUs, stick to models in the 7-30B range for usable OpenClaw performance.
At Q4 and above, tool calling accuracy is largely unaffected. The structured output that OpenClaw needs for tool calls — JSON formatting, function names, parameter extraction — is robust to moderate quantization. Below Q4, you may see increased formatting errors and missed tool calls, which directly impacts agent reliability.