Remote OpenClaw

Remote OpenClaw Blog

Quantization Strategies: Running Large Models on Budget Hardware

8 min read ·

Quantization is the single most impactful technique for running capable language models on hardware that should not be able to handle them. If you have a $300 GPU and want to run a 30B parameter model for OpenClaw, quantization is what makes that possible — not by magic, but by trading precision you do not need for VRAM you desperately do.

This guide explains quantization in practical terms for OpenClaw operators, not researchers. You will learn which quantization levels work, which ones break agent reliability, and how to pick the right level for your specific hardware. For model selection, start with the best Ollama models for OpenClaw guide.


What Is Quantization and Why Does It Matter?

A language model is essentially billions of numbers (weights) that together encode everything the model knows. In their original form, these weights are stored as 16-bit floating-point numbers (FP16). Each weight takes 2 bytes of memory.

A 30B parameter model at FP16 needs roughly 60GB of memory just for the weights — before adding context window overhead, operating system memory, or anything else. That exceeds even high-end consumer GPUs.

Quantization reduces the precision of each weight. Instead of 16 bits per weight, you store them in 8 bits, 4 bits, or even 2 bits. This directly reduces memory requirements:

The reason this works is that most model weights are clustered around similar values. Reducing precision loses some information in the tails of the distribution, but the core patterns that drive model behavior are preserved.


Understanding the GGUF Format

If you use Ollama for OpenClaw — which most operators do — you are working with GGUF files. GGUF (GPT-Generated Unified Format) is the quantization format used by llama.cpp, the inference engine that powers Ollama.

GGUF matters because it determines what quantization levels are available and how they behave. Key things to know:

Other quantization formats exist — GPTQ, AWQ, AQLM — but they are primarily for vLLM and other GPU-only inference servers. If you are running OpenClaw through Ollama, GGUF is the only format you need to care about. For raw hosting options, check the self-hosted LLM guide.


Quantization Levels Compared: Q2 Through Q8

GGUF offers a range of quantization levels. Here is what each level means in practice for a 30B model:

LevelApprox. size (30B)VRAM at 64K ctxQuality vs FP16OpenClaw viability
Q8_0~32GB~42GB~99%Excellent — if you have the VRAM
Q6_K~25GB~35GB~98%Excellent
Q5_K_M~22GB~32GB~96%Very good
Q4_K_M~18GB~28GB~93%Good — best default for most operators
Q4_K_S~17GB~27GB~91%Good — slightly less reliable on edge cases
Q3_K_M~15GB~25GB~85%Marginal — noticeable degradation on multi-step tasks
Q3_K_S~13GB~23GB~80%Poor — tool calling starts failing
Q2_K~11GB~21GB~65%Not recommended for OpenClaw

The quality percentages are approximate and vary by model architecture. The important pattern: quality degrades slowly from Q8 to Q4, then drops off sharply below Q4. For OpenClaw, Q4_K_M is the inflection point where you get the best VRAM savings without meaningful quality loss for agent tasks.

How Quantization Affects OpenClaw Specifically

OpenClaw tasks have a specific quality profile that interacts with quantization differently than general chatbot usage:

Tool calling and structured output

OpenClaw relies on the model producing correctly formatted tool calls — JSON objects with specific keys and values. This is essentially structured output generation. At Q4 and above, modern models handle this reliably. Below Q4, you start seeing malformed JSON, missing parameters, and phantom tool calls that reference tools that do not exist. This is the single biggest quality concern for quantized OpenClaw models.

Code generation

Code generation is moderately sensitive to quantization because syntax errors are binary failures — the code works or it does not. At Q4_K_M, most models maintain good code quality for common languages and patterns. You may see slightly more syntax errors in less common languages or complex algorithmic code, but for typical OpenClaw code tasks (file manipulation, API calls, scripting), Q4 is reliable.

Instruction following

OpenClaw feeds the model system instructions, persona definitions, skill descriptions, and task context. The model needs to follow these instructions accurately across long sessions. Quantization has minimal impact here at Q4 and above. The model's ability to maintain instruction adherence is more dependent on context length than weight precision.

Marketplace

Free skills and AI personas for OpenClaw — browse the marketplace.

Browse the Marketplace →

Multi-step reasoning

Complex multi-step tasks — where the agent needs to plan, execute, evaluate, and adjust — are the most sensitive to quantization. At Q4, you may see occasional planning errors that a Q8 or FP16 model would avoid. For most operational tasks, this is acceptable. For complex reasoning-heavy workflows, consider using a higher quantization level or routing those specific tasks to a cloud model through OpenRouter.


Practical Setups for Budget Hardware

8GB VRAM (RTX 3050, RTX 4060, etc.)

At 8GB, your options are limited to small models at aggressive quantization. The practical choice is qwen3.5:3b at Q4_K_M, which leaves room for a 16-32K context window. This is enough for basic OpenClaw tasks but will struggle with complex multi-step workflows. Best paired with an OpenRouter fallback for heavier work. For the full cost breakdown, see the cheapest way to run OpenClaw.

12GB VRAM (RTX 3060, RTX 4070)

12GB opens up 7-9B models at Q4_K_M with room for 32-48K context. qwen3.5:9b at Q4 is the best fit here. You can push to 64K context if you accept some partial CPU offloading (slower but functional). This tier delivers genuinely useful OpenClaw performance for routine tasks.

16GB VRAM (RTX 4070 Ti, RTX 5060 Ti)

16GB is the entry point for comfortable OpenClaw usage. You can run qwen3.5:9b at Q4 with a full 64K context window, or stretch to 14B models at Q4 with moderate context. This tier handles most daily OpenClaw workflows without compromises.

24GB VRAM (RTX 3090, RTX 4090)

24GB is the sweet spot. You can run glm-4.7-flash or qwen3-coder:30b at Q4_K_M with 32-64K context. This is the hardware tier where the recommended OpenClaw models run comfortably without sacrifices.


How to Choose the Right Quantization Level

The decision process is straightforward:

  1. Start with the model you want. Pick the best model for your OpenClaw use case from the model guide.
  2. Calculate VRAM at Q4_K_M. Check the model's Q4 size and add context window overhead for 64K tokens.
  3. Compare to your GPU. If it fits with 1-2GB of headroom, use Q4_K_M. If it does not fit, either drop to a smaller model or try Q3_K_M (but test agent reliability carefully).
  4. If you have extra VRAM, upgrade quantization. Move from Q4 to Q5 or Q6, not to a larger model. A 9B model at Q6 often outperforms the same 9B model at Q4 more than a 14B model at Q3 outperforms it.
# Check available quantization levels for a model
ollama show qwen3.5:9b --modelfile

# Pull a specific quantization
ollama pull qwen3.5:9b-q4_K_M
ollama pull qwen3.5:9b-q5_K_M

# Compare sizes
ollama list

Advanced Quantization Techniques

Importance matrix quantization

Some GGUF quantizations use an "importance matrix" (imatrix) to determine which weights are most critical for quality. imatrix-quantized models preserve precision on the weights that matter most, producing better quality at the same bit level. When choosing between two Q4 variants, prefer the imatrix version if available.

Mixed precision across layers

The "K" in Q4_K_M means the quantization uses different precision levels for different layers. The first and last layers of the model — which handle input processing and output generation — typically get higher precision (Q6 or Q8), while middle layers use the target quantization level (Q4). This is why Q4_K_M outperforms Q4_0 despite having similar average bit widths.

Dynamic quantization for different tasks

Some operators maintain two copies of the same model at different quantization levels — a Q4 version for routine tasks and a Q6 or Q8 version for important work. You can switch between them in your OpenClaw configuration based on the task type. This is more complex to manage but gives you the best of both worlds.


Frequently Asked Questions

What is the best quantization level for OpenClaw?

Q4_K_M is the best default quantization for most OpenClaw operators. It reduces VRAM usage by roughly 72% compared to full precision while keeping quality degradation minimal for agent tasks like tool calling, code generation, and instruction following. Only move to Q5 or Q8 if you have VRAM headroom after fitting your target context window.

What is the difference between GGUF and GPTQ quantization?

GGUF is the format used by llama.cpp and Ollama. It supports CPU and GPU inference, partial GPU offloading, and a wide range of quantization levels. GPTQ is an older GPU-only format used primarily with vLLM and text-generation-inference. For Ollama-based OpenClaw setups, GGUF is the only relevant format. GPTQ matters only if you are running raw inference servers.

Can I run a 70B model on a 24GB GPU with quantization?

A 70B model at Q4 quantization requires approximately 40GB of VRAM for model weights alone, before adding context window overhead. A single 24GB GPU cannot fit it entirely. You can use partial CPU offloading, but inference will be significantly slower. For 24GB GPUs, stick to models in the 7-30B range for usable OpenClaw performance.

Does quantization affect OpenClaw tool calling accuracy?

At Q4 and above, tool calling accuracy is largely unaffected. The structured output that OpenClaw needs for tool calls — JSON formatting, function names, parameter extraction — is robust to moderate quantization. Below Q4, you may see increased formatting errors and missed tool calls, which directly impacts agent reliability.