Remote OpenClaw Blog
How to Run Gemma 4 with OpenClaw on Ollama (Free Local AI)
7 min read ·
Remote OpenClaw Blog
7 min read ·
Gemma 4 is Google DeepMind's latest open-source language model family, released on April 2, 2026. It represents a significant step forward from Gemma 3, with improved instruction following, better multilingual support, and a new Mixture of Experts (MoE) architecture in the 26B variant that delivers large-model quality at small-model speed.
According to Google's official Gemma documentation, Gemma 4 models are released under a permissive license that allows commercial use, modification, and redistribution — making them ideal for self-hosted OpenClaw deployments where you want to eliminate API costs entirely.
For OpenClaw operators, Gemma 4 running on Ollama provides a completely free local AI agent with no recurring API charges, no rate limits, and no data leaving your machine.
Google released four Gemma 4 variants, each targeting different hardware and use-case profiles.
| Variant | Parameters | Architecture | Min RAM | Best For |
|---|---|---|---|---|
| gemma4:2b | 2 billion | Dense | 4GB | Edge devices, Raspberry Pi, mobile |
| gemma4:4b | 4 billion | Dense | 8GB | Lightweight desktops, basic tasks |
| gemma4:26b | 26 billion (MoE) | Mixture of Experts | 16GB | Best quality/speed ratio for OpenClaw |
| gemma4:31b | 31 billion | Dense | 32GB | Maximum quality, requires strong hardware |
The Mixture of Experts architecture in the 26B variant activates only a subset of its parameters for each token generated. This means it processes at approximately the same speed as the 4B model while having access to 6.5x more parameters for quality. It is the clear winner for most OpenClaw deployments.
The gemma4:26b model hits the optimal balance for OpenClaw agents. Here is why it outperforms the other variants for this specific use case.
MoE architecture means only a fraction of the 26B parameters activate per inference step. In practice, throughput is comparable to the 4B dense model — approximately 30-50 tokens per second on a modern GPU, or 10-20 tokens per second on CPU only. This is fast enough for real-time messaging responses via Telegram, WhatsApp, or Slack.
The 26B MoE variant scores significantly higher than the 4B dense model on instruction-following benchmarks. For OpenClaw tasks like email drafting, CRM data extraction, and workflow decision-making, the quality difference is immediately noticeable — fewer hallucinations, better formatting, and more accurate tool use.
All Gemma 4 variants support a 128K token context window (131,072 tokens), which is sufficient for most OpenClaw conversations. While this is smaller than Claude's 200K or 1M token windows, it covers the vast majority of agent interactions including multi-turn conversations with full memory context.
Before starting the setup, verify your hardware meets these requirements for the recommended gemma4:26b variant.
Apple Silicon note: M1 Pro, M2, M3, and M4 chips with 16GB+ unified memory run the 26B MoE variant effectively. Apple's unified memory architecture means the GPU and CPU share the same memory pool, so a 16GB M2 MacBook Pro can run this model without a discrete GPU.
Ollama is a free, open-source tool that runs language models locally with a simple API that OpenClaw can connect to directly.
Download and install from ollama.com/download, or use Homebrew:
brew install ollama
curl -fsSL https://ollama.com/install.sh | sh
Download the installer from ollama.com/download and run it. Ollama runs as a background service on Windows.
After installation, verify Ollama is running:
ollama --version
You should see version 0.6.x or later. If Ollama is not running as a service, start it with:
ollama serve
Download the recommended 26B MoE variant. This is a one-time download of approximately 15GB.
ollama pull gemma4:26b
The download will take 5-30 minutes depending on your internet connection. Once complete, verify the model is available:
ollama list
You should see gemma4:26b in the list. Test it with a quick prompt to confirm it works:
ollama run gemma4:26b "Hello, what model are you?"
If you see a coherent response, the model is ready for OpenClaw integration.
OpenClaw needs to know where to find the Ollama API and which model to use. Add the following to your OpenClaw provider configuration.
Marketplace
Free skills and AI personas for OpenClaw — browse the marketplace.
Browse the Marketplace →{
"provider": "ollama",
"baseUrl": "http://localhost:11434",
"model": "gemma4:26b",
"contextWindow": 131072,
"reasoning": false,
"temperature": 0.7,
"maxTokens": 4096
}
contextWindow: 131072 — matches Gemma 4's native 128K token context. Setting this correctly prevents OpenClaw from truncating context prematurely or sending oversized requests.reasoning: false — Gemma 4 does not have a native chain-of-thought reasoning mode like Claude or o1. Setting this to false prevents OpenClaw from sending reasoning-specific prompts that Gemma 4 would not handle correctly.temperature: 0.7 — a good default for agent tasks. Lower (0.3-0.5) for factual lookups, higher (0.8-1.0) for creative content.maxTokens: 4096 — maximum output length per response. Increase to 8192 for long-form content generation.After updating the provider configuration, restart OpenClaw and send a test message through your preferred channel (Telegram, WhatsApp, or Slack).
If any of these checks fail, consult the Best Ollama Models for OpenClaw guide for troubleshooting steps specific to local model configurations.
Running a 26B parameter model locally leaves less room for error than cloud APIs. These optimizations can significantly improve response speed and stability.
If you have a compatible GPU, ensure Ollama is using it. Check with:
ollama ps
The output should show your model loaded with GPU layers. If it shows CPU only, verify your GPU drivers are installed and Ollama can detect the GPU.
While Gemma 4 supports 128K tokens, using the full context window consumes significant memory. If you are running low on RAM, reduce contextWindow to 32768 or 65536 in your OpenClaw config. Most agent conversations fit well within 32K tokens.
Unlike cloud APIs, local models process one request at a time by default. If your OpenClaw agent handles multiple channels simultaneously, configure Ollama's OLLAMA_NUM_PARALLEL environment variable:
export OLLAMA_NUM_PARALLEL=2
Setting this to 2-3 allows Ollama to process multiple requests in parallel, though each additional parallel request increases memory usage.
The financial case for running Gemma 4 locally is straightforward.
| Cost Category | Local (Gemma 4 + Ollama) | Cloud API (Claude Sonnet) |
|---|---|---|
| Software | $0 | $0 |
| Model access | $0 (open source) | $3/$15 per million tokens |
| Monthly API cost | $0 | $15-40 typical |
| Hosting | $0 (your hardware) | $5-10/mo VPS |
| Annual total | $0 | $240-600 |
Over 12 months, a local Gemma 4 deployment saves $240-600 compared to cloud APIs. The trade-off is lower quality on complex reasoning tasks and the upfront hardware requirement. For operators who already have a capable machine, the savings are immediate.
For a deeper comparison of all local models compatible with OpenClaw, see Best Ollama Models for OpenClaw. For more on Gemma 4 capabilities beyond OpenClaw, see Gemma 4: Google's Open Model.
The recommended variant is gemma4:26b, the 26B Mixture of Experts model. It runs at approximately the same speed as the 4B parameter model because only a subset of experts activate per token, but delivers substantially higher quality output. It requires 16GB of RAM minimum (24GB recommended) and fits comfortably on a modern GPU with 16GB+ VRAM.
The total cost is $0 for software. Gemma 4 is open source under Google's permissive license, Ollama is free, and OpenClaw is free. The only cost is your hardware and electricity. Compared to cloud API costs of $15-40 per month for a typical OpenClaw deployment, running locally eliminates the largest ongoing expense entirely.
Not for complex reasoning or long-form generation. The gemma4:26b variant performs well for structured tasks like email triage, CRM updates, scheduling, and template-based responses. For tasks requiring deep reasoning, nuanced writing, or large context windows beyond 128K tokens, cloud models like Claude Sonnet or GPT-4o still produce better results. Many operators use Gemma 4 locally for routine tasks and fall back to cloud APIs for complex work.