Remote OpenClaw

Remote OpenClaw Blog

How to Run Gemma 4 with OpenClaw on Ollama (Free Local AI)

7 min read ·

What Is Gemma 4?

Gemma 4 is Google DeepMind's latest open-source language model family, released on April 2, 2026. It represents a significant step forward from Gemma 3, with improved instruction following, better multilingual support, and a new Mixture of Experts (MoE) architecture in the 26B variant that delivers large-model quality at small-model speed.

According to Google's official Gemma documentation, Gemma 4 models are released under a permissive license that allows commercial use, modification, and redistribution — making them ideal for self-hosted OpenClaw deployments where you want to eliminate API costs entirely.

For OpenClaw operators, Gemma 4 running on Ollama provides a completely free local AI agent with no recurring API charges, no rate limits, and no data leaving your machine.


Gemma 4 Variants Explained

Google released four Gemma 4 variants, each targeting different hardware and use-case profiles.

VariantParametersArchitectureMin RAMBest For
gemma4:2b2 billionDense4GBEdge devices, Raspberry Pi, mobile
gemma4:4b4 billionDense8GBLightweight desktops, basic tasks
gemma4:26b26 billion (MoE)Mixture of Experts16GBBest quality/speed ratio for OpenClaw
gemma4:31b31 billionDense32GBMaximum quality, requires strong hardware

The Mixture of Experts architecture in the 26B variant activates only a subset of its parameters for each token generated. This means it processes at approximately the same speed as the 4B model while having access to 6.5x more parameters for quality. It is the clear winner for most OpenClaw deployments.


Why the 26B MoE Variant

The gemma4:26b model hits the optimal balance for OpenClaw agents. Here is why it outperforms the other variants for this specific use case.

Speed

MoE architecture means only a fraction of the 26B parameters activate per inference step. In practice, throughput is comparable to the 4B dense model — approximately 30-50 tokens per second on a modern GPU, or 10-20 tokens per second on CPU only. This is fast enough for real-time messaging responses via Telegram, WhatsApp, or Slack.

Quality

The 26B MoE variant scores significantly higher than the 4B dense model on instruction-following benchmarks. For OpenClaw tasks like email drafting, CRM data extraction, and workflow decision-making, the quality difference is immediately noticeable — fewer hallucinations, better formatting, and more accurate tool use.

Context Window

All Gemma 4 variants support a 128K token context window (131,072 tokens), which is sufficient for most OpenClaw conversations. While this is smaller than Claude's 200K or 1M token windows, it covers the vast majority of agent interactions including multi-turn conversations with full memory context.


Hardware Requirements

Before starting the setup, verify your hardware meets these requirements for the recommended gemma4:26b variant.

Minimum Specifications

Recommended Specifications

Apple Silicon note: M1 Pro, M2, M3, and M4 chips with 16GB+ unified memory run the 26B MoE variant effectively. Apple's unified memory architecture means the GPU and CPU share the same memory pool, so a 16GB M2 MacBook Pro can run this model without a discrete GPU.


Step 1: Install Ollama

Ollama is a free, open-source tool that runs language models locally with a simple API that OpenClaw can connect to directly.

macOS

Download and install from ollama.com/download, or use Homebrew:

brew install ollama

Linux

curl -fsSL https://ollama.com/install.sh | sh

Windows

Download the installer from ollama.com/download and run it. Ollama runs as a background service on Windows.

After installation, verify Ollama is running:

ollama --version

You should see version 0.6.x or later. If Ollama is not running as a service, start it with:

ollama serve

Step 2: Pull the Gemma 4 Model

Download the recommended 26B MoE variant. This is a one-time download of approximately 15GB.

ollama pull gemma4:26b

The download will take 5-30 minutes depending on your internet connection. Once complete, verify the model is available:

ollama list

You should see gemma4:26b in the list. Test it with a quick prompt to confirm it works:

ollama run gemma4:26b "Hello, what model are you?"

If you see a coherent response, the model is ready for OpenClaw integration.


Step 3: Configure OpenClaw Provider

OpenClaw needs to know where to find the Ollama API and which model to use. Add the following to your OpenClaw provider configuration.

Marketplace

Free skills and AI personas for OpenClaw — browse the marketplace.

Browse the Marketplace →

Provider Configuration

{
  "provider": "ollama",
  "baseUrl": "http://localhost:11434",
  "model": "gemma4:26b",
  "contextWindow": 131072,
  "reasoning": false,
  "temperature": 0.7,
  "maxTokens": 4096
}

Configuration Notes

  • contextWindow: 131072 — matches Gemma 4's native 128K token context. Setting this correctly prevents OpenClaw from truncating context prematurely or sending oversized requests.
  • reasoning: false — Gemma 4 does not have a native chain-of-thought reasoning mode like Claude or o1. Setting this to false prevents OpenClaw from sending reasoning-specific prompts that Gemma 4 would not handle correctly.
  • temperature: 0.7 — a good default for agent tasks. Lower (0.3-0.5) for factual lookups, higher (0.8-1.0) for creative content.
  • maxTokens: 4096 — maximum output length per response. Increase to 8192 for long-form content generation.

Step 4: Test the Connection

After updating the provider configuration, restart OpenClaw and send a test message through your preferred channel (Telegram, WhatsApp, or Slack).

What to Check

  1. Response time: First response should arrive within 5-15 seconds on recommended hardware. If it takes longer than 30 seconds, check GPU offloading (see optimization section below).
  2. Response quality: Ask the agent to summarize a recent email or draft a short message. The output should be coherent and well-formatted.
  3. Memory persistence: Tell the agent a specific fact, then in a new message, ask it to recall that fact. Confirm OpenClaw's memory system is working with the local model.
  4. Tool use: Ask the agent to perform a skill that requires tool calls (calendar check, file read). Verify Gemma 4 generates correct tool-call syntax.

If any of these checks fail, consult the Best Ollama Models for OpenClaw guide for troubleshooting steps specific to local model configurations.


Performance Optimization Tips

Running a 26B parameter model locally leaves less room for error than cloud APIs. These optimizations can significantly improve response speed and stability.

GPU Offloading

If you have a compatible GPU, ensure Ollama is using it. Check with:

ollama ps

The output should show your model loaded with GPU layers. If it shows CPU only, verify your GPU drivers are installed and Ollama can detect the GPU.

Context Window Tuning

While Gemma 4 supports 128K tokens, using the full context window consumes significant memory. If you are running low on RAM, reduce contextWindow to 32768 or 65536 in your OpenClaw config. Most agent conversations fit well within 32K tokens.

Concurrent Request Limits

Unlike cloud APIs, local models process one request at a time by default. If your OpenClaw agent handles multiple channels simultaneously, configure Ollama's OLLAMA_NUM_PARALLEL environment variable:

export OLLAMA_NUM_PARALLEL=2

Setting this to 2-3 allows Ollama to process multiple requests in parallel, though each additional parallel request increases memory usage.


Cost Comparison: Local vs Cloud

The financial case for running Gemma 4 locally is straightforward.

Cost CategoryLocal (Gemma 4 + Ollama)Cloud API (Claude Sonnet)
Software$0$0
Model access$0 (open source)$3/$15 per million tokens
Monthly API cost$0$15-40 typical
Hosting$0 (your hardware)$5-10/mo VPS
Annual total$0$240-600

Over 12 months, a local Gemma 4 deployment saves $240-600 compared to cloud APIs. The trade-off is lower quality on complex reasoning tasks and the upfront hardware requirement. For operators who already have a capable machine, the savings are immediate.

For a deeper comparison of all local models compatible with OpenClaw, see Best Ollama Models for OpenClaw. For more on Gemma 4 capabilities beyond OpenClaw, see Gemma 4: Google's Open Model.


Frequently Asked Questions

Which Gemma 4 variant should I use with OpenClaw?

The recommended variant is gemma4:26b, the 26B Mixture of Experts model. It runs at approximately the same speed as the 4B parameter model because only a subset of experts activate per token, but delivers substantially higher quality output. It requires 16GB of RAM minimum (24GB recommended) and fits comfortably on a modern GPU with 16GB+ VRAM.

How much does it cost to run Gemma 4 locally with OpenClaw?

The total cost is $0 for software. Gemma 4 is open source under Google's permissive license, Ollama is free, and OpenClaw is free. The only cost is your hardware and electricity. Compared to cloud API costs of $15-40 per month for a typical OpenClaw deployment, running locally eliminates the largest ongoing expense entirely.

Can Gemma 4 on Ollama match Claude or GPT-4 quality for OpenClaw?

Not for complex reasoning or long-form generation. The gemma4:26b variant performs well for structured tasks like email triage, CRM updates, scheduling, and template-based responses. For tasks requiring deep reasoning, nuanced writing, or large context windows beyond 128K tokens, cloud models like Claude Sonnet or GPT-4o still produce better results. Many operators use Gemma 4 locally for routine tasks and fall back to cloud APIs for complex work.