Remote OpenClaw Blog

Self-Hosted LLMs for OpenClaw: Run AI Without Cloud APIs [2026]

6 min read · 1 March 2026

Why Self-Host Your LLM?

There are three compelling reasons to run your own AI model instead of using cloud APIs:

If you are choosing models for OpenClaw specifically, see Best Ollama Models for OpenClaw.

1. Privacy. When you use a cloud API, your prompts and data are sent to a third party. With a local model, everything stays on your machine. No data leaves your network. This matters for sensitive business data, client information, personal communications, and any use case with regulatory requirements around data handling.

2. Cost. Cloud API costs scale with usage. A busy agent can easily spend $50-200/month on API calls. A local model has zero per-request cost after the initial hardware investment. If you already have a capable machine, the marginal cost is just electricity.

3. Availability. Cloud APIs can go down. Rate limits can throttle your agent at the worst possible time. A local model is always available, always responsive, and never rate-limited. Your agent works even if your internet goes down (for local-only tasks).

The trade-off is quality. As of March 2026, local models are good — genuinely useful for real agent tasks — but they do not match the capabilities of frontier cloud models like Claude Sonnet 4, GPT-5.4, or Gemini 2.5 Pro. For most agent tasks (scheduling, email drafts, data processing, simple reasoning), the quality gap is acceptable. For complex reasoning, nuanced writing, or advanced code generation, cloud models are still noticeably better.

Ollama Setup Guide

Ollama is the standard way to run local models with OpenClaw. It handles model downloading, quantization, memory management, and serves an API that is compatible with OpenClaw's model provider interface.

Install Ollama:

# Linux/Mac
curl -fsSL https://ollama.com/install.sh | sh

# Mac (alternative via Homebrew)
brew install ollama

For Windows, download the installer from ollama.com.

Pull a model:

# Recommended starting model
ollama pull qwen3.5:14b

# Lighter alternative for limited hardware
ollama pull qwen3.5:7b

# Maximum quality if you have 64GB+ RAM
ollama pull llama3.3:70b

Verify it works:

ollama run qwen3.5:14b "Hello, what can you help me with?"

If you get a response, Ollama is working. It runs a server on port 11434 by default.

Configure OpenClaw:

If OpenClaw runs directly on the same machine as Ollama:

# .env
MODEL_PROVIDER=ollama
OLLAMA_BASE_URL=http://localhost:11434
OLLAMA_MODEL=qwen3.5:14b

If OpenClaw runs in Docker:

# .env
MODEL_PROVIDER=ollama
OLLAMA_BASE_URL=http://host.docker.internal:11434
OLLAMA_MODEL=qwen3.5:14b

The host.docker.internal hostname lets the Docker container reach services running on the host machine. This is the most common setup issue — if you see connection errors, check that you are using the correct URL for your Docker setup.

Best Local Models for OpenClaw

Not all local models are equally suited for agent tasks. Agent work requires strong instruction following, tool use comprehension, structured output generation, and multi-turn conversation ability. Here are the best options as of March 2026:

Qwen3.5 14B — Best All-Around

Qwen3.5 14B from Alibaba is the current sweet spot for local agent use. It has excellent instruction following, strong reasoning, good code generation, and fits comfortably in 16GB RAM with quantization. It handles OpenClaw's scheduling, email drafting, data processing, and conversational tasks with consistently good quality.

Use when: You want one model that handles everything well without requiring premium hardware.

Llama 3.3 70B — Best Quality

Meta's Llama 3.3 70B is the most capable open model available. It approaches cloud model quality on many tasks. The catch is hardware — you need 64GB+ RAM or a powerful GPU to run it at acceptable speeds.

Use when: You have the hardware and want the best possible local model quality.

GLM-4 9B — Best Multilingual

GLM-4 from Zhipu AI excels at multilingual tasks. If your agent needs to handle Chinese, Japanese, Korean, or European languages alongside English, GLM-4 provides the best multilingual performance in its size class.

Use when: Your agent handles non-English languages frequently.

Llama 3.3 8B — Best for Limited Hardware

If you have only 8GB RAM, Llama 3.3 8B is your best option. It handles basic agent tasks — simple conversation, scheduling, email drafts — at acceptable quality. It will struggle with complex reasoning and long context windows.

Best Next Step

Use the marketplace filters to choose the right OpenClaw bundle, persona, or skill for the job you want to automate.

Find Your Workflow →Compare Best Fits →

Use when: You have limited hardware and need the best quality possible under those constraints.

Mistral 7B — Fastest 7B Option

Mistral 7B generates tokens faster than competing 7B models on the same hardware. If speed matters more than reasoning depth, Mistral is the choice.

Use when: Response speed is your top priority and tasks are straightforward.

Hardware Requirements

Model	Parameters	Min RAM	Recommended RAM	GPU	Tokens/sec (CPU)	Tokens/sec (GPU)
Qwen3.5 7B	7B	8GB	16GB	Optional	8-15	30-60
Llama 3.3 8B	8B	8GB	16GB	Optional	8-12	25-50
Mistral 7B	7B	8GB	16GB	Optional	10-18	35-70
GLM-4 9B	9B	12GB	16GB	Optional	6-12	25-45
Qwen3.5 14B	14B	16GB	32GB	Recommended	4-8	20-40
Llama 3.3 70B	70B	64GB	128GB	Almost required	1-3	10-25

Tokens per second is the speed at which the model generates output. For conversational agent use, 8+ tokens/second feels responsive. Below 5 tokens/second, responses feel sluggish. Below 3, the experience is frustrating.

GPU recommendations: NVIDIA RTX 3060 (12GB VRAM) for 7B models. RTX 3090 or 4090 (24GB VRAM) for 14B models. Multiple GPUs or cloud GPU instances for 70B models. Apple Silicon Macs with 16GB+ unified memory handle up to 14B models well using Metal acceleration.

Performance Benchmarks

These benchmarks compare local models against cloud models on typical OpenClaw agent tasks. Scores are out of 10, judged on accuracy, coherence, and usefulness for each task type:

Task	Qwen3.5 14B	Llama 3.3 70B	Claude Sonnet 4	GPT-5.4
Simple conversation	8	9	10	9
Email drafting	7	8	9	9
Scheduling logic	7	8	9	9
Data summarization	7	8	9	9
Complex reasoning	5	7	9	9
Code generation	6	8	9	9
Tool use / function calling	6	7	9	8

The takeaway: local 14B models are genuinely useful for most agent tasks (scores of 6-8). They fall behind cloud models primarily on complex reasoning and sophisticated tool use. For a personal assistant that handles scheduling, email, and information retrieval, a local 14B model performs well.

When to Use Local vs Cloud

Use local when:

Privacy is a hard requirement (legal, regulatory, or personal preference)
Cost sensitivity — you want $0 ongoing API costs
Your agent handles simple to moderate tasks (email, scheduling, summarization)
You have adequate hardware (16GB+ RAM)
Offline capability is needed
You want full control over the model and its behavior

Use cloud when:

You need the best possible reasoning quality
Tasks involve complex multi-step logic or nuanced writing
You do not have hardware capable of running good local models
Response speed matters and you do not have a GPU
You are running a business where quality directly affects revenue

Hybrid Setup: Best of Both Worlds

The most effective setup for many operators is a hybrid approach: use a local model for routine tasks and route complex tasks to a cloud model. OpenClaw's multi-model routing makes this straightforward:

{
 "modelRouting": {
 "default": {
 "provider": "ollama",
 "model": "qwen3.5:14b"
 },
 "complex": {
 "provider": "anthropic",
 "model": "claude-sonnet-4-20250514"
 }
 }
}

With this configuration, most requests go to your local Qwen3.5 model (free, private, fast). Only when you explicitly need complex reasoning does the request go to Claude (paid, cloud, highest quality).

You can also configure automatic routing based on task complexity, message length, or specific keywords. For example, route anything containing "analyze" or "reason through" to the cloud model, and everything else to local.

This hybrid approach gives you: zero-cost operation for 80%+ of requests, maximum quality when you need it, full privacy for routine data, and resilience if either provider goes down.

Ready to choose the right OpenClaw workflow?

Best Next StepUse the marketplace filters to choose the right OpenClaw bundle, persona, or skill for the job you want to automate.More GuidesBrowse 200+ free OpenClaw guides, tutorials, and comparisons.Get the Production ChecklistUse the free checklist if you want the production setup sequence in one place.

Loading article