Remote OpenClaw Blog
Self-Hosted LLMs for OpenClaw: Run AI Without Cloud APIs [2026]
What changed
This post was reviewed and updated to reflect current deployment, security hardening, and operations guidance.
What should operators know about Self-Hosted LLMs for OpenClaw: Run AI Without Cloud APIs [2026]?
Answer: There are three compelling reasons to run your own AI model instead of using cloud APIs: This guide covers practical deployment decisions, security controls, and operations steps to run OpenClaw, ClawDBot, or MOLTBot reliably in production on your own VPS.
Run AI models locally with OpenClaw using Ollama. Best self-hosted models including Qwen3.5, GLM, and Llama. Hardware requirements, performance benchmarks, and when to use local vs cloud.
Marketplace
Free skills and AI personas for OpenClaw — deploy a pre-built agent in 15 minutes.
Browse the Marketplace →Join the Community
Join 500+ OpenClaw operators sharing deployment guides, security configs, and workflow automations.
Why Self-Host Your LLM?
There are three compelling reasons to run your own AI model instead of using cloud APIs:
1. Privacy. When you use a cloud API, your prompts and data are sent to a third party. With a local model, everything stays on your machine. No data leaves your network. This matters for sensitive business data, client information, personal communications, and any use case with regulatory requirements around data handling.
2. Cost. Cloud API costs scale with usage. A busy agent can easily spend $50-200/month on API calls. A local model has zero per-request cost after the initial hardware investment. If you already have a capable machine, the marginal cost is just electricity.
3. Availability. Cloud APIs can go down. Rate limits can throttle your agent at the worst possible time. A local model is always available, always responsive, and never rate-limited. Your agent works even if your internet goes down (for local-only tasks).
The trade-off is quality. As of March 2026, local models are good — genuinely useful for real agent tasks — but they do not match the capabilities of frontier cloud models like Claude Sonnet 4, GPT-5.4, or Gemini 2.5 Pro. For most agent tasks (scheduling, email drafts, data processing, simple reasoning), the quality gap is acceptable. For complex reasoning, nuanced writing, or advanced code generation, cloud models are still noticeably better.
Ollama Setup Guide
Ollama is the standard way to run local models with OpenClaw. It handles model downloading, quantization, memory management, and serves an API that is compatible with OpenClaw's model provider interface.
Install Ollama:
# Linux/Mac
curl -fsSL https://ollama.com/install.sh | sh
# Mac (alternative via Homebrew)
brew install ollama
For Windows, download the installer from ollama.com.
Pull a model:
# Recommended starting model
ollama pull qwen3.5:14b
# Lighter alternative for limited hardware
ollama pull qwen3.5:7b
# Maximum quality if you have 64GB+ RAM
ollama pull llama3.3:70b
Verify it works:
ollama run qwen3.5:14b "Hello, what can you help me with?"
If you get a response, Ollama is working. It runs a server on port 11434 by default.
Configure OpenClaw:
If OpenClaw runs directly on the same machine as Ollama:
# .env
MODEL_PROVIDER=ollama
OLLAMA_BASE_URL=http://localhost:11434
OLLAMA_MODEL=qwen3.5:14b
If OpenClaw runs in Docker:
# .env
MODEL_PROVIDER=ollama
OLLAMA_BASE_URL=http://host.docker.internal:11434
OLLAMA_MODEL=qwen3.5:14b
The host.docker.internal hostname lets the Docker container reach services running on the host machine. This is the most common setup issue — if you see connection errors, check that you are using the correct URL for your Docker setup.
Best Local Models for OpenClaw
Not all local models are equally suited for agent tasks. Agent work requires strong instruction following, tool use comprehension, structured output generation, and multi-turn conversation ability. Here are the best options as of March 2026:
Qwen3.5 14B — Best All-Around
Qwen3.5 14B from Alibaba is the current sweet spot for local agent use. It has excellent instruction following, strong reasoning, good code generation, and fits comfortably in 16GB RAM with quantization. It handles OpenClaw's scheduling, email drafting, data processing, and conversational tasks with consistently good quality.
Use when: You want one model that handles everything well without requiring premium hardware.
Llama 3.3 70B — Best Quality
Meta's Llama 3.3 70B is the most capable open model available. It approaches cloud model quality on many tasks. The catch is hardware — you need 64GB+ RAM or a powerful GPU to run it at acceptable speeds.
Use when: You have the hardware and want the best possible local model quality.
GLM-4 9B — Best Multilingual
GLM-4 from Zhipu AI excels at multilingual tasks. If your agent needs to handle Chinese, Japanese, Korean, or European languages alongside English, GLM-4 provides the best multilingual performance in its size class.
Use when: Your agent handles non-English languages frequently.
Llama 3.3 8B — Best for Limited Hardware
If you have only 8GB RAM, Llama 3.3 8B is your best option. It handles basic agent tasks — simple conversation, scheduling, email drafts — at acceptable quality. It will struggle with complex reasoning and long context windows.
Use when: You have limited hardware and need the best quality possible under those constraints.
Mistral 7B — Fastest 7B Option
Mistral 7B generates tokens faster than competing 7B models on the same hardware. If speed matters more than reasoning depth, Mistral is the choice.
Use when: Response speed is your top priority and tasks are straightforward.
Hardware Requirements
| Model | Parameters | Min RAM | Recommended RAM | GPU | Tokens/sec (CPU) | Tokens/sec (GPU) |
|---|---|---|---|---|---|---|
| Qwen3.5 7B | 7B | 8GB | 16GB | Optional | 8-15 | 30-60 |
| Llama 3.3 8B | 8B | 8GB | 16GB | Optional | 8-12 | 25-50 |
| Mistral 7B | 7B | 8GB | 16GB | Optional | 10-18 | 35-70 |
| GLM-4 9B | 9B | 12GB | 16GB | Optional | 6-12 | 25-45 |
| Qwen3.5 14B | 14B | 16GB | 32GB | Recommended | 4-8 | 20-40 |
| Llama 3.3 70B | 70B | 64GB | 128GB | Almost required | 1-3 | 10-25 |
Tokens per second is the speed at which the model generates output. For conversational agent use, 8+ tokens/second feels responsive. Below 5 tokens/second, responses feel sluggish. Below 3, the experience is frustrating.
GPU recommendations: NVIDIA RTX 3060 (12GB VRAM) for 7B models. RTX 3090 or 4090 (24GB VRAM) for 14B models. Multiple GPUs or cloud GPU instances for 70B models. Apple Silicon Macs with 16GB+ unified memory handle up to 14B models well using Metal acceleration.
Performance Benchmarks
These benchmarks compare local models against cloud models on typical OpenClaw agent tasks. Scores are out of 10, judged on accuracy, coherence, and usefulness for each task type:
| Task | Qwen3.5 14B | Llama 3.3 70B | Claude Sonnet 4 | GPT-5.4 |
|---|---|---|---|---|
| Simple conversation | 8 | 9 | 10 | 9 |
| Email drafting | 7 | 8 | 9 | 9 |
| Scheduling logic | 7 | 8 | 9 | 9 |
| Data summarization | 7 | 8 | 9 | 9 |
| Complex reasoning | 5 | 7 | 9 | 9 |
| Code generation | 6 | 8 | 9 | 9 |
| Tool use / function calling | 6 | 7 | 9 | 8 |
The takeaway: local 14B models are genuinely useful for most agent tasks (scores of 6-8). They fall behind cloud models primarily on complex reasoning and sophisticated tool use. For a personal assistant that handles scheduling, email, and information retrieval, a local 14B model performs well.
When to Use Local vs Cloud
Use local when:
- Privacy is a hard requirement (legal, regulatory, or personal preference)
- Cost sensitivity — you want $0 ongoing API costs
- Your agent handles simple to moderate tasks (email, scheduling, summarization)
- You have adequate hardware (16GB+ RAM)
- Offline capability is needed
- You want full control over the model and its behavior
Use cloud when:
- You need the best possible reasoning quality
- Tasks involve complex multi-step logic or nuanced writing
- You do not have hardware capable of running good local models
- Response speed matters and you do not have a GPU
- You are running a business where quality directly affects revenue
Hybrid Setup: Best of Both Worlds
The most effective setup for many operators is a hybrid approach: use a local model for routine tasks and route complex tasks to a cloud model. OpenClaw's multi-model routing makes this straightforward:
{
"modelRouting": {
"default": {
"provider": "ollama",
"model": "qwen3.5:14b"
},
"complex": {
"provider": "anthropic",
"model": "claude-sonnet-4-20250514"
}
}
}
With this configuration, most requests go to your local Qwen3.5 model (free, private, fast). Only when you explicitly need complex reasoning does the request go to Claude (paid, cloud, highest quality).
You can also configure automatic routing based on task complexity, message length, or specific keywords. For example, route anything containing "analyze" or "reason through" to the cloud model, and everything else to local.
This hybrid approach gives you: zero-cost operation for 80%+ of requests, maximum quality when you need it, full privacy for routine data, and resilience if either provider goes down.
