Remote OpenClaw Blog
Qwen3 8B on OpenClaw: Best Small Model for Local Deployment
7 min read ·
Remote OpenClaw Blog
7 min read ·
Qwen3 8B is the 8 billion parameter model from Alibaba Cloud's Qwen (Tongyi Qianwen) family. Unlike the massive MoE models covered in our other guides, Qwen3 8B is a dense model — all 8 billion parameters are active on every forward pass. This makes it smaller, faster, and dramatically easier to run on consumer hardware.
What makes Qwen3 8B stand out among small models is a combination of three features: dual thinking modes that let you trade speed for accuracy per-request, support for 119 languages (the widest multilingual coverage of any model in its class), and hardware requirements that fit within a 16GB RAM laptop. For OpenClaw operators who want a local model that runs without any API dependency, Qwen3 8B is the strongest option available.
The model is released under the Apache 2.0 license by Alibaba Cloud, which means fully free commercial use with no restrictions. Download it, run it, fine-tune it, build products on it — no licensing fees, no usage caps, no API keys required.
Before diving into setup, here is why local deployment matters for OpenClaw operators:
| Specification | Value |
|---|---|
| Parameters | 8 billion (dense) |
| Architecture | Dense transformer |
| Developer | Alibaba Cloud (Qwen Team) |
| License | Apache 2.0 |
| Languages | 119 |
| Thinking Modes | Dual (thinking + non-thinking) |
| Context Window | 32K tokens |
| RAM Required | 16GB (q4 quantization) |
| Disk Space | ~5GB (q4 quantization) |
The 32K context window is adequate for most OpenClaw agent tasks — conversation histories, individual file analysis, email processing, and document summarization. For tasks requiring longer context (full codebase analysis, long document processing), you will need a cloud-hosted model with a larger context window.
Qwen3 8B's dual thinking system is its most distinctive feature. You can switch between two modes on a per-request basis:
In thinking mode, Qwen3 8B works through problems step by step, showing its reasoning chain before arriving at an answer. This is similar to chain-of-thought prompting but built into the model architecture. Thinking mode produces higher accuracy on complex tasks — math problems, code debugging, logical reasoning — at the cost of generating more tokens (and therefore running slower).
# Enable thinking mode via system prompt
system_prompt: "You are a helpful assistant. Think step by step before answering."
# Or via Ollama parameters
ollama run qwen3:8b --system "Think step by step."
In non-thinking mode, Qwen3 8B generates direct responses without intermediate reasoning. This is faster and uses fewer tokens, making it ideal for simple tasks — classification, formatting, factual lookups, template filling.
# Non-thinking mode (default)
system_prompt: "You are a helpful assistant. Respond directly and concisely."
For OpenClaw operators, the practical approach is to configure your agent to use thinking mode for complex tasks (coding, analysis, problem-solving) and non-thinking mode for routine tasks (formatting, classification, data extraction). This optimizes both speed and accuracy without switching models.
Qwen3 8B supports 119 languages — the broadest multilingual coverage of any small model. This includes all major world languages, most regional languages, and many minority languages that other models do not cover.
For OpenClaw operators, this means:
| Hardware | Minimum | Recommended |
|---|---|---|
| RAM | 16GB | 32GB |
| Disk | 10GB free | 20GB free |
| CPU | Any modern (2020+) | Apple Silicon / modern x86 |
| GPU | Not required | Any GPU with 8GB+ VRAM |
Qwen3 8B runs on CPU-only hardware. A GPU accelerates inference significantly but is not required. On a MacBook Air M2 with 16GB RAM, expect 15-30 tokens per second. On a machine with a dedicated GPU (RTX 3060 or better), expect 40-80 tokens per second.
Marketplace
Free skills and AI personas for OpenClaw — browse the marketplace.
Browse the Marketplace →# macOS / Linux
curl -fsSL https://ollama.ai/install.sh | sh
# Verify installation
ollama --version
# Pull the model (~5GB download)
ollama pull qwen3:8b
# Verify it downloaded
ollama list
# Interactive chat
ollama run qwen3:8b
# Test with a prompt
ollama run qwen3:8b "Write a Python function to calculate fibonacci numbers"
# Ollama automatically starts a local server at port 11434
curl http://localhost:11434/api/generate -d '{
"model": "qwen3:8b",
"prompt": "Hello, are you running?",
"stream": false
}'
The entire process takes under 5 minutes on a decent internet connection. The model download is approximately 5GB for the q4 quantization.
# In your OpenClaw config (e.g., ~/.openclaw/config.yaml)
llm:
provider: ollama
model: qwen3:8b
base_url: http://localhost:11434
temperature: 0.7
max_tokens: 4096
# Optional: Configure dual thinking
system_prompt: |
You are a helpful assistant powering an OpenClaw agent.
For complex tasks, think step by step before answering.
For simple tasks, respond directly and concisely.
# Make sure Ollama is running first
ollama serve &
# Start OpenClaw
openclaw start
OpenClaw connects to Ollama's local API server. No API keys, no authentication, no network egress. Everything stays on your machine.
| Hardware | Tokens/Second | Time for 500-word Response |
|---|---|---|
| MacBook Air M2 (16GB) | 15-25 tok/s | ~25 seconds |
| MacBook Pro M3 (32GB) | 25-40 tok/s | ~15 seconds |
| Desktop + RTX 3060 | 40-60 tok/s | ~10 seconds |
| Desktop + RTX 4090 | 60-90 tok/s | ~7 seconds |
These speeds are for the q4 quantization. Higher quantizations (q8, fp16) improve quality slightly but require more RAM and run slower. For most OpenClaw agent tasks, q4 provides the best balance of quality and speed.
Compared to cloud API models that typically respond at 60-150 tokens per second, local inference on consumer hardware is slower. The trade-off is zero cost, complete privacy, and no rate limits. For development, testing, and moderate production volumes, the speed is more than adequate.
| Metric | Qwen3 8B | Llama 3.3 8B | Gemma 3 9B |
|---|---|---|---|
| Parameters | 8B | 8B | 9B |
| Languages | 119 | ~8 | ~30 |
| Dual Thinking | Yes | No | No |
| RAM Required | 16GB | 16GB | 16GB |
| License | Apache 2.0 | Llama License | Gemma License |
| Context Window | 32K | 128K | 32K |
| Best For | Multilingual, math | English, ecosystem | Multimodal tasks |
Qwen3 8B wins on multilingual support and dual thinking. Llama 3.3 8B wins on context window length and English-language ecosystem. Gemma 3 9B has basic vision capabilities that neither Qwen3 nor Llama offer. For most OpenClaw operators, the choice comes down to whether you need multilingual support (Qwen3), maximum English context (Llama), or vision (Gemma).
If your laptop has 16GB of RAM, yes. Qwen3 8B is a dense 8 billion parameter model that runs comfortably in q4 quantization on 16GB RAM machines — including M1/M2/M3/M4 MacBooks, recent Windows laptops with 16GB+ RAM, and Linux workstations. Install Ollama, run ollama pull qwen3:8b, and you are up and running in under five minutes. Performance is typically 15-30 tokens per second on a MacBook Air M2.
Dual thinking mode means Qwen3 8B can operate in two distinct ways: "thinking" where it works through problems step by step with visible reasoning chains (similar to chain-of-thought prompting), and "non-thinking" where it generates direct responses without intermediate reasoning. You can switch between modes via the system prompt. Thinking mode produces higher accuracy on complex tasks but uses more tokens. Non-thinking mode is faster and cheaper for simple tasks.
Both are excellent 8B-class models for local deployment. Qwen3 8B has the edge in multilingual support (119 languages vs Llama's ~8), dual thinking modes, and mathematical reasoning. Llama 3.3 8B has a stronger English-language ecosystem with more fine-tuned variants and community tooling. For OpenClaw operators who need multilingual support or better math performance, choose Qwen3. For English-only workflows with maximum community support, choose Llama.
Yes, completely. Qwen3 8B is released under the Apache 2.0 license by Alibaba Cloud. You can download the weights, run them locally via Ollama, fine-tune them for your domain, and use them commercially — all for free. The only cost is your hardware. Running locally on your existing laptop or desktop means zero ongoing API fees.