Remote OpenClaw

Remote OpenClaw Blog

Qwen3 8B on OpenClaw: Best Small Model for Local Deployment

7 min read ·

What Is Qwen3 8B?

Qwen3 8B is the 8 billion parameter model from Alibaba Cloud's Qwen (Tongyi Qianwen) family. Unlike the massive MoE models covered in our other guides, Qwen3 8B is a dense model — all 8 billion parameters are active on every forward pass. This makes it smaller, faster, and dramatically easier to run on consumer hardware.

What makes Qwen3 8B stand out among small models is a combination of three features: dual thinking modes that let you trade speed for accuracy per-request, support for 119 languages (the widest multilingual coverage of any model in its class), and hardware requirements that fit within a 16GB RAM laptop. For OpenClaw operators who want a local model that runs without any API dependency, Qwen3 8B is the strongest option available.

The model is released under the Apache 2.0 license by Alibaba Cloud, which means fully free commercial use with no restrictions. Download it, run it, fine-tune it, build products on it — no licensing fees, no usage caps, no API keys required.


Why Run a Model Locally?

Before diving into setup, here is why local deployment matters for OpenClaw operators:


Specifications

Specification Value
Parameters 8 billion (dense)
Architecture Dense transformer
Developer Alibaba Cloud (Qwen Team)
License Apache 2.0
Languages 119
Thinking Modes Dual (thinking + non-thinking)
Context Window 32K tokens
RAM Required 16GB (q4 quantization)
Disk Space ~5GB (q4 quantization)

The 32K context window is adequate for most OpenClaw agent tasks — conversation histories, individual file analysis, email processing, and document summarization. For tasks requiring longer context (full codebase analysis, long document processing), you will need a cloud-hosted model with a larger context window.


Dual Thinking Modes

Qwen3 8B's dual thinking system is its most distinctive feature. You can switch between two modes on a per-request basis:

Thinking Mode

In thinking mode, Qwen3 8B works through problems step by step, showing its reasoning chain before arriving at an answer. This is similar to chain-of-thought prompting but built into the model architecture. Thinking mode produces higher accuracy on complex tasks — math problems, code debugging, logical reasoning — at the cost of generating more tokens (and therefore running slower).

# Enable thinking mode via system prompt
system_prompt: "You are a helpful assistant. Think step by step before answering."

# Or via Ollama parameters
ollama run qwen3:8b --system "Think step by step."

Non-Thinking Mode

In non-thinking mode, Qwen3 8B generates direct responses without intermediate reasoning. This is faster and uses fewer tokens, making it ideal for simple tasks — classification, formatting, factual lookups, template filling.

# Non-thinking mode (default)
system_prompt: "You are a helpful assistant. Respond directly and concisely."

For OpenClaw operators, the practical approach is to configure your agent to use thinking mode for complex tasks (coding, analysis, problem-solving) and non-thinking mode for routine tasks (formatting, classification, data extraction). This optimizes both speed and accuracy without switching models.


119-Language Support

Qwen3 8B supports 119 languages — the broadest multilingual coverage of any small model. This includes all major world languages, most regional languages, and many minority languages that other models do not cover.

For OpenClaw operators, this means:


Hardware Requirements

Hardware Minimum Recommended
RAM 16GB 32GB
Disk 10GB free 20GB free
CPU Any modern (2020+) Apple Silicon / modern x86
GPU Not required Any GPU with 8GB+ VRAM

Qwen3 8B runs on CPU-only hardware. A GPU accelerates inference significantly but is not required. On a MacBook Air M2 with 16GB RAM, expect 15-30 tokens per second. On a machine with a dedicated GPU (RTX 3060 or better), expect 40-80 tokens per second.

Marketplace

Free skills and AI personas for OpenClaw — browse the marketplace.

Browse the Marketplace →

Step-by-Step Setup with Ollama

Step 1: Install Ollama

# macOS / Linux
curl -fsSL https://ollama.ai/install.sh | sh

# Verify installation
ollama --version

Step 2: Pull Qwen3 8B

# Pull the model (~5GB download)
ollama pull qwen3:8b

# Verify it downloaded
ollama list

Step 3: Test the Model

# Interactive chat
ollama run qwen3:8b

# Test with a prompt
ollama run qwen3:8b "Write a Python function to calculate fibonacci numbers"

Step 4: Verify the Server

# Ollama automatically starts a local server at port 11434
curl http://localhost:11434/api/generate -d '{
  "model": "qwen3:8b",
  "prompt": "Hello, are you running?",
  "stream": false
}'

The entire process takes under 5 minutes on a decent internet connection. The model download is approximately 5GB for the q4 quantization.


OpenClaw Configuration

# In your OpenClaw config (e.g., ~/.openclaw/config.yaml)
llm:
  provider: ollama
  model: qwen3:8b
  base_url: http://localhost:11434
  temperature: 0.7
  max_tokens: 4096

# Optional: Configure dual thinking
system_prompt: |
  You are a helpful assistant powering an OpenClaw agent.
  For complex tasks, think step by step before answering.
  For simple tasks, respond directly and concisely.

Start OpenClaw

# Make sure Ollama is running first
ollama serve &

# Start OpenClaw
openclaw start

OpenClaw connects to Ollama's local API server. No API keys, no authentication, no network egress. Everything stays on your machine.


Performance Expectations

Hardware Tokens/Second Time for 500-word Response
MacBook Air M2 (16GB) 15-25 tok/s ~25 seconds
MacBook Pro M3 (32GB) 25-40 tok/s ~15 seconds
Desktop + RTX 3060 40-60 tok/s ~10 seconds
Desktop + RTX 4090 60-90 tok/s ~7 seconds

These speeds are for the q4 quantization. Higher quantizations (q8, fp16) improve quality slightly but require more RAM and run slower. For most OpenClaw agent tasks, q4 provides the best balance of quality and speed.

Compared to cloud API models that typically respond at 60-150 tokens per second, local inference on consumer hardware is slower. The trade-off is zero cost, complete privacy, and no rate limits. For development, testing, and moderate production volumes, the speed is more than adequate.


Qwen3 8B vs Other Small Models

Metric Qwen3 8B Llama 3.3 8B Gemma 3 9B
Parameters 8B 8B 9B
Languages 119 ~8 ~30
Dual Thinking Yes No No
RAM Required 16GB 16GB 16GB
License Apache 2.0 Llama License Gemma License
Context Window 32K 128K 32K
Best For Multilingual, math English, ecosystem Multimodal tasks

Qwen3 8B wins on multilingual support and dual thinking. Llama 3.3 8B wins on context window length and English-language ecosystem. Gemma 3 9B has basic vision capabilities that neither Qwen3 nor Llama offer. For most OpenClaw operators, the choice comes down to whether you need multilingual support (Qwen3), maximum English context (Llama), or vision (Gemma).


Frequently Asked Questions

Can my laptop run Qwen3 8B?

If your laptop has 16GB of RAM, yes. Qwen3 8B is a dense 8 billion parameter model that runs comfortably in q4 quantization on 16GB RAM machines — including M1/M2/M3/M4 MacBooks, recent Windows laptops with 16GB+ RAM, and Linux workstations. Install Ollama, run ollama pull qwen3:8b, and you are up and running in under five minutes. Performance is typically 15-30 tokens per second on a MacBook Air M2.

What is dual thinking mode?

Dual thinking mode means Qwen3 8B can operate in two distinct ways: "thinking" where it works through problems step by step with visible reasoning chains (similar to chain-of-thought prompting), and "non-thinking" where it generates direct responses without intermediate reasoning. You can switch between modes via the system prompt. Thinking mode produces higher accuracy on complex tasks but uses more tokens. Non-thinking mode is faster and cheaper for simple tasks.

How does Qwen3 8B compare to Llama 3.3 8B for OpenClaw?

Both are excellent 8B-class models for local deployment. Qwen3 8B has the edge in multilingual support (119 languages vs Llama's ~8), dual thinking modes, and mathematical reasoning. Llama 3.3 8B has a stronger English-language ecosystem with more fine-tuned variants and community tooling. For OpenClaw operators who need multilingual support or better math performance, choose Qwen3. For English-only workflows with maximum community support, choose Llama.

Is Qwen3 8B free to use?

Yes, completely. Qwen3 8B is released under the Apache 2.0 license by Alibaba Cloud. You can download the weights, run them locally via Ollama, fine-tune them for your domain, and use them commercially — all for free. The only cost is your hardware. Running locally on your existing laptop or desktop means zero ongoing API fees.


Further Reading