Remote OpenClaw

Remote OpenClaw Blog

Claude Opus 4.6 on OpenClaw: Setup, Benchmarks, and Best Practices

9 min read ·

What Is Claude Opus 4.6?

Claude Opus 4.6 is Anthropic's most capable language model, released in February 2026. It represents the top of the Claude model family — above Sonnet and Haiku — and is designed for tasks that require the deepest reasoning, the longest context, and the most reliable autonomous execution.

For OpenClaw operators, Opus 4.6 is the model you choose when task success rate matters more than per-token cost. Its 80.8% score on SWE-bench Verified means it can autonomously resolve 4 out of 5 real-world software engineering tasks — reading issue descriptions, navigating codebases, generating fixes, and producing valid patches. No other model consistently matches this level of autonomous coding capability.

Three features define Opus 4.6's value proposition for agent workflows: the 1M token context window (no more truncating large codebases or documents), the 128K output token limit (complete responses even for large code changes), and adaptive thinking (automatic reasoning depth scaling that optimizes cost without sacrificing quality on hard problems).


Architecture and Specifications

Specification Value
Model Name Claude Opus 4.6 (1M context)
Developer Anthropic
Release Date February 2026
Context Window 1,000,000 tokens
Max Output 128,000 tokens per response
Modalities Text + Vision
Adaptive Thinking Yes (automatic reasoning depth)
Prompt Caching 90% savings on cached input
License Proprietary (API access)

The 128K output limit is a significant differentiator that is easy to overlook. Most models cap output at 4K-16K tokens, which means long code changes, detailed reports, or comprehensive analyses get truncated. Opus 4.6 can generate complete responses for tasks that would require multiple calls with other models — reducing latency, complexity, and error rates in agent pipelines.


Benchmarks and Performance

Benchmark Claude Opus 4.6 Context
SWE-bench Verified 80.8% Highest of any model for autonomous coding
AIME 2024 91.5% Near-perfect mathematical reasoning
MMLU 92.3% Broadest knowledge base of any model
HumanEval 94.1% Excellent code generation from natural language
GPQA Diamond 78.2% Graduate-level scientific reasoning

The SWE-bench Verified score of 80.8% deserves attention. This benchmark measures end-to-end software engineering: given a real GitHub issue, can the model read the issue, find the relevant code across the repository, generate a correct fix, and produce a patch that passes tests? An 80.8% success rate means Opus 4.6 handles the vast majority of real-world coding tasks without human intervention. For OpenClaw operators running coding agents, this translates directly to higher task completion rates and less manual intervention.

The GPQA Diamond score of 78.2% measures graduate-level scientific reasoning across physics, chemistry, and biology. This is relevant for operators building research agents, scientific analysis workflows, or any task requiring deep domain expertise beyond just coding and math.


Adaptive Thinking Explained

Adaptive thinking is one of Opus 4.6's most impactful features for cost optimization. Here is how it works:

When Opus 4.6 receives a request, it assesses the complexity of the task before generating a response. For simple tasks — formatting text, answering factual questions, classification — it uses minimal internal reasoning and produces a quick, efficient response. For complex tasks — multi-file code refactoring, mathematical proofs, strategic analysis — it engages extended reasoning chains that may use thousands of thinking tokens to work through the problem step by step.

This matters for cost because thinking tokens count toward your output token usage. Without adaptive thinking, a model uses the same reasoning depth for every request, meaning you pay for deep thinking on tasks that do not need it. With adaptive thinking, simple tasks cost less automatically.

Configuring Thinking Budget in OpenClaw

# In your OpenClaw config (e.g., ~/.openclaw/config.yaml)
llm:
  provider: anthropic
  model: claude-opus-4-6-1m
  api_key: your-anthropic-api-key
  temperature: 0.7
  max_tokens: 32768
  # Adaptive thinking configuration
  thinking:
    enabled: true
    budget_tokens: 16384    # Max thinking tokens per request
    # Higher budget = deeper reasoning but higher cost
    # Lower budget = faster responses, lower cost

A budget of 8,192-16,384 thinking tokens works well for most OpenClaw agent tasks. For particularly complex coding or research tasks, increase to 32,768 or higher. For simple classification and routing, set to 2,048 or disable entirely.


Prompt Caching: 90% Cost Savings

Prompt caching is the single most impactful cost optimization for OpenClaw operators using Opus 4.6. Here is how it works:

Every time your agent sends a request to Claude, it includes a system prompt, context documents, and the user's message. Without caching, Claude processes the entire prompt from scratch on every request — including the system prompt and context that have not changed. With caching, you mark stable portions of your prompt (system instructions, reference documents, code files) as cacheable. Claude processes these once and stores the result, charging 90% less for cached tokens on subsequent requests.

For a typical OpenClaw agent that sends a 10,000-token system prompt with every request:

Over 10,000 requests per month, that is $500 vs $50. For high-volume agent workflows, caching turns Opus 4.6 from an expensive premium model into a cost-competitive option.

Enabling Caching in OpenClaw

# In your OpenClaw config (e.g., ~/.openclaw/config.yaml)
llm:
  provider: anthropic
  model: claude-opus-4-6-1m
  api_key: your-anthropic-api-key
  cache:
    enabled: true
    # Specify which parts of the prompt to cache
    cache_system_prompt: true
    cache_context_documents: true
    ttl: 300    # Cache time-to-live in seconds

Pricing Breakdown

Token Type Standard Price (per 1M) With Caching (per 1M)
Input Tokens $5.00 $0.50 (cached)
Output Tokens $25.00 $25.00 (no caching)
Thinking Tokens $25.00 $25.00 (no caching)

The output pricing of $25 per million tokens is the highest of any major model. This is where adaptive thinking's cost optimization matters most — by using fewer thinking tokens on simple tasks, Opus 4.6 keeps effective output costs manageable. For operators running high-volume workflows, combining caching (90% input savings) with controlled thinking budgets can bring Opus 4.6's effective cost below GPT-5.4 for many workloads.

Marketplace

Free skills and AI personas for OpenClaw — browse the marketplace.

Browse the Marketplace →

Setup Method 1: Anthropic API (Direct)

The Anthropic API provides direct access to Claude Opus 4.6 with the lowest latency and full feature support including caching and adaptive thinking.

Step 1: Get an Anthropic API Key

Sign up at console.anthropic.com and generate an API key. Add credits based on your expected usage. The pay-as-you-go model means no minimum commitment.

Step 2: Configure OpenClaw

# In your OpenClaw config (e.g., ~/.openclaw/config.yaml)
llm:
  provider: anthropic
  model: claude-opus-4-6-1m
  api_key: your-anthropic-api-key
  temperature: 0.7
  max_tokens: 32768
  thinking:
    enabled: true
    budget_tokens: 16384
  cache:
    enabled: true
    cache_system_prompt: true

Step 3: Start OpenClaw

openclaw start

The Anthropic API is highly reliable with strong uptime. For production deployments, it is the recommended integration path for Claude models.


Setup Method 2: OpenRouter

OpenRouter provides access to Claude Opus 4.6 alongside other models through a single API key. This is the best option if you want to switch between Claude, GPT, and open models without reconfiguring.

Step 1: Get an OpenRouter API Key

Sign up at openrouter.ai and generate an API key.

Step 2: Configure OpenClaw

# In your OpenClaw config (e.g., ~/.openclaw/config.yaml)
llm:
  provider: openrouter
  model: anthropic/claude-opus-4-6-1m
  api_key: your-openrouter-api-key
  temperature: 0.7
  max_tokens: 32768

Step 3: Start OpenClaw

openclaw start

Note that OpenRouter may not support all Opus 4.6 features (like advanced caching configurations). For full feature access, use the direct Anthropic API.


Best Practices for OpenClaw

  • Always enable caching. If your agent sends a system prompt with every request (which it should), caching alone saves 90% on those input tokens. There is no reason not to enable it.
  • Set appropriate thinking budgets. Start with 16,384 thinking tokens and adjust based on task complexity. Monitor your thinking token usage in the Anthropic dashboard to find the right balance.
  • Use Opus for hard tasks, not all tasks. Route simple classification and formatting tasks to cheaper models like DeepSeek V3.2 or GPT-5.4-nano. Reserve Opus for tasks where its 80.8% SWE-bench capability actually matters — complex code changes, multi-step reasoning, long-document analysis.
  • Leverage the 128K output. Unlike models with 4K-8K output limits, Opus can generate complete responses for large code changes in a single call. Design your agent to take advantage of this by requesting comprehensive outputs rather than splitting tasks unnecessarily.
  • Use the 1M context strategically. Load your entire codebase, documentation, and conversation history into context. Opus performs better with more context — do not artificially truncate to save costs. The caching discount makes large contexts affordable.

Opus 4.6 vs GPT-5.4 vs DeepSeek V3.2

Metric Claude Opus 4.6 GPT-5.4 DeepSeek V3.2
SWE-bench Verified 80.8% 79.5% 67.8%
Context Window 1M 1M 128K
Max Output 128K 32K 8K
Input Cost $5.00/M ($0.50 cached) $5.00/M $0.028/M
Output Cost $25.00/M $20.00/M $0.10/M
Vision Yes Yes No
Computer Use Yes Yes (75% OSWorld) No

The bottom line: Opus 4.6 delivers the highest success rate on complex tasks but at premium pricing. GPT-5.4 is close in capability with better computer-use performance. DeepSeek V3.2 is 250x cheaper on input and handles 67.8% of coding tasks — the best value for high-volume, less-complex workflows. Smart operators use all three: DeepSeek for routine work, GPT-5.4 for computer use, Opus for the hardest problems.


Frequently Asked Questions

What makes Claude Opus 4.6 different from Claude Sonnet?

Claude Opus 4.6 is Anthropic's most capable model, scoring 80.8% on SWE-bench Verified compared to Sonnet 4's ~79%. The key differences are: 1M context window (vs 200K for Sonnet), 128K output tokens per response (vs 8K for Sonnet), adaptive thinking that scales reasoning depth to task complexity, and significantly better performance on long-context tasks. Opus costs $5/$25 per million tokens vs Sonnet's $3/$15 — a premium that is justified for complex agent workflows.

How does prompt caching work with Claude Opus 4.6?

Prompt caching stores frequently used context — system prompts, large documents, code files — so they do not need to be re-processed on every request. With Opus 4.6, cached tokens cost 90% less than uncached tokens. For OpenClaw agents that send the same system prompt and context with every request, this can reduce effective input costs from $5.00 to $0.50 per million tokens. Enable caching by adding cache_control breakpoints in your prompts.

What is adaptive thinking in Claude Opus 4.6?

Adaptive thinking is Claude Opus 4.6's ability to automatically adjust its reasoning depth based on task complexity. For simple tasks like formatting or classification, Opus uses minimal reasoning tokens. For complex tasks like multi-file code refactoring or mathematical proofs, Opus engages extended thinking chains that can use thousands of reasoning tokens. You can configure the thinking budget in OpenClaw to control this behavior — setting a higher budget allows deeper reasoning but costs more.

Is Claude Opus 4.6 worth the cost over cheaper models?

For complex agent tasks — multi-file code changes, long-document analysis, multi-step autonomous workflows — yes. Opus 4.6's 80.8% SWE-bench score means it resolves 4 out of 5 real software engineering tasks autonomously. Combined with the 1M context window (no truncation), 128K output (complete responses), and 90% caching savings (lower effective cost), Opus delivers the highest success rate per dollar on complex tasks. For simple tasks, cheaper models like DeepSeek V3.2 or GPT-5.4-mini are more cost-effective.


Further Reading