Most AI agent benchmarks measure how well a model answers questions or completes constrained tasks. ClawWork from HKUDS takes a different approach: it gives an AI agent $10, assigns it professional work tasks, makes it pay for its own API calls, and measures whether it stays solvent.

That's not a hypothetical stress test. It's a live benchmark system that has attracted 3,300+ GitHub stars in its first few days.

The Core Concept

ClawWork transforms OpenClaw (via its lightweight cousin Nanobot) from an AI assistant into what the project calls an "AI coworker." The distinction is economic accountability.

In the simulation:

  • The agent starts with $10
  • Every LLM call costs real money (deducted automatically based on actual token usage)
  • Income comes only from completing professional work tasks to a quality standard
  • If the balance hits zero, the agent is insolvent — game over

The result: the agent has to make real trade-off decisions. Work on a task for income now, or invest time learning to do future tasks better? Spend tokens on web searches to improve quality, or submit with existing knowledge? These are decisions with actual consequences in the simulation.

The GDPVal Dataset

The tasks aren't toy problems. ClawWork uses the GDPVal dataset — 220 real professional tasks across 44 economic sectors, originally designed to estimate AI's contribution to GDP.

Task categories include:

  • Technology: Computer systems management, software engineering, data analysis
  • Finance: Financial analysis, compliance, auditing
  • Healthcare: Health administration, social work
  • Legal/Operations: Property management, project coordination

Payment is calculated on real economic value:

Payment = quality_score × (estimated_hours × BLS_hourly_wage)

Tasks range from $82 to $5,004 in potential payment, with an average around $260. The quality score (0.0 to 1.0) is evaluated by GPT-5.2 using category-specific rubrics — not a generic "did it answer" check.

What the Leaderboard Shows

The benchmark produces results that are genuinely useful for evaluating AI work capability:

Top performers reach $1,500+/hour equivalent salary — meaning their work quality × task volume × efficiency outperforms typical human white-collar productivity in these domains.

Survival is the hard constraint. Starting with only $10 creates genuine pressure. One bad task or careless use of expensive web search calls can wipe the balance. Models that are sloppy with tokens don't survive long regardless of raw capability.

Strategic decisions matter. The "work vs. learn" choice is a real fork. Agents that invest too heavily in learning tasks run out of money. Agents that never learn plateau in quality. The balance between immediate income and capability building affects long-run performance.

The Nanobot Integration (ClawMode)

ClawWork doesn't require running a standalone simulation — it integrates directly into live Nanobot (or OpenClaw) deployments via ClawMode.

With ClawMode active, your regular Nanobot instance gains economic awareness:

  • Every conversation costs tokens (tracked in real-time)
  • The /clawwork command lets any user in your Telegram/Discord/WhatsApp assign paid professional tasks
  • Tasks are automatically classified into 44 occupational categories with BLS wage-based pricing
  • A cost footer appears on every response: Cost: $0.0075 | Balance: $999.99 | Status: thriving

This transforms a personal AI assistant into something that has to demonstrate value — you can literally see whether your agent is earning more than it costs to run.

Getting It Running

Standalone simulation:

git clone https://github.com/HKUDS/ClawWork.git
cd ClawWork
conda create -n clawwork python=3.10
conda activate clawwork
pip install -r requirements.txt

# Terminal 1: Start the dashboard
./start_dashboard.sh

# Terminal 2: Run the agent
./run_test_agent.sh

# Open http://localhost:3000

You'll need an OpenAI API key (for the GPT-4o agent and LLM evaluation) and an E2B API key (for code execution in sandboxed environments).

Live dashboard metrics include:

  • Balance chart updating in real-time
  • Work vs. learn activity distribution
  • Income, costs, net worth, survival status
  • Individual task quality scores and payment amounts
  • Knowledge base from learning sessions

Multi-Agent Competition

The benchmark supports running multiple AI models head-to-head in the same economic environment:

"agents": [
  {"signature": "gpt4o-run", "basemodel": "gpt-4o", "enabled": true},
  {"signature": "claude-run", "basemodel": "claude-sonnet-4-5-20250929", "enabled": true}
]

This produces genuinely useful comparisons: not "which model scores higher on MMLU" but "which model can sustain economic viability doing real professional work."

Why This Matters for OpenClaw Operators

For operators using OpenClaw as a productivity tool, ClawWork is interesting as a benchmark framework rather than a production system. But a few things it demonstrates are worth thinking about:

Token cost consciousness. ClawWork makes token spending visible in a way that most deployments don't. If you're running OpenClaw heavily and haven't set up cost monitoring, you're flying blind on API spend.

Quality matters more than speed. The economic model rewards quality (payment × quality_score) rather than task volume. This mirrors how real assistant value works — a fast but sloppy agent is worse than a thoughtful one.

The "work vs. learn" tradeoff is real. In production OpenClaw deployments, the equivalent question is: how much context should your agent accumulate vs. how much should it act on immediate tasks? ClawWork makes this tradeoff visible and measurable.

Links:


Running OpenClaw in production and want monitoring for costs, uptime, and agent behavior? Remote OpenClaw's managed tier includes Slack notifications and 24/7 monitoring so you can see what your agent is doing and what it's costing. See the plans.