Remote OpenClaw

Remote OpenClaw Blog

Multi-Agent Failure Handling: Error Recovery and Rollback in OpenClaw

9 min read ·

Running a single OpenClaw agent is forgiving. If it hits an error, it stops, you check the logs, and you restart it. Running two, three, or ten agents together is a different story. When one agent fails in a multi-agent deployment, the consequences ripple outward: tasks pile up, downstream agents stall, and partial work creates inconsistencies that are harder to fix than the original error.

This guide covers the error handling patterns that keep multi-agent OpenClaw deployments running reliably. These are not theoretical concepts; they are configurations and strategies used by operators running production multi-agent setups right now.

Why Multi-Agent Errors Are Different

In a single-agent setup, errors are linear. Something breaks, the agent stops, you fix it. The blast radius is limited to that one agent's tasks.

Multi-agent deployments introduce dependency chains. Agent A processes incoming emails and hands structured summaries to Agent B, which creates calendar events and task items. Agent C monitors those task items and sends Telegram reminders. If Agent A crashes, Agents B and C have nothing to work with. If Agent B fails silently and produces malformed output, Agent C sends garbage reminders.

The core challenge is not that errors happen more frequently with multiple agents. It is that errors propagate in ways that are harder to predict and harder to diagnose after the fact. A failure at 2am in your email-processing agent might not surface until 8am when you notice your calendar is empty and your Telegram is silent.

This is why multi-agent error handling is not optional. It is the difference between a deployment that runs reliably and one that creates more problems than it solves.

Common Failure Modes

Before configuring error handling, you need to understand what actually breaks. Multi-agent OpenClaw deployments encounter five primary failure modes:

API rate limiting. When multiple agents share the same Claude or OpenAI API key, they compete for rate limits. Agent A might burn through the per-minute token allocation, causing Agents B and C to receive 429 errors for the next 60 seconds. This is the most common failure mode in multi-agent deployments and the easiest to prevent.

External service outages. Your agents depend on external services: Gmail, Google Calendar, Telegram, Notion, Slack. When one of these services goes down or responds slowly, agents waiting for responses can hang indefinitely without proper timeouts configured.

Inter-agent communication failures. When agents pass data to each other through shared state, message queues, or file-based handoffs, any interruption in that communication channel stalls the entire pipeline. Network issues, file permission problems, or queue backlogs all create this failure mode.

Resource exhaustion. Multiple agents running on the same VPS compete for CPU, RAM, and disk I/O. Under heavy load, one agent can starve others of resources, causing timeouts and crashes that look like application errors but are actually infrastructure problems.

Configuration drift. Over time, agents get updated independently. One agent might be running a newer version of a skill that outputs data in a different format, breaking the downstream agent that expects the old format. This is a slow-burn failure that is hard to catch without version pinning and integration tests.

Retry Policies and Backoff Strategies

The simplest error recovery mechanism is retrying the failed operation. But naive retries (immediately re-executing the same request) make most problems worse. If you are getting rate-limited, hammering the API faster is not going to help.

OpenClaw supports configurable retry policies in your agent configuration:

retry:
  max_attempts: 3
  backoff: exponential
  initial_delay: 2000
  max_delay: 30000
  retryable_errors:
    - rate_limit
    - timeout
    - connection_reset

This configuration tells the agent to retry up to 3 times, starting with a 2-second delay and doubling each time (2s, 4s, 8s), capping at 30 seconds. Only specific error types trigger retries; permanent errors like authentication failures or invalid input are not retried.

Exponential backoff with jitter is the recommended strategy for multi-agent setups. Adding randomness to the delay prevents multiple agents from retrying at exactly the same time and creating a thundering herd:

retry:
  backoff: exponential_jitter
  jitter_range: 500

This adds up to 500ms of random jitter to each retry delay, spreading out the retry attempts across your agent fleet.

For API rate limiting specifically, respect the Retry-After header that most APIs return with 429 responses. OpenClaw parses this header automatically when respect_retry_after: true is set in the retry configuration.

Circuit Breakers for External Services

Retries handle transient failures. Circuit breakers handle sustained failures. The pattern is borrowed from electrical engineering: when a service fails repeatedly, the circuit "opens" and the agent stops calling that service entirely for a cooldown period.

Without circuit breakers, an agent trying to reach a down service will burn through all its retry attempts for every single task, wasting API tokens and CPU cycles on calls that will never succeed. With a circuit breaker, the agent recognizes the pattern after a few failures and skips the broken service until it recovers.

Configure circuit breakers per integration:

circuit_breaker:
  gmail:
    failure_threshold: 5
    cooldown_seconds: 60
    half_open_requests: 2
  telegram:
    failure_threshold: 3
    cooldown_seconds: 30
    half_open_requests: 1

When the failure threshold is exceeded, the circuit opens. After the cooldown period, it enters a "half-open" state where it allows a small number of test requests through. If those succeed, the circuit closes and normal operation resumes. If they fail, the circuit opens again.

Circuit breakers are especially important in multi-agent setups because a single failing integration can create backpressure across the entire agent fleet. If three agents all depend on Gmail and Gmail goes down, those three agents can collectively exhaust your API rate limits just on retry attempts, starving your other agents of API access for completely unrelated tasks.

Rollback Patterns

Not every failed operation can simply be retried. Some operations are partially completed before failing: an agent might create a calendar event but crash before sending the confirmation message. Rollback patterns handle these partial-completion scenarios.

Checkpoint-based rollback. Before each critical operation, the agent saves a state checkpoint. If the operation fails, the agent (or an operator) can restore the previous checkpoint and retry from a known-good state.

checkpoints:
  enabled: true
  storage: file
  path: ./data/checkpoints/
  retention_hours: 48

Checkpoints are lightweight snapshots of the agent's working state: current task queue, in-progress items, and relevant context. They do not snapshot the entire system; they capture enough to resume or roll back a specific workflow.

Marketplace

Free skills and AI personas for OpenClaw — browse the marketplace.

Browse the Marketplace →

Saga pattern for multi-step workflows. When a workflow spans multiple agents, each step needs a corresponding compensating action that undoes its work. If step 3 of a 5-step workflow fails, the system executes compensating actions for steps 2 and 1 in reverse order.

Example: an agent workflow that processes an invoice involves (1) extracting data from the PDF, (2) creating an entry in the accounting system, (3) sending a Slack notification, and (4) updating the project tracker. If step 3 fails, the compensating action for step 2 deletes the accounting entry, and step 1's extraction data is discarded.

saga:
  invoice_processing:
    steps:
      - action: extract_invoice_data
        compensate: discard_extracted_data
      - action: create_accounting_entry
        compensate: delete_accounting_entry
      - action: send_slack_notification
        compensate: send_failure_notice
      - action: update_project_tracker
        compensate: revert_tracker_update

The saga pattern requires more upfront configuration but prevents the most dangerous outcome in multi-agent systems: partially completed workflows where some agents think the work is done and others think it never started.

Idempotency keys. For operations that interact with external services, attach a unique idempotency key to each request. If a retry sends the same request twice, the external service recognizes the duplicate and returns the original result instead of creating a duplicate entry. Most modern APIs (Stripe, Google Calendar, Notion) support idempotency keys.

Health Checks and Monitoring

Error recovery is reactive. Monitoring is proactive. A well-monitored multi-agent deployment catches problems before they cascade.

Every OpenClaw agent exposes a health endpoint at /api/health that returns the agent's current status, uptime, last successful task, and integration connection states. For multi-agent monitoring, configure a dedicated watchdog agent that polls the health endpoints of all other agents:

role: watchdog
monitor:
  agents:
    - name: email-agent
      health_url: http://localhost:3001/api/health
      check_interval: 30
    - name: calendar-agent
      health_url: http://localhost:3002/api/health
      check_interval: 30
    - name: task-agent
      health_url: http://localhost:3003/api/health
      check_interval: 30
  alerts:
    telegram:
      enabled: true
      chat_id: "your-chat-id"
    email:
      enabled: true
      to: "admin@yourdomain.com"

The watchdog agent sends alerts via Telegram or email when any agent fails a health check. It also tracks response times and can alert on degraded performance before an outright failure occurs.

For structured logging, configure all agents to write to a central log location with consistent formatting. This makes it possible to trace a single task across multiple agents by correlation ID:

logging:
  format: json
  output: ./logs/agent-name.log
  level: info
  include_correlation_id: true

Detailed guidance on log aggregation and debugging is available in the OpenClaw logs and debugging guide.

Building Resilient Multi-Agent Workflows

Individual error handling patterns work best when combined into a coherent resilience strategy. Here is the approach that experienced multi-agent operators follow:

Layer 1: Timeouts on everything. Every external call, every inter-agent message, every file operation gets a timeout. No operation should be able to hang indefinitely. Default timeouts of 30 seconds for API calls and 10 seconds for local operations catch most hangs.

Layer 2: Retries with backoff for transient errors. Rate limits, connection resets, and temporary service unavailability are retried with exponential backoff and jitter. Permanent errors are not retried.

Layer 3: Circuit breakers for sustained outages. When retries are consistently failing, the circuit breaker opens and the agent stops wasting resources on a dead service.

Layer 4: Fallback routing. When a primary integration is circuit-broken, route tasks to a fallback. If Gmail is down, queue emails for later delivery. If the calendar API is unreachable, write events to a local file for manual reconciliation.

Layer 5: Dead letter queue. Tasks that fail all retries, circuit breakers, and fallbacks go to a dead letter queue for manual review. Nothing is silently dropped.

dead_letter:
  enabled: true
  storage: ./data/dead-letter/
  notify: true
  max_age_hours: 168

This five-layer approach handles everything from momentary hiccups to extended outages. The key principle is graceful degradation: the system should do as much work as possible even when parts of it are broken, and clearly surface what it could not complete.

For a deeper look at how agents coordinate and communicate in these resilience layers, see the multi-agent setup guide and the multi-agent configuration reference.


Frequently Asked Questions

What happens when one agent fails in a multi-agent OpenClaw setup?

By default, a single agent failure does not crash other agents in the deployment. However, without proper error handling configured, a failed agent can leave tasks in an incomplete state, cause downstream agents to stall waiting for input, or create data inconsistencies. Configuring health checks, retry policies, and fallback routing prevents these cascading issues.

How do I implement rollback in OpenClaw multi-agent workflows?

OpenClaw supports rollback through checkpoint-based state snapshots. Before each critical operation, the agent saves a state checkpoint. If the operation fails, you can restore the previous checkpoint either automatically via retry policy or manually through the admin interface. For multi-step workflows, use saga patterns where each step has a defined compensating action.

What is a circuit breaker in the context of OpenClaw agents?

A circuit breaker is a pattern that stops an agent from repeatedly calling a failing external service. When consecutive failures exceed a threshold (default: 5), the circuit opens and the agent skips that integration for a cooldown period (default: 60 seconds) before testing again. This prevents cascading failures and protects your API rate limits.

How do I monitor agent failures across a multi-agent OpenClaw deployment?

Use the built-in health endpoint at /api/health for each agent instance, combined with structured logging to a central collector. Most operators use a dedicated monitoring agent that watches the health endpoints of all other agents and sends alerts via Telegram or email when failures are detected. The logs and debugging guide covers the full monitoring setup.