Remote OpenClaw Blog

How to Write Tests for Your OpenClaw Skills

8 min read · 29 March 2026

Publishing an untested OpenClaw skill is like shipping a library without a test suite. It might work on your machine, but you have no way to know if it works for everyone else, or if it will keep working after the next OpenClaw update. Testing skills is different from testing traditional software, but the principles are the same: verify behavior, catch regressions, and build confidence before you ship.

This guide covers the full testing stack for OpenClaw skills: unit tests for instruction parsing, integration tests for agent behavior, prompt-level tests for output quality, and CI setup to run it all automatically.

The OpenClaw Testing Framework

OpenClaw ships with a built-in testing framework designed specifically for skills. Install it alongside the CLI:

npm install -g @openclaw/skill-test

Or add it as a dev dependency in your skill project:

npm install --save-dev @openclaw/skill-test

The testing framework provides three types of tests:

Unit tests — Validate skill metadata, structure, and instruction parsing
Integration tests — Run prompts through the agent with the skill loaded and check outputs
Prompt tests — Assert specific patterns in agent responses using snapshot and pattern matching

Project Structure for a Testable Skill

A well-structured skill project looks like this:

my-skill/
├── skill.toml              # Skill metadata and configuration
├── instructions.md         # The actual skill instructions
├── examples/               # Code examples included in the skill
│   ├── good-pattern.ts
│   └── bad-pattern.ts
├── tests/
│   ├── unit/
│   │   ├── metadata.test.ts
│   │   └── structure.test.ts
│   ├── integration/
│   │   ├── basic-prompt.test.ts
│   │   └── edge-cases.test.ts
│   └── prompts/
│       ├── prompts.test.ts
│       └── snapshots/
│           └── create-component.snap.md
├── package.json
└── README.md

Unit Tests: Validating Skill Structure

Unit tests verify that your skill is well-formed without needing to run an agent. They are fast, free (no API calls), and catch structural issues early.

Testing Metadata

// tests/unit/metadata.test.ts
import { describe, it, expect } from "vitest";
import { loadSkillManifest } from "@openclaw/skill-test";

describe("skill metadata", () => {
  const manifest = loadSkillManifest("./skill.toml");

  it("has a valid name", () => {
    expect(manifest.name).toMatch(/^[a-z][a-z0-9-]+$/);
    expect(manifest.name.length).toBeLessThanOrEqual(50);
  });

  it("has a description under 200 characters", () => {
    expect(manifest.description.length).toBeLessThanOrEqual(200);
    expect(manifest.description.length).toBeGreaterThan(0);
  });

  it("specifies compatible openclaw versions", () => {
    expect(manifest.compatibility).toBeDefined();
    expect(manifest.compatibility.openclaw).toMatch(/^>=\d+\.\d+\.\d+/);
  });

  it("declares its dependencies", () => {
    expect(manifest.dependencies).toBeDefined();
    // No circular dependencies
    expect(manifest.dependencies.required).not.toContain(manifest.name);
  });

  it("has at least three keywords for discoverability", () => {
    expect(manifest.keywords.length).toBeGreaterThanOrEqual(3);
  });
});

Testing Instruction Structure

// tests/unit/structure.test.ts
import { describe, it, expect } from "vitest";
import { loadInstructions, analyzeInstructions } from "@openclaw/skill-test";

describe("instruction structure", () => {
  const instructions = loadInstructions("./instructions.md");
  const analysis = analyzeInstructions(instructions);

  it("stays within token limits", () => {
    expect(analysis.tokenCount).toBeLessThan(12000);
  });

  it("has clear section headers", () => {
    expect(analysis.sections.length).toBeGreaterThan(0);
    for (const section of analysis.sections) {
      expect(section.heading).toBeTruthy();
    }
  });

  it("includes at least one code example", () => {
    expect(analysis.codeBlocks.length).toBeGreaterThan(0);
  });

  it("does not contain placeholder text", () => {
    expect(instructions).not.toMatch(/TODO/i);
    expect(instructions).not.toMatch(/FIXME/i);
    expect(instructions).not.toMatch(/\[insert/i);
  });

  it("uses specific instructions, not vague ones", () => {
    // Flag overly vague instructions
    const vaguePatterns = [
      /write clean code/i,
      /follow best practices/i,
      /use good naming/i,
    ];
    for (const pattern of vaguePatterns) {
      expect(instructions).not.toMatch(pattern);
    }
  });
});

These tests run in milliseconds and catch the most common skill quality issues: missing metadata, oversized instructions, placeholder text, and vague instructions that agents tend to ignore.

Integration Tests: Verifying Agent Behavior

Integration tests send prompts to an actual agent with your skill loaded and verify the output. These require an API key and incur token costs, so use them selectively.

Basic Integration Test

// tests/integration/basic-prompt.test.ts
import { describe, it, expect } from "vitest";
import { createTestAgent } from "@openclaw/skill-test";

describe("basic skill behavior", () => {
  const agent = createTestAgent({
    skills: ["./"],   // Load the skill from the current directory
    model: "default",
    timeout: 30000,
  });

  it("follows the primary instruction", async () => {
    const response = await agent.prompt(
      "Create a React component that displays a user profile"
    );

    // The skill instructs the agent to always use TypeScript
    expect(response.code).toContain(": React.FC");
    // The skill instructs the agent to include prop types
    expect(response.code).toMatch(/interface\s+\w+Props/);
  });

  it("includes error handling as instructed", async () => {
    const response = await agent.prompt(
      "Create a function that fetches user data from an API"
    );

    expect(response.code).toContain("try");
    expect(response.code).toContain("catch");
    // The skill specifies custom error types
    expect(response.code).toMatch(/throw new \w+Error/);
  });

  it("does not generate patterns the skill prohibits", async () => {
    const response = await agent.prompt(
      "Create a component with state management"
    );

    // The skill prohibits class components
    expect(response.code).not.toContain("class ");
    expect(response.code).not.toContain("extends React.Component");
  });
});

Testing Edge Cases

// tests/integration/edge-cases.test.ts
import { describe, it, expect } from "vitest";
import { createTestAgent } from "@openclaw/skill-test";

describe("edge cases", () => {
  const agent = createTestAgent({
    skills: ["./"],
    model: "default",
    timeout: 30000,
  });

  it("handles ambiguous prompts according to skill defaults", async () => {
    const response = await agent.prompt("Create a button");

    // Skill should default to accessible, styled button
    expect(response.code).toContain("aria-");
    expect(response.code).toContain("className");
  });

  it("maintains behavior when combined with common skills", async () => {
    const agentWithChain = createTestAgent({
      skills: ["./", "typescript-strict-mode"],
      model: "default",
      timeout: 30000,
    });

    const response = await agentWithChain.prompt(
      "Create a data fetching hook"
    );

    // Both skills should be reflected
    expect(response.code).toContain("as const");   // strict mode
    expect(response.code).toMatch(/interface\s+/);  // our skill
  });
});

Controlling Cost

Integration tests call the API, which means token costs. Keep them manageable:

Best-Fit Skills

Browse installable skills when you already know the job and want the fastest capability upgrade.

See Best-Fit Skills →Compare Best Fits →

const agent = createTestAgent({
  skills: ["./"],
  model: "default",
  maxTokens: 500,       // Cap response length
  temperature: 0,        // Deterministic output for consistent tests
  timeout: 15000,
});

Setting temperature: 0 makes responses more deterministic, which reduces flaky tests. Capping maxTokens keeps responses short and costs low. You do not need a full implementation to verify that the skill's instructions were followed — a partial response usually reveals enough.

Prompt Tests: Snapshot and Pattern Matching

Prompt tests sit between unit and integration tests. They run prompts through the agent and compare responses against saved snapshots or pattern rules.

Snapshot Testing

// tests/prompts/prompts.test.ts
import { describe, it } from "vitest";
import { createTestAgent, snapshotTest } from "@openclaw/skill-test";

describe("prompt snapshots", () => {
  const agent = createTestAgent({
    skills: ["./"],
    temperature: 0,
  });

  it("matches snapshot for component creation", async () => {
    await snapshotTest(agent, {
      prompt: "Create a card component with title, description, and image",
      snapshotFile: "./tests/prompts/snapshots/create-component.snap.md",
      matchMode: "structural",  // Compare structure, not exact text
    });
  });
});

The structural match mode compares the shape of the response — does it have the same imports, the same component structure, similar prop types — without requiring character-for-character matches. This makes snapshots resilient to minor wording changes between agent versions.

Pattern Rule Testing

For more flexible assertions, use pattern rules:

import { createTestAgent, patternTest } from "@openclaw/skill-test";

describe("pattern rules", () => {
  const agent = createTestAgent({ skills: ["./"], temperature: 0 });

  it("always includes accessibility attributes", async () => {
    await patternTest(agent, {
      prompts: [
        "Create a button component",
        "Create a form with email input",
        "Create a modal dialog",
        "Create a navigation menu",
      ],
      rules: [
        { pattern: /aria-/, message: "Missing ARIA attributes" },
        { pattern: /role=/, message: "Missing role attribute", optional: true },
      ],
      passThreshold: 0.75,  // At least 75% of prompts must pass all rules
    });
  });
});

Pattern tests run multiple prompts and check each response against a set of rules. The passThreshold accounts for the inherent variability in agent responses — requiring 100% pass rates leads to flaky tests.

CI Pipeline Setup

Automate your tests in CI to catch regressions on every commit. Here is a GitHub Actions workflow:

# .github/workflows/skill-tests.yml
name: Skill Tests

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

jobs:
  unit-tests:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: "20"
      - run: npm ci
      - run: npm run test:unit

  integration-tests:
    runs-on: ubuntu-latest
    needs: unit-tests    # Only run if unit tests pass
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: "20"
      - run: npm ci
      - run: npm run test:integration
        env:
          OPENCLAW_API_KEY: ${{ secrets.OPENCLAW_API_KEY }}
          OPENCLAW_TEST_MODE: "true"    # Enables cost tracking

  prompt-tests:
    runs-on: ubuntu-latest
    needs: unit-tests
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: "20"
      - run: npm ci
      - run: npm run test:prompts
        env:
          OPENCLAW_API_KEY: ${{ secrets.OPENCLAW_API_KEY }}

Add corresponding scripts to package.json:

{
  "scripts": {
    "test": "vitest run",
    "test:unit": "vitest run tests/unit/",
    "test:integration": "vitest run tests/integration/",
    "test:prompts": "vitest run tests/prompts/"
  }
}

Cost Management in CI

Integration and prompt tests cost money. Control spending with these strategies:

Run unit tests on every commit, integration tests only on PRs to main. Unit tests are free and fast. Gate expensive tests behind branch rules.
Use the test mode flag. OPENCLAW_TEST_MODE=true enables cost tracking and sets conservative token limits.
Cache agent responses. The test framework supports response caching:

const agent = createTestAgent({
  skills: ["./"],
  cache: true,              // Cache responses by prompt hash
  cacheDir: "./tests/.cache",
});

Cached responses are used for identical prompts, eliminating repeat API calls. Clear the cache when you update your skill to force fresh responses.

Set a monthly budget. In your OpenClaw dashboard, set a spending alert for your test API key. If your CI runs too many integration tests, you will know before the bill arrives.

Testing Best Practices

Test behaviors, not exact output. Agent responses vary between runs. Assert that the response contains specific patterns (TypeScript interfaces, error handling, accessibility attributes) rather than expecting exact code.

Keep integration tests focused. Each test should verify one skill behavior. A test that checks for TypeScript types, error handling, accessibility, and performance optimization all at once is impossible to debug when it fails.

Update snapshots intentionally. When you change your skill's instructions, some snapshots will break. Review the diffs to confirm the new behavior is what you intended, then update: npx openclaw-test update-snapshots.

Test skill interactions. If your skill is commonly used alongside specific other skills, write integration tests that load both. Catching interaction bugs before users report them builds trust in your skill.

Version your test expectations. When OpenClaw releases a new agent version, some tests may need updating. Tag your test configurations with the OpenClaw version they target so you can update systematically.

A tested skill is a trustworthy skill. Users on the Bazaar can see test coverage badges on skill listings, and skills with tests consistently rank higher in search results. The investment in testing pays back in downloads, ratings, and fewer bug reports from the community.

Browse the Skills Directory

Find the right skill for your workflow. The OpenClaw Bazaar skills directory has over 2,300 community-rated skills — searchable, sortable, and free to install.

Browse Skills →

Built a Skill? List It on the Bazaar

If you have built a skill that others would find useful, publish it on the Bazaar. Reach thousands of developers and get feedback from the community.

Learn how to publish →

Ready to choose the right OpenClaw workflow?

Best-Fit SkillsBrowse installable skills when you already know the job and want the fastest capability upgrade.Compare Best FitsUse the marketplace filters to choose the right bundle, persona, or skill without browsing blind.More GuidesBrowse 200+ free OpenClaw guides, tutorials, and comparisons.

Loading article