Remote OpenClaw Blog
How to Write Tests for Your OpenClaw Skills
8 min read ·
Publishing an untested OpenClaw skill is like shipping a library without a test suite. It might work on your machine, but you have no way to know if it works for everyone else, or if it will keep working after the next OpenClaw update. Testing skills is different from testing traditional software, but the principles are the same: verify behavior, catch regressions, and build confidence before you ship.
This guide covers the full testing stack for OpenClaw skills: unit tests for instruction parsing, integration tests for agent behavior, prompt-level tests for output quality, and CI setup to run it all automatically.
The OpenClaw Testing Framework
OpenClaw ships with a built-in testing framework designed specifically for skills. Install it alongside the CLI:
npm install -g @openclaw/skill-test
Or add it as a dev dependency in your skill project:
npm install --save-dev @openclaw/skill-test
The testing framework provides three types of tests:
- Unit tests — Validate skill metadata, structure, and instruction parsing
- Integration tests — Run prompts through the agent with the skill loaded and check outputs
- Prompt tests — Assert specific patterns in agent responses using snapshot and pattern matching
Project Structure for a Testable Skill
A well-structured skill project looks like this:
my-skill/
├── skill.toml # Skill metadata and configuration
├── instructions.md # The actual skill instructions
├── examples/ # Code examples included in the skill
│ ├── good-pattern.ts
│ └── bad-pattern.ts
├── tests/
│ ├── unit/
│ │ ├── metadata.test.ts
│ │ └── structure.test.ts
│ ├── integration/
│ │ ├── basic-prompt.test.ts
│ │ └── edge-cases.test.ts
│ └── prompts/
│ ├── prompts.test.ts
│ └── snapshots/
│ └── create-component.snap.md
├── package.json
└── README.md
Unit Tests: Validating Skill Structure
Unit tests verify that your skill is well-formed without needing to run an agent. They are fast, free (no API calls), and catch structural issues early.
Testing Metadata
// tests/unit/metadata.test.ts
import { describe, it, expect } from "vitest";
import { loadSkillManifest } from "@openclaw/skill-test";
describe("skill metadata", () => {
const manifest = loadSkillManifest("./skill.toml");
it("has a valid name", () => {
expect(manifest.name).toMatch(/^[a-z][a-z0-9-]+$/);
expect(manifest.name.length).toBeLessThanOrEqual(50);
});
it("has a description under 200 characters", () => {
expect(manifest.description.length).toBeLessThanOrEqual(200);
expect(manifest.description.length).toBeGreaterThan(0);
});
it("specifies compatible openclaw versions", () => {
expect(manifest.compatibility).toBeDefined();
expect(manifest.compatibility.openclaw).toMatch(/^>=\d+\.\d+\.\d+/);
});
it("declares its dependencies", () => {
expect(manifest.dependencies).toBeDefined();
// No circular dependencies
expect(manifest.dependencies.required).not.toContain(manifest.name);
});
it("has at least three keywords for discoverability", () => {
expect(manifest.keywords.length).toBeGreaterThanOrEqual(3);
});
});
Testing Instruction Structure
// tests/unit/structure.test.ts
import { describe, it, expect } from "vitest";
import { loadInstructions, analyzeInstructions } from "@openclaw/skill-test";
describe("instruction structure", () => {
const instructions = loadInstructions("./instructions.md");
const analysis = analyzeInstructions(instructions);
it("stays within token limits", () => {
expect(analysis.tokenCount).toBeLessThan(12000);
});
it("has clear section headers", () => {
expect(analysis.sections.length).toBeGreaterThan(0);
for (const section of analysis.sections) {
expect(section.heading).toBeTruthy();
}
});
it("includes at least one code example", () => {
expect(analysis.codeBlocks.length).toBeGreaterThan(0);
});
it("does not contain placeholder text", () => {
expect(instructions).not.toMatch(/TODO/i);
expect(instructions).not.toMatch(/FIXME/i);
expect(instructions).not.toMatch(/\[insert/i);
});
it("uses specific instructions, not vague ones", () => {
// Flag overly vague instructions
const vaguePatterns = [
/write clean code/i,
/follow best practices/i,
/use good naming/i,
];
for (const pattern of vaguePatterns) {
expect(instructions).not.toMatch(pattern);
}
});
});
These tests run in milliseconds and catch the most common skill quality issues: missing metadata, oversized instructions, placeholder text, and vague instructions that agents tend to ignore.
Integration Tests: Verifying Agent Behavior
Integration tests send prompts to an actual agent with your skill loaded and verify the output. These require an API key and incur token costs, so use them selectively.
Basic Integration Test
// tests/integration/basic-prompt.test.ts
import { describe, it, expect } from "vitest";
import { createTestAgent } from "@openclaw/skill-test";
describe("basic skill behavior", () => {
const agent = createTestAgent({
skills: ["./"], // Load the skill from the current directory
model: "default",
timeout: 30000,
});
it("follows the primary instruction", async () => {
const response = await agent.prompt(
"Create a React component that displays a user profile"
);
// The skill instructs the agent to always use TypeScript
expect(response.code).toContain(": React.FC");
// The skill instructs the agent to include prop types
expect(response.code).toMatch(/interface\s+\w+Props/);
});
it("includes error handling as instructed", async () => {
const response = await agent.prompt(
"Create a function that fetches user data from an API"
);
expect(response.code).toContain("try");
expect(response.code).toContain("catch");
// The skill specifies custom error types
expect(response.code).toMatch(/throw new \w+Error/);
});
it("does not generate patterns the skill prohibits", async () => {
const response = await agent.prompt(
"Create a component with state management"
);
// The skill prohibits class components
expect(response.code).not.toContain("class ");
expect(response.code).not.toContain("extends React.Component");
});
});
Testing Edge Cases
// tests/integration/edge-cases.test.ts
import { describe, it, expect } from "vitest";
import { createTestAgent } from "@openclaw/skill-test";
describe("edge cases", () => {
const agent = createTestAgent({
skills: ["./"],
model: "default",
timeout: 30000,
});
it("handles ambiguous prompts according to skill defaults", async () => {
const response = await agent.prompt("Create a button");
// Skill should default to accessible, styled button
expect(response.code).toContain("aria-");
expect(response.code).toContain("className");
});
it("maintains behavior when combined with common skills", async () => {
const agentWithChain = createTestAgent({
skills: ["./", "typescript-strict-mode"],
model: "default",
timeout: 30000,
});
const response = await agentWithChain.prompt(
"Create a data fetching hook"
);
// Both skills should be reflected
expect(response.code).toContain("as const"); // strict mode
expect(response.code).toMatch(/interface\s+/); // our skill
});
});
Controlling Cost
Integration tests call the API, which means token costs. Keep them manageable:
Marketplace
Free skills and AI personas for OpenClaw — browse the marketplace.
Browse the Marketplace →const agent = createTestAgent({
skills: ["./"],
model: "default",
maxTokens: 500, // Cap response length
temperature: 0, // Deterministic output for consistent tests
timeout: 15000,
});
Setting temperature: 0 makes responses more deterministic, which reduces flaky tests. Capping maxTokens keeps responses short and costs low. You do not need a full implementation to verify that the skill's instructions were followed — a partial response usually reveals enough.
Prompt Tests: Snapshot and Pattern Matching
Prompt tests sit between unit and integration tests. They run prompts through the agent and compare responses against saved snapshots or pattern rules.
Snapshot Testing
// tests/prompts/prompts.test.ts
import { describe, it } from "vitest";
import { createTestAgent, snapshotTest } from "@openclaw/skill-test";
describe("prompt snapshots", () => {
const agent = createTestAgent({
skills: ["./"],
temperature: 0,
});
it("matches snapshot for component creation", async () => {
await snapshotTest(agent, {
prompt: "Create a card component with title, description, and image",
snapshotFile: "./tests/prompts/snapshots/create-component.snap.md",
matchMode: "structural", // Compare structure, not exact text
});
});
});
The structural match mode compares the shape of the response — does it have the same imports, the same component structure, similar prop types — without requiring character-for-character matches. This makes snapshots resilient to minor wording changes between agent versions.
Pattern Rule Testing
For more flexible assertions, use pattern rules:
import { createTestAgent, patternTest } from "@openclaw/skill-test";
describe("pattern rules", () => {
const agent = createTestAgent({ skills: ["./"], temperature: 0 });
it("always includes accessibility attributes", async () => {
await patternTest(agent, {
prompts: [
"Create a button component",
"Create a form with email input",
"Create a modal dialog",
"Create a navigation menu",
],
rules: [
{ pattern: /aria-/, message: "Missing ARIA attributes" },
{ pattern: /role=/, message: "Missing role attribute", optional: true },
],
passThreshold: 0.75, // At least 75% of prompts must pass all rules
});
});
});
Pattern tests run multiple prompts and check each response against a set of rules. The passThreshold accounts for the inherent variability in agent responses — requiring 100% pass rates leads to flaky tests.
CI Pipeline Setup
Automate your tests in CI to catch regressions on every commit. Here is a GitHub Actions workflow:
# .github/workflows/skill-tests.yml
name: Skill Tests
on:
push:
branches: [main]
pull_request:
branches: [main]
jobs:
unit-tests:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: "20"
- run: npm ci
- run: npm run test:unit
integration-tests:
runs-on: ubuntu-latest
needs: unit-tests # Only run if unit tests pass
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: "20"
- run: npm ci
- run: npm run test:integration
env:
OPENCLAW_API_KEY: ${{ secrets.OPENCLAW_API_KEY }}
OPENCLAW_TEST_MODE: "true" # Enables cost tracking
prompt-tests:
runs-on: ubuntu-latest
needs: unit-tests
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: "20"
- run: npm ci
- run: npm run test:prompts
env:
OPENCLAW_API_KEY: ${{ secrets.OPENCLAW_API_KEY }}
Add corresponding scripts to package.json:
{
"scripts": {
"test": "vitest run",
"test:unit": "vitest run tests/unit/",
"test:integration": "vitest run tests/integration/",
"test:prompts": "vitest run tests/prompts/"
}
}
Cost Management in CI
Integration and prompt tests cost money. Control spending with these strategies:
-
Run unit tests on every commit, integration tests only on PRs to main. Unit tests are free and fast. Gate expensive tests behind branch rules.
-
Use the test mode flag.
OPENCLAW_TEST_MODE=trueenables cost tracking and sets conservative token limits. -
Cache agent responses. The test framework supports response caching:
const agent = createTestAgent({
skills: ["./"],
cache: true, // Cache responses by prompt hash
cacheDir: "./tests/.cache",
});
Cached responses are used for identical prompts, eliminating repeat API calls. Clear the cache when you update your skill to force fresh responses.
- Set a monthly budget. In your OpenClaw dashboard, set a spending alert for your test API key. If your CI runs too many integration tests, you will know before the bill arrives.
Testing Best Practices
Test behaviors, not exact output. Agent responses vary between runs. Assert that the response contains specific patterns (TypeScript interfaces, error handling, accessibility attributes) rather than expecting exact code.
Keep integration tests focused. Each test should verify one skill behavior. A test that checks for TypeScript types, error handling, accessibility, and performance optimization all at once is impossible to debug when it fails.
Update snapshots intentionally. When you change your skill's instructions, some snapshots will break. Review the diffs to confirm the new behavior is what you intended, then update: npx openclaw-test update-snapshots.
Test skill interactions. If your skill is commonly used alongside specific other skills, write integration tests that load both. Catching interaction bugs before users report them builds trust in your skill.
Version your test expectations. When OpenClaw releases a new agent version, some tests may need updating. Tag your test configurations with the OpenClaw version they target so you can update systematically.
A tested skill is a trustworthy skill. Users on the Bazaar can see test coverage badges on skill listings, and skills with tests consistently rank higher in search results. The investment in testing pays back in downloads, ratings, and fewer bug reports from the community.
Browse the Skills Directory
Find the right skill for your workflow. The OpenClaw Bazaar skills directory has over 2,300 community-rated skills — searchable, sortable, and free to install.
Built a Skill? List It on the Bazaar
If you have built a skill that others would find useful, publish it on the Bazaar. Reach thousands of developers and get feedback from the community.