OpenClaw: How to Run Test Suites and Fix Failures Autonomously

Yes, OpenClaw can sit in your repo, run your entire test suite, read the stack traces, patch the code, and push a green commit while you get coffee. This post documents the exact wiring I use in production for both JavaScript (Jest) and Python (pytest) projects. No glossy marketing—just what works, what breaks, and how to keep the loop from melting your CI bill.

Why bother? The real ROI of an autonomous testing loop

My team ships five micro-services and three React front-ends. On a bad day we spend 40 minutes per PR babysitting flaky tests or hunting obvious typos. Offloading that to OpenClaw freed two engineers per sprint—concrete numbers our finance folks could verify. If automated test remediation sounds like overkill, keep reading; the setup is mostly YAML and a couple of TypeScript/Python adapters.

Prerequisites and guard rails

OpenClaw ≥ 0.42.1 (the first version with the test-runner action).
Node 22+ (OpenClaw is written in Node).
A GitHub token with contents:write and pull_requests:write.
CI runner with at least 8 GB RAM (LLM context chewing is hungry).
Clear budget limits in ClawCloud if you run the hosted agent—mine is capped at 400k tokens/day.

OpenClaw will happily recurse forever if you let it. Add two mandatory limits:

maxIterations: number of fix-retest cycles before we give up (I default to 5).
maxTokens: total LLM tokens per run (I cap at 100k).

Without those, a single flaky integration test can cost you a pizza budget.

Installing the test-runner capability

Local install is one line:

npm i -g openclaw@^0.42.1

Then scaffold the agent inside your repo root:

openclaw init --agent test-guardian

This creates .openclaw/config.yml plus a guardian.js (Node) or guardian.ts skeleton if you picked TypeScript. ClawCloud users hit the dashboard, click “New agent”, pick the repo, done—same config file is generated in the background.

Connecting OpenClaw to your CI workflow

I run GitHub Actions but the pattern is identical in GitLab or Buildkite.

GitHub Actions stub


name: OpenClaw autonomous tests

on:
  pull_request:
    paths-ignore:
      - '**/*.md'

jobs:
  test-fix:
    runs-on: ubuntu-22.04
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0 # important for patching
      - name: Setup Node 22
        uses: actions/setup-node@v4
        with:
          node-version: 22
      - name: Install deps
        run: npm ci
      - name: Run OpenClaw test guardian
        env:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
          OPENCLAW_API_KEY: ${{ secrets.OPENCLAW_KEY }}
        run: openclaw run guardian --ci

The --ci flag tells OpenClaw to exit non-zero if it hits maxIterations without green tests, preventing false positives on the branch.

Adapters for popular test frameworks

OpenClaw parses test reports via small “adapters”. Out of the box it ships with Mocha, Jest, and Pytest adapters. Anything else just implement ITestAdapter—20 lines tops.

1. Jest (React, Node)

Add this to .openclaw/config.yml:


agent:
  name: test-guardian
  language: node
  adapters:
    - name: jest
      command: 'npm test -- --json --outputFile=report.json'
  maxIterations: 5
  maxTokens: 100000

Why the JSON output? Parsing colored console spew through regex was brittle; the adapter consumes report.json and extracts failing test names, file paths, and error messages.

The Jest adapter sequence:

Runs the command.
Reads report.json.
Feeds top-N (default 5) failures into the LLM context.
Generates a patch with git diff style output.
Applies patch with git apply.

If the patch applies cleanly, we rerun tests immediately. Failures persist? Iterate.

2. pytest (Django, FastAPI, data-science)

Install the JSON reporter:

pip install pytest-json-report

Update config:


agent:
  name: test-guardian
  language: node
  adapters:
    - name: pytest
      command: 'pytest --json-report --json-report-file=pytest.json'
  memory:
    provider: sqlite # still experimental but works

Under the hood the Python adapter is similar but also slaps the traceback into the context so the LLM can inspect variable values. That extra juice increases fix rate by ~12% in my benchmarks (Databricks notebooks, 180 tests).

3. Other frameworks

Go, Rust, PHP—anything dumping JUnit XML is one YAML line away:


    - name: junit
      command: 'go test ./... -json > go-tests.json'

Community members have published adapters for Maven, Vitest, RSpec. Check the adapters directory.

The analysis / patch loop in depth

OpenClaw exposes a built-in test-fix action but I prefer explicit code because I can hot-patch prompts when things go weird.


// guardian.ts
import { openai, git, tests } from 'openclaw';

export default async function guardian(ctx) {
  const report = await tests.run();            // 1. execute adapter command
  if (report.ok) {
    ctx.log('🎉 All tests green, nothing to do.');
    return;
  }

  for (let i = 0; i < ctx.config.maxIterations; i++) {
    const failures = report.failures.slice(0, 5); // clip context

    const patch = await openai.chat({
      system: `You are an efficient senior dev...`,
      user: `Here are failing tests: ${JSON.stringify(failures)}`
    });

    if (!patch.content.includes('diff')) {
      ctx.error('Model did not return a patch');
      break;
    }

    await git.apply(patch.content);
    const retry = await tests.run();
    if (retry.ok) {
      ctx.log('✅ Fixed all failures after', i + 1, 'iterations');
      await git.commit('chore: auto-fix tests via OpenClaw');
      await git.push();
      return;
    }
    report.failures = retry.failures; // tighten feedback loop
  }
  throw new Error('OpenClaw gave up before suite turned green');
}

Why not use the canned action? Having the loop in code lets me:

Inject repo-specific lint rules before committing.
Reject patches that touch > 10 files (too risky).
Collect telemetry—token count, iteration counts—to a Prometheus push-gateway.

Dealing with flaky tests

The Achilles heel of autonomous fixing is nondeterminism. If a test fails 30% of the time, OpenClaw will “fix” it by adding sleep calls or worse, commenting the assertion. I learned this the hard way.

Strategy 1: quarantine label

We tag unstable tests with [flaky] and update adapter filters:


command: 'npm test -- --testNamePattern="^((?!\\[flaky\\]).)*$" --json --outputFile=report.json'

Flakies still run in nightly pipelines but the autonomous loop ignores them.

Strategy 2: statistical reruns

Integrate jest-repeat or pytest-rerunfailures:


pytest --reruns 3 --only-rerun "*network*"

If a failure disappears on rerun the adapter marks it “non-actionable” and removes it from context.

Strategy 3: min patch coverage

I added a heuristic: if the proposed patch touches code outside the failing test’s module, abort. 70% of flaky-fix attempts died here, which is safer than merging sleeping threads.

Iteration limits and cost controls

Even deterministic projects can send the agent into infinite loops—usually when the model misunderstands the domain (regex edge cases, floating point rounding, date/timezones). Guard rails I enforce:

contextClip: truncate each stack trace to 30 lines.
patchMaxLines: reject patches adding more than 150 LOC.
failFastGlobs: end loop if changes hit migrations/** or package.json.
budget: send clawcloud budgets:set --daily $5 for hosted agents.

With these, our mean spend per PR is ~$0.42. The worst I have seen is $3.18 on a monster refactor.

Notifications and reporting back to humans

The loop feels magical only if the team knows what happened. We post status to Slack and annotate the PR.


// inside guardian.ts, after push()
await ctx.notify.slack({
  channel: '#ci',
  text: `OpenClaw fixed tests on ${ctx.git.sha.slice(0,7)}. All green.`
});

await ctx.github.createComment({
  body: '✅ All tests passing after OpenClaw auto-fix. Review the diff and merge when ready.'
});

Engineers still review the diff, but they rarely have to touch anything.

Debugging the agent itself

When OpenClaw misbehaves, crank up verbosity:

OPENCLAW_DEBUG=1 openclaw run guardian

Common issues:

Patch applies locally but fails CI: Ensure you committed package-lock.json; the agent might have re-installed deps with different versions.
Adapter command not found: CI images sometimes cache ./node_modules/.bin inconsistently; adding npx fixes it.
Model times out: Lower contextClip or request a higher timeout in your ClawCloud plan.

Numbers: success rates and edge cases

Metrics after three months on two projects:

Average iterations to green: 2.3 (Jest), 1.8 (pytest)
Success on first attempt: 62%
Cases hitting maxIterations: 7%
Manual rollbacks required: 1 commit out of 318 (OpenClaw removed a security check—now blocked by failFastGlobs)

What doesn’t work:

Large snapshot tests—OpenAI context limit explodes.
Generated code (Protobufs, OpenAPI) where diff must be run through the generator; the agent tries to edit the result files instead.
Database migrations with checksum hashes—it passes locally, breaks staging. Treat those as immutable.

Security considerations

Yes, an autonomous agent with git push access can torch your repo. Mitigations:

Scope token to a bot user limited to the repository.
Require a protected branch and manual PR review. The bot commits to openclaw-fix branch; GitHub policies force a human merge.
Use patchMaxLines to avoid sweeping changes.
Pipe all commits through git-secrets to prevent leaking credentials accidentally exposed in prompts.

So far this caught two instances where the model tried to add plaintext Redis passwords from the test fixture into source. The run failed, alerting us immediately.

Practical takeaway

The wiring is straightforward: install OpenClaw, pick or build a test adapter, slap a YAML job in CI, and set strict iteration + cost limits. Your first green build without human intervention is eerie and addictive. The next step is obvious—extend the agent to eslint fixes or storybook snapshots, but that is a story for another post.