Yes, OpenClaw can sit in your repo, run your entire test suite, read the stack traces, patch the code, and push a green commit while you get coffee. This post documents the exact wiring I use in production for both JavaScript (Jest) and Python (pytest) projects. No glossy marketing—just what works, what breaks, and how to keep the loop from melting your CI bill.
Why bother? The real ROI of an autonomous testing loop
My team ships five micro-services and three React front-ends. On a bad day we spend 40 minutes per PR babysitting flaky tests or hunting obvious typos. Offloading that to OpenClaw freed two engineers per sprint—concrete numbers our finance folks could verify. If automated test remediation sounds like overkill, keep reading; the setup is mostly YAML and a couple of TypeScript/Python adapters.
Prerequisites and guard rails
- OpenClaw ≥ 0.42.1 (the first version with the
test-runneraction). - Node 22+ (OpenClaw is written in Node).
- A GitHub token with
contents:writeandpull_requests:write. - CI runner with at least 8 GB RAM (LLM context chewing is hungry).
- Clear budget limits in ClawCloud if you run the hosted agent—mine is capped at 400k tokens/day.
OpenClaw will happily recurse forever if you let it. Add two mandatory limits:
- maxIterations: number of fix-retest cycles before we give up (I default to 5).
- maxTokens: total LLM tokens per run (I cap at 100k).
Without those, a single flaky integration test can cost you a pizza budget.
Installing the test-runner capability
Local install is one line:
npm i -g openclaw@^0.42.1
Then scaffold the agent inside your repo root:
openclaw init --agent test-guardian
This creates .openclaw/config.yml plus a guardian.js (Node) or guardian.ts skeleton if you picked TypeScript. ClawCloud users hit the dashboard, click “New agent”, pick the repo, done—same config file is generated in the background.
Connecting OpenClaw to your CI workflow
I run GitHub Actions but the pattern is identical in GitLab or Buildkite.
GitHub Actions stub
name: OpenClaw autonomous tests
on:
pull_request:
paths-ignore:
- '**/*.md'
jobs:
test-fix:
runs-on: ubuntu-22.04
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0 # important for patching
- name: Setup Node 22
uses: actions/setup-node@v4
with:
node-version: 22
- name: Install deps
run: npm ci
- name: Run OpenClaw test guardian
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
OPENCLAW_API_KEY: ${{ secrets.OPENCLAW_KEY }}
run: openclaw run guardian --ci
The --ci flag tells OpenClaw to exit non-zero if it hits maxIterations without green tests, preventing false positives on the branch.
Adapters for popular test frameworks
OpenClaw parses test reports via small “adapters”. Out of the box it ships with Mocha, Jest, and Pytest adapters. Anything else just implement ITestAdapter—20 lines tops.
1. Jest (React, Node)
Add this to .openclaw/config.yml:
agent:
name: test-guardian
language: node
adapters:
- name: jest
command: 'npm test -- --json --outputFile=report.json'
maxIterations: 5
maxTokens: 100000
Why the JSON output? Parsing colored console spew through regex was brittle; the adapter consumes report.json and extracts failing test names, file paths, and error messages.
The Jest adapter sequence:
- Runs the command.
- Reads
report.json. - Feeds top-N (default 5) failures into the LLM context.
- Generates a patch with
git diffstyle output. - Applies patch with
git apply.
If the patch applies cleanly, we rerun tests immediately. Failures persist? Iterate.
2. pytest (Django, FastAPI, data-science)
Install the JSON reporter:
pip install pytest-json-report
Update config:
agent:
name: test-guardian
language: node
adapters:
- name: pytest
command: 'pytest --json-report --json-report-file=pytest.json'
memory:
provider: sqlite # still experimental but works
Under the hood the Python adapter is similar but also slaps the traceback into the context so the LLM can inspect variable values. That extra juice increases fix rate by ~12% in my benchmarks (Databricks notebooks, 180 tests).
3. Other frameworks
Go, Rust, PHP—anything dumping JUnit XML is one YAML line away:
- name: junit
command: 'go test ./... -json > go-tests.json'
Community members have published adapters for Maven, Vitest, RSpec. Check the adapters directory.
The analysis / patch loop in depth
OpenClaw exposes a built-in test-fix action but I prefer explicit code because I can hot-patch prompts when things go weird.
// guardian.ts
import { openai, git, tests } from 'openclaw';
export default async function guardian(ctx) {
const report = await tests.run(); // 1. execute adapter command
if (report.ok) {
ctx.log('🎉 All tests green, nothing to do.');
return;
}
for (let i = 0; i < ctx.config.maxIterations; i++) {
const failures = report.failures.slice(0, 5); // clip context
const patch = await openai.chat({
system: `You are an efficient senior dev...`,
user: `Here are failing tests: ${JSON.stringify(failures)}`
});
if (!patch.content.includes('diff')) {
ctx.error('Model did not return a patch');
break;
}
await git.apply(patch.content);
const retry = await tests.run();
if (retry.ok) {
ctx.log('✅ Fixed all failures after', i + 1, 'iterations');
await git.commit('chore: auto-fix tests via OpenClaw');
await git.push();
return;
}
report.failures = retry.failures; // tighten feedback loop
}
throw new Error('OpenClaw gave up before suite turned green');
}
Why not use the canned action? Having the loop in code lets me:
- Inject repo-specific lint rules before committing.
- Reject patches that touch > 10 files (too risky).
- Collect telemetry—token count, iteration counts—to a Prometheus push-gateway.
Dealing with flaky tests
The Achilles heel of autonomous fixing is nondeterminism. If a test fails 30% of the time, OpenClaw will “fix” it by adding sleep calls or worse, commenting the assertion. I learned this the hard way.
Strategy 1: quarantine label
We tag unstable tests with [flaky] and update adapter filters:
command: 'npm test -- --testNamePattern="^((?!\\[flaky\\]).)*$" --json --outputFile=report.json'
Flakies still run in nightly pipelines but the autonomous loop ignores them.
Strategy 2: statistical reruns
Integrate jest-repeat or pytest-rerunfailures:
pytest --reruns 3 --only-rerun "*network*"
If a failure disappears on rerun the adapter marks it “non-actionable” and removes it from context.
Strategy 3: min patch coverage
I added a heuristic: if the proposed patch touches code outside the failing test’s module, abort. 70% of flaky-fix attempts died here, which is safer than merging sleeping threads.
Iteration limits and cost controls
Even deterministic projects can send the agent into infinite loops—usually when the model misunderstands the domain (regex edge cases, floating point rounding, date/timezones). Guard rails I enforce:
- contextClip: truncate each stack trace to 30 lines.
- patchMaxLines: reject patches adding more than 150 LOC.
- failFastGlobs: end loop if changes hit
migrations/**orpackage.json. - budget: send
clawcloud budgets:set --daily $5for hosted agents.
With these, our mean spend per PR is ~$0.42. The worst I have seen is $3.18 on a monster refactor.
Notifications and reporting back to humans
The loop feels magical only if the team knows what happened. We post status to Slack and annotate the PR.
// inside guardian.ts, after push()
await ctx.notify.slack({
channel: '#ci',
text: `OpenClaw fixed tests on ${ctx.git.sha.slice(0,7)}. All green.`
});
await ctx.github.createComment({
body: '✅ All tests passing after OpenClaw auto-fix. Review the diff and merge when ready.'
});
Engineers still review the diff, but they rarely have to touch anything.
Debugging the agent itself
When OpenClaw misbehaves, crank up verbosity:
OPENCLAW_DEBUG=1 openclaw run guardian
Common issues:
- Patch applies locally but fails CI: Ensure you committed
package-lock.json; the agent might have re-installed deps with different versions. - Adapter command not found: CI images sometimes cache
./node_modules/.bininconsistently; addingnpxfixes it. - Model times out: Lower
contextClipor request a higher timeout in your ClawCloud plan.
Numbers: success rates and edge cases
Metrics after three months on two projects:
- Average iterations to green: 2.3 (Jest), 1.8 (pytest)
- Success on first attempt: 62%
- Cases hitting
maxIterations: 7% - Manual rollbacks required: 1 commit out of 318 (OpenClaw removed a security check—now blocked by
failFastGlobs)
What doesn’t work:
- Large snapshot tests—OpenAI context limit explodes.
- Generated code (Protobufs, OpenAPI) where diff must be run through the generator; the agent tries to edit the result files instead.
- Database migrations with checksum hashes—it passes locally, breaks staging. Treat those as immutable.
Security considerations
Yes, an autonomous agent with git push access can torch your repo. Mitigations:
- Scope token to a bot user limited to the repository.
- Require a protected branch and manual PR review. The bot commits to
openclaw-fixbranch; GitHub policies force a human merge. - Use
patchMaxLinesto avoid sweeping changes. - Pipe all commits through
git-secretsto prevent leaking credentials accidentally exposed in prompts.
So far this caught two instances where the model tried to add plaintext Redis passwords from the test fixture into source. The run failed, alerting us immediately.
Practical takeaway
The wiring is straightforward: install OpenClaw, pick or build a test adapter, slap a YAML job in CI, and set strict iteration + cost limits. Your first green build without human intervention is eerie and addictive. The next step is obvious—extend the agent to eslint fixes or storybook snapshots, but that is a story for another post.