How to Benchmark Different Models for OpenClaw Performance

If you pick the wrong model for an OpenClaw agent, you either burn money, watch latency spike, or ship a bot that fumbles tools. A systematic benchmark is the only way to know which model actually works for your workload. The process is not complicated, but you need discipline: fixed prompts, reproducible runs, and metrics that map to business value. I’ve broken down the methodology that’s kept our production agents honest since OpenClaw 0.24.1.

Why ad-hoc testing fails (and how a real benchmark fixes it)

Most teams open the playground, try three prompts, and call it a day. That approach misses:

Sample bias – you remember the one impressive answer, not the 10 mediocre ones.
Hidden cost – token usage is invisible until the invoice lands.
Tool reliability – a nice natural-language answer can still produce a broken shell command.
Regression risk – models and OpenClaw versions change weekly; without fixtures you can’t detect drift.

A benchmark harness solves these by running a fixed task suite against multiple models, logging every interaction, and spitting out a scorecard you can defend in a post-mortem.

Setting up a reproducible OpenClaw benchmarking harness

Project skeleton

Start with a clean repo; you don’t want benchmark hacks polluting your production agent.


mkdir claw-bench && cd claw-bench
npm init -y
npm install openclaw@0.24.1 dotenv csv-writer chalk

Create two folders:

scenarios/ – YAML fixtures that describe tasks
runs/ – auto-generated JSON logs and CSV summaries

Version pinning matters

Benchmark results die when a dependency updates under you. Pin everything via package-lock.json. For external APIs, lock the model name (gpt-4o-2024-05-13, claude-3-sonnet-20240229) and add a note if an endpoint silently upgrades.

Minimal runner script

bench.js is 90 lines; the core is:


import { Gateway } from "openclaw";
import fs from "fs/promises";
import { performance } from "node:perf_hooks";

async function runScenario(scenario, model) {
  const gw = new Gateway({
    model,
    apiKey: process.env[model.toUpperCase().replace(/-/g, "_") + "_KEY"],
    log: false,
  });
  const start = performance.now();
  const result = await gw.ask(scenario.prompt, scenario.context);
  const end = performance.now();

  return {
    id: scenario.id,
    model,
    success: scenario.check(result),
    latencyMs: end - start,
    tokensIn: gw.stats.tokens.prompt,
    tokensOut: gw.stats.tokens.completion,
    toolCalls: gw.stats.tools.success + gw.stats.tools.error,
    toolErrors: gw.stats.tools.error,
    costUsd: gw.stats.cost,
    output: result.text,
  };
}

We’re calling OpenClaw’s Gateway directly; when you benchmark the daemon via HTTP you pick up network noise, which hides true model latency.

Defining tasks and success criteria

Throwing random prompts at the wall is useless. Each scenario needs:

Deterministic prompt – string literals, no date stamps.
Ground truth – a function that returns true/false.
Tool manifest – list of tools the agent is allowed to use.

Example YAML:


---
id: "email-reply"
prompt: |
  You are ReplyBot. Draft a polite response to the email below.
  EMAIL: "Can we move the meeting to 3pm?"
context: {}
check: |
  return /sure|sounds good|works for me/i.test(output);
requiredTools: []
importance: 5

Yes, the check field is a JS function string. The runner turns it into new Function('output', check). That hack lets you encode rich validators, including JSON schema assertions for tool calls.

Task weighting

End-users don’t give equal weight to every failure. The importance field (1-5) lets the scorecard amplify misses on critical paths. Fatal steps like sending money get a 5.

Capturing metrics: latency, token usage, cost, tool success

OpenClaw’s runtime exposes the numbers you need via gw.stats. I record seven metrics:

success – Boolean from the scenario checker.
latencyMs – wall clock, start to final token.
tokensIn / tokensOut – prompt vs completion.
costUsd – calculated with model-specific pricing.
toolCalls – how many times the agent invoked any tool.
toolErrors – OpenClaw marks a tool call invalid if schema mismatch or exception.

Store the raw log; you’ll need it when a model upgrade changes tokenisation.

Automating runs across vendors and local models

Add an array in .env to list candidates:


MODELS="gpt-4o-2024-05-13,claude-3-sonnet-20240229,openrouter/mistral-8x22b,qwen:14b,local/llama-3-70b-instruct"

Tip: Standardise on the OpenAI chat completion schema. Most vendors (including OpenRouter, Perplexity) now provide an API-compatible shim, so the runner can swap host URLs via an environment map.

To include local models:


ollama pull llama3:70b-instruct
OLLAMA_BASE_URL=http://localhost:11434

Then add in the model list ollama/llama3:70b-instruct. The runner switches based on prefix.

Bash wrapper

The Node script parses process.argv. A minimal wrapper:


#!/usr/bin/env bash
for model in $(echo $MODELS | tr ',' ' '); do
  echo "=== $model ==="
  node bench.js --model "$model" --out "runs/$(date +%s)-$model.json"
done
node summarise.js runs/*.json > runs/summary-$(date +%F).csv

Put it on a nightly cron so you spot regressions before your users do.

Scoring and comparing: the evaluation scorecard template

Raw CSV is fine for the machine, not for the weekly eng review. I render a Markdown table with weighted scores.

Formula

Total Score = Σ(importance * success) − Penalties

Penalties:

latencyMs > 3000 → −1 point / 500 ms over
costUsd > scenario.budget → −2 points
toolErrors > 0 → −3 points per error

Export code:


import createCsvWriter from 'csv-writer';

function summarise(results) {
  const byModel = {};
  results.forEach(r => {
    byModel[r.model] ??= {score: 0, tasks: 0, cost: 0, latency: 0};
    const w = r.importance || 1;
    byModel[r.model].score += w * (r.success ? 1 : 0);
    byModel[r.model].latency += r.latencyMs;
    byModel[r.model].cost += r.costUsd;
    byModel[r.model].tasks += 1;
  });
  return Object.entries(byModel).map(([model, stats]) => ({
    model,
    avgScore: (stats.score / stats.tasks).toFixed(2),
    avgLatency: (stats.latency / stats.tasks).toFixed(0),
    totalCost: stats.cost.toFixed(4),
  }));
}

Template scorecard (CSV):


model,avgScore,avgLatencyMs,totalCostUsd
"gpt-4o-2024-05-13",4.8,612,0.0412
"claude-3-sonnet-20240229",4.5,734,0.0338
"mistral-8x22b",4.1,933,0.0194
"llama3-70b-local",3.6,1185,0.0000

Paste into the company wiki; the numbers speak for themselves.

Example results from last Friday

We ran 55 scenarios representing our Slack support bot, data extraction tasks, and shell automation.

GPT-4o solved 53/55, median latency 580 ms, cost $0.044.
Claude Sonnet failed two regex-heavy extraction steps; latency variance higher (p95 1.4 s).
Mistral-8x22b via OpenRouter matched quality on chit-chat but hallucinated a GitHub label name 3/10 times.
Local Llama-3 70B was free, but shell tool failure rate 18 % (OpenClaw’s function call schema is strict and local models mis-fill args).

The takeaway for us: GPT-4o remains the default for high-stakes routes; Mistral handles low-priority summarisation; Llama-3 runs offline for weekend outages. Numbers back the choice, not gut feel.

Operational tips to keep the benchmark cheap and honest

Seed your random

Set process.env.OPENAI_RANDOM_SEED=42. Some APIs let you fix random; if they don’t, run each scenario 5× and average.

Throttle parallelism

Concurrency hides true latency and can trigger vendor rate limits. I cap at --max 3 concurrent requests.

Snapshot prices

Vendors love silent price cuts. I store a pricing.json with the numbers used for the cost calculation so historical comparisons stay valid.

Validate tool outputs offline

Don’t execute the shell or HTTP request in benchmark mode; use a mock that checks schema only. You want to measure model fidelity, not network flakiness.

Label artefacts

Checksum the scenario YAML and write it into the JSON log. When the PM asks “did anything change?”, you have proof.

Next step: plug the harness into CI

Put node bench.js --ci in GitHub Actions. Fail the build if the primary model’s regression exceeds a threshold (-0.2 average score or +20 % token usage). That’s how you catch a vendor silently shipping a lower-quality model alias.

Benchmarks aren’t glamorous, but they save real money and user trust. Steal this harness, adapt the YAML for your own agent, and share your numbers in the #benchmarks channel on the OpenClaw Discord. The community needs more public data.