The short version: GPT-4o is cheaper and snappier for high-volume OpenClaw agents, Claude 3 feels more deliberate and polite, but breaks tools more often. Below is the long version with numbers, logs, and the scripts I used. Skip to any section if you just want prices, latency, or sample outputs.
Why benchmark GPT-4o and Claude 3 inside OpenClaw at all?
OpenClaw shipped in 2022 with Anthropic’s Claude as the default because Peter Steinberger liked its longer context window and the friendlier safety layer. Fast-forward to 2024: Peter joined OpenAI, GPT-4o dropped its price by 50 %, and many community bots started timing out on Claude during peak US hours. Meanwhile ClawCloud’s pay-as-you-go billing made the model line-item impossible to ignore. So I re-ran my daily workload against both models.
Test setup and methodology
Everything runs on the latest gateway (v0.42.1) and daemon (v0.42.0) with Node 22.3. I spun up two nearly identical agents:
- gopher-4o → uses OpenAI
gpt-4o-minifor standard calls andgpt-4ofor coding/tool reasoning. - gopher-claude → uses Anthropic
claude-3-opus-20240229for everything.
Each agent was asked to execute 500 daily tasks collected from our internal Slack:
- Write or reply to an email (Gmail tool, 120 samples).
- Generate or refactor code snippets (shell + editor, 100).
- Summarise a meeting transcript and push to Notion (80).
- Create event in Google Calendar based on a chat (60).
- Browser search & scrape a competitor’s pricing (50).
- Write a short social post (Telegram + X, 90).
All calls used temperature 0.2 and max_tokens=2048. Logs were streamed into ClickHouse for later crunching.
Script fragment
import { Agent } from "@openclaw/sdk";
export async function run(model) {
const agent = new Agent({
model,
memory: "redis://claw-mem:6379",
tools: [
"composio/gmail.send",
"composio/notion.page.create",
"browser.open",
"shell.exec"
]
});
return agent.chat("Draft a polite follow-up for yesterday’s meeting");
}
Tool-use reliability
Most OpenClaw agents live or die on whether the LLM calls the right tool with the right parameters. I counted a success if tool.call resolved without a validation error and the human reviewer clicked 👍.
- GPT-4o: 92 % success across all tasks.
- Claude 3: 83 % success. The gap widened on complex multi-step tasks (browser → scrape → summarise).
Two specific pain points for Claude:
- Sometimes adds superfluous JSON keys the Composio schema does not expect.
- Refuses shell commands that include
rm, even with a sandbox FS.
OpenAI’s new JSON-mode is real. When I forced gpt-4o to respond with a strict schema the error rate dropped to 4 %.
Cost per task (models & tokens)
I paid list prices (May 2024): GPT-4o at $5.00/1M input tokens, $15/1M output; Claude 3 Opus at $15/$75. Average token counts below.
| Task | Avg prompt | Avg completion | Total tokens |
|---|---|---|---|
| Email draft | 700 | 380 | 1,080 |
| Code refactor | 1,200 | 890 | 2,090 |
| Transcript → summary | 3,800 | 1,100 | 4,900 |
Doing the math:
- Email draft cost (GPT-4o): (0.000005*700) + (0.000015*380) ≈ $0.0101
(Claude): (0.000015*700) + (0.000075*380) ≈ $0.0335 - Transcript summary: $0.098 (4o) vs $0.325 (Claude).
Across all 500 tasks GPT-4o cost me $10.84, Claude 3 $33.11. Multiply by 30 days and the difference is a junior engineer’s salary in Berlin.
Personality and conversational quality
This is squishy, so I double-blind scored tone, empathy, and hallucination rate with two coworkers. Scale 1-5, higher is better.
- Empathy: Claude 4.7, GPT-4o 4.3
- Hallucinations in factual answers: Claude 0.7 %, GPT-4o 1.2 %
- Over-correction / refusal when safe: Claude 6 %, GPT-4o 2 %
Claude still feels like the friend who listens before replying. GPT-4o answers faster but sometimes steamrollers nuance. For front-of-house customer support I’d lean Claude; for quick Slack helpers I prefer 4o.
Coding ability (Node & Python)
I fed each model ten GitHub issues from our OSS repo asking for either a new feature or a tight bug fix. Agents pushed PRs through the GitHub tool and the CI pipeline fired.
- Merged without human edits: GPT-4o 6/10, Claude 3 4/10.
- Compilation/runtime failures: GPT-4o 1, Claude 3 3.
Interesting detail: GPT-4o sometimes writes terser commits (fix: null check) while Claude adds a 5-line commit body. Personal taste, but it matters for logs.
Example diff GPT-4o produced
--- a/src/cache.ts
+++ b/src/cache.ts
@@
- export function get(key: string) {
- return memory[key];
- }
+ export function get(key: string): string | undefined {
+ return Object.hasOwn(memory, key) ? memory[key] : undefined;
+ }
Email drafting quality
We asked the agents to write a polite but firm discount-denial email. Human PMs rated “Would you send this as is?”
- Claude 3: 41/50 yes votes
- GPT-4o: 38/50 yes votes
Common GPT-4o gripe: uses American punctuation ("Hi Peter," not "Hi Peter,") even after specifying "British English" — trivial but real. Claude respected locale more consistently.
Latency and concurrency inside the gateway
Average end-to-end latency (agent.receive → final tool success):
- GPT-4o: 5.4 s (p95 8.1 s)
- Claude 3: 7.9 s (p95 13.4 s)
Not drastic, but when you chain 4-5 calls the difference feels bigger. Concurrency was 20 parallel jobs; Anthropic throttled a few bursts giving 429s, OpenAI silently queued but delivered.
Steinberger now at OpenAI: does it matter?
The elephant in the room: Peter built OpenClaw on top of Claude, but his badge now says OpenAI. He’s publicly said model neutrality still matters, and the community governance file was merged last month to keep hooks generic. In practice the @openclaw/sdk defaults changed: "model": "gpt-4o-mini" ships in 0.43.0. If you deploy to ClawCloud and never touch config you’ll be on OpenAI starting this week.
For folks worried about lock-in, both model adapters live under src/providers. Swapping is a single line:
const model = process.env.LLM === "claude" ? "claude-3-opus-20240229" : "gpt-4o";
In short: the politics changed the default, not the API surface. If Anthropic drops prices again, switching back is easy.
Recommendation matrix
| Use case | Pick | Why |
|---|---|---|
| High-volume internal automations | GPT-4o | 3× cheaper, better tool JSON |
| Customer-facing email or chat | Claude 3 | More empathic tone, fewer hallucinations |
| Code-writing agents | GPT-4o | Higher pass rate, faster |
| Long-document reasoning (>100k tokens) | Claude 3* | Context window still bigger; GPT-4o tape not yet public |
*as of May 2024. If OpenAI opens 200k window this flips.
What I’m running in production
My ClawCloud workspace now defaults to GPT-4o for all automation flows. Specific channels where brand voice matters (@support Slack and help@ inbox) still point to Claude. The switch saved roughly $700 last month without user complaints.
If you only read one thing: try GPT-4o first, but keep the provider flag in a .env. Models move fast, bills stick around.