The short version: GPT-4o is cheaper and snappier for high-volume OpenClaw agents, Claude 3 feels more deliberate and polite, but breaks tools more often. Below is the long version with numbers, logs, and the scripts I used. Skip to any section if you just want prices, latency, or sample outputs.

Why benchmark GPT-4o and Claude 3 inside OpenClaw at all?

OpenClaw shipped in 2022 with Anthropic’s Claude as the default because Peter Steinberger liked its longer context window and the friendlier safety layer. Fast-forward to 2024: Peter joined OpenAI, GPT-4o dropped its price by 50 %, and many community bots started timing out on Claude during peak US hours. Meanwhile ClawCloud’s pay-as-you-go billing made the model line-item impossible to ignore. So I re-ran my daily workload against both models.

Test setup and methodology

Everything runs on the latest gateway (v0.42.1) and daemon (v0.42.0) with Node 22.3. I spun up two nearly identical agents:

  • gopher-4o → uses OpenAI gpt-4o-mini for standard calls and gpt-4o for coding/tool reasoning.
  • gopher-claude → uses Anthropic claude-3-opus-20240229 for everything.

Each agent was asked to execute 500 daily tasks collected from our internal Slack:

  1. Write or reply to an email (Gmail tool, 120 samples).
  2. Generate or refactor code snippets (shell + editor, 100).
  3. Summarise a meeting transcript and push to Notion (80).
  4. Create event in Google Calendar based on a chat (60).
  5. Browser search & scrape a competitor’s pricing (50).
  6. Write a short social post (Telegram + X, 90).

All calls used temperature 0.2 and max_tokens=2048. Logs were streamed into ClickHouse for later crunching.

Script fragment

import { Agent } from "@openclaw/sdk"; export async function run(model) { const agent = new Agent({ model, memory: "redis://claw-mem:6379", tools: [ "composio/gmail.send", "composio/notion.page.create", "browser.open", "shell.exec" ] }); return agent.chat("Draft a polite follow-up for yesterday’s meeting"); }

Tool-use reliability

Most OpenClaw agents live or die on whether the LLM calls the right tool with the right parameters. I counted a success if tool.call resolved without a validation error and the human reviewer clicked 👍.

  • GPT-4o: 92 % success across all tasks.
  • Claude 3: 83 % success. The gap widened on complex multi-step tasks (browser → scrape → summarise).

Two specific pain points for Claude:

  • Sometimes adds superfluous JSON keys the Composio schema does not expect.
  • Refuses shell commands that include rm, even with a sandbox FS.

OpenAI’s new JSON-mode is real. When I forced gpt-4o to respond with a strict schema the error rate dropped to 4 %.

Cost per task (models & tokens)

I paid list prices (May 2024): GPT-4o at $5.00/1M input tokens, $15/1M output; Claude 3 Opus at $15/$75. Average token counts below.

TaskAvg promptAvg completionTotal tokens
Email draft7003801,080
Code refactor1,2008902,090
Transcript → summary3,8001,1004,900

Doing the math:

  • Email draft cost (GPT-4o): (0.000005*700) + (0.000015*380) ≈ $0.0101
    (Claude): (0.000015*700) + (0.000075*380) ≈ $0.0335
  • Transcript summary: $0.098 (4o) vs $0.325 (Claude).

Across all 500 tasks GPT-4o cost me $10.84, Claude 3 $33.11. Multiply by 30 days and the difference is a junior engineer’s salary in Berlin.

Personality and conversational quality

This is squishy, so I double-blind scored tone, empathy, and hallucination rate with two coworkers. Scale 1-5, higher is better.

  • Empathy: Claude 4.7, GPT-4o 4.3
  • Hallucinations in factual answers: Claude 0.7 %, GPT-4o 1.2 %
  • Over-correction / refusal when safe: Claude 6 %, GPT-4o 2 %

Claude still feels like the friend who listens before replying. GPT-4o answers faster but sometimes steamrollers nuance. For front-of-house customer support I’d lean Claude; for quick Slack helpers I prefer 4o.

Coding ability (Node & Python)

I fed each model ten GitHub issues from our OSS repo asking for either a new feature or a tight bug fix. Agents pushed PRs through the GitHub tool and the CI pipeline fired.

  • Merged without human edits: GPT-4o 6/10, Claude 3 4/10.
  • Compilation/runtime failures: GPT-4o 1, Claude 3 3.

Interesting detail: GPT-4o sometimes writes terser commits (fix: null check) while Claude adds a 5-line commit body. Personal taste, but it matters for logs.

Example diff GPT-4o produced

--- a/src/cache.ts +++ b/src/cache.ts @@ - export function get(key: string) { - return memory[key]; - } + export function get(key: string): string | undefined { + return Object.hasOwn(memory, key) ? memory[key] : undefined; + }

Email drafting quality

We asked the agents to write a polite but firm discount-denial email. Human PMs rated “Would you send this as is?”

  • Claude 3: 41/50 yes votes
  • GPT-4o: 38/50 yes votes

Common GPT-4o gripe: uses American punctuation ("Hi Peter," not "Hi Peter,") even after specifying "British English" — trivial but real. Claude respected locale more consistently.

Latency and concurrency inside the gateway

Average end-to-end latency (agent.receive → final tool success):

  • GPT-4o: 5.4 s (p95 8.1 s)
  • Claude 3: 7.9 s (p95 13.4 s)

Not drastic, but when you chain 4-5 calls the difference feels bigger. Concurrency was 20 parallel jobs; Anthropic throttled a few bursts giving 429s, OpenAI silently queued but delivered.

Steinberger now at OpenAI: does it matter?

The elephant in the room: Peter built OpenClaw on top of Claude, but his badge now says OpenAI. He’s publicly said model neutrality still matters, and the community governance file was merged last month to keep hooks generic. In practice the @openclaw/sdk defaults changed: "model": "gpt-4o-mini" ships in 0.43.0. If you deploy to ClawCloud and never touch config you’ll be on OpenAI starting this week.

For folks worried about lock-in, both model adapters live under src/providers. Swapping is a single line:

const model = process.env.LLM === "claude" ? "claude-3-opus-20240229" : "gpt-4o";

In short: the politics changed the default, not the API surface. If Anthropic drops prices again, switching back is easy.

Recommendation matrix

Use casePickWhy
High-volume internal automationsGPT-4o3× cheaper, better tool JSON
Customer-facing email or chatClaude 3More empathic tone, fewer hallucinations
Code-writing agentsGPT-4oHigher pass rate, faster
Long-document reasoning (>100k tokens)Claude 3*Context window still bigger; GPT-4o tape not yet public

*as of May 2024. If OpenAI opens 200k window this flips.

What I’m running in production

My ClawCloud workspace now defaults to GPT-4o for all automation flows. Specific channels where brand voice matters (@support Slack and help@ inbox) still point to Claude. The switch saved roughly $700 last month without user complaints.

If you only read one thing: try GPT-4o first, but keep the provider flag in a .env. Models move fast, bills stick around.