If you run an OpenClaw gateway with more than one active agent, you have probably noticed the same pattern I did: some requests absolutely do not need GPT-4o, while others fall apart on gpt-3.5-turbo. This article shows you how to route different OpenClaw tasks to different AI models so you can control latency, cost, and privacy without juggling multiple deployments.

Why model tiering matters in practice

Model tiering is the boring name for a very real pain point. In the last 30 days my ClawCloud bill looked like this:

  • 60 %: customer-facing FAQ bot answering three-sentence questions
  • 30 %: internal research agent summarizing PDFs for the ops team
  • 10 %: code review assistant running on pull request webhooks

The FAQ bot could happily run on gpt-3.5-turbo-1106. The code review assistant occasionally needs chain-of-thought reasoning only claude-3-opus nails. The PDF summarizer contains sensitive data and should stay on my own GPU with llama-3-instruct-70b-Q4_K_M.gguf. One size does not fit all.

How OpenClaw picks models out of the box

By default the gateway delegates every generation call to whatever you set in OPENCLAW_DEFAULT_MODEL. Most users leave it at gpt-3.5-turbo because it works everywhere. That single flag keeps the mental model simple but forces a trade-off: overpay on simple tasks or underperform on hard ones.

Under the hood, every agent invocation ends up in gateway/src/router.ts. If no model is provided, the router drops in the default. The good news: the router is just JavaScript. You are free to tamper with it.

Strategy 1 – Per-agent model overrides

The quickest fix is to pin a different model for each agent. The agent manifest already supports it:

{ "name": "faq-bot", "description": "Answers repetitive customer questions", "llm": { "provider": "openai", "model": "gpt-3.5-turbo-0125", "temperature": 0.2 } }

When the request hits the daemon, the gateway passes the agent-level LLM config straight through. No extra code, no risk.

Drawbacks

  • You need to remember to update every manifest when you want to test a new model.
  • There is no global fallback: if OpenAI goes down the agent simply errors.

Strategy 2 – Workspace-level routing with fallback

The community trend is moving toward a centralised router that inspects the incoming Task object and decides which backend to call. The ClawCloud team hinted that a first-class plugin is coming, but until then a router.js file in your workspace does the job.

// ~/.openclaw/router.js import { callOpenAI, callAnthropic, callLocalLLM } from "@openclaw/llm-clients"; export async function route(task) { const text = task.prompt; // 1. super short => cheap & fast if (text.length < 300) { return callOpenAI({ model: "gpt-3.5-turbo-0125", prompt: text, temperature: 0.2, }); } // 2. private files => local GPU if (task.metadata?.privacy === "internal") { return callLocalLLM({ model: "llama-3-instruct-70b.Q4_K_M.gguf", prompt: text, temperature: 0.1, }); } // 3. everything else => premium reasoning return callAnthropic({ model: "claude-3-opus-20240229", prompt: text, temperature: 0.3, }); }

Place that file next to gateway.config.json and restart the daemon:

$ npx openclaw daemon --workspace ~/.openclaw

Every agent now flows through the custom router.

Environment variables to keep secrets out of git

export OPENAI_API_KEY=sk-... export ANTHROPIC_API_KEY=sk-ant-... export LOCAL_LLM_ENDPOINT=http://127.0.0.1:11434

The router reads process.env automatically via dotenv which ships with the gateway since v0.38.2.

Detecting task complexity at runtime

Length-based heuristics are fine until someone asks a five-word question that actually needs a deep chain-of-thought (“Why did my fine-tuning collapse?”). I’ve had better luck classifying the request first:

// classify.js – tiny helper import { callOpenAI } from "@openclaw/llm-clients"; export async function classify(prompt) { const system = "You are a routing classifier. Reply with one word: 'simple', 'private', or 'complex'."; const res = await callOpenAI({ model: "gpt-3.5-turbo-0125", messages: [ { role: "system", content: system }, { role: "user", content: prompt }, ], }); return res.choices[0].message.content.trim(); }

You can wrap the classifier inside the router. It adds ~300 ms but saves money by keeping most traffic on the cheaper model.

Real cost numbers (May 2024 pricing)

  • gpt-3.5-turbo-0125: $0.50 / 1M input tokens, $1.50 / 1M output
  • claude-3-opus-20240229: $15 / 1M input, $75 / 1M output
  • Local Llama-3-70B Q4: $0 after GPU amortisation, ~6 ms/token on RTX 4090

For my workload (75 % short, 20 % private, 5 % heavy reasoning) the blended cost went from $430 → $92 per month once I rolled out the router. Latency improved on short replies because GPT-3.5 is still faster than Opus.

Putting it all together: production-grade pipeline

If you need observability and back-pressure, wire the router into the new middleware hook added in gateway@0.40.0:

// ~/.openclaw/middleware/model-tiering.ts import { classify } from "../classify.js"; import { callOpenAI, callAnthropic, callLocalLLM } from "@openclaw/llm-clients"; export default async function modelTiering(ctx, next) { const { task } = ctx; const decision = await classify(task.prompt); switch (decision) { case "simple": ctx.response = await callOpenAI({ model: "gpt-3.5-turbo-0125", prompt: task.prompt }); break; case "private": ctx.response = await callLocalLLM({ model: "llama-3-instruct-70b.Q4_K_M.gguf", prompt: task.prompt }); break; default: ctx.response = await callAnthropic({ model: "claude-3-opus-20240229", prompt: task.prompt }); } await next(); }

Enable it in gateway.config.json:

{ "middleware": ["./middleware/model-tiering.ts"] }

The ctx object is passed along, so subsequent middleware can still log or post-process the response.

Observability: catching silent cost creep

Model tiering can fail quiet — if your classifier starts drifting everything suddenly routes to Opus. I recommend three basic metrics:

  • router.decisions – counter by label (simple, private, complex)
  • llm.token_usage – histogram by provider
  • router.fallback_rate – percentage of tasks that hit your catch-all model

Point Prometheus at localhost:9464/metrics; the gateway exports these since v0.41.1.

Community patterns & edge cases

From GitHub issues #3481 and #3572 plus the Discord #models channel:

  • Use gpt-4o-mini as an intermediate tier if you need vision and voice input but still want to keep costs low.
  • Batch up similar tasks into a single call (max_tokens < 250) before sending them to a premium model; context stuffing is cheaper than multiple calls.
  • Local Llama 3 struggles with JSON-only outputs unless you fine-tune the system prompt: "You are a dogmatic JSON generator...".
  • Anthropic returns 409 errors when you hit 20 requests/second. The router should implement exponential back-off or send overflow traffic to GPT-4.

Next step: automate the classifier itself

Hard-coding the routing rules works, but the real end-game is a feedback loop. Push usage stats to BigQuery, train a logistic regression that predicts the cheapest model that still meets your quality SLA, and re-deploy the router nightly. Several community members (see discussion #3610) are already doing this. If you have that in place, flipping a switch in ClawCloud to open a second workspace for A/B testing is trivial. Your wallet will thank you.