OpenClaw model failover: automatic switching between providers step-by-step

If you landed here after searching “OpenClaw model failover setup automatic switching between providers,” you probably hit the same wall I did: one flaky model endpoint can take the whole agent down. The good news is that OpenClaw ships with a routing layer that can juggle multiple providers (OpenAI, Anthropic, Groq, Ollama, whatever) and transparently roll over when the current one throws 429s or just disappears. The bad news is that the best practices are scattered across GitHub issues and Slack threads. Here’s everything in one place.

How model routing and failover work inside OpenClaw

OpenClaw 2.3.4 introduced modelPools. Think of a pool as an ordered list of adapters. The gateway looks at the first adapter (your primary). If it replies within timeoutMs and doesn’t raise a hard error, you’re done. If not, the daemon bumps the failure counter and immediately calls the next adapter. The whole thing happens synchronously in the same request, so the user only sees one response.

Key points the docs gloss over:

The decision happens per request, not per session. A single bad prompt won’t permanently blacklist a provider.
The circuit opens after failuresBeforeSkip consecutive faults. This prevents instant flip-flopping on transient hiccups.
A success resets the failure counter only for that adapter.
Health is stored in memory. Restarting the gateway clears it unless you enable Redis persistence (new in 2.4.0-beta).

Picking primary, secondary and tertiary models

An obvious but often skipped step is deciding what you actually want to fail over to. A random list won’t cut it; different providers have different token windows, function-call formats and quality quirks. My own stack:

Primary — OpenAI gpt-4o (modelId gpt-4o-2024-05-13) for quality.
Secondary — Anthropic Claude 3 Sonnet (claude-3-sonnet-20240229) when OpenAI is throttled.
Tertiary — Groq LLama-3-70B-Instruct for low latency spikes.
Ultimate fallback — Local llama-3-8B-Instruct served by Ollama. Keeps the demo booth alive even when the hotel Wi-Fi blocks everything.

Yes, mixing models can break function call schemas. The hack is to use OpenClaw formatters (also 2.3.x) to canonicalize.

Configuring automatic failover in `claw.config.js`

Everything lives in one file at the project root. Below is the trimmed version that’s currently powering the public Discord bot:

{
  /* claw.config.js */
  models: {
    defaultPool: {
      timeoutMs: 12000,            // 12 seconds per hop
      failuresBeforeSkip: 3,       // three strikes rule
      adapters: [
        {
          provider: 'openai',
          model: 'gpt-4o-2024-05-13',
          apiKey: process.env.OPENAI_KEY,
          maxTokens: 4096,
          cost: 0.005 // per 1K input tokens, used for telemetry only
        },
        {
          provider: 'anthropic',
          model: 'claude-3-sonnet-20240229',
          apiKey: process.env.ANTHROPIC_KEY,
          maxTokens: 4096,
          cost: 0.0038
        },
        {
          provider: 'groq',
          model: 'llama3-70b-8192',
          apiKey: process.env.GROQ_KEY,
          maxTokens: 8192,
          cost: 0.0011
        },
        {
          provider: 'ollama',       // local dockerised model
          model: 'llama3:8b-instruct-q4_0',
          endpoint: 'http://127.0.0.1:11434',
          maxTokens: 2048,
          cost: 0                  // free but slower, marks as fallback
        }
      ]
    }
  }
}

Restart the daemon:

$ npx openclaw daemon restart

and you’re live. No other flags needed.

Understanding failover triggers: timeouts, 429s and model errors

Out of the box, OpenClaw treats these as hard failures:

HTTP ≥ 500 from provider
HTTP 429 (rate limit) if skipOnRateLimit is true (default)
Network timeout beyond timeoutMs
JSON schema violations — yes, malformed function call JSON counts

You can soften or harden the rules:

{
  skipOnRateLimit: true,          // false if you’d rather queue
  failurePatterns: [
    /model_overloaded/i,          // Anthropic custom message
    /^context length exceeded/    // OpenAI context error
  ]
}

Internally, each request walks this pseudocode:

for (adapter of adapters) {
  try {
    const res = await adapter.call(prompt, opts)
    if (isHardError(res)) throw Error(res.error)
    return res
  } catch (e) {
    markFailure(adapter)
    if (failures(adapter) >= failuresBeforeSkip) continue
    else if (!adapter.retryable) throw e
  }
}
throw new Error('All models failed')

One subtlety: retryable is intentionally false for local models by default, to avoid hammering your GPU.

Cost dynamics: what actually happens to your bill

The optimizer in me wanted to stack cheap models first. That fails the moment latency matters. After a week of logs I learned:

OpenAI did 95 % of traffic despite occasional spikes.
Anthropic picked up 3.5 % during OpenAI’s 02:00–02:10 UTC brownouts.
Groq got 1 % mostly on weekend rate limits.
Ollama maybe 0.2 %, usually when my home internet DNS broke.

Effective blended cost: $0.0051 / 1K tokens — only 2 % over pure OpenAI, totally acceptable. The takeaway: make the primary the best quality you can afford. Failover almost never dominates the spend.

An edge case: some teams use streaming mode to slash latency. The first chunk comes from provider A, but if it fails mid-way, OpenClaw can’t switch mid-stream. You’ll eat the whole request cost and a timeout. The fix (added in 2.4.0-rc1) is chunkedResume, but it’s still experimental.

Setting up a local model as the final fallback

Running a GPU node is optional; a CPU box works for emergencies. I use Ollama 0.1.36 with llama-3-8B-Instruct-Q4.

Install Ollama: $ curl -fsSL https://ollama.ai/install.sh | sh
Pull the model: $ ollama pull llama3:8b-instruct-q4_0
Expose on LAN (optional): $ ollama serve --host 0.0.0.0 --port 11434
Add the adapter block shown earlier.

Gotchas:

Ollama uses /api/generate, not /v1/chat/completions. OpenClaw’s built-in adapter handles this; if you rolled your own before 2.3, delete it.
Latency on CPU is 8–12× slower. The gateway front-ends this behind the same timeoutMs, so increase it to 40 s for fallback or you’ll never hit it.
Memory footprint ~10 GB for Q4 quantization. A cheap m6i.2xlarge spot box does fine.

I also enable metrics.fallbackAlerts which sends a Slack DM whenever the local model is used more than twice in ten minutes.

Monitoring and testing your failover chain

Dry-run using the CLI

The simplest sanity check:

$ npx openclaw ask "say hi" --trace

The trace shows which adapter responded. Shut down OpenAI network access (sudo iptables -A OUTPUT -p tcp --dport 443 -m string --string "api.openai.com" --algo kmp -j DROP) and rerun to see Anthropic kick in.

Unit tests

import { ask } from 'openclaw'
import nock from 'nock'

test('falls back on 429', async () => {
  nock('https://api.openai.com')
    .post('/v1/chat/completions')
    .reply(429, { error: 'rate_limit' })
  const res = await ask('hello')
  expect(res.adapter.provider).toBe('anthropic')
})

Grafana dashboards

OpenClaw exports Prometheus metrics at /metrics. Key counters:

openclaw_adapter_requests_total{provider="openai"}
openclaw_adapter_failures_total{provider="openai"}
openclaw_failover_events_total

Alert when failover_events_total - offset 5m > 10.

Common pitfalls and what the community learned

Timeouts too short — folks copy 5 s examples from churn posts; Anthropic can take 8–10 s on 1000 token prompts.
Ordering by price — leads to unexpected regressions when secondary quality is lower. Front-load quality, let fallback be cheap.
Environment variable leaks — make sure printEnv is false in prod logs; otherwise your Anthropic key gets dumped during verbose retries.
Not all errors are equal — OpenAI “context_length_exceeded” is fatal; Anthropic returns 400 with no JSON body and breaks the parser unless upgraded to adapter 2.3.1.
Redis mismatch — mixing gateway v2.4 with daemon v2.3 silently disables health persistence; both need to match.

The most upvoted GitHub discussion (#1732) points out that failover only helps availability, not consistency. If your prompt depends on deterministic function calls, keep the provider constant per conversation by pinning session.modelPoolIndex.

Next step: test in production traffic shadows

Enable failover in mirror mode first: set mirrorOnly: true on secondary adapters. They’ll run in the background and log latency/cost without affecting the user. Once you’re confident, flip the flag, restart, and sleep better at night.

Questions, corrections, or war stories? Drop them in #models on the OpenClaw Discord or file a GitHub issue. I read every single one.