OpenClaw lives or dies by how well a language model follows the tool system message. If the model hallucinates a reply instead of the JSON payload, the chain breaks and your users see stack traces. Below is the list — heavily biased by that one metric — of the best free and open-source models you can actually run on a single GPU or even a MacBook in 2024.

Why tool-use reliability beats benchmark scores

The academic leaderboards (MT-Bench, GSM8K, etc.) tell you about reasoning, not obedience. I wired each model into OpenClaw v0.42.7 and triggered the same 50-prompt harness:

  • Structured {"tool":..., "args":{...}} calls
  • Long-form requests near the top of context window
  • Adversarial "just answer normally" distractions

I counted a run as a pass when the model responded with well-formed JSON (jq accepted it) and the requested tool name matched. The final score you’ll see below is tool-use success rate over 50 attempts.

Hardware used in this review

  • CUDA box: 1× RTX 4090 24 GB, CUDA 12.5, Ubuntu 22.04
  • Apple M3 Max 48-GB
  • CPU fallback: Ryzen 7950X with llama.cpp AVX2
  • Serving stack: llama.cpp 0.2.19 and Ollama 0.1.31

I ran all models in 4-bit Q4_K_M GGUF where available; otherwise 4-bit NF4 from auto-gptq. Bigger context models required sliding-window batching.

Quick-look ranking (tool success over 50 prompts)

  • 1️⃣ Llama 3.1 70B — 48/50 (96%)
  • 2️⃣ Mixtral-8x7B-v0.1 — 46/50 (92%)
  • 3️⃣ Mistral-7B-v0.2 — 44/50 (88%)
  • 4️⃣ DeepSeek-67B-Instruct — 42/50 (84%)
  • 5️⃣ Qwen-14B-Chat — 40/50 (80%)
  • 6️⃣ Llama 3.1 8B — 38/50 (76%)
  • 7️⃣ Phi-3-mini-4K-instruct (MIT) — 32/50 (64%) *

*Phi-3 isn’t Apache-2; it’s on the MIT license. Still free for most commercial use so included here.

Llama 3.1 — still the compliance king

Models covered

  • llama-3.1-70B-instruct (context 8 K)
  • llama-3.1-8B-instruct (context 8 K)

Tool-use reliability

The 70B knocks it out of the park. Its 96 % pass rate came down to two failures where it added a persuasive sentence after the JSON — easy to strip but technically wrong. The 8B cut corners occasionally by omitting an argument key.

Context window

Still 8 K out of the box. You can squeeze to 12 K with llama.cpp --flash-attn plus ALiBi patches. For most agentic work, 8 K is fine because the memory subsystem in OpenClaw stores conversation state separately.

Speed & hardware

  • 70B Q4_K_M: ~18 tok/s on 24-GB 4090, GPU RAM at 21.5 GB
  • 8B Q4_K_M: 58 tok/s on M3 Max, 5.3 GB unified memory

If you only have 16-GB VRAM, stick to the 8B. The 70B will spill to CPU and crawl.

OpenClaw config snippet

{ "llm": { "provider": "llama.cpp", "model": "/models/llama3.1/llama3.1-70b-q4_k_m.gguf", "params": { "temperature": 0, "top_p": 0.1 } } }

Temperature 0 is critical; even small sampling noise can break JSON.

Mixtral-8x7B — sparse MOE, dense compliance

Mistral-AI released Mixtral under Apache-2, and the community ported it to GGUF within 48 hours. Sparse mixture-of-experts means two of eight experts fire per token, giving near-70B quality for ~12 GB VRAM.

Tool-use reliability

92 %. The errors were strongly correlated with long prompts; the router occasionally chose a creative expert that rephrased the tool call. Setting repeat_penalty 1.05 helped.

Context window

32 K natively. Because attention is grouped by expert, RAM usage grows slower than dense 32 K models.

Speed & hardware

  • Q4 on 4090: 23 tok/s, 11.8 GB VRAM
  • CPU AVX2: 4 tok/s, usable for low-traffic Slack bots

OpenClaw snippet

{ "llm": { "provider": "ollama", "model": "mixtral:instruct-q4", // pulled via `ollama pull mixtral:instruct-q4` "params": {"mirostat": 0} } }

Ollama bundles the right prompt wrapper; you don’t need to prepend <|tool|> tokens manually.

Mistral-7B-v0.2 — the fallback that just works

If you’re targeting baseline commodity machines (T4, older Quadro), Mistral-7B is the sweet spot.

Tool-use reliability

88 % in my tests, but it jumped to 94 % after adding the community "better-tool-system" prefix from #agents-bert on the OpenClaw Discord:

<|system|> You only. Reply with JSON nothing else.

Context window

8 K; plenty unless you’re embedding full PDFs.

Speed & hardware

  • T4 16-GB: 12 tok/s in 4-bit
  • MacBook Air M2: 40 tok/s (!!) thanks to Metal backend

Minimal RAM

Under 5 GB GPU for Q4_K_M. Cheap to host.

Why not the 8×22B variant?

Mixtral-8x22B isn’t fully open: the weights are, but the license forbids using it to train competing models. I skipped it to keep this list cleanly "free as in speech".

DeepSeek-67B-Instruct — bigger, slower, still Apache-2

DeepSeek’s 67B surprised me. Out of the box it behaved worse than Llama 3.1 on normal chat, yet it was very strict inside the tool harness.

Tool-use reliability

84 %. The misses were empty tool calls when asked for multi-step reasoning.

Context window

16 K. Enough to glue RAG chunks plus a page of memory.

Speed & hardware

  • Q4_K_M needs 25 GB GPU RAM. That killed my single 4090; I had to split across two GPUs with llama.cpp --gpu-split 20,12.
  • Inference drops to 8 tok/s.

Good if you already pay for beefy servers, otherwise skip.

Qwen — the sleeper hit from Alibaba

Models covered

  • Qwen-14B-Chat (32 K context)
  • Qwen-1.8B-Chat (strong on phones via llama.cpp Mobile)

Tool-use reliability

80 % for the 14B. Interesting pattern: it never hallucinated random text, but roughly one in five replies contained a trailing comma rendering the JSON invalid. A quick sed 's/,}/}/' fixer pushed pass rate to 94 %.

Speed & hardware

  • 14B Q4_K_M on 4090: 37 tok/s
  • 1.8B on Raspberry Pi 5 (ARMv8): 6 tok/s, enough for a hobby Telegram bot

Why include 1.8B?

Because sometimes you want an always-on assistant that costs $0 on a Pi and still triggers GitHub Automation via Composio. It passed 62 % of tool tests — workable if you wrap with retries.

OpenClaw snippet (14B)

{ "llm": { "provider": "llama.cpp", "model": "qwen-14b-chat-q4_k_m.gguf", "params": {"top_p": 0.9, "temperature": 0.1} } }

Qwen requires the ### Assistant: stop word in llama.cpp; upgrade to 0.2.19+.

Phip-3 mini — edge deployments welcome

Phi-3 scored lowest but earns its place because it runs entirely in 4 GB RAM. MIT license is good enough for most commercial users.

Stats

  • Model: phi-3-mini-4K-instruct
  • Tool-success: 64 %
  • Context: 4 K
  • Speed: 25 tok/s on Intel NUC11 i7 CPU

Use when you literally can’t afford GPU — think customer-hosted on-prem boxes.

Testing script (simplified)

#!/usr/bin/env node import { Agent } from "@openclaw/sdk"; import prompts from "./toolBench.js"; const agent = new Agent({ llm: "localhost:11434", // ollama endpoint model: process.argv[2] }); let pass = 0; for (const prompt of prompts) { const res = await agent.run(prompt); try { JSON.parse(res); pass++; } catch {} } console.log(`${pass}/${prompts.length} JSON tool calls correct`);

Replace toolBench.js with your own prompts. Run like:

node bench.js mixtral:instruct-q4

Which model should you start with?

If you’re on consumer GPUs or Apple Silicon, try Mixtral-8x7B first — you get high compliance at sane VRAM. On a laptop, Mistral-7B wins. If you have deep pockets or run ClawCloud’s premium GPU tier, go straight to Llama 3.1 70B.

Once your agent is live, measure your own failure rate via OpenClaw’s /admin/logs endpoint. Models evolve weekly; swap them out without touching business logic. The rate-limiting factor is no longer accuracy, it’s how reliably the model sticks to a contract.

Happy shipping — and share your numbers in #models-and-cost on the OpenClaw Discord so we can keep this ranking real.