How to scale OpenClaw to hundreds of agents in the cloud with Kubernetes

If you are googling "how to scale OpenClaw to hundreds of agents in the cloud", you probably just hit the wall where pm2 start gateway.js -i max is no longer cute. This piece documents the architecture we ended up shipping in production (currently 640 concurrent agents) and names every sharp edge we bled on. Everything below is tested on OpenClaw v4.8.3 (Node 22.2.0) and Kubernetes 1.29.

The real scaling pain shows up at ~40 agents

OpenClaw spawns a WebSocket connection per channel, keeps a memory store per agent, and opens a Chromium instance every time the browser tool gets invoked. On a single c6i.4xlarge we could run roughly 35–40 agents before GC pauses went through the roof. Beyond that, we had to answer four questions:

Where do new agents live? Pod, container, process?
How do we roll agents without dropping chats?
How do we share expensive stuff? Vector DB, headless browsers, GPU slots.
How much is this going to cost? We needed CFO-friendly numbers.

Moltbook model: reference architecture for 100–1000 agents

The community keeps mentioning the Moltbook model (credit to @lena-s on GitHub) — think of it as a notebook where each sheet can be torn out and replaced without killing the binder. Applied to OpenClaw, that means:

Stateless agent pods that can be killed anytime.
External state stores (Postgres for metadata, Redis for ephemeral cache, Pinecone for vector memory).
Sidecar pattern for the daemon that restarts the gateway if it panics.
Horizontal Pod Autoscaler (HPA) driven by queue depth, not CPU.

We will wire all of that up in the next sections. You do not have to copy the exact stack, but keep the binder/sheet metaphor in mind whenever you make a design call.

Cluster baseline: Kubernetes primitives that actually matter

Namespaces and resource quotas

Carve out one namespace per environment (openclaw-prod, openclaw-staging), otherwise your staging agents will happily grab GPUs meant for paying customers when you run load tests.

apiVersion: v1
kind: Namespace
metadata:
  name: openclaw-prod
---
apiVersion: v1
kind: ResourceQuota
metadata:
  name: prod-budget
  namespace: openclaw-prod
spec:
  hard:
    requests.cpu: '200'
    requests.memory: 800Gi
    limits.cpu: '400'
    limits.memory: 1600Gi

Custom Resource Definition: `ClawAgent`

Instead of scripting kubectl scale deployment calls, expose a CRD that captures everything that makes an agent unique: channels, tools, memory profile. Our trimmed version:

apiVersion: clawcloud.ai/v1
kind: ClawAgent
metadata:
  name: support-bot-eu
spec:
  image: ghcr.io/openclaw/gateway:4.8.3
  connections:
    - type: slack
      tokenRef: slack-support-token
  tools:
    - gmail
    - notion
  memory:
    mode: postgres
    encrypted: true

The Operator we’ll write next listens for these objects and spins up a Deployment per agent. Why a Deployment and not a Job? Because agents are long-lived.

Building the Agent Deployment Controller

You can get away with a shell script for 20 agents, but not for 200. We used kopf (Python-based Kubernetes operator framework) because it ships hot reload and doesn’t care we are a Node shop.

# handlers.py
import kopf, yaml, base64, kubernetes
from jinja2 import Template

@kopf.on.create('clawcloud.ai', 'v1', 'clawagents')
def on_create(body, **_):
    name = body['metadata']['name']
    spec = body['spec']
    tmpl = Template(open('deployment.yaml.j2').read())
    manifest = yaml.safe_load(tmpl.render(name=name, spec=spec))
    kubernetes.client.AppsV1Api().create_namespaced_deployment(
        namespace='openclaw-prod', body=manifest)

The Jinja template sets resources.limits, injects secrets via envFrom, and mounts a ConfigMap containing agent.yaml (OpenClaw reads that at boot).

Rolling upgrades without chat drops

Slack and Discord webhooks retry on 5xx, but WhatsApp sessions die if the socket disappears for more than 30 seconds. Our fix:

Set terminationGracePeriodSeconds: 35.
Enable lifecycle.preStop hook that calls POST /internal/agent/drain on the gateway; that flips a flag so new inbound messages immediately 302 to the backup agent.
The gateway exits once the active conversation counter hits 0.

The Operator then rolls to the new image tag.

State, memory, and other shared services

Postgres — metadata & long-term memory

We run one aws/rds-postgres14.11 cluster per region. Use pgbouncer if you go north of 300 connections (each agent keeps two).

Redis — short-term memory, rate limits

We started on a single cache.t3.medium. At 500 agents we needed cluster mode (3 shards, 2 read replicas each). The SDK config goes in OPENCLAW_REDIS_URL.

Vector store

Pinecone, Qdrant, or pgvector all work. The hot path looks like:

Agent emits embedding request to RabbitMQ.
Worker pool crunches the 768-dim vector on GPU when available, CPU otherwise.
Store returns top-k chunks; agent passes them to LLM call.

We moved embedding out of the agent pod to keep it stateless. That shaved 300 MiB RAM per agent.

Browser automation pool

If your agents ever screen-scrape, don’t bundle Chromium with every pod. We run browserless/chrome:1.61 as a Deployment with --max-concurrency=40. Agents hit it over websocket. Slash RAM by another ~400 MiB.

Observability: metrics, traces, logs, synthetic conversations

Metrics

Prometheus gets you 80%. The must-have dashboards:

LLM round-trip latency per provider key.
Token consumption (prompt, completion) per agent.
Queue depth for embedding workers.

Sample ServiceMonitor:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: openclaw-agents
spec:
  selector:
    matchLabels:
      app: openclaw-gateway
  endpoints:
    - port: metrics
      interval: 15s

Traces

Enable OpenTelemetry in OpenClaw v4.8+: --otel-endpoint=http://tempo:4317. We push traces to Grafana Tempo. It is the only way to debug why the memory vector lookup adds 1.2 s on Tuesdays but not Wednesdays.

Logs

Vector (the Rust one, not the DB) tails /var/log/containers/*gateway*.log and ships to Loki. Keep 3-day retention in cluster, archive to S3 thereafter.

Synthetic conversations

We use checkly to post “hi” every 5 minutes via Slack to a canary agent. It measures response, sentiment, and LLM latency end-to-end. Alert on p95 > 6 s.

Cost levers: bin-packing GPUs, spot pools, cold storage

CPU vs. GPU embedding

With GPT-4o we rarely need our own GPU inference anymore, but embedding via text-embedding-3-large is cheaper if you run a single A10G ($1.20/hr) locally. We batch 128 documents per call. Autoscaler spins up GPU nodes on demand:

apiVersion: autoscaling.karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: a10g-spot
spec:
  requirements:
    - key: karpenter.k8s.aws/instance-family
      operator: In
      values: [g5]
    - key: karpenter.k8s.aws/instance-lifecycle
      operator: In
      values: [spot]
  ttlSecondsAfterEmpty: 60

Chromium pool sizing

Every page scrape blocks one concurrency slot. We watch browserless_active metric and scale the deployment when p95>0.8 for 5 min. Running 3 x c7g.2xlarge saved $480/mo compared to c5n.

Per-agent memory footprint

Gateway alone: 180 MiB. Add shell tool: +30 MiB. Add Slack RTM: +25 MiB. Browser access: +400 MiB unless offloaded. Our hard target: <300 MiB median. Anything fatter gets flagged in CI.

LLM provider coupon juggling

Enterprises usually have Azure OpenAI credits that expire quarterly. We added a Cost Router: read from llm_providers table ordered by cost desc and freshness asc. Agents query: SELECT api_key FROM llm_providers WHERE enabled=true ORDER BY freshness DESC LIMIT 1. That alone knocked 18% off Q1 spend.

Step-by-step: spin up your own 200-agent fleet

This is the recipe we hand interns. Feel free to diff against yours.

Provision base infra
- AWS EKS 1.29, managed node group (c6i.large) min=3.
- RDS Postgres 14, 2 x db.r6g.large, Multi-AZ.
- ElastiCache Redis cluster mode, 3 shards.
- S3 bucket openclaw-chat-logs.
Install cluster addons
- cert-manager 1.14
- aws-load-balancer-controller 2.7
- karpenter 0.35
- prometheus-operator 58.1
- grafana-tempo-helm 1.7
Deploy browserless pool: helm install chrome browserless/chrome --set replicas=3
Push ClawAgent CRD and Operator
Create first agent (kubectl apply -f support-bot.yml)
Watch logs: kubectl logs -f deploy/support-bot-eu -c gateway
Scale to 200: generate 199 more CRs, kubectl apply -f agents/
Observe HPA: kubectl get hpa -n openclaw-prod
Ship Grafana dashboard to execs.

When not to DIY: ClawCloud’s hosted offering

All of the above is fun if you like YAML. If you just want agents, ClawCloud spins one in ~60 s: pick a region, name your agent, paste Slack token. We still run our staging clusters ourselves because we like to peek under the hood, but customer workloads ride on ClawCloud so that on-call is their problem, not ours.

Next step: run the 30-minute chaos drill

Create a script that deletes a random ClawAgent every five minutes. If any chat gets dropped or any on-call phone rings, you’re not production-ready. Fix that, then come brag on the GitHub Discussions board.