Most OpenClaw projects work fine out-of-the-box — until the seventh or eighth hour-long conversation thread. Then the prompt balloons past the model’s context limit, responses lag, and your API bill looks like a Bitcoin chart from 2021. This post is a field report on how we keep multi-day conversations flowing in production agents running on ClawCloud and on-prem, using nothing fancier than Node 22, OpenClaw 0.34.3, and some disciplined token math.
Why session management matters in OpenClaw
The OpenClaw gateway stores message history in session objects. Every inbound or outbound message is appended so the LLM gets the full dialogue on the next request. That’s great for small chats, but the moment you integrate WhatsApp groups or a busy Slack channel, the context window becomes the bottleneck:
- Anthropic’s Claude 3 Sonnet: 200k tokens
- OpenAI GPT-4o: 128k tokens
- Mixtral 8x22B on Ollama: 32k tokens after rope scaling
Add system prompts, function/tool declarations, and you can shred 10k-20k tokens before the user even types hello. Without pruning, sessions either blow up or you start truncating randomly and lose coherence.
Session lifecycle: the defaults and the landmines
The gateway’s default SessionStore implementation keeps everything in Redis with no TTL. A background job (sessionSweeper) is off by default. If you launch using claw run on ClawCloud today, you get:
- Unlimited lifetime sessions
- History capped only by the context limit of your selected model
- No cost guardrails unless you set them
I learned this the expensive way: a support agent summarizing PDF manuals chewed through 9.3M tokens over a weekend because a customer pasted the entire ISO 26262 spec twice. The fix is to treat session management as first-class infrastructure, not an afterthought.
Measuring your context window: token math 101
Before you optimize, you need numbers. I export three metrics per request:
prompt_tokens– tokens sent to the modelcompletion_tokens– tokens in the model’s answersession_tokens_total– running total for that session
The cheapest way to count is tiktoken. A tiny helper inside the agent looks like this:
import { encoding_for_model } from 'tiktoken';
const enc = encoding_for_model('gpt-4o');
export const countTokens = (text) => enc.encode(text).length;
Wrap every incoming and outgoing message. Ship the numbers to Prometheus. Grafana heatmaps will show when you’re creeping toward the limit long before the API 400s.
Practical token budgeting for Node.js agents
1. Hard cap per request
In gateway.config.mjs:
export default {
llm: {
provider: 'openai',
model: 'gpt-4o-mini',
maxPromptTokens: 120000, // hard safety margin
maxCompletionTokens: 4096
}
}
This forces the gateway to refuse building a prompt larger than maxPromptTokens. It fails loudly, which is what you want in prod.
2. Soft budget per session
Add a mid-tier budget (e.g., 50% of max). Once the running total passes that, start summarizing old turns (we’ll get to that in a sec). Store the threshold in Redis so the daemon and any horizontally scaled gateways share state:
redis.set('budget:gpt-4o-mini', 60000);
3. Per-tenant or per-user quotas
If you’re SaaS-ing your agent, expose quotas in your billing DB and clip the session when they’re out. Users are remarkably good at creating degenerate prompt loops when they’re not paying for it.
Long-conversation strategy: rolling window, summarize, archive
I’ve tested five retention strategies. Three survive real traffic:
- Rolling window — keep the last N turns verbatim, drop older ones.
- Summarize & replace — swap blocks of dialogue with LLM summaries.
- Vector recall — embed every message, dump from prompt, re-inject only the ones nearest to the current query.
Rolling window config
Fast, deterministic, zero extra costs, but the model can forget important early context. Ideal for support chats where the user restates context often anyway.
// gateway.config.mjs
export default {
session: {
strategy: 'window',
turns: 12 // keep last 12 messages (user+assistant)
}
}
Summarize & replace
OpenClaw ships a convenience helper summarizeHistory(). My patch adds a guard so we don’t spend more than 1% of a session’s token budget summarizing it:
if (summaryCost > sessionBudget * 0.01) {
// skip summarization this turn
}
Store summaries prefixed with [context] so the model treats them as references, not fresh dialogue.
Vector recall with Composio Postgres
If you already run pgvector for tool embeddings, piggy-back on it. Every time you prune a message, embed it using openai-text-embedding-3-small (1.5k/token). At inference time:
SELECT content
FROM message_embeddings
ORDER BY embedding <-> :current_query_vector
LIMIT 6;
Inject the top 6 chunks at the tail of the prompt under a header like Relevant past facts. Works great for personal assistant agents that must remember a user’s preferences from months ago.
Implementing session pruning hooks
The gateway exposes two hooks:
beforePromptAssemble(session)afterCompletion(session, messages)
Glue code lives in src/hooks/session.js:
export async function beforePromptAssemble(session) {
const { totalTokens } = session.meta;
if (totalTokens > budget.soft) {
await summarizeAndPrune(session);
} else if (totalTokens > budget.hard) {
throw new Error('Context window exhausted');
}
}
Because this runs in the same process that renders the prompt, keep it < 20 ms or the UX stutters. Heavy lifting like vector search belongs in the async worker pool.
Cost control: avoiding surprise API bills
Two levers saved us 32% last quarter:
- Early exit on no-op commands — If the user types thanks and the agent has no memory update or tool call, skip the LLM round-trip.
- Model tier downgrade — For sessions that haven’t crossed a critical complexity threshold, start with GPT-3.5 Turbo 1106 on a 16k window. Promote to GPT-4o only when a summary or toolchain call fails a QA check. Do it transparently and users rarely notice.
Code sketch:
if (message.length < 6 && !needsAction(message)) {
return sendTypingIndicatorOnly();
}
if (!session.meta.highComplexity) {
llm.model = 'gpt-3.5-turbo-1106';
}
Monitoring & metrics you actually look at
A Grafana dashboard that nobody checks is useless. Hook alerts to #ops-alerts Slack when:
session_tokens_total> 80% of the max for 5 min- request latency p95 > 5 s
- cost per tenant per day > $1 (adjust to your margins)
Prometheus rule example:
alert: OpenClawContextNearLimit
expr: session_tokens_total / model_context_limit > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: 'Session {{ $labels.session_id }} near context limit'
Session config cheatsheet (copy/paste)
// session.config.mjs
export default {
ttlSeconds: 86400, // prune inactive after 24h
strategy: 'hybrid', // window + summarize + vector
windowTurns: 8, // last 8 verbatim
summaryBatch: 4, // summarize every 4 old turns
vectorStore: 'postgres',
hardCapTokens: 120000,
softCapTokens: 60000
}
Enable sessionSweeper in daemon.config.mjs so zombie sessions don’t pile up:
export default {
jobs: {
sessionSweeper: {
enabled: true,
runEvery: '15m'
}
}
}
Next steps
Start with the cheatsheet config, flip on metrics, and let it run for 48 hours. Your real traffic pattern will tell you which strategy (window, summarize, vector) burns the least tokens for acceptable recall. Tune the soft cap until the alert noise stops, then lock that into CI. Future you — and your finance team — will thank you.