OpenClaw session management and context window optimization guide

Most OpenClaw projects work fine out-of-the-box — until the seventh or eighth hour-long conversation thread. Then the prompt balloons past the model’s context limit, responses lag, and your API bill looks like a Bitcoin chart from 2021. This post is a field report on how we keep multi-day conversations flowing in production agents running on ClawCloud and on-prem, using nothing fancier than Node 22, OpenClaw 0.34.3, and some disciplined token math.

Why session management matters in OpenClaw

The OpenClaw gateway stores message history in session objects. Every inbound or outbound message is appended so the LLM gets the full dialogue on the next request. That’s great for small chats, but the moment you integrate WhatsApp groups or a busy Slack channel, the context window becomes the bottleneck:

Anthropic’s Claude 3 Sonnet: 200k tokens
OpenAI GPT-4o: 128k tokens
Mixtral 8x22B on Ollama: 32k tokens after rope scaling

Add system prompts, function/tool declarations, and you can shred 10k-20k tokens before the user even types hello. Without pruning, sessions either blow up or you start truncating randomly and lose coherence.

Session lifecycle: the defaults and the landmines

The gateway’s default SessionStore implementation keeps everything in Redis with no TTL. A background job (sessionSweeper) is off by default. If you launch using claw run on ClawCloud today, you get:

Unlimited lifetime sessions
History capped only by the context limit of your selected model
No cost guardrails unless you set them

I learned this the expensive way: a support agent summarizing PDF manuals chewed through 9.3M tokens over a weekend because a customer pasted the entire ISO 26262 spec twice. The fix is to treat session management as first-class infrastructure, not an afterthought.

Measuring your context window: token math 101

Before you optimize, you need numbers. I export three metrics per request:

prompt_tokens – tokens sent to the model
completion_tokens – tokens in the model’s answer
session_tokens_total – running total for that session

The cheapest way to count is tiktoken. A tiny helper inside the agent looks like this:

import { encoding_for_model } from 'tiktoken';
const enc = encoding_for_model('gpt-4o');
export const countTokens = (text) => enc.encode(text).length;

Wrap every incoming and outgoing message. Ship the numbers to Prometheus. Grafana heatmaps will show when you’re creeping toward the limit long before the API 400s.

Practical token budgeting for Node.js agents

1. Hard cap per request

In gateway.config.mjs:

export default {
  llm: {
    provider: 'openai',
    model: 'gpt-4o-mini',
    maxPromptTokens: 120000, // hard safety margin
    maxCompletionTokens: 4096
  }
}

This forces the gateway to refuse building a prompt larger than maxPromptTokens. It fails loudly, which is what you want in prod.

2. Soft budget per session

Add a mid-tier budget (e.g., 50% of max). Once the running total passes that, start summarizing old turns (we’ll get to that in a sec). Store the threshold in Redis so the daemon and any horizontally scaled gateways share state:

redis.set('budget:gpt-4o-mini', 60000);

3. Per-tenant or per-user quotas

If you’re SaaS-ing your agent, expose quotas in your billing DB and clip the session when they’re out. Users are remarkably good at creating degenerate prompt loops when they’re not paying for it.

Long-conversation strategy: rolling window, summarize, archive

I’ve tested five retention strategies. Three survive real traffic:

Rolling window — keep the last N turns verbatim, drop older ones.
Summarize & replace — swap blocks of dialogue with LLM summaries.
Vector recall — embed every message, dump from prompt, re-inject only the ones nearest to the current query.

Rolling window config

Fast, deterministic, zero extra costs, but the model can forget important early context. Ideal for support chats where the user restates context often anyway.

// gateway.config.mjs
export default {
  session: {
    strategy: 'window',
    turns: 12 // keep last 12 messages (user+assistant)
  }
}

Summarize & replace

OpenClaw ships a convenience helper summarizeHistory(). My patch adds a guard so we don’t spend more than 1% of a session’s token budget summarizing it:

if (summaryCost > sessionBudget * 0.01) {
  // skip summarization this turn
}

Store summaries prefixed with [context] so the model treats them as references, not fresh dialogue.

Vector recall with Composio Postgres

If you already run pgvector for tool embeddings, piggy-back on it. Every time you prune a message, embed it using openai-text-embedding-3-small (1.5k/token). At inference time:

SELECT content
FROM message_embeddings
ORDER BY embedding <-> :current_query_vector
LIMIT 6;

Inject the top 6 chunks at the tail of the prompt under a header like Relevant past facts. Works great for personal assistant agents that must remember a user’s preferences from months ago.

Implementing session pruning hooks

The gateway exposes two hooks:

beforePromptAssemble(session)
afterCompletion(session, messages)

Glue code lives in src/hooks/session.js:

export async function beforePromptAssemble(session) {
  const { totalTokens } = session.meta;
  if (totalTokens > budget.soft) {
    await summarizeAndPrune(session);
  } else if (totalTokens > budget.hard) {
    throw new Error('Context window exhausted');
  }
}

Because this runs in the same process that renders the prompt, keep it < 20 ms or the UX stutters. Heavy lifting like vector search belongs in the async worker pool.

Cost control: avoiding surprise API bills

Two levers saved us 32% last quarter:

Early exit on no-op commands — If the user types thanks and the agent has no memory update or tool call, skip the LLM round-trip.
Model tier downgrade — For sessions that haven’t crossed a critical complexity threshold, start with GPT-3.5 Turbo 1106 on a 16k window. Promote to GPT-4o only when a summary or toolchain call fails a QA check. Do it transparently and users rarely notice.

Code sketch:

if (message.length < 6 && !needsAction(message)) {
  return sendTypingIndicatorOnly();
}
if (!session.meta.highComplexity) {
  llm.model = 'gpt-3.5-turbo-1106';
}

Monitoring & metrics you actually look at

A Grafana dashboard that nobody checks is useless. Hook alerts to #ops-alerts Slack when:

session_tokens_total > 80% of the max for 5 min
request latency p95 > 5 s
cost per tenant per day > $1 (adjust to your margins)

Prometheus rule example:

alert: OpenClawContextNearLimit
expr: session_tokens_total / model_context_limit > 0.8
for: 5m
labels:
  severity: warning
annotations:
  summary: 'Session {{ $labels.session_id }} near context limit'

Session config cheatsheet (copy/paste)

// session.config.mjs
export default {
  ttlSeconds: 86400,          // prune inactive after 24h
  strategy: 'hybrid',         // window + summarize + vector
  windowTurns: 8,             // last 8 verbatim
  summaryBatch: 4,            // summarize every 4 old turns
  vectorStore: 'postgres',
  hardCapTokens: 120000,
  softCapTokens: 60000
}

Enable sessionSweeper in daemon.config.mjs so zombie sessions don’t pile up:

export default {
  jobs: {
    sessionSweeper: {
      enabled: true,
      runEvery: '15m'
    }
  }
}

Next steps

Start with the cheatsheet config, flip on metrics, and let it run for 48 hours. Your real traffic pattern will tell you which strategy (window, summarize, vector) burns the least tokens for acceptable recall. Tune the soft cap until the alert noise stops, then lock that into CI. Future you — and your finance team — will thank you.