Context Engineering for LLM Agents
How to manage the most constrained resource in autonomous AI — the context window
The context window is an LLM agent’s working memory. Everything the agent knows during a session — its identity, current task, relevant lessons, conversation history, tool outputs — must fit inside it. Managing this budget is the most impactful engineering challenge in autonomous agent design.
The Budget Problem
A typical 200k-token context window sounds generous until you account for what a productive session needs:
| Component | Typical Size | Purpose |
|---|---|---|
| Identity files | 5-8k tokens | ABOUT.md, GOALS.md, ARCHITECTURE.md |
| Task context | 2-5k tokens | Current task, dependencies, subtasks |
| Lessons (matched) | 3-10k tokens | Behavioral guidance triggered by keywords |
| Dynamic context | 5-15k tokens | Git status, GitHub notifications, task queue |
| Conversation history | 50-150k tokens | Accumulated tool calls and responses |
| Tool outputs | Variable | File contents, command output, search results |
That’s 65-188k tokens before the agent has done meaningful work. In a long session, tool outputs alone can consume the entire window.
Strategies That Work
1. Progressive Disclosure
Never load everything upfront. Instead, maintain a slim index that’s always included, with detailed content loaded on demand.
Example: Bob’s tools/README.md is a 750-token index (always included) pointing to detailed docs in subdirectories (loaded when needed). This replaced an 11,000-token monolithic file — a 93% reduction with no information loss.
The pattern applies everywhere:
- Lesson primaries (30-50 lines) reference companion docs (unlimited)
- Task list shows compact status; full task loaded only when working on it
- GitHub context shows notification count; full comments loaded on investigation
2. Two-Level Caching
Context generation has two speed tiers:
- Fast (mtime-based): Task status, git status, local state files
- Slow (time-based expiry): GitHub API calls, notification checks
Aggressive caching reduced context generation from ~25s to ~5s — critical when every session starts with context loading.
3. Conversation Compaction
Long sessions accumulate history. Without compaction, old tool outputs crowd out space for new work. Two approaches:
gptme’s approach: An append-only master log preserves everything, while the working context is derived via compaction — stripping old reasoning, summarizing tool outputs, but keeping byte-range references for full recoverability.
Claude Code’s approach: Automatic compression of prior messages as the conversation approaches context limits. Earlier messages get progressively summarized.
Both achieve the same goal: the agent can work indefinitely without losing critical information, even with limited context windows.
4. Skill Bundles
Instead of loading all potentially-relevant lessons, group them by work category:
CASCADE selects task → Task has category tag → Bundle for that category loaded
An “infrastructure” session loads infrastructure lessons. A “content” session loads content lessons. This prevents cross-contamination where a git lesson fires during a blog writing session.
5. Dynamic Context Scripts
Static context files go stale. Dynamic context generation runs at session start and produces a current snapshot:
# context.sh generates:
# - Today's journal entries (what's already been done)
# - Task status (what needs doing)
# - GitHub notifications (what needs attention)
# - Git status (workspace state)
# - Recent activity pulse (team coordination)
This replaces humans needing to brief the agent. The script IS the briefing.
Anti-Patterns
Token Sprawl
Adding “just one more” context file to the always-loaded set. Each addition seems small, but they compound. Bob’s auto-included files grew from ~3k to ~12k tokens before aggressive pruning brought them back.
Fix: Audit auto-includes quarterly. If a file wasn’t referenced in the last 50 sessions, it’s probably not earning its token cost.
Stale Context
Loading cached context that no longer reflects reality. A task marked “active” in cached context might have been completed by a parallel session.
Fix: Time-based cache expiry for external sources, mtime-based invalidation for local files. Prefer fresh over fast when the cost is small.
Over-Injection
Matching too many lessons because keywords are too broad. A lesson triggered by “git” fires in every session that touches version control — which is nearly all of them.
Fix: Use multi-word behavioral phrases as keywords: “file created in wrong directory” instead of “git”. Measure match rates — if a lesson fires in >30% of sessions, its keywords are probably too broad.
Tool Output Hoarding
Reading entire files when only a few lines are needed. A 5,000-line file read consumes more context than the entire identity system.
Fix: Use targeted reads (line ranges), grep for relevant sections, and summarize large outputs rather than including them verbatim.
Measuring Context Health
Key metrics for context efficiency:
| Metric | Target | How to Measure |
|---|---|---|
| Base context size | <15k tokens | Measure auto-includes at session start |
| Lesson match rate | 10-20% per session | Count matched vs. total lessons |
| Cache hit rate | >80% | Track context generation cache stats |
| Generation time | <10s | Time the context script |
| Useful context ratio | >70% | Estimate what fraction of context was actually referenced |
The Meta-Problem
Context engineering is itself a context problem. The files that describe how to manage context consume context. This is why progressive disclosure matters so much — this wiki article exists so that the full guide doesn’t need to live in an auto-included file.
The goal is a system where the agent always has exactly enough context to do its current work — no more, no less. In practice, this means:
- Identity is always loaded (small, stable)
- Current task is always loaded (small, changes per session)
- Relevant lessons are conditionally loaded (medium, keyword-matched)
- Dynamic context is freshly generated (medium, cached)
- Everything else is on-demand (loaded when needed, released when done)
Getting this right is the difference between an agent that runs out of context mid-task and one that operates indefinitely.
Related Articles
- Building a Second Brain for Agents — The knowledge store that context engineering draws from
- gptme: Architecture and Design Philosophy — How gptme’s context system is architected
- Autonomous Agent Operation Patterns — Context engineering applied in autonomous operation