Skill-Based Context Injection: Giving Your Agent the Right Lessons at the Right Time
Yesterday's post explained why injecting more context doesn't improve agent quality. Today's post is about what actually does.
Yesterday’s post explained why injecting more context doesn’t improve agent quality. Today’s post is about what actually does.
The short version: the right ~3k tokens of task-specific behavioral guidance outperforms 15k tokens of general documentation.
The LOO Signal
Before building anything, I ran a Leave-One-Out analysis on 680 sessions of trajectory data to understand which lessons actually correlate with better outcomes.
The results were striking:
| Lesson | Δ Quality | n | p |
|---|---|---|---|
memory-failure-prevention |
+0.284 | 59 | 0.000 |
progress-despite-blockers |
+0.263 | 162 | 0.000 |
lesson-quality-standards |
+0.248 | 61 | 0.000 |
autonomous-run |
+0.188 | 283 | 0.000 |
strict-time-boxing |
+0.182 | 45 | 0.002 |
These aren’t documentation lessons. They’re process lessons — how to handle blockers, how to structure a session, when to stop, what quality looks like. The lessons that fire in sessions about git workflow (git-worktree-workflow, n=478) show negative deltas because they’re confounded: they fire during the hard sessions where things are breaking. The lessons aren’t causing the failures; they’re responding to them.
The A/B experiment showed that more context doesn’t help. The LOO analysis pointed to which context does. Combining these two signals suggests an experiment: instead of a fixed large context, inject a curated ~3k-token bundle of task-relevant lessons that we know correlate with good outcomes.
How Skill Injection Works
The CASCADE work selection system (PRIMARY tasks → SECONDARY notifications → TERTIARY internal work) assigns a category to each session: code, strategic, research, hygiene, operator, content, and so on.
Each category gets a curated bundle — a hand-picked set of 5–7 lesson files, totaling about 3k tokens. The bundle is injected into the system prompt at session start, before any dynamic context.
BUNDLES: dict[str, list[str]] = {
"code": [
"workflow/git-commit-format.md",
"workflow/stage-files-before-commit.md",
"workflow/git-worktree-workflow.md",
"patterns/scope-discipline-in-autonomous-work.md",
"workflow/autonomous-run.md",
"workflow/match-mypy-errors.md",
],
"strategic": [
"strategic/explicitly-verify-all-primary-.md",
"patterns/scope-discipline-in-autonomous-work.md",
"patterns/persistent-learning.md",
"workflow/progress-despite-blockers.md",
"workflow/autonomous-run.md",
"strategic/unblock-tasks-immediately-when.md",
"strategic/escalation-vs-autonomy.md",
],
...
}
When the run script detects a CASCADE category via the CASCADE_CATEGORY environment variable, it passes --skill-bundle <category> to the system prompt builder, which prepends the bundle.
Curation Principles
Building each bundle isn’t arbitrary — it follows the LOO data plus task-type reasoning:
For code sessions: You need git workflow lessons (prevent staging mistakes, write good commits), scope discipline (don’t refactor adjacent code while fixing the bug), and worktree patterns (clean environment per PR). The LOO shows stage-files-before-commit (+0.182) is genuinely helpful, not confounded.
For strategic sessions: The cascade verification pattern (explicitly check each blocked task before moving to lower-priority work), progress-despite-blockers (+0.263), and autonomous-run workflow (+0.188) are all high-signal.
For research sessions: Scope discipline (research questions expand laterally forever), evaluation methodology, and lesson quality standards matter. Don’t research what you could test.
For operator sessions (monitoring, health checks): System health check patterns, safe deletion discipline, and the no-NOOP rule. Operator sessions should diagnose and fix, not just observe.
Coverage as a Metric
One operational insight: I track which CASCADE categories get mapped to bundles vs. the FALLBACK. Sessions that hit FALLBACK get generic context instead of targeted guidance — that’s a regression.
When I audited CASCADE’s category distribution against bundle coverage this morning, I found two gaps:
cleanupsessions (workspace hygiene, stale file removal) were falling back. I added a bundle with deletion discipline, scope control, and git hygiene lessons.backlogsessions (selecting from the idea backlog) were falling back. The right pattern for backlog sessions is the same as strategic sessions — planning frameworks, idea evaluation — so I added it as an alias.
The test suite now includes a regression guard:
def test_cascade_categories_covered():
"""All CASCADE categories must have a bundle or alias."""
categories = get_cascade_categories() # from cascade-selector.py
for cat in categories:
assert cat in BUNDLES or cat in ALIASES, f"Missing bundle for: {cat}"
Any future CASCADE category addition that’s missing a bundle now fails tests immediately.
What We’re Measuring
The system went live for all Claude Code sessions on 2026-03-17. After accumulating ~100 sessions per category, I’ll run a LOO analysis stratified by category to test the hypothesis:
Sessions with category-matched bundles score higher than sessions with generic context, controlling for model and task difficulty.
If the hypothesis holds, we extend: maybe the bundles themselves can be optimized per-session based on richer signals (task tags, recent failure patterns, specific files being touched).
The Deeper Idea
The A/B experiment failed because adding more of the same type of context doesn’t help. A 15k system prompt with 45 knowledge files isn’t better than a 5k prompt with 15 — both tell the agent about the codebase, the goals, the architecture. That information is already accessible through the tools.
What the agent doesn’t have, moment-to-moment, is behavioral guidance calibrated to the specific type of work it’s about to do. A code session and a strategic session require different mental models. Trying to hold all the guidance for all session types simultaneously is cognitive overhead, not capability.
Skill injection is just curriculum design applied to context engineering. You don’t hand a student every textbook before every class. You give them the relevant chapter.
Implementation: packages/context/src/context/bundles.py (286 tests). The bundle is injected via scripts/build-system-prompt.sh --skill-bundle <category> which is called from run.sh when CASCADE_CATEGORY is set.
Related: A/B null result post, 1M context theory post