We Tested 1M Context on 143 Agent Sessions. The Result Was Null.

Three days ago I published a post about 1M context windows going GA for Claude and what it might mean for agents. The theory was reasonable: more context headroom → more knowledge included → better decisions.

We tested it. It doesn’t work.

The Experiment

Bob (my autonomous agent, running ~25 sessions/day) has been split-testing two context configurations since March 14:

Standard tier: 15k token system prompt, keyword-matched lessons, dynamic task status. What we’ve used for months.
Massive tier: 3× more content — full knowledge base, longer lesson history, more GitHub context, extended journal summaries.

Random assignment per session. N=143 sessions over 3 days. We measure trajectory grade (0-1, LLM-as-judge scoring session quality) as the primary metric.

What the Naive Analysis Said

At first glance, massive tier looks like a clear winner:

Tier	n	Productive	Grade	Deliverables	Duration
massive	93	99%	0.641	8.7	1284s
standard	50	84%	0.561	11.0	1025s

Massive: +0.080 quality, +15% productivity. Ship it.

But this is wrong. The standard group was contaminated.

The Confounding Problem

Bob runs on multiple backends — Claude Opus, Sonnet, and (for a brief experimental window) gpt-5.4 via OpenAI subscription. The gpt-5.4 sessions ran exclusively in the standard tier during the experiment period. And gpt-5.4 is… not great at autonomous work:

Model×Tier	n	Productive	Grade
opus@massive	69	99%	0.653
opus@standard	16	100%	0.651
sonnet@massive	24	100%	0.606
sonnet@standard	12	100%	0.650
gpt-5.4@standard	20	60%	0.417

Those 20 gpt-5.4 sessions dragged the standard group’s mean from ~0.651 down to 0.561. Not a context effect at all — a model contamination effect.

The Real Numbers

Opus-only (primary signal):

Group	n	Mean Grade	Difference
opus@massive	69	0.653	—
opus@standard	16	0.651	+0.002

+0.002. That’s noise.

Proper randomized AB groups (excluding the contaminated sessions):

Group	n	Productive	Grade
treatment_massive	93	99%	0.641
control_standard	27	100%	0.646

Standard wins by 0.005. Also noise — but in the other direction.

Sonnet (interesting):

Group	n	Grade
sonnet@massive	24	0.606
sonnet@standard	12	0.650

Standard wins by 0.044. This might be real — smaller models may get distracted or confused by more context, while Opus is robust to it.

What Actually Matters

If context volume doesn’t move quality, what does?

We have a parallel data source: Thompson sampling bandits tracking lesson effectiveness across ~800 sessions (leave-one-out analysis). The top quality drivers:

Lesson	Quality lift
`memory-failure-prevention`	+0.288
`progress-despite-blockers`	+0.270
`lesson-quality-standards`	+0.243
`autonomous-run`	+0.195
`stage-files-before-commit`	+0.183

Look at that list. Not a single documentation or knowledge-base lesson. Every top performer is a mindset or process lesson — frameworks for how to think and decide, not raw information.

The lesson isn’t “give the agent more context.” It’s “give the agent better decision frameworks.”

This is consistent with the context engineering research finding that structured, task-relevant context outperforms comprehensive but unfocused context. The model already knows most of what it needs. What it needs is help using that knowledge effectively.

The Cost of Being Wrong

The naive analysis would have had us shipping massive context as the default. What that would actually cost:

Metric	Massive	Standard	Difference
Session duration	1284s	1025s	+25%
Context tokens injected	32.8M	13.0M	+152%

25% slower, no quality gain. At 25 sessions/day, that’s ~80 minutes of extra wall-clock time per day, plus meaningfully higher token costs on the system prompt side.

The confounding nearly had us pay 25% more for nothing.

What We’re Doing Instead

If broad context volume is neutral, the next hypothesis is targeted context: inject skill sets matched to the current task type.

For a strategic planning session: inject planning frameworks, goal hierarchy, decision rubrics. For a PR review session: inject git workflow, review patterns, code quality standards. For a code debugging session: inject error tracing patterns, test-writing lessons.

We implemented this last week — packages/context/src/context/bundles.py maps 12 CASCADE task categories to curated 5-7 lesson bundles. The system now reads the task category at session start and injects only the relevant skill set.

The A/B experiment on this is running. Hypothesis: targeted injection outperforms both massive and standard by 0.1+ on trajectory grade, because it gives the model the right context rather than more context.

Takeaways for Context Engineers

1. Run controlled experiments before deploying context changes. The naive analysis (massive looks great!) was backwards from reality. Without proper controls, you can’t know if your context improvements are real.

2. Watch for model contamination in multi-backend deployments. If you A/B test context strategies while also A/B testing models, contamination is almost inevitable. Either fix your model or isolate the confound explicitly.

3. Mindset/process lessons beat knowledge dumps. For Opus-class models, the raw information is often already there. The gap is decision quality, not information quantity. Invest in frameworks over references.

4. Context volume doesn’t scale quality. Context precision does. The right 10k tokens beats the wrong 100k tokens. Design for relevance, not comprehensiveness.

5. Smaller models may be more sensitive to context bloat. The Sonnet data (standard wins by 0.044) suggests capacity-constrained models get hurt by excess context. If you’re trying to cut costs with smaller models, don’t compensate with more context.

The full analysis doc is in Bob’s workspace at knowledge/analysis/ab-context-tier-decision-2026-03.md. The skill injection experiment is running now — should have enough data for analysis in ~2 weeks.

This post is part of an ongoing series on what actually works in production autonomous agent systems.