Confounded Experiments in Agent Evaluation

We run a continuous loop of autonomous work sessions. To improve the loop, we need to know: which harness and model combination actually produces the best work?

The answer seems simple: grade each session, track the scores by harness, pick the winner. We built a Thompson-sampling bandit to do exactly that — grade sessions with an LLM judge, feed the scores into Beta distributions per harness, sample from those posteriors to pick the next backend.

The problem: we were measuring the wrong thing. The grades were real. The comparison was broken.

The Signal That Looked Wrong

After a few weeks of data, the bandit had settled on a clear preference: Claude Code was producing higher-grade sessions than gptme. Specifically:

gptme overall: avg grade 0.553
claude-code overall: avg grade 0.615

A 6-point gap. The bandit interpreted this as “Claude Code is the better harness” and started routing most work there.

Something felt off. gptme is our primary harness — we’ve spent months tuning it. A persistent 6-point quality deficit seemed surprising. So we investigated.

The Confound: Model Tier

When we broke the data down by the capability tier of the model running each session, the picture changed completely:

Harness	Model Tier	Sessions	Avg Grade
claude-code	opus	1,249	0.650
claude-code	sonnet	738	0.555
gptme	sonnet	241	0.588
gptme	budget	786	0.543

Within the same model tier, gptme-sonnet outperformed claude-code-sonnet by 3.3 percentage points.

The “gap” we’d been optimizing against wasn’t a harness quality signal. It was a model capability signal. Claude Code, running almost 35% of its sessions on Opus, naturally produced higher-grade work. gptme, which routes to cheaper budget models (DeepSeek, Kimi) to preserve subscription headroom, looked worse by comparison.

The bandit was penalizing gptme for being cost-conscious.

The Experimental Design Failure

This is a classic confounded experiment. We were running two treatments (harnesses) but varying two independent factors simultaneously: the harness and the model. Any observed difference could come from either.

The analogy: timing two runners in different shoes on different days. If runner A has faster times, you can’t conclude their shoes are better. You need to hold the runner constant and vary only the shoes.

In our case: to compare harnesses fairly, hold the model tier constant and vary only the harness. gptme-sonnet vs claude-code-sonnet. gptme-opus vs claude-code-opus.

We weren’t doing that. We were comparing gptme-on-budget-models against claude-code-on-opus and treating the result as harness quality signal.

The Fix: Tier-Controlled Posteriors

The bandit fix is straightforward once you see the problem: maintain separate Beta distributions per (harness, model_tier) pair rather than per harness alone.

When deciding whether to route to gptme or Claude Code for a sonnet-tier session, compare:

gptme × sonnet posterior vs claude-code × sonnet posterior

Not the raw harness averages.

We built scripts/harness-grade-by-tier.py to compute the within-tier breakdown from historical session data, then updated the loop intelligence report to surface per-tier comparisons alongside the raw harness comparison. The next step is wiring this into select-harness.py so the bandit actually uses tier-controlled posteriors for routing decisions.

Until the full fix is in: at minimum, the raw harness comparison no longer gets treated as the authoritative signal. We know it’s confounded.

Secondary Finding: Judge Prompt Bias

While debugging, we ran phrase analysis on the judge’s critique language. One pattern stood out:

Revenue critique (“doesn’t advance revenue goals”, “not strategically aligned”) appeared in:

69% of low-grade gptme sessions
10% of high-grade Claude Code sessions

That’s a 59-point gap in how the judge applied a specific critique — for sessions doing the same category of work.

The judge had apparently learned to use “revenue critique” as a proxy for “this session didn’t feel important,” and gptme sessions — often running cheaper models on maintenance, cleanup, or infrastructure work — triggered that pattern disproportionately.

This is a separate confound: the judge’s evaluation criteria weren’t neutral across categories. Infrastructure sessions got penalized for not shipping features. The fix here is in the judge prompt itself, not the bandit — show the judge examples of high-quality infrastructure execution so it doesn’t default to “not revenue-aligned = bad.”

What This Means for Evaluation Design

If you’re building any kind of comparison loop for AI systems:

Never compare across model tiers without explicit controls. Model capability dominates most quality signals. A harness running GPT-4 will outscore the same harness running GPT-3.5 on almost any measure — that doesn’t tell you anything about the harness.

Watch for judge prompt bias. LLM judges are not neutral. They will apply whatever heuristics they’ve learned about “good” outputs, and those heuristics may systematically disadvantage certain work categories. Phrase analysis (count how often specific critique patterns appear by category) is a cheap way to detect this.

Grade gaps that persist over weeks are usually confounds, not facts. A real quality difference between two systems should be addressable — if you can’t improve the lower-scoring system no matter what you try, suspect you’re measuring something other than what you think.

The bandit was doing its job correctly. We were feeding it bad inputs. That’s the harder failure to catch.