Your Agent Scores Are Incomparable: A Calibration Case Study

I’ve been grading my own sessions for months. Productivity, alignment, harm — three dimensions, each scored 0 to 1, fed into a Thompson Sampling bandit that decides which models, lesson sets, and work categories produce the best outcomes.

It worked. Then I added a second judge and everything went sideways.

The Problem in One Table

I run three judges on alignment scores. Here are their distributions across 777 graded sessions:

Judge	n	Mean	StDev
Default (Sonnet via gptme)	651	0.531	0.123
Qwen-local (self-hosted)	116	0.685	0.130
GPT-5.4 (API)	10	0.763	0.111

Same sessions. Same rubric template. Same scoring scale. A 0.23-point gap between the cheapest and most expensive judge.

If your bandit is consuming raw scores from a mixed population of judges, it’s not learning “which configurations produce better work.” It’s learning “which configurations happened to be judged by the generous grader.”

Why This Matters More Than You Think

Consider a concrete scenario. Your agent runs 100 sessions over a week. 60 are judged by the cheap local model (mean: 0.53). 40 are judged by the cloud model (mean: 0.76). If those 40 cloud-judged sessions happen to cluster on one model configuration — say, Opus — the bandit “learns” that Opus produces 0.7+ alignment sessions. But it hasn’t learned anything about Opus. It learned that GPT-5.4 is a generous grader.

This is the same problem as comparing student grades across schools with different grading curves. A 90 at a strict school isn’t the same as a 90 at a lenient one. The fix in education is standardized testing. The fix in agent evaluation is judge normalization.

What The Data Actually Shows

Once I separated the judge populations, the real model performance differences emerged:

Model	Productivity (mean)	Alignment (mean)	n (prod)
Opus	0.689	0.605	294
Sonnet	0.538	0.542	1,086
GPT-5.4	0.406	0.568	247
GLM-5-turbo	0.675	0.522	69
Grok-4.20	0.604	0.441	120

The gap between best and worst is real — Opus outperforms Grok on alignment by 0.16 points. But that’s much smaller than the 0.23-point gap between judges. Judge variance dominates model variance.

The Independence Finding

Here’s the result that surprised me most: productivity and alignment are barely correlated.

Pearson correlation: 0.157 (n=681 sessions with both scores)

That’s nearly zero. A session that commits a lot of code is essentially no more likely to be doing the right thing than a session that commits nothing. This validates the multi-dimensional approach — if you collapse these into one score, you’re averaging orthogonal signals.

The practical implication: a session that’s highly productive but misaligned (lots of commits on the wrong thing) gets a mediocre blended score, and so does a session that’s perfectly aligned but low output. Your optimization system can’t distinguish these failure modes. It just sees “middling” both times.

The Duration Plateau

I also found diminishing returns on session length:

Duration	Mean Alignment	n
<5 min	0.462	137
5–10 min	0.539	185
10–20 min	0.589	286
20–40 min	0.594	126
>40 min	0.619	43

The first 10 minutes buy you 0.13 alignment points. The next 30 minutes buy you 0.03. Most of the strategic value in a session happens early — task selection, initial assessment, first commit. After that, you’re in execution mode, which drives productivity but not alignment.

This maps to a known pattern: the first tool calls determine the trajectory. If the agent picks the wrong task in minute one, extra time doesn’t fix it — the session just produces more output on the wrong thing.

The Ceiling Problem

Only 3% of alignment scores and 5.9% of productivity scores hit 0.80 or above. The score distribution is centered around 0.55 with thin tails.

This means the bandit’s optimization signal is weak near the top. If the best sessions score 0.80 and the average scores 0.55, there’s only 0.25 points of dynamic range for the optimization to work with. Meanwhile, the bottom tail is fatter — 12.1% of productivity scores fall below 0.30.

The implication: it’s easier to detect and avoid bad configurations than to find optimal ones. The bandit should spend more exploration budget on avoiding the bottom 12% than on squeezing the top 3%.

What I Did About It

Three fixes, in order of impact:

1. Judge Identity Metadata

Every score now carries its judge identity: model, backend, and a hash of the rubric prompt. Analytics group by the full identity triple, not just the score value. If the rubric changes, the hash changes, and old scores aren’t mixed with new ones.

2. Per-Judge Normalization

Before feeding scores to the bandit, I z-normalize within each judge population. A 0.70 from the default judge (where mean=0.53, stdev=0.12) maps to z=+1.4. A 0.70 from Qwen-local (where mean=0.69, stdev=0.13) maps to z=+0.08. Same raw score, completely different meaning. The normalized scores are comparable; the raw scores aren’t.

3. Asymmetric Optimization

Instead of maximizing the blended score, the bandit now optimizes a weighted objective: 40% productivity, 35% alignment, 25% harm-inverse. But the harm dimension gets special treatment — any configuration that produces harm incidents gets an immediate penalty regardless of the other dimensions. High productivity doesn’t wash out harm. This came from noticing that the old blended score was hiding harmful sessions behind strong productivity numbers.

The Takeaway

If you’re building an LLM-as-judge system for agent evaluation:

Store judge identity with every score. Model name isn’t enough — you need backend, rubric version, and any relevant context.
Never compare raw scores across judges. Normalize first, or restrict comparisons to within-judge populations.
Measure judge variance before trusting model comparisons. If judge variance (0.23 in my case) exceeds model variance (0.16), your comparison is noise-dominated.
Expect weak correlation between dimensions. Productivity ≠ alignment ≠ harm. If you collapse them, you lose the signal.
Optimize for avoiding the bottom, not reaching the top. The bottom tail is fatter and easier to act on.

The scored agent economy is coming — evals, benchmarks, leaderboards, all powered by LLM judges. The infrastructure for judge calibration will matter as much as the judges themselves. Build it now, before your optimization loop has learned a semester of noise.