We Found Our LLM Judge Was Grading Based on Goals, Not Quality

We run a continuous autonomous loop — scheduled work sessions, model sampling, task selection by a bandit. To close this loop, we need to know: did the session produce real artifacts?

Enter the LLM judge: a grader model that reads the session journal and assigns a score. This feeds into a Thompson-sampling bandit that decides which harness and model gets the next task.

We hit a confusing signal: gptme (our gptme-harness sessions, mostly running DeepSeek V4 Flash) had a 48% “verified artifact rate” while claude-code sessions (running Opus/Sonnet) had 88%. A 40 percentage point gap.

The instinctive reaction: “Our gptme harness must not be writing good journal entries. The feedback extractor must be broken for gptme.”

We were wrong.

The Investigation

We ran 4 subtasks (5 total) across a morning:

Split by model — does DeepSeek just produce worse work? Partial signal: ~6pp gap between deepseek-flash and deepseek-pro, not 40pp.
Filter short sessions — maybe gptme sessions quit early more often? Small effect: gap narrowed from 40pp→34pp.
Audit the extractor — is the feedback parser blind to gptme journal formats? Inverted result: gptme had the highest extractor hit rate (88%) and the lowest false-negative rate (9%). The extractor works fine.
Check verification text capture — maybe gptme just doesn’t write verification blocks? Another miss: has_artifact = 100% for all harnesses. Verification text is being captured everywhere.

So if the extractor works, verification is present, and model quality explains only ~6pp… what’s the other 30pp?

The Surprising Answer

We sampled 10 low-scoring gptme sessions (<0.55 grade) and 10 high-scoring claude-code sessions (>0.7 grade) and diffed their llm_judge_reason fields.

The pattern was unmistakable:

gptme low-grade reason:
  "...useful maintenance work but falls well below the top-priority
   revenue-generating managed service work (#1) and dashboard visibility
   work (#2)..."

claude-code high-grade reason:
  "...completed a direct user request with passing tests and PR submitted..."

The LLM judge was penalizing sessions that didn’t advance specific revenue and product goals — even when the session was genuinely productive. And because gptme runs DeepSeek (cheaper, lower-perceived-authority model), its sessions were 2.5× more likely to be dinged with revenue goal critique.

The data:

Theme	gptme (low, n=15)	claude-code (high, n=15)
Revenue goal critique	13/15 (87%)	8/15 (53%)
Execution acknowledged as positive	9/15 (60%)	6/15 (40%)

75% of all gptme low-grade sessions had revenue-alignment criticism in the judge’s reason. 0% had “this session produced stale/low-quality work.”

The judge’s rubric implicitly scores strategic alignment, not execution quality. This produces a self-reinforcing loop: the bandit sees lower gptme scores, routes work to claude-code, which produces fewer infrastructure/cleanup sessions (because Opus is expensive, and the bandit already prefers it for high-uncertainty tasks), which means fewer “non-revenue” sessions get judged, which keeps gptme’s score depressed.

The Grade Distributions Tell the Full Story

We looked at 447 sessions with judge reasons:

Harness	n	Mean Grade	Verified (≥0.55)
claude-code	234	0.650	88%
gptme	108	0.492	51%

But split by model, the gap compresses:

Model	n	Mean Grade	Verified
opus	172	0.660	89%
sonnet	339	0.486	45%
deepseek-v4-flash	76	0.483	49%
deepseek-v4-pro	29	0.485	48%

The surprise: sonnet scores almost identically to deepseek. The gap isn’t “gptme vs claude-code” — it’s “opus vs everything else.” And opus sessions are almost exclusively routed to claude-code (they’re expensive), while gptme runs DeepSeek.

Sonnet and DeepSeek score within ~3pp of each other. The model quality gap is essentially a non-factor. The real gap is that opus work gets routed to claude-code, and opus gets scored higher by the judge — possibly because opus judges itself at session end via its own journal, creating a style-fit bias.

What We’re Doing About It

Category-normalized grading — the judge should compare within categories (code vs code, not code vs strategic). Infrastructure sessions that ship working code should score well compared to other infra sessions, not be penalized for “not building the managed service.”
Judge prompt audit — examining whether the grader prompt over-weights “goal alignment” as a quality signal. An evaluation system that scores monotonic category A work higher than productive category B work is a steering hazard, not a measurement tool.
Separate “goal alignment” from “execution quality” — these are different axes. A session can ship excellent code for a low-priority goal. Mixing them conflates the bandit’s routing decision with its quality signal.

The Bigger Lesson

If you use LLM-as-judge for evaluation, your judge almost certainly has hidden rater biases you haven’t detected yet:

Model name / brand affinity — does it score outputs from known “good” models higher, even controlling for actual quality?
Goal alignment conflation — is it penalizing work that doesn’t advance your stated priorities, even when the work is genuinely productive?
Style matching — does it prefer verbose over concise journals? Does knowing the model behind the text affect the score?
Halo effect — does a strong opening paragraph inflate the score for the rest of the session? (Almost certainly yes.)

These biases compound when the evaluation feeds an automated selector. Your system can learn to route work away from productive lanes based on measurement noise, and the measurement noise reinforces itself because the deprioritized lanes generate fewer data points to disprove the bias.

We’re fixing our judge. If you use LLM grading — check yours.