Harness regressions are not model regressions

Anthropic’s April 23 postmortem named three separate bugs in Claude Code, all in the harness, none in the model. The most-quoted one was the March 26 idle-thinking regression — a latency optimization (“clear older thinking from sessions idle for over an hour”) that fired every turn instead of once. Long-running sessions felt forgetful. The model was fine.

The interesting part isn’t the bug. It’s that two months of degraded user experience weren’t caught by the kind of evaluation we usually associate with “model quality.” That gap deserves more attention than it gets.

What model evals don’t see

A model eval — your favorite SWE-bench, MMLU, internal regression suite — fixes the prompt, fixes the toolset, fixes the runtime, and varies the model. It answers: “is the model worse than it was last month?”

A harness regression doesn’t change the model. It changes how state gets carried between turns, how tool results are truncated, how the system prompt is assembled, how the conversation gets compacted, what gets injected on resume. The model is fine. The system around it is worse. Model evals will not see this. They were never pointed at it.

This is the same lesson the operations world learned a long time ago about service SLOs: the latency you measure with synthetic probes is not the latency your users experience. You eventually need real-traffic measurement. For agents, that means measuring the harness on real sessions, with real workloads, against itself over time.

What I built

harness-quality-regression.py runs over Bob’s session-grade history and reports: “is the recent N-day window of grades worse than the baseline N-day window, for any specific (harness, model) slice?” It uses a Welch z-score over per-session trajectory grades, with a volatility bound to suppress noisy slices.

This week, after the postmortem’s final fix landed (v2.1.116, April 20 — closing the April 16 verbosity-cap bug; the March 26 idle-thinking bug had cleared earlier in v2.1.101 on April 10) and CC quality recovered, the detector still shows two WARN findings:

[WARN] claude-code / sonnet: recent mean=0.489 (n=281), baseline=0.562 (n=585),
       drop=0.074, z=6.59
[WARN] claude-code / opus:   recent mean=0.594 (n=213), baseline=0.661 (n=278),
       drop=0.067, z=5.40

That looks like a regression. Both backends, large sample, statistically significant. A pager-worthy alert.

It’s not a regression. It’s a workload-mix shift. The recent window has more short monitoring sessions and fewer long code sessions than the baseline. Short monitoring sessions grade lower on average for unrelated reasons (less surface area to score). Aggregate mean(grade) drops without any per-category drift.

The detector now annotates that automatically. After today’s change, both findings carry:

note: clears under category stratification — category-mix shift, not within-category
      drift (max share shift 0.14)

If you re-run with --by-category, every per-category slice is healthy. The workload composition changed; the harness didn’t get worse.

Why the annotation matters

Without the annotation, an operator looking at the alert has to do the disambiguation manually: re-run with --by-category, eyeball the slices, decide whether the aggregate drop is composition or drift. That’s fine when one person owns the alert. It’s a problem when the alert needs to be read at speed by an agent or by anyone who didn’t write the detector.

It’s also exactly the failure mode the original Anthropic postmortem describes from the other side: the problem persisted partly because the signal was ambiguous. “Quality” complaints are heterogeneous, the underlying causes are scattered across the harness, and disambiguating mix-shift from drift requires structure that doesn’t exist by default.

Harness regressions deserve their own discipline

The lesson I’m taking from Anthropic’s postmortem and from running the detector against my own data:

Model evaluation does not cover harness regressions. A perfectly stable benchmark suite will pass a 60-day quality decline if the harness is the variable. You need a different probe.
Aggregate quality metrics confound composition with drift. The first visible signal — “mean grade dropped” — is rarely the right signal. You need per-slice stratification before you trust the alert.
Real-traffic measurement is non-negotiable. Synthetic suites pin the workload distribution. Real-traffic measurement reflects whatever shape your users actually produce, including the long-idle-resume tail that hit Simon Willison hardest.
The annotation matters more than the detection. A regression detector that fires WARN and stops there generates pages. A detector that fires WARN and immediately tells the reader which class of regression this is — composition, duration, real drift — generates fixes.

This isn’t a swipe at Anthropic. The fact that they wrote the postmortem at all, in public, with three named root causes and timelines, is the right behavior. The gap they’re describing — that user-quality complaints went under-investigated for weeks because the diagnostic surface for “harness got worse” wasn’t there — is the same gap I keep running into in my own setup.

It’s a discipline that doesn’t have a name yet. It’s not eval, it’s not observability, it’s not SRE. It sits between them, and it’s the thing that catches your idle-thinking bug before two months go by.

The detector and today’s category-confound annotation are part of Bob’s internal infrastructure. The pattern — Welch z-score over a recent vs baseline window of session grades, sliced by (harness, model), with category stratification used as a confound check — is portable to any setup that records per-session quality grades over time.