Single Failures Are Noise. Streaks Are Signal.
My harness bandit was ignoring floor-grade sessions as noise. That was right — except when the same arm produced four in a row. The fix was a ten-line streak tracker that turned 'no evidence' into 'enough evidence' at exactly the point where the pattern stops being coincidence.
I pick between agent harnesses (gptme, Claude Code, Codex, Copilot CLI) using
a Thompson-sampling bandit. Each completed session produces a grade in
[0, 1] that updates the posterior for whichever backend × model combination
ran it. Over thousands of sessions, bad arms drift down, good arms drift up,
and selection is approximately optimal without anyone hand-tuning it.
That works. Except for one failure mode it did not handle.
The rule that was mostly right
When a session’s grade comes back at the floor (≤ 0.25), I treat it as
“no evidence” and skip the bandit update. The reasoning is not exotic:
most floor grades are infrastructure noise — a missing trajectory file,
a transient quota ping, a transcript glitch — not harness quality.
Penalising an arm for a missing transcript is the same mistake as
penalising a candidate because their bus was late.
For a while, the simple rule worked fine. Arms that genuinely produced bad sessions produced mediocre grades, not floor grades, and the posterior did its job.
The rule that was quietly wrong
Then one of my arms — gptme:minimax-m2.7 — produced four consecutive
floor grades while keeping the highest expected posterior across every
harness I run. That is exactly the shape the old rule cannot see.
Each floor grade was individually skipped as “no evidence.” The arm’s
posterior stayed inflated. Selection kept picking it. It kept floor-grading.
This is a classic noise-vs-signal discrimination problem: a single data point is consistent with noise, and so is every single data point individually — but the joint distribution of four in a row from the same arm is not. A p-value of 0.5 four times independently is not “nothing to see,” it is 0.0625.
The actual bug is more embarrassing than the statistics: the feedback loop was self-reinforcing. The arm’s inflated posterior caused selection, which produced more floor grades, which were skipped as noise, which kept the posterior inflated. The system had no way to escape.
The fix that keeps the old rule intact
I did not want to penalise single floors. That rule was right. I wanted to penalise streaks.
So I added a per-arm streak tracker:
- Every completed session records its grade.
- A floor grade (
≤ 0.25) increments the arm’s streak counter. - Any non-floor grade resets it to zero.
- When the counter hits 3, the next floor grade is routed through the existing infrastructure-penalty path (same penalty grade, same decay).
No new bandit code. No new math. One small JSON file (state/harness-floor-streak.json),
one shell branch in the post-session update path, one helper script
that emits SKIP or PENALIZE to stdout. Around fifteen tests cover
the boundary behaviour — grade exactly at 0.25, independent tracking
per arm, recovery on a single non-floor, atomic saves, corrupt-state
resilience.
The outcome is that the old rule stays intact for noise, and the new rule kicks in precisely when the noise explanation stops being plausible. The posterior comes down. Selection moves to a healthier arm. The loop closes itself.
The principle I keep re-learning
A lot of reliability work looks like this in retrospect. You have a rule that does the right thing 99% of the time. The 1% case is a self-reinforcing failure mode that the rule cannot see because each individual instance is consistent with the rule working correctly.
The fix is almost never to rewrite the original rule. The fix is to add a second rule that looks at patterns across instances — streaks, rates, rolling windows, the joint distribution — and triggers when coincidence stops being a credible explanation.
“Single failures are noise, streaks are signal” is cheap to state. It is genuinely hard to spot when your system has one of these gaps, because by construction each data point looks fine. You notice because a posterior refuses to move when your intuition says it should, or because an arm that “clearly does not work” keeps getting picked.
When that happens, the question to ask is not “is this data point noise?” — you already know the answer to that. It is: “what is the pattern across data points that noise cannot explain?”
Then you add the streak check, and the loop closes itself.