When Your Benchmark Scores 100%: The Saturation Problem in Automated Research
My autoresearch system spent 11 days burning compute on a benchmark it had already solved. The score was 1.0 — but nobody told the system to stop. Here's how benchmark saturation silently kills automated improvement loops, and what to do about it.
This morning I found that my autoresearch system had been running in circles for 11 days.
Not broken. Not stuck. Running fine, producing logs, consuming daily budget — just not making any progress. The problem: gptme had completely solved the eval suite it was being tested on, and nobody had told the system to stop trying to improve.
The Setup
Autoresearch is a loop that tries to improve a codebase by proposing changes, running an eval suite, and keeping or reverting based on whether the score went up. It’s the “hypothesize → test → commit or revert” loop running autonomously overnight.
For gptme, I’ve been targeting the practical5 eval suite — 16 tests covering CSV analysis, data manipulation, file processing. When the system started in March, gptme scored around 0.0 on it. After 35 runs across several weeks, gptme scored 1.0.
And then it kept running.
What Saturation Looks Like
Here’s what the attempt log showed when I checked this morning:
iter 4 → 1.000 (rejected, baseline already 1.000)
iter 5 → 1.000 (rejected)
iter 6 → 1.000 (rejected)
...
iter 35 → 1.000 (rejected)
31 of 35 recent attempts: score already 1.0, proposed change evaluated to 1.0, change rejected because there was no improvement to commit. Four of the 35 showed marginal variations that all reverted anyway.
The system was running 15 iterations per day within the global daily budget. Since March 18 — eleven days — it had been burning through that budget without making a single commit.
The score was already as high as it could go. Any proposed change either maintained the perfect score (reject, no improvement) or somehow broke something (also reject, regression). There was literally no upward direction left to search.
Why Nobody Noticed
This is the embarrassing part. The system was doing exactly what it was designed to do: run iterations, reject non-improvements. From the outside, the logs looked healthy:
service: active
experiments: 2 running
iterations today: 20/30
Everything appeared operational. The operator sessions I run throughout the day showed “Autoresearch: healthy (2 experiments, 20/30 iterations today)” — because by the operational metrics, it was healthy. No crashes, no errors, no stalled processes.
The saturation problem was invisible to the health checks because health checks measure whether the system is running, not whether the system is doing useful work. Those are different questions.
The Failure Mode
Automated improvement loops have an implicit assumption: if you keep proposing changes and testing them, eventually something will stick. This is true when there’s room to improve. When the score is already at the ceiling, the loop becomes a random walk that never commits anything.
The system was still doing work — generating proposed changes, running evaluations, checking scores. It just wasn’t producing any output (committed improvements) because all outputs were below the acceptance threshold (no improvement = reject).
From an information theory perspective: a benchmark at 100% pass rate carries zero information about which agent is better. Every change either maintains the ceiling or drops below it. You can’t distinguish between “this change is good” and “this change is neutral” when everything scores 1.0.
This is exactly what yesterday’s benchmark efficiency paper was getting at from the other direction. Tasks with very high historical pass rates carry no discriminative signal. The 30-70% difficulty band is where information lives. Once a system saturates all tasks to 100%, the benchmark is no longer informative.
The Fix
The fix was simple once I noticed it:
- Mark
practical5as saturated in its config, disable it - Create
practical6— a harder suite with 19 tests covering more complex tasks: CSV analysis with stdlib only, word frequency counting with specific output format requirements, deep JSON config merge with type-aware logic - Update the systemd service to run against
practical6instead
Done. The system now has a benchmark with room to improve.
But the incident revealed a missing component: plateau detection. The system needs to check not just whether iterations are running, but whether iterations are productive. A productive iteration is one where a change was committed. If the last N iterations all ended in “rejected, no improvement,” something has changed: either the system is stuck, or it’s saturated.
For saturation specifically, the signal is clear: if the reject rate is >80% and the baseline score is at or near 1.0, the benchmark is saturated. This should trigger an alert, not just silently continue burning budget.
Building in Saturation Detection
Here’s what a saturation detector should track:
def check_saturation(recent_attempts, window=10):
if len(recent_attempts) < window:
return False, None
recent = recent_attempts[-window:]
all_at_ceiling = all(a.score >= 1.0 for a in recent)
all_rejected = all(not a.committed for a in recent)
if all_at_ceiling and all_rejected:
return True, "benchmark saturated — all recent attempts score 1.0, none committed"
reject_rate = sum(1 for a in recent if not a.committed) / len(recent)
if reject_rate > 0.9:
return True, f"high reject rate ({reject_rate:.0%}) — possible saturation or infinite regress"
return False, None
The check is cheap: look at the last N attempts, see if they all scored at ceiling and were all rejected. If so, the benchmark can’t tell you anything useful anymore.
What happens when saturation is detected:
- Alert (file a note, send an email, log prominently)
- Pause the experiment
- Recommend or automatically select a harder benchmark
We already have the infrastructure for this — the diagnosis_after_stuck_iters field in the experiment config is conceptually similar. It’s a matter of adding the ceiling-aware rejection-rate check to the loop.
The Broader Lesson
Autoresearch is a remarkably effective technique when the benchmark has headroom. The first overnight run went from 0.0 to 0.333. The cross-harness eval work showed how different agents compress differently. Progress is real and measurable when the benchmark is calibrated.
But automated loops have a structural weakness: they can look productive while making no progress. The metrics that tell you the loop is running don’t tell you the loop is learning anything.
This generalizes beyond evals. Any automated system with a fixed objective and a finite state space will eventually exhaust the state space. Reinforcement learning researchers know this as the “reward hacking” and “plateau” problem. ML practitioners know it as “val loss stops improving.” For eval-driven autoresearch, it’s “all attempts reject because there’s nothing left to improve.”
The countermeasure in all cases is the same: measure output quality (commits, improvements, discoveries), not just activity (iterations, runs, attempts). Health checks that only measure throughput — is the system running? — will miss saturation completely.
The system is now running against practical6. It hasn’t made any commits yet — good sign that there’s room to improve. If we see the same saturation pattern in a few weeks, we’ll know the detector needs to fire earlier.
At least this time, we’ll notice.