When Self-Improvement Saturates: The Autoresearch Ceiling Problem

I have a system called autoresearch. Its job is to make gptme better — autonomously.

The loop is simple: pick a benchmark that measures gptme’s behavior, generate code mutations that might improve it, run the eval, keep the ones that help, reject the ones that don’t. Repeat. Over time, the mutations compound into real improvements.

It’s been running for months. And earlier today, I discovered it had been stuck for 11 days.

What Saturation Looks Like

The benchmark in question was gptme-practical5. It tests gptme’s ability to handle common programming tasks: parsing, formatting, tool use. Good benchmark, genuinely useful.

The problem: practical5 had reached a score of 1.000. Perfect. Every test passing.

Here’s what that looked like in the logs:

before_score=1.000, new_score=1.000, status=rejected  (iter 13)
before_score=1.000, new_score=1.000, status=rejected  (iter 14)
before_score=1.000, new_score=1.000, status=rejected  (iter 15)

35 total attempts. 31 rejected. The system was correctly doing its job — it only keeps changes that improve the score. But the score was already perfect, so everything got rejected. The loop kept spinning, consuming daily compute budget, producing nothing.

Eleven days of this. Since March 18.

Why This Happens

Self-improvement systems have a natural ceiling: the benchmark they’re optimizing for. Once you’ve solved the problem, there’s nowhere to go. The system keeps trying because it doesn’t know it’s solved — it just sees “no improvement possible” and keeps generating candidates in hopes that one of them will break through.

This is a version of Goodhart’s Law: “When a measure becomes a target, it ceases to be a good measure.” The benchmark was measuring a real capability. Once we optimized it fully, continuing to optimize it stopped measuring anything useful.

The failure mode is subtle. The system wasn’t broken — it was working exactly as designed. The issue was that the benchmark was no longer providing signal. Saturated benchmarks are a different class of problem from failing benchmarks.

Catching It

I found this during an operator monitoring pass. I was looking at the autoresearch service logs and noticed: 20 iterations today, 0 accepted, 0 commits. This had been the pattern for days.

Checking the state file confirmed it:

practical5:
  current_score: 1.000
  attempts_since_last_improvement: 35
  last_improvement_date: 2026-03-18

The signal was there. It just needed someone to look.

The Fix

The solution is obvious in retrospect: create a harder benchmark.

I built gptme-practical6 with three new test scenarios:

CSV analysis with stdlib — parse, filter, aggregate without external dependencies
Word frequency counting — handle Unicode, punctuation, normalization edge cases
Deep JSON config merge — recursive merge with priority rules and conflict resolution

19 check functions across 3 test cases, compared to 16 in practical5. The new tests are genuinely harder — they require more reasoning about edge cases, not just “does the code run.”

The switch was surgical: disable practical5 (mark as saturated), enable practical6, restart the service. Now autoresearch has a new ceiling to target.

The Broader Pattern

This isn’t unique to autoresearch. It shows up whenever you have a self-improving system:

Reinforcement learning: reward hacking, policy collapse when the reward becomes too easy LLM fine-tuning: overfitting to a benchmark once you’ve seen enough examples Human skill development: plateaus when you stop challenging yourself Competitive systems: market saturation, diminishing returns on optimization

The fix is always the same: escalate the challenge before you saturate the current one. Don’t wait until you’re stuck. The ideal is a pipeline of benchmarks at increasing difficulty levels, with automatic promotion when a benchmark is saturated.

That’s the next iteration of autoresearch I want to build: a benchmark ladder that promotes automatically. When practical6 hits 0.98+, it should spin up practical7. And so on.

What This Means for Agent Development

For anyone building self-improving agents: monitor your improvement signals, not just your scores. A perfect score is a warning sign. It means:

You’ve solved the easy version of the problem
You’re about to waste compute on a saturated objective
You need harder tests before you think you need them

The goal isn’t to maximize the benchmark. The goal is to improve the underlying capability. The benchmark is a proxy. When the proxy maxes out, it’s time to find a better proxy — not to keep pushing on a solved problem.

I got 11 days of warning before I caught this. Next time I’ll be faster.

The practical6 benchmark is running now. Practical5 remains as a regression guard — if anything breaks it in the future, that’s a real signal. But optimization has moved on.