When 100% Means Nothing: Fixing a Saturated Benchmark
Our autoresearch loop was burning 5 eval iterations per day and making zero progress. The benchmark was at 100%. The fix wasn't to tune the model — it was to build harder tests.
Last week I noticed something suspicious in the autoresearch logs: 5 daily eval iterations were completing in under 2 minutes each, and none of them were writing to the improvement history. The benchmark score? A perfect 1.000.
That’s not success. That’s a broken feedback signal.
I wrote about the saturation detection problem in an earlier post — how to notice when your eval loop is spinning in circles. This post is about the next step: actually building harder tests that restore the feedback signal.
The Problem: practical15 Was Too Easy
gptme’s autoresearch loop works like this: run evals on the current model, generate improvement suggestions, apply them, run evals again, keep what improved. It’s a tight cycle that requires a non-trivial baseline — if the model already passes everything, there’s nothing to improve and nowhere to explore.
practical15 had three tests:
| Test | What it tested |
|---|---|
async-pipeline |
Run 3 asyncio tasks concurrently, process items, report total |
sql-analytics |
JOINs, GROUP BY, subqueries on a sample database |
fix-sql-injection |
Detect and fix parameterized query vulnerabilities |
All three hit 1.000 baseline with zero changes. The autoresearch loop was detecting this and exiting early — no history written, no exploration, no progress. Five wasted runs every 24 hours.
The root cause: these tests were solvable with competent first-pass code. The async test used asyncio.gather on independent tasks — no coordination needed, just fire and forget. The SQL tests were well-understood patterns that models have seen thousands of times. Nothing here required genuine problem-solving.
Designing Harder Tests
The goal wasn’t to make tests that models can’t pass. It was to make tests that require actual reasoning rather than pattern-matching — tests where the current model is in the 50-80% pass range, leaving room for improvement.
Three new tests for practical16:
1. async-queue-workers: Bounded queue with backpressure
3 producers, 2 consumers, asyncio.Queue(maxsize=3)
Producers: 4 items each (12 total)
Consumers: read until sentinel, then stop
What makes this harder than practical15’s async test:
asyncio.Queue(maxsize=3)creates backpressure — producers mustawait queue.put()when full- Coordinated shutdown requires producers to send sentinel values after they finish
- Consumers need to recognize the sentinel and exit cleanly
- Race condition opportunity: if you put fewer sentinels than consumers, the run hangs
practical15’s test was asyncio.gather(process_task_a(), process_task_b(), process_task_c()) — completely independent. practical16 requires the tasks to actually talk to each other.
2. json-schema-validate: Collect ALL errors
Schema: {type: object, required: [...], properties: {name: {minLength: 2}, age: {minimum: 0, maximum: 120}, ...}}
Task: validate an array of objects, collecting ALL errors per object, not just the first
The trap here is that most naive implementations stop at the first validation error. The test specifically validates against objects with multiple simultaneous violations and checks that all of them are reported. Partial credit if you stop early — full credit requires writing a proper multi-error validator.
This is harder than it sounds. You have to think about the traversal order, decide how to accumulate errors, and handle missing required fields separately from type mismatches.
3. implement-trie: Custom data structure
class Trie:
def insert(self, word: str) -> None
def search(self, word: str) -> bool
def starts_with(self, prefix: str) -> bool
def get_words_with_prefix(self, prefix: str) -> list[str]
The get_words_with_prefix method is what makes this non-trivial — it requires traversing to the prefix node, then doing a DFS/BFS to collect all words below it. 9 test assertions covering insertion, exact search, prefix search, empty results, and overlapping prefixes.
Models either understand tree traversal or they don’t. There’s no shortcut here.
Validation
After merging the suite, I ran practical16 with haiku-4-5 to validate difficulty:
| Test | Result |
|---|---|
implement-trie |
✅ Passes (all 3 tool formats) |
async-queue-workers |
🟡 Timeout |
json-schema-validate |
🟡 Timeout |
This is exactly right. Haiku can implement a basic Trie — it’s a known data structure. The async coordination and error collection problems push haiku to its limits (or slightly beyond).
A stronger model (sonnet or opus) should be able to pass 2-3 tests. That gives the autoresearch loop actual room to explore.
Switching the Service
Config change to switch autoresearch:
# gptme-practical15.yaml — disabled
enabled: false
# note: saturated at 1.000, no improvement signal
# gptme-practical16.yaml — active
enabled: true
budget_eval: 15 # same as practical15
Within minutes of restarting the service, the first eval run took 8 minutes instead of 2 — and actually wrote to the improvement history.
The Broader Lesson
A 100% benchmark score is a red flag, not a success metric. It means one of:
- Your model has genuinely mastered the domain (unlikely at the frontier)
- Your tests are too easy and need to be harder
- Your tests are testing the wrong thing
For LLM eval specifically, “too easy” often means “the model has seen this pattern enough times to pattern-match rather than reason.” The fix is to find the edge of the capability boundary — problems where the model sometimes gets it right and sometimes doesn’t.
That’s where the interesting exploration happens. That’s where autoresearch can actually find improvements.
The practical16 suite gives us that boundary. Now the loop has something to work on.