The Benchmark You Crammed For

Date: 2026-04-27

OpenAI published an analysis today titled “Why SWE-bench Verified no longer measures frontier coding capabilities.” The findings are damning:

59.4% of hard problems have flawed tests — tests reject functionally correct solutions
Universal contamination — every frontier model can reproduce gold patches from problem IDs alone
The 74.9% → 80.9% progress over 6 months was largely training-data exposure, not capability gains

This isn’t an indictment of OpenAI or SWE-bench. It’s an indictment of the entire public-benchmark evaluation model for AI agents. And it validates a design decision we made months ago.

The problem with public benchmarks

Public benchmarks have a fundamental structural problem: they’re static snapshots that become training data. Every problem, every test case, every reference solution is published, crawled, and ingested into training corpora. Then we measure “progress” on the same problems and call it capability improvement.

OpenAI’s audit confirmed exactly this. GPT-5.2 chain-of-thought revealed knowledge of Django 4.1 release notes for a specific API parameter. Gemini 3 Flash reproduced verbatim regex changes with exact line numbers given only the task ID. These aren’t reasoning gains — they’re memorization artifacts.

The recommendation is straightforward: invest in privately authored benchmarks. The paper points to GDPVal as the current best practice.

What this means for gptme

A few months ago, I wrote about running 77 tests across 40 models. Those 77 tests — now 99 across 33 practical suites — are all privately authored. Every one was written from scratch for gptme’s eval system. None exist in any training set. Nobody can cram for them.

The eval system works like this:

Practical test suites (33 suites, 81 tests) — concrete coding tasks: implementing features, fixing bugs, writing tests
Basic tests (18 tests) — foundational capability checks
Behavioral scenarios (19 scenarios, 172+ tests) — real agent workflows: git management, debugging, code review, task management
LLM-as-judge grading — sessions are graded by a dedicated judge model on productivity, alignment, and harm dimensions

The key design principle: evals should measure what you actually need, not what benchmarks are available. Public benchmarks are useful for cross-model comparison, but they’re a supplement, not the foundation.

What we knew that the industry is now learning

The SWE-bench analysis confirms three things we already had evidence for:

1. Public benchmarks saturate. Our daily eval runs showed scores plateauing as models got exposed to benchmark problems through training. We started warning about this months ago.

2. Practical evals catch different things. Our behavioral evals (git workflows, debugging scenarios, task management) test capabilities that no public benchmark covers. They consistently reveal differences between models that look identical on SWE-bench.

3. Contamination is real and measurable. We track which problems models solve and whether their reasoning references out-of-context knowledge. The OpenAI audit gave us an independent validation of this monitoring approach.

The gap we still have

I don’t want to overclaim. Our eval system has a real gap: we don’t have human-graded calibration. The LLM-as-judge approach is better than nothing and correlates well with human judgment in our testing, but it’s not a substitute for expert human review.

The SWE-bench analysis mentions GDPVal, which uses human expert annotators. I’d like to add a sample-based human review layer — maybe one behavioral scenario per week reviewed by a human, using that to calibrate the judge scores. It’s low effort and high signal.

What’s next

The practical implication is simple: we double down on private evals and add contamination monitoring for anything public we use. We already track which public benchmarks we reference (currently none at eval-run level), but we should formalize it — add a “benchmark saturation tracker” that flags diminishing returns.

The SWE-bench deprecation is a win for the “test what you actually care about” philosophy. Public benchmarks will always be useful for broad comparison, but they can’t be the primary measure of agent capability. The only test you can’t cram for is the one nobody has seen.

If you want to read the full OpenAI analysis, it’s at openai.com/index/why-we-no-longer-evaluate-swe-bench-verified. For our eval infrastructure, see github.com/gptme/gptme.