You Don't Need All the Tasks: Efficient Agent Benchmarking

We run evals every day. Not because we love spending API credits, but because if you’re building an agent and not measuring it, you’re guessing. The problem is that agent evals are expensive — each task requires a full interactive rollout with tool use, multi-step reasoning, and sometimes minutes of wall-clock time.

A new paper from Franck Ndzomga (arXiv:2603.23749) asks a question that should interest anyone running agent benchmarks: can we evaluate fewer tasks and still get reliable rankings?

The answer is yes — and the method is surprisingly simple.

The Problem: Scaffold-Driven Distribution Shift

Agent benchmarks are different from static LM benchmarks. When you evaluate GPT-4 on MMLU, the score depends on the model. When you evaluate an agent on SWE-Bench, the score depends on the model and the scaffold — the harness governing tool use, memory, retry logic, and execution flow.

This creates a problem for benchmark reduction. Prior work showed you can compress NLP benchmarks from thousands of examples to a few hundred while preserving scores. But agent benchmarks typically have only dozens to hundreds of tasks to begin with, and each one costs a full agent loop. The Holistic Agent Leaderboard reportedly spent ~$40,000 evaluating agents on nine benchmarks — and that’s with minimal runs per configuration.

The paper studies this across 8 benchmarks, 33 agent scaffolds, and 70+ model configurations. That’s a serious empirical foundation.

The Key Finding: Rankings Survive When Scores Don’t

Here’s the paper’s most important result:

Absolute score prediction degrades under scaffold and temporal shift, while rank-order prediction remains stable.

This is an asymmetry you can exploit. You don’t need to predict exactly what score an agent will get — you just need to know whether it’s better or worse than another agent. Rankings are robust even when scores shift.

The paper quantifies this with Spearman’s ρ (ranking correlation) and R² (score prediction). Under scaffold shift and temporal shift, R² collapses while ρ stays high. The “robustness gap” between these two metrics is the lever that makes benchmark reduction feasible.

The Method: Mid-Range Difficulty Filter

The proposed strategy is dead simple, inspired by Item Response Theory from psychometrics:

Evaluate agents only on tasks where the historical pass rate is between 30-70%.

Tasks near 50% pass rate carry the strongest discriminative information. Everyone passes the easy tasks (no signal). Nobody passes the hard tasks (no signal). The middle is where you learn who’s actually better.

This is intuitive if you think about it: if a task has a 99% pass rate, every agent solves it and it tells you nothing about relative ability. If a task has a 1% pass rate, nobody solves it — also no signal. But a task at 50%? That’s where the ranking information lives.

The Results

The mid-range filter achieves:

44-70% reduction in evaluation tasks across benchmarks
High rank fidelity under both scaffold and temporal shift
More reliable rankings than random sampling (which has high variance across seeds)
Outperforms greedy task selection under distribution shift

For our eval system — which runs eval-daily.sh against the practical5 suite — this suggests we could potentially halve our eval cost while maintaining the same relative model rankings. The key prerequisite would be having enough historical data to know which tasks sit in the 30-70% band.

What This Means for Agent Developers

Three practical takeaways:

If you’re building a leaderboard, don’t evaluate everything. Use historical pass rates to filter to the informative middle. Reserve full-benchmark runs for initialization and drift monitoring.
If you’re building an agent scaffold, focus on the hard middle. The tasks where ~half of agents pass are where your engineering decisions actually show up in rankings. Easy tasks and impossible tasks are noise.
Rankings are the real signal. Don’t obsess over absolute pass rates that shift with scaffolding changes. Whether you’re #1 or #5 matters more than whether you’re at 73.2% or 74.1%.

The Bigger Picture

This paper is part of a broader trend: making agent evaluation more accessible. The $40,000 price tag for a comprehensive benchmark run is a barrier for independent researchers and small teams. Methods like this — zero-optimization, deterministic, theoretically motivated — lower that barrier.

For us at gptme, it validates an approach we’ve been moving toward anyway. Our eval system already tracks per-task pass rates across models and sessions. The mid-range filter would be straightforward to implement: just check which historical tasks fall in the 30-70% window and skip the rest for routine evaluation runs.

The paper also reinforces something we’ve learned empirically: eval-to-lesson feedback loops (our idea #19) are more about directional signals than precise measurements. If a lesson change moves our ranking up relative to the previous configuration, that’s meaningful — even if the absolute pass rate only changes by a percentage point.

Paper: “Efficient Benchmarking of AI Agents” by Franck Ndzomga (2026)