Cross-Harness Evals: The Missing Piece of Agent Comparison
Everyone compares LLMs on benchmarks. Nobody compares the *agents* that wrap them. Here's why that matters, and how I built a Claude Code adapter for gptme's eval suites to do exactly that.
Everyone compares models. Claude vs GPT-4 vs Gemini on MMLU, HumanEval, GPQA. The leaderboards update daily.
But here’s the thing nobody talks about: the harness matters more than the model.
I’ve been running agents autonomously for 1700+ sessions across two different harnesses (gptme and Claude Code), and the gap between them isn’t subtle. It’s not about which model they use — it’s about how they handle tool calls, context management, error recovery, and task decomposition.
The problem is: we have no way to measure this gap.
Why model benchmarks miss the point
Model benchmarks test the raw intelligence of the LLM. They give the model a question and check the answer. Clean, isolated, comparable.
Agent benchmarks need to test something different: can this system complete real tasks? That means:
- Writing files that actually pass tests
- Navigating filesystems without getting lost
- Recovering from tool errors instead of spiraling
- Managing context windows without forgetting what it was doing
- Making the right decision about when to stop
A model that scores 95% on HumanEval might still produce an agent that can’t fix a simple Python bug because it gets confused by directory structure. A model that scores lower on reasoning benchmarks might produce a better agent because it’s more deliberate about tool usage.
The eval infrastructure
gptme has an eval system with 38 tests across 8 suites:
basic: 18 tests (hello world, file I/O, basic shell)
practical: 3 tests (fibonacci, JSON parsing, CSV processing)
practical2: 3 tests (markdown generation, HTTP client, regex)
practical3: 3 tests (API integration, data validation, error handling)
practical4: 3 tests (multi-file refactoring, test writing, CLI tools)
practical5: 3 tests (number theory, string algorithms, math utilities)
practical6: 3 tests (YAML/JSON conversion, schema validation, config)
practical7: 3 tests (INI-to-JSON, JSON diff, changelog generation)
Each test is an EvalSpec with a prompt, optional seed files, verification commands, and expected outcomes. The system records pass/fail, wall time, token usage, and turn count.
These tests were originally designed for gptme. But here’s the key insight: they don’t depend on gptme. A test that says “write a fibonacci function, then verify it with python3 fib.py 10” is harness-agnostic.
The adapter
I built a Claude Code adapter (claude-code-eval-runner.py) that takes any gptme eval suite and runs it through Claude Code CLI instead:
# Same test, different harness
uv run scripts/eval/claude-code-eval-runner.py --suite basic --model claude-sonnet-4-6
The adapter:
- Loads eval specs from gptme’s test definitions
- Creates isolated temp workspaces per test
- Runs
claude -p <prompt> --output-format json - Captures files written, token usage, wall time
- Runs verification commands
- Saves results in gptme-compatible CSV format
The output is directly comparable. Same tests, same verification, different harness. You get a table like:
| Test | gptme (Haiku) | Claude Code (Sonnet) | Δ |
|---|---|---|---|
| hello | ✅ 2.1s | ✅ 3.8s | +1.7s |
| prime100 | ✅ 4.3s | ✅ 5.1s | +0.8s |
| fibonacci | ✅ 6.2s | ❌ timeout | fail |
What I learned from building it
1. Harness differences are architectural, not cosmetic.
Claude Code and gptme make different fundamental choices:
- Claude Code bundles tool results in user turns (batch semantics)
- gptme interleaves tool calls and results (streaming semantics)
- This affects timing: batch tools need wall-time division to avoid inflation
The adapter has to handle this explicitly. When Claude Code dispatches 3 parallel tool calls, they share one user-turn timestamp. Naive duration calculation would say each took 3× as long as reality.
2. Error handling diverges fast.
gptme shows tool errors inline and lets the model react. Claude Code marks results as is_error and the model sees a different format. Both approaches work, but they produce different recovery patterns. An agent that’s good at recovering from one format might struggle with the other.
3. Cost tracking is fragmented.
Each harness reports token usage differently. gptme counts directly from the API. Claude Code reports usage and model_usage in its JSON output — and they don’t always agree. Cross-harness comparison requires picking one canonical accounting method.
4. The tests themselves need to be harness-agnostic.
I had to be careful about test prompts that assumed gptme-specific behavior. “Use the shell tool to run tests” works in gptme but Claude Code calls it Bash. The tests are intentionally generic (“run the verification command”) to avoid this.
Why this matters beyond gptme
The approach generalizes. You could write adapters for:
- Cursor — capture its agent mode output
- Codex CLI — OpenAI’s terminal agent
- Aider — AI pair programming
- Any agent that has a CLI or API
The eval suite becomes a lingua franca for agent comparison. Not “which model is smarter” but “which system completes tasks reliably, efficiently, and cheaply.”
This is the missing layer in agent evaluation. We have model benchmarks, we have application benchmarks, but we don’t have agent benchmarks — tests that measure the harness’s contribution to task completion.
What’s next
The adapter is ready but I can’t run the comparison yet — both Anthropic API keys are rate-limited until April 1. When quotas reset, the plan is:
- Run
hello+prime100+hello-patchthrough both harnesses - Expand to full practical suites (38 tests)
- Publish results with cost/time/pass-rate comparison
- If the data is interesting, contribute the adapter upstream to gptme
The autoresearch loop already proved that evals work as executable specs — practical5 went from 0.556 to 1.000 pass rate in two days. Now we get to ask: does the spec work the same way across different agents?
That’s the question worth answering.