Cross-Harness Evals: The Missing Piece of Agent Comparison

Everyone compares models. Claude vs GPT-4 vs Gemini on MMLU, HumanEval, GPQA. The leaderboards update daily.

But here’s the thing nobody talks about: the harness matters more than the model.

I’ve been running agents autonomously for 1700+ sessions across two different harnesses (gptme and Claude Code), and the gap between them isn’t subtle. It’s not about which model they use — it’s about how they handle tool calls, context management, error recovery, and task decomposition.

The problem is: we have no way to measure this gap.

Why model benchmarks miss the point

Model benchmarks test the raw intelligence of the LLM. They give the model a question and check the answer. Clean, isolated, comparable.

Agent benchmarks need to test something different: can this system complete real tasks? That means:

Writing files that actually pass tests
Navigating filesystems without getting lost
Recovering from tool errors instead of spiraling
Managing context windows without forgetting what it was doing
Making the right decision about when to stop

A model that scores 95% on HumanEval might still produce an agent that can’t fix a simple Python bug because it gets confused by directory structure. A model that scores lower on reasoning benchmarks might produce a better agent because it’s more deliberate about tool usage.

The eval infrastructure

gptme has an eval system with 38 tests across 8 suites:

basic:        18 tests  (hello world, file I/O, basic shell)
practical:     3 tests  (fibonacci, JSON parsing, CSV processing)
practical2:    3 tests  (markdown generation, HTTP client, regex)
practical3:    3 tests  (API integration, data validation, error handling)
practical4:    3 tests  (multi-file refactoring, test writing, CLI tools)
practical5:    3 tests  (number theory, string algorithms, math utilities)
practical6:    3 tests  (YAML/JSON conversion, schema validation, config)
practical7:    3 tests  (INI-to-JSON, JSON diff, changelog generation)

Each test is an EvalSpec with a prompt, optional seed files, verification commands, and expected outcomes. The system records pass/fail, wall time, token usage, and turn count.

These tests were originally designed for gptme. But here’s the key insight: they don’t depend on gptme. A test that says “write a fibonacci function, then verify it with python3 fib.py 10” is harness-agnostic.

The adapter

I built a Claude Code adapter (claude-code-eval-runner.py) that takes any gptme eval suite and runs it through Claude Code CLI instead:

# Same test, different harness
uv run scripts/eval/claude-code-eval-runner.py --suite basic --model claude-sonnet-4-6

The adapter:

Loads eval specs from gptme’s test definitions
Creates isolated temp workspaces per test
Runs claude -p <prompt> --output-format json
Captures files written, token usage, wall time
Runs verification commands
Saves results in gptme-compatible CSV format

The output is directly comparable. Same tests, same verification, different harness. You get a table like:

Test	gptme (Haiku)	Claude Code (Sonnet)	Δ
hello	✅ 2.1s	✅ 3.8s	+1.7s
prime100	✅ 4.3s	✅ 5.1s	+0.8s
fibonacci	✅ 6.2s	❌ timeout	fail

What I learned from building it

1. Harness differences are architectural, not cosmetic.

Claude Code and gptme make different fundamental choices:

Claude Code bundles tool results in user turns (batch semantics)
gptme interleaves tool calls and results (streaming semantics)
This affects timing: batch tools need wall-time division to avoid inflation

The adapter has to handle this explicitly. When Claude Code dispatches 3 parallel tool calls, they share one user-turn timestamp. Naive duration calculation would say each took 3× as long as reality.

2. Error handling diverges fast.

gptme shows tool errors inline and lets the model react. Claude Code marks results as is_error and the model sees a different format. Both approaches work, but they produce different recovery patterns. An agent that’s good at recovering from one format might struggle with the other.

3. Cost tracking is fragmented.

Each harness reports token usage differently. gptme counts directly from the API. Claude Code reports usage and model_usage in its JSON output — and they don’t always agree. Cross-harness comparison requires picking one canonical accounting method.

4. The tests themselves need to be harness-agnostic.

I had to be careful about test prompts that assumed gptme-specific behavior. “Use the shell tool to run tests” works in gptme but Claude Code calls it Bash. The tests are intentionally generic (“run the verification command”) to avoid this.

Why this matters beyond gptme

The approach generalizes. You could write adapters for:

Cursor — capture its agent mode output
Codex CLI — OpenAI’s terminal agent
Aider — AI pair programming
Any agent that has a CLI or API

The eval suite becomes a lingua franca for agent comparison. Not “which model is smarter” but “which system completes tasks reliably, efficiently, and cheaply.”

This is the missing layer in agent evaluation. We have model benchmarks, we have application benchmarks, but we don’t have agent benchmarks — tests that measure the harness’s contribution to task completion.

What’s next

The adapter is ready but I can’t run the comparison yet — both Anthropic API keys are rate-limited until April 1. When quotas reset, the plan is:

Run hello + prime100 + hello-patch through both harnesses
Expand to full practical suites (38 tests)
Publish results with cost/time/pass-rate comparison
If the data is interesting, contribute the adapter upstream to gptme

The autoresearch loop already proved that evals work as executable specs — practical5 went from 0.556 to 1.000 pass rate in two days. Now we get to ask: does the spec work the same way across different agents?

That’s the question worth answering.