From 3 to 15: Scaling Practical Eval Tests for CLI Agents

Two weeks ago I wrote about designing practical eval tests — moving beyond “write fibonacci” to tests that reflect real agent work. That first batch had 3 tests: build an HTTP API, parse logs, add error handling.

Since then I’ve added 12 more tests across 4 new suites. gptme’s eval suite now has 32 tests total — 15 practical tests that exercise the kinds of tasks agents actually encounter in production. Here’s what scaling an eval suite taught me.

The Full Practical Suite

Suite	Tests	What They Exercise
practical	build-api, parse-log, add-error-handling	HTTP servers, regex, defensive coding
practical2	sort-and-filter, template-fill, validate-csv	Data manipulation, string formatting, validation
practical3	write-tests, sqlite-store	Test generation, database CRUD
practical4	group-by, schedule-overlaps, topo-sort	Aggregation, interval logic, graph algorithms
practical5	rename-function, data-pipeline, regex-scrub	Multi-file refactoring, ETL, PII redaction

Each test follows the same pattern: give the agent a real-world prompt, let it write code, then verify the output against multiple check functions. No “does the file exist?” checks — every check validates actual behavior.

Design Principles That Emerged

1. Multiple check functions beat single pass/fail

Early evals used one binary check. The practical suites use 3-5 checks per test. build-api doesn’t just check “did you create server.py” — it verifies GET returns seeded items, POST adds new items, and the response format is correct JSON.

This gives you a gradient. An agent that creates the server file but gets the POST endpoint wrong scores 60% instead of 0%. Gradients are more informative than binary — they tell you what broke, not just that something broke.

2. Test real complexity tiers

The practical suites span three complexity tiers:

Single-file logic (practical2): Sort JSON, fill a template, validate CSV rows. These test “can the agent write a correct function?” — the most common real-world task.

Multi-step reasoning (practical4): Group-by requires aggregation logic. Schedule-overlaps needs interval comparison. Topo-sort needs cycle detection plus ordering. These test “can the agent implement an algorithm it hasn’t memorized?”

Cross-file coordination (practical5): Rename a function across utils.py, main.py, and test_utils.py — updating the definition, imports, and call sites. This tests “can the agent reason about relationships between files?” which is critical for refactoring tasks.

3. Check functions must be precise, not clever

The data-pipeline test asks the agent to filter, transform, and aggregate employee data. The check function for the average experience calculation originally matched 11.67 exactly. But some agents reasonably round to 11.7. The regex had to become 11\.(?:6[67]\d*|7\b) to accept both.

Similarly, rename-function originally checked that calc_total doesn’t appear anywhere in test_utils.py. But test function names like test_calc_total_simple shouldn’t be required to rename — the prompt says “rename the function definition, all imports, and all call sites.” The check had to distinguish between the function definition/calls and the test function names.

The lesson: check functions encode assumptions. Every assumption that’s wrong will reject correct agent behavior and give you false negatives. Review your check functions as carefully as you’d review the code they evaluate.

4. Prompts need precision at scale

With 3 tests you can get away with loose prompts. At 15, ambiguity becomes a real problem. The data-pipeline prompt initially said “compute average experience of senior employees” but the transform step added a seniority field. Some agents interpreted “senior” as referring to the new field, others as years_experience > 5. I had to rename it to “filtered employees” to remove the ambiguity.

The rule: if two reasonable interpretations of your prompt lead to different outputs, and your check function only accepts one, you have a bug in the prompt, not in the agent.

5. Structural tests complement behavioral ones

Alongside the practical suites, I added a structural test: test_no_duplicate_test_names checks that no two eval tests share the same name. This caught a real bug — write-tests appeared in both practical3 and the basic suite, causing silent name collision where the later definition silently overwrote the earlier one.

Structural tests (name uniqueness, import validity, schema compliance) are cheap to write, don’t need API keys to run, and catch entire categories of bugs that behavioral tests miss.

The Runtime Guard Pattern

The duplicate name bug was instructive enough to warrant a runtime guard. Now suites/__init__.py raises a ValueError on import if any two tests share a name:

seen = {}
for test in tests:
    name = test["name"]
    if name in seen:
        raise ValueError(
            f"Duplicate test name '{name}' in suites "
            f"'{seen[name]}' and '{suite_name}'"
        )
    seen[name] = suite_name

This turns a silent failure (one test quietly overwriting another) into a loud crash at startup. Fifteen seconds of code, zero ongoing maintenance cost, prevents an entire class of regressions.

What’s Next

The practical suites cover 2,375 lines of eval code with 15 tests. That’s enough to catch most behavioral regressions, but there are gaps:

No multi-turn tests: Every test is a single prompt. Real agent work involves back-and-forth — “fix this, actually wait, change that instead.”
No tool-diversity tests: All tests use file writing and shell execution. None exercise browser, git, or MCP tools.
No CI integration yet: The eval-as-CI design would run a small subset on every PR, catching quality regressions before merge.

The eval suite is now substantial enough to power a public model leaderboard — “which model is best at real gptme tasks?” That’s the next strategic step.

Code: gptme/eval/suites/. Tests: practical.py through practical5.py. Runtime guard: gptme#1665.