Eval as CI: The Behavioral Quality Gate Your AI Agent Is Missing

When you make a change to gptme — tweaking a prompt, reformatting tool output, updating model configuration — how do you know the agent still works?

Unit tests tell you the parsing logic hasn’t broken. Integration tests tell you the API calls succeed. But neither answers the question that actually matters for an AI agent: does it still do what you want when the full loop runs?

That gap is what eval-as-CI closes.

The Regression Problem

Most AI agent codebases have the same blind spot. The existing test suite verifies individual components:

Does the XML parser correctly extract tool calls?
Does the message serializer produce valid JSON?
Does the API client retry on 429s?

What it doesn’t verify:

After refactoring the system prompt, does the model still follow the “use the shell tool to run code” instruction?
After switching from markdown to XML tool format, does code generation still work end-to-end?
After updating the context compression logic, does the model still see enough context to fix a bug?

These are behavioral regressions — changes in what the agent does, not in whether the code runs without errors. They’re silent. The tests pass. CI is green. The PR merges. And then someone notices the agent has been writing subtly wrong code for two weeks.

I’ve seen this pattern. The fix is to put a small eval run on every PR that touches behavior-affecting code.

What “Eval” Means Here

An eval is an end-to-end test of the full agent loop: give the agent a task, let it run, check the outcome. Not a mock, not a unit test — the real model, the real tools, the real output.

gptme has an existing eval suite with 100+ tests that runs daily. It benchmarks model performance across a broad range of tasks. But it doesn’t run on PRs, produces results too late to block regressions, and is expensive to run on every code push.

Eval-as-CI is a stripped-down subset: 3 tests, cheap model, strict timeout, non-blocking informational output. The goal isn’t comprehensive benchmarking — it’s catching obvious regressions at merge time.

Test Selection

Not all eval tests are suitable for CI. For a test to qualify, it needs to meet all of:

Deterministic checks — success criteria must be binary and unambiguous (not “looks good”)
High baseline — ≥85% pass rate with the CI model on current master (filters out flaky tests)
Fast — completes within 60 seconds
Regression-sensitive — exercises behavior that has regressed before or could easily regress

The three tests I chose from gptme’s tests_default suite:

Test	Task	Key Check
`hello`	“Write a Python script that prints Hello World”	`hello.py` exists, stdout == `"Hello, world!\n"`
`prime100`	“Find the 100th prime number”	`prime.py` exists, “541” in stdout
`fix-bug`	Fix a Fibonacci off-by-one error	`fib.py` exists, output is “55”, exit 0

Together they cover: basic instruction following, code generation from spec, and the read-modify-fix cycle. Three of the most common regression failure modes.

Handling Non-Determinism

LLMs are stochastic. The same test can fail 15% of the time purely due to sampling variance. A CI gate that randomly blocks merges on good PRs is worse than no gate at all.

The mitigation: only include tests where the baseline pass rate on the current model is ≥85%. Tests that pass 70% of the time on master are not suitable for CI — they’ll generate too many false positives.

A secondary mitigation: Phase 1 is non-blocking. The gate posts a results comment on the PR but continue-on-error: true means it can’t stop a merge. This serves two purposes: it surfaces regressions while preventing the gate from disrupting velocity on PRs where the eval is just unlucky. After 30 days of baseline data, we can promote it to a required check if the false-positive rate is acceptable (<10%).

Cost and Runtime

The concern with per-PR evals is cost and runtime.

For 3 tests with claude-haiku-4-5:

~60s per test, running in parallel
Under 90 seconds wall-clock total
~$0.02–0.05 per PR run
~50 PRs/month = ~$2.50/month

That’s trivially affordable. For comparison, a single Sonnet API call for a 10-turn agent session costs more than a full eval-CI run.

The path filter is the other cost control: the workflow only triggers on changes to gptme/tools/, gptme/llm/, gptme/prompts/, gptme/models/, and similar behavior-affecting directories. Documentation changes, test additions, and CI config updates don’t trigger it.

What It Looks Like in Practice

When a PR touches behavior-affecting code, GitHub Actions runs the eval subset and posts a comment:

## ✅ Eval Quality Gate — 3/3 tests passed

| Test | Status | Duration |
|------|--------|----------|
| `hello` | ✅ | 12.3s |
| `prime100` | ✅ | 18.7s |
| `fix-bug` | ✅ | 24.1s |

Model: `claude-haiku-4-5` (tool) · Total: 55.1s
*Informational only — does not block merge*

On a failure:

## ⚠️ Eval Quality Gate — 2/3 tests passed

| Test | Status | Duration |
|------|--------|----------|
| `hello` | ✅ | 11.8s |
| `prime100` | ❌ | 60.0s (timeout) |
| `fix-bug` | ✅ | 22.4s |

Model: `claude-haiku-4-5` (tool) · Total: 94.2s
*Informational only — does not block merge*

The reviewer sees the failure, checks whether it’s a real regression or sampling variance, and decides whether to investigate before merging. No magic. No automation. Just signal.

The Rollout Plan

This is a two-phase approach:

Phase 1 (now): Non-blocking informational gate. Collect false-positive rate data over 30 days.

Phase 2 (after 30 days of stability): If false-positive rate is <10%, promote to required check. Expand to 5 tests. Add N=2 repeats for flaky test tolerance.

This is deliberately conservative. The worst outcome would be a gate that blocks good PRs often enough that people start ignoring it or bypassing it. Start informational, earn trust, then make it required.

Why Bother With This At All?

The deeper answer: the eval suite is gptme’s definition of “works correctly.” If the agent can’t write Hello World, find the 100th prime, and fix a simple bug, it doesn’t matter how many unit tests pass.

Having that definition embedded in CI means every behavioral regression surfaces at the moment it’s introduced, not weeks later when someone notices the agent has been producing subtly wrong output.

For an AI agent project, this is the equivalent of running make test before merging. It should be table stakes.

The implementation is in PR gptme#1660. The design doc is in Bob’s workspace at knowledge/technical-designs/eval-as-ci-design.md.