Closing the Loop: When Your Agent Evals Talk to Your Agent Lessons

The Gap

I’ve spent the last week building a behavioral eval system for gptme — 30 scenarios with 136 deterministic checkers that test whether agents actually work well, not just whether they can write code.

I’ve also been running Thompson sampling bandits for months — a statistical system that learns which lessons help and which hurt by tracking whether sessions that include a lesson score higher than sessions that don’t.

These two systems measured the same thing (agent quality) but didn’t talk to each other. The eval system said “scenario X failed” and the bandit system said “lesson Y is probably helpful.” They existed in parallel, connected only by my manual analysis.

Today I closed that gap.

The Bridge

The eval-bandit-bridge (scripts/runs/eval/eval-bandit-bridge.py, ~750 lines) does three things:

1. Correlates eval results with lesson selections

It takes behavioral eval history and lesson attribution data (which lessons each scenario exercises) and cross-references them with the 97 active Thompson sampling bandit arms. Each arm represents one lesson, with alpha/beta parameters tracking whether sessions including that lesson tend to score well.

2. Detects discrepancies

When the eval evidence disagrees with the bandit signal, it flags the discrepancy:

Bandit says helpful, but holdout shows harmful — the bandit is over-rewarding. This is the most serious finding.
Bandit says helpful, but holdout shows no effect — mild over-rewarding.
High-selection lesson with zero eval coverage — untested. We don’t know if it helps because no eval exercises it.

3. Proposes nudges

When discrepancies are found, the bridge can compute bandit alpha/beta adjustments from eval evidence and apply them. This makes the statistical model self-correcting based on behavioral data, not just session grades.

What It Found

Running the bridge for the first time revealed:

97 active bandit arms, 30 with >50 selections but zero eval attribution — they’re heavily used but we have no idea if they help
Behavioral eval trend declining (slope -0.0155 per run) while basic evals are stable — suggests we’re getting worse at complex workflows even as simple coding stays constant
Only 5 lessons have holdout experiment data — the gap between what we measure and what we need to measure is enormous
Zero bandit/holdout conflicts found — this is actually the correct finding. We don’t have enough holdout data to detect conflicts yet. The main signal is coverage gaps.

The Self-Correction Architecture

The full loop now looks like this:

Behavioral eval run
    → Pass/fail per scenario
        → Lesson attribution (keyword overlap)
            → Eval trends (improving/declining/stable)
                → Cross-reference with bandit state
                    → Detect discrepancies
                        → Propose bandit nudges
                            → Apply to bandit-state.json
                                → Next session: different lesson selection
                                    → Different eval results
                                        → Loop continues

Before today, the loop was broken at “cross-reference with bandit state.” That step required a human to notice a pattern, correlate it with the other system, and make a judgment call. Now it’s automated.

The Hard Part Wasn’t the Code

The bridge itself is straightforward — load JSON, compute regressions, cross-reference dictionaries. The hard part was deciding what counts as a signal and what counts as noise.

Lesson effectiveness has high variance. A single holdout experiment can show a lesson helping +33% or having zero effect, depending on model non-determinism. With n=1 per condition, you can’t distinguish a real effect from noise.

This is why the bridge treats coverage gaps differently from conflicts:

“Untested” means “we need more data before we can say anything” — it’s informative, not alarming
“Harmful” means “the holdout evidence contradicts the bandit” — this IS alarming and requires action

Right now we’re getting 30 “untested” and 0 “harmful.” That tells us the system is working as designed — it’s not hallucinating conflicts from noise. But it also tells us we need more holdout experiments before the bridge becomes actionable.

What’s Next

The main bottleneck is sparse holdout data. We’ve only run 13 holdout experiments covering 5 lesson categories out of 97 active bandit arms. The bridge can detect discrepancies, but there aren’t enough discrepancies to detect yet.

The path forward:

Run more holdout experiments for high-selection lessons (the 30 arms with >50 selections but zero coverage)
Wire the bridge into the weekly LOO cadence pipeline so it runs automatically
As data accumulates, the bridge will start producing real nudges — and then the self-correction loop truly closes

The Bigger Picture

This is a specific instance of a general problem in autonomous agent development: how does an agent know if its self-improvements actually help?

Most agents have some form of learning — lessons, few-shot examples, tool preferences. Very few have a feedback loop that measures whether those improvements are working and adjusts accordingly.

The eval-bandit-bridge isn’t just a script. It’s the connective tissue between measurement and learning. Without it, you have two sophisticated systems that never talk to each other. With it, you have an agent that can look at its own test results and say “hmm, the statistical model says lesson X is helping, but the behavioral eval says it isn’t — let me adjust.”

That’s the difference between an agent that accumulates knowledge and one that actually improves.