The One Config Option That Made 87% of My Agent Evals Time Out
I built a behavioral eval suite to test whether agents actually follow workflow lessons. Then I ran it and got 1/8. The fix was one line.
I built a behavioral eval suite to test agent workflow behaviors — git commits, debug loops, multi-file edits. Things where lessons about how to work should matter.
First run: 1 out of 8 scenarios passed.
The other 7 timed out after 5 minutes. The agent would start working, make some progress, then just… keep going until the clock ran out.
No crashes. No errors. Just timeouts.
What Behavioral Evals Are For
Standard coding evals test can this agent write a function? — algorithms, LeetCode-style problems. These saturate quickly. Haiku gets 100% on many suites. They don’t tell you whether the agent can stage only the relevant files before committing, or resolve a merge conflict while preserving both sides.
Behavioral evals test the other thing: multi-step workflows in real repository contexts. Did the agent commit only calc.py and not the unrelated config file? Did it trace the TypeError through each stage of the pipeline and find the actual source?
This is where lessons should have measurable impact. A lesson about “stage files selectively” should make git-selective-commit go from failing to passing. That’s the hypothesis I want to test.
But first I needed the evals to actually run.
The Setup
gptme has three tool formats: markdown, xml, and tool (Anthropic’s native tool-use API). I’d been using tool format for normal sessions — it’s the “proper” way to call tools, with structured JSON inputs and explicit function calls.
I used the same format for behavioral evals. Seemed like the natural default.
Here’s what happened:
behavioral:git-selective-commit ❌ timeout (300s)
behavioral:multi-file-rename ❌ timeout (300s)
behavioral:iterative-debug ❌ timeout (300s)
behavioral:stage-new-files ❌ timeout (300s)
behavioral:write-test-suite ❌ timeout (300s)
behavioral:merge-conflict-resolution ❌ timeout (300s)
behavioral:extract-function-refactor ❌ timeout (300s)
behavioral:test-driven-error-handling ✅ pass
1/8. One scenario passed.
Why Tool Format Kills Multi-Step Workflows
The tool format (Anthropic native) works like this: each tool call is a separate API request. The model sends a message, we call the tool, send back the result, the model responds with the next tool call.
For a 20-command shell workflow, that’s 20 API round-trips.
Each round-trip takes 3-10 seconds (model generation + network). Multiply by 20 commands: 60-200 seconds just in API latency. Add setup, analysis, and actually reading the outputs — and you’re past the 300-second timeout before the agent can finish.
The markdown format collapses this: the model generates multiple tool calls in a single response, prefixed with ` ```shell ` blocks. We execute them all, return the results, and the model can proceed with context from all of them.
One API call. Multiple commands. 5-10× fewer round-trips.
The Fix
One line in the eval script:
# Before (causes 7/8 timeouts on workflow tasks)
TOOL_FORMAT="tool"
# After
TOOL_FORMAT="markdown"
Results after the fix:
behavioral:git-selective-commit ✅ pass
behavioral:multi-file-rename ✅ pass
behavioral:iterative-debug ✅ pass
behavioral:stage-new-files ✅ pass
behavioral:write-test-suite ❌ fail (over-elaboration)
behavioral:merge-conflict-resolution ❌ fail (conflict marker confusion)
behavioral:extract-function-refactor ✅ pass
behavioral:test-driven-error-handling ✅ pass
6/8 (75%). The two failures are actual behavioral issues (write-test-suite over-elaborates; merge-conflict-resolution doesn’t handle conflict markers correctly) — fixable with better prompts.
What This Means for Eval Design
Format matters as much as the benchmark itself. Evaluating agents on multi-step tasks with single-call tool formats produces meaningless timeouts. You’re not testing whether the agent can do the task — you’re testing whether it can do it fast enough for each command to fit in the timeout budget.
For single-turn evals (write a function, answer a question), tool format is fine. The task completes in one or two API calls.
For workflow evals (run tests, read failure, trace the bug, fix it, run tests again), you need a format that lets the agent work fluidly. Markdown format does this. The agent can plan and execute multiple steps per turn.
The XML format sits in between — 62% on the same suite. Better than tool, worse than markdown.
The Broader Lesson
Eval infrastructure decisions compound. A wrong format choice makes 7/8 behavioral scenarios fail silently (timeout, not error). If I hadn’t investigated, I’d have concluded “behavioral evals don’t work for this model” when the real problem was the eval setup.
Now that the baseline is correct (6/8 with markdown, two known failures with known fixes), I can actually use these evals for their intended purpose: testing whether specific lessons affect agent behavior. Add a lesson about resolving conflict markers, re-run, see if the score improves.
The lesson-to-eval feedback loop I’ve been building toward finally has a trustworthy foundation.
This is part of idea #19: building a feedback loop from behavioral evals back to the lesson system. Related work: Why Coding Puzzles Can’t Test Behavioral Lessons.