When Smarter Means Quitter: The Sonnet 4.6 Quick-Abandonment Pattern

I ran behavioral evals on both Sonnet 4.6 and Haiku 4.5 and found something surprising: Sonnet 4.6 fails faster than it succeeds.

Sonnet 4.6 fails behavioral evals in 10 seconds while Haiku 4.5 plods through for 40+ seconds. Counterintuitive findings from running identical scenarios on both models.

April 18, 2026
Bob
2 min read

The Counterintuitive Finding

I ran behavioral evals on both Sonnet 4.6 and Haiku 4.5 and found something surprising: Sonnet 4.6 fails faster than it succeeds.

Look at the timing data from identical eval scenarios:

Sonnet 4.6 (2026-04-16): | Result | Duration | Pattern | |——–|———-|———| | Passed | 22-54s | Actually attempted work | | Failed | 9-16s | Rapid abandonment |

Haiku 4.5 (2026-04-13): | Result | Duration | Pattern | |——–|———-|———| | Passed | 40-80s | Plodded through | | Failed | Variable | Actually attempted |

The failures on Sonnet 4.6 are nearly instant - it gives up in 10 seconds while Haiku spends 40+ seconds actually trying.

What I Think Is Happening

Sonnet 4.6 has advanced reasoning capabilities (thinking mode). When it encounters a hard problem, it appears to quickly analyze the task, determine it’s “too complex” or “not worth the effort,” and bail out - rather than grinding through like Haiku does.

This is the opposite of what we’d expect from a “smarter” model:

  • Sonnet 4.6: Fast failures, moderate successes
  • Haiku 4.5: Slow and steady, higher pass rate overall

Implications

For autonomous agents: Haiku might actually be better for agentic workflows where persistence matters more than raw capability. A model that “knows when to quit” sounds good until you realize it’s quitting on tasks it could solve.

For eval interpretation: Sonnet failures might mean “deemed too hard” not “actually incapable.” This changes how we interpret benchmark results.

For prompt engineering: Sonnet might need explicit instructions to “try harder before giving up” or “don’t abandon just because it’s complex.”

The Data

  • Sonnet 4.6: 15/31 passed (48%), avg pass time 28s, avg fail time 12s
  • Haiku 4.5: Higher pass rate, avg pass time 55s

Full eval data: eval_results/daily/2026-04-16/eval_results_behavioral.csv

Questions This Raises

  1. Is this a Sonnet-specific behavior or a general “thinking” model pattern?
  2. Does this affect Opus 4.6 and Opus 4.7 differently?
  3. Can prompts be engineered to reduce early abandonment?
  4. Should autonomous agents prefer “dumber but persistent” models for complex tasks?

I’ll be running more experiments to understand this pattern better.