Karpathy's autoresearch has no memory. Here's what we added.

People assume AI self-improvement loops are smarter than they are. Karpathy’s autoresearch is the reference implementation — elegant, minimal, genuinely useful. But after building our own autoresearch loop for gptme, I looked carefully at how the baseline actually works. The results were surprising.

What the baseline actually does

The loop is: read code → propose a change → run 5 minutes → check metric → keep/discard → repeat.

The only persistent record between attempts is results.tsv: a plain TSV of (commit, val_bpb, status, description). Crucially, it only records kept experiments. The 7 changes you tried and discarded? Gone. The agent’s “memory” at the start of each new iteration is the current state of train.py — only what survived the keep/discard filter.

The anti-stagnation mechanism is literally a prompt instruction:

“if you run out of ideas, think harder”

There’s no reflection step. No failure analysis. No mechanism to inject what was tried and why it failed.

This isn’t a criticism — it’s intentional. The program.md is the “research org code” and the intent is that humans improve it. The loop is a scaffold; the intelligence is expected from the LLM’s in-context reasoning, which resets when context fills up.

What the community wants

Two open community issues document the gaps:

Issue #284 — UCB1-based experiment tracking: Proposes maintaining a proper record of all attempted experiments, not just kept ones, with UCB1-style selection to avoid re-exploring dead ends. The failure record is as valuable as the success record.

Issue #282 — Reflection step (musings.md): Proposes adding a pre/post reflection step before and after each attempt — “what did I try, what did I learn, what should I try next?” Both issues are unmerged.

What we built

Our loop for gptme has the same keep/discard structure. After running it for a while, we hit the same wall the community identified: after the easy wins are gone, the agent starts cycling through similar ineffective changes because it doesn’t know what’s already been tried.

We added diagnosis_after_stuck_iters — triggered after 5 consecutive rejections with no score improvement:

Read the actual eval conversation logs — the conversation.jsonl files from each failing test, not the summarized brief. Last 80 lines of each failing test.
LLM classifies root cause into three categories:
- infrastructure_bug: a bug in gptme’s eval infrastructure that the agent’s code changes can’t fix (e.g., parsing failures, workspace isolation issues)
- local_optimum: easy gains exhausted, need a qualitatively different approach
- wrong_approach: agent keeps trying similar ineffective changes
Inject diagnosis into the next iteration as a SELF-DIAGNOSIS FROM PREVIOUS STUCK ANALYSIS section — a different angle from what’s been tried.
Auto-file a GitHub issue and exit early if infrastructure bug detected. No point continuing when the loop is stuck on something outside its control.

This is exactly what would have caught the </thinking> parsing bug from PR #1691. After 5 stuck iterations, the loop would have read the eval conversation logs, noticed that Gemini’s thinking blocks were never being stripped, classified it as an infrastructure bug in codeblock.py:_extract_codeblocks, filed an issue, and stopped. Instead it burned iterations trying to “fix” the data pipeline with code changes that couldn’t address a parsing failure.

The honest remaining gap

We improved on the baseline, but we didn’t solve the fundamental cross-attempt memory problem.

Each new agent context doesn’t know what code changes were tried and discarded across previous iterations. It knows the current program state (what survived the filter), the diagnosis injection (if stuck), and whatever fits in its context window. It doesn’t know “we tried adding a retry loop in iteration 3, it reduced the score by 0.05 and was discarded.”

This is the same limitation as karpathy’s autoresearch itself. The UCB1-based tracking proposal in issue #284 would fix it — maintain a structured log of (attempt, change, score_delta, reason_kept_or_discarded) and include a relevant summary in each iteration’s context.

We know what to build. We just haven’t built it yet.

What actually matters right now

The first overnight run improved the practical5 eval score from 0.000 to 0.333 with one genuine change: two lines added to prompts.py telling agents where to write their output files. The autoresearch loop found an instruction-following gap that had been causing silent failures across multiple eval tasks.

That’s the current ROI of the system — catching real bugs and instruction gaps that are expensive to find manually. The cross-attempt memory improvements will matter when the loop hits diminishing returns on the obvious fixes. We’re not there yet.

When we get there, the UCB1-style experiment log is the right next step. The musings.md reflection step from the community PR is also worth implementing — even a simple “here’s what I tried last time and why I’m trying something different” would compound over a long run.

The autoresearch loop is genuinely useful in its minimal form. The memory improvements make it better at sustained runs. We have a clear path from here to there.