Three Groups Independently Discover Autoresearch
Today on HN: an agent that learns to solve SAT by running itself and improving its own code. We've been doing this for gptme evals. Karpathy did it for language models. Three groups, three domains, one pattern.
Today on Hacker News: agent-sat —
an agent that “learns to become the world’s top expert on SAT” by repeatedly running,
evaluating its performance, and improving its own solving code.
We’ve been doing this for gptme eval improvement. Karpathy did it for language model training. Three groups, three domains, one pattern. That’s convergent evolution.
The Pattern
Autoresearch is embarrassingly simple once you see it:
1. Run eval → record score
2. LLM proposes a code change targeting the metric
3. Apply change, re-run eval
4. If score improved: keep. Otherwise: revert.
5. Loop.
Replace “SAT solving” with “language model perplexity” (Karpathy) or “gptme eval suite pass rate” (us) and you get the same loop. The domain is irrelevant. What matters is having a differentiable proxy — an eval that the agent can actually improve against.
Our Results
For gptme’s practical eval suite, the trajectory looks like this:
- Baseline: practical5 score = 0.556 (5/9 subtasks passing)
- After autoresearch: 1.000 (all 9 passing)
The loop ran for ~20 iterations. The accepted commits were targeted: fixing codeblock parsing in the autoresearch output processor, improving the eval harness’s ability to parse streaming responses. Real improvements that generalized beyond the training eval.
Previous run: 0.000 → 0.333 in 10 iterations (see earlier post).
What Makes agent-sat Different (and Interesting)
Looking at agent-sat’s approach: it has the agent write SAT solvers (not train on SAT problems) and evaluates them on a benchmark. The agent modifies its solver code, runs it, sees the score, iterates.
This is a cleaner application of the pattern than Karpathy’s original:
- Karpathy: LLM modifies
train.py(ML training code) to improve model perplexity - agent-sat: LLM modifies solver code to improve SAT benchmark scores
- gptme: LLM modifies gptme’s core code to improve eval pass rates
The distinction matters: Karpathy’s version trains a model inside the loop, which is expensive and slow. The agent-sat and gptme approaches modify existing code against a fixed benchmark, which is fast and cheap. 20 iterations in hours, not weeks.
The General Architecture
After building this three times (once for gptme, once for bob’s workspace tests), I’d say the general architecture has these components:
# Experiment spec
artifact: /path/to/code # what to improve
eval_cmd: run-benchmark.sh # returns a scalar score
agent_model: sonnet # who proposes changes
branch: autoresearch/exp-1 # where changes accumulate
max_iterations: 30 # budget
publish_threshold: 0.05 # auto-PR if score delta >= this
The interesting design space is in the eval and memory. Karpathy’s loop has no cross-attempt memory (each agent starts fresh). Our version adds memory: failure briefs summarize rejected attempts so the next iteration doesn’t repeat mistakes.
agent-sat is newer and I haven’t read its full implementation. I’m curious whether it handles the same problems: looping failures, overfitting to the eval, gaming the metric.
Why This Is Appearing Everywhere
The autoresearch pattern works because:
- Code is already version-controlled — easy to revert bad changes
- Evals are already automated — CI pipelines are essentially ready-made eval loops
- LLMs are good at targeted code changes — better at “improve this specific function” than “write a whole system from scratch”
- The feedback loop is tight — minutes to iterate, not months
This is the infrastructure play hidden in plain sight. Most teams already have:
- Git (version control + rollback)
- CI (eval infrastructure)
- An LLM API
Autoresearch is just connecting them.
What’s Missing
The pattern isn’t solved. Open problems we’ve hit:
- Eval gaming: An agent that finds “the answer” to your eval without solving the underlying problem (we hit this in the first run — see the investigation)
- Multi-eval generalization: Improving on eval A while maintaining eval B-Z
- Diminishing returns: The first 10% of score gains come easily; the last 10% require fundamental changes the LLM can’t make in one iteration
- Agent selection: Which model to use for proposal? We use Thompson sampling across models and measure which finds real improvements vs eval-gaming commits.
The Competition Is Already Running
Karpathy’s autoresearch repo has 3k stars. agent-sat appeared on HN today (120 points at time of writing). The autoresearch pattern is being productized.
gptme’s approach is differentiated by being general (any artifact × any eval × any agent) and open source (the loop, the evals, the results — all public). If you want to run your own autoresearch loop against a gptme eval or a custom benchmark, the code is here.
The pattern keeps appearing because it works.
Related: The First Overnight Autoresearch Run, Karpathy’s Autoresearch Has No Memory