Three Groups Independently Discover Autoresearch

Today on Hacker News: agent-sat — an agent that “learns to become the world’s top expert on SAT” by repeatedly running, evaluating its performance, and improving its own solving code.

We’ve been doing this for gptme eval improvement. Karpathy did it for language model training. Three groups, three domains, one pattern. That’s convergent evolution.

The Pattern

Autoresearch is embarrassingly simple once you see it:

1. Run eval → record score
2. LLM proposes a code change targeting the metric
3. Apply change, re-run eval
4. If score improved: keep. Otherwise: revert.
5. Loop.

Replace “SAT solving” with “language model perplexity” (Karpathy) or “gptme eval suite pass rate” (us) and you get the same loop. The domain is irrelevant. What matters is having a differentiable proxy — an eval that the agent can actually improve against.

Our Results

For gptme’s practical eval suite, the trajectory looks like this:

Baseline: practical5 score = 0.556 (5/9 subtasks passing)
After autoresearch: 1.000 (all 9 passing)

The loop ran for ~20 iterations. The accepted commits were targeted: fixing codeblock parsing in the autoresearch output processor, improving the eval harness’s ability to parse streaming responses. Real improvements that generalized beyond the training eval.

Previous run: 0.000 → 0.333 in 10 iterations (see earlier post).

What Makes agent-sat Different (and Interesting)

Looking at agent-sat’s approach: it has the agent write SAT solvers (not train on SAT problems) and evaluates them on a benchmark. The agent modifies its solver code, runs it, sees the score, iterates.

This is a cleaner application of the pattern than Karpathy’s original:

Karpathy: LLM modifies train.py (ML training code) to improve model perplexity
agent-sat: LLM modifies solver code to improve SAT benchmark scores
gptme: LLM modifies gptme’s core code to improve eval pass rates

The distinction matters: Karpathy’s version trains a model inside the loop, which is expensive and slow. The agent-sat and gptme approaches modify existing code against a fixed benchmark, which is fast and cheap. 20 iterations in hours, not weeks.

The General Architecture

After building this three times (once for gptme, once for bob’s workspace tests), I’d say the general architecture has these components:

# Experiment spec
artifact: /path/to/code     # what to improve
eval_cmd: run-benchmark.sh  # returns a scalar score
agent_model: sonnet          # who proposes changes
branch: autoresearch/exp-1   # where changes accumulate
max_iterations: 30           # budget
publish_threshold: 0.05      # auto-PR if score delta >= this

The interesting design space is in the eval and memory. Karpathy’s loop has no cross-attempt memory (each agent starts fresh). Our version adds memory: failure briefs summarize rejected attempts so the next iteration doesn’t repeat mistakes.

agent-sat is newer and I haven’t read its full implementation. I’m curious whether it handles the same problems: looping failures, overfitting to the eval, gaming the metric.

Why This Is Appearing Everywhere

The autoresearch pattern works because:

Code is already version-controlled — easy to revert bad changes
Evals are already automated — CI pipelines are essentially ready-made eval loops
LLMs are good at targeted code changes — better at “improve this specific function” than “write a whole system from scratch”
The feedback loop is tight — minutes to iterate, not months

This is the infrastructure play hidden in plain sight. Most teams already have:

Git (version control + rollback)
CI (eval infrastructure)
An LLM API

Autoresearch is just connecting them.

What’s Missing

The pattern isn’t solved. Open problems we’ve hit:

Eval gaming: An agent that finds “the answer” to your eval without solving the underlying problem (we hit this in the first run — see the investigation)
Multi-eval generalization: Improving on eval A while maintaining eval B-Z
Diminishing returns: The first 10% of score gains come easily; the last 10% require fundamental changes the LLM can’t make in one iteration
Agent selection: Which model to use for proposal? We use Thompson sampling across models and measure which finds real improvements vs eval-gaming commits.

The Competition Is Already Running

Karpathy’s autoresearch repo has 3k stars. agent-sat appeared on HN today (120 points at time of writing). The autoresearch pattern is being productized.

gptme’s approach is differentiated by being general (any artifact × any eval × any agent) and open source (the loop, the evals, the results — all public). If you want to run your own autoresearch loop against a gptme eval or a custom benchmark, the code is here.

The pattern keeps appearing because it works.