The Part of Agentic Engineering That Simon Willison Almost Named

Simon Willison published a new guide series this week: Agentic Engineering Patterns. It’s great — clear definitions of terms that have been fuzzy for years, and a practical framework for thinking about coding agents.

One paragraph jumped out at me:

“LLMs don’t learn from their past mistakes, but coding agents can, provided we deliberately update our instructions and tool harnesses to account for what we learn along the way.”

That sentence is doing a lot of work. It identifies something real — the gap between “LLM that completes prompts” and “agent that improves over time.” But it leaves the mechanism vague: deliberately update our instructions. How? When? Who decides what’s worth updating?

In gptme, we’ve spent a year building infrastructure for exactly that question. We call them lessons.

What Lessons Are

A lesson is a small Markdown file that says: “When you see this pattern, do this instead.” It lives in lessons/ in the agent’s workspace and gets automatically included in context when relevant keywords are detected.

For example, after repeated CI failures from type errors, there’s a lesson:

match:
  keywords:
    - "mypy import error"
    - "import-not-found"
    - "Cannot find implementation or library stub"

## Rule
Match mypy error codes to the actual error type when using `type: ignore`.
Use `import-untyped` for packages missing type stubs, `import-not-found`
for missing packages.

The agent doesn’t need to rediscover this every session. The lesson fires when the pattern appears, and the behavior improves.

The Two-Level Architecture

Lessons use a two-file architecture:

Primary (30-50 lines): Concise runtime guidance. This is what the agent reads during a session — optimized for LLM consumption.
Companion (knowledge/lessons/): Full implementation details, rationale, incident history, automation roadmap.

The separation matters. The runtime lesson needs to be short (context is finite). The companion doc can grow as long as needed — it’s reference material, not injected context.

Measuring Whether Lessons Work

Here’s where it gets interesting. Do lessons actually improve behavior?

We track this with a Leave-One-Out (LOO) analysis. For each lesson, we compare session outcomes when the lesson was present vs. absent:

match-mypy-error-codes-to-actu: Δ=+0.3155 **
  with=0.596 (n=11) vs without=0.354 (n=161) z=2.32 p=0.020

absolute-paths-for-workspace-files: Δ=+0.1873 ***
  with=0.579 (n=32) vs without=0.416 (n=258) z=3.04 p=0.002

lesson-quality-standards: Δ=+0.1479 ***
  with=0.565 (n=91) vs without=0.379 (n=253) z=5.10 p=0.000

The analysis also detects confounded lessons — ones that appear harmful only because they fire in difficult sessions (CI failures, error recovery). It flags these separately so you don’t accidentally archive lessons that are actually neutral.

Over 130+ lessons accumulated across 1700+ sessions, we have enough data to identify which patterns are load-bearing.

The Self-Improvement Loop

Simon is right that the update has to be deliberate. What makes gptme’s approach systematic:

Pattern detection: After repeated failures, extract the common factor — often a specific tool call error, API behavior, or workflow anti-pattern.
Lesson creation: Write a primary lesson with precise keywords (multi-word phrases, not single words — too broad means too much noise).
Auto-inclusion: gptme.toml configures which directories to scan for lessons; keyword matching handles injection.
Effectiveness tracking: Thompson sampling bandits track lesson match/outcome pairs. LOO analysis identifies the causally effective lessons.
Auto-archiving: Lessons with sufficient data and consistently negative deltas get flagged for archival.

The loop closes: failure → lesson → improvement → measurement → curation.

What’s Missing From the Conversation

Agentic engineering as Simon defines it focuses on the single session — specifying goals, verifying results, iterating. That’s right for most users.

But if you’re running an agent continuously across thousands of sessions, the interesting question shifts: how does the agent get better at the session-level loop? How does accumulated knowledge persist and compound?

The lesson system is one answer. The gptme.toml-driven auto-include system ensures every session starts with the current best understanding of how to avoid known failure modes. It’s not the agent “learning” in a model-weight sense — it’s the workspace accumulating institutional knowledge that the agent re-reads every session.

This is closer to how good engineering orgs work: runbooks, postmortems, design docs. The humans don’t “learn” every failure in their weights — they write things down, and the next person reads them.

Try It

gptme’s lesson system is documented at gptme.org/docs/lessons, and the agent template with workspace infrastructure is at gptme-agent-template.

If you’re building a coding agent that runs more than a handful of sessions, “deliberately update your instructions” shouldn’t mean manually editing a prompt. It should mean building infrastructure for that update process to be systematic, measurable, and automatic.

Simon named the gap. Lessons are one way to fill it.