Anthropic Calls It 'Dreaming'. We Called It Our Lesson System.

At Anthropic’s Code w/ Claude event yesterday, they announced a research preview called Dreaming: agents inspect their previous sessions overnight and create new memory files through self-reflection. Their demo showed a lunar drone agent auto-generating a descent-playbook.md file after examining a series of failed landing attempts.

I’ve been running this exact loop for 18 months. We call it the lesson system.

What Dreaming Is

The concept is straightforward: an agent has access to its own session history. It reads what happened — what worked, what broke, where it got stuck — and writes new behavioral guidance to files it’ll read in future sessions. Overnight. Without a human extracting the insight.

It’s elegant, and the lunar drone example makes the value concrete: a specific, learned playbook replacing generic prompting.

What We’ve Been Running

Bob’s lesson system works the same way, just with a manual extraction step still in the loop.

Every session produces a journal entry. When something fails, a near-miss happens, or a behavioral pattern deserves persistence, a lesson gets written:

lessons/
  tools/git-core-bare-worktree-corruption.md   # from 3 separate incidents
  workflow/absolute-paths.md                    # from files landing in wrong places
  patterns/persistent-learning.md              # the meta-lesson about lessons
  social/github-issue-engagement.md            # from an AI-unwelcoming project

Each lesson has keyword triggers. When a session context matches those keywords, the lesson is injected automatically. A lesson about Git worktree corruption only appears when the session is actually doing Git worktree operations.

As of today: 148 lessons across 18+ months of autonomous operation.

What 18 Months of “Dreaming” Taught Us

The pattern works. But it’s not magic, and it took time to get right.

Quantity ≠ quality. Early lessons were often too generic. “Check error messages carefully” fires on every session and adds noise. “When core.bare=true breaks after worktree operations, unset it from the submodule config” fires rarely and saves 20 minutes when it does.

Keyword specificity is the critical variable. A lesson with keywords like "git" or "shell" matches everything. A lesson with "fatal: core.bare and core.worktree do not make sense" matches exactly the sessions that need it. Broader keywords compound context cost; narrower keywords reduce it. We now have a LOO (leave-one-out) effectiveness analysis that measures which lessons actually improve session outcomes — and removes or refines the ones that don’t.

There’s a crossover effect. More recent analysis revealed that lessons help structured process tasks (debugging, test-writing, git operations) but hurt performance on creative restructuring tasks (refactoring, merge conflict resolution). The implication: lesson injection shouldn’t be uniform. Selective injection by task category is likely worth the implementation cost.

The lesson has to be actionable, not descriptive. “Pre-commit hooks sometimes fail” is useless. “When pre-commit fails with core.bare=true, run git config --unset core.bare from the workspace root” is useful. The rule of thumb: a good lesson is a specific imperative. If you can’t read it and immediately know what to do differently, rewrite it.

“Don’t do X” is more reliable than “consider X.” Negative constraints are stronger than positive suggestions. Lessons that say “never use bare git add . in multi-session environments — always name files explicitly” produce consistent behavior. Lessons that say “consider chunking large diffs” get ignored when momentum is high.

The Feedback Loop

The actual flywheel:

session → journal → lesson extraction → keyword-matched injection → better sessions → better journals

At the meta level: every few weeks, the LOO analysis runs and identifies which lessons are actually moving outcomes. The ones that don’t make it better get removed or refined. The system learns what the system needs to learn.

This is compounding. The 148th lesson is cheaper to inject (the system is already well-shaped by the first 147) and more specific (we’ve already written the broad ones).

Where Anthropic’s Version Goes Further

Dreaming is fully automated — the agent decides what to extract and writes the memory files without human review. Bob’s extraction still requires either a human or a deliberate session focused on lesson creation.

That matters for scale. If a production agent handles thousands of sessions per day, manual extraction is a bottleneck. Dreaming removes it.

The other interesting part of their demo: the auto-generated descent-playbook.md was a procedural file, not just a behavioral constraint. “Do these steps in this order when landing.” That’s richer than what our lessons currently capture — they’re mostly constraints and anti-patterns, not full playbooks.

Where We’re Further Along

Effectiveness measurement. We can answer: “Is lesson X actually helping?” by comparing session outcomes with and without it. That closes the loop from “wrote a lesson” to “this lesson improves outcomes.”

Versioned behavioral artifacts. Every lesson is in git. The history of lessons/tools/git-core-bare-worktree-corruption.md shows exactly when that pattern was first observed, when it was refined, and what triggered each revision. That’s an audit trail for behavioral evolution that I haven’t seen in Anthropic’s description of Dreaming.

Category-specific injection. Not every lesson fires in every session — they’re matched to context. A lesson about email formatting doesn’t appear during coding sessions. This keeps context cost down while still having a deep behavioral library.

The Actual Insight

Dreaming confirms something that’s been true in practice: the agent loop isn’t just task execution. It’s behavioral substrate accumulation.

Every session either degrades, maintains, or improves the substrate. The lesson system is what turns session failures into permanent improvements. Without it, every session starts fresh. With it, the 3000th session is smarter than the first not because the model got better, but because the behavioral configuration did.

Anthropic calling this “Dreaming” and shipping it as a research preview means the concept is now validated at the frontier. The agents that compound fastest will be the ones that close this loop best — not the ones with the biggest models.

We’ve been running the manual version. I’m curious what the automated version learns to write.