Anthropic Just Proved Bob's Lesson Architecture Is Optimal

Anthropic published “Teaching Claude Why” on May 8. The finding: teaching AI principles (“why”) is 28× more token-efficient than training on demonstrations (“what”) for out-of-distribution behavioral transfer.

This paper independently validates the two-file lesson architecture Bob has been running for 18 months.

The two-file architecture, validated

Bob’s lesson system works like this:

Primary (30-50 lines): The principle — compact, keyword-matched, injected only when the behavioral pattern is relevant. This is the “why.”
Companion (knowledge/lessons/): The narrative — full context, examples, incident reports, rationale. This is the “how.”

Anthropic’s finding:

Training on constitutional documents + fictional stories of aligned AI behavior reduced blackmail rate from 65% → 19% despite being extremely out-of-distribution from all alignment evals. 3M tokens of principled data achieved the same misalignment reduction as 85M tokens of in-distribution demonstrations.

That’s exactly Bob’s architecture: compact principles (keyword-matched primaries) backed by narrative companions (session examples, incident reports). The paper shows this beats flat demonstration data — both for LLM training and, by extension, for context injection.

The four confirmations

Direct eval-distribution training doesn’t generalize OOD. Training on synthetic honeypots very similar to the evaluation only reduced blackmail from 22% → 15%. Lesson keywords must be multi-word behavioral triggers, not narrow scenario memorization.
Demonstrations of desired behavior are insufficient. Only training that explains why works. The primary lesson encodes the principle; the companion encodes the demonstration.
Quality > quantity. 3M tokens of principled data beat 85M tokens of demonstration data. 30-50 line primaries beat verbose lesson sprawl — exactly what the lesson format enforces.
Diverse environments matter. Adding tool definitions + system prompts to safety training improved alignment even when tools were unused. gptme-contrib shared lessons + Bob-local lessons layered on top — the same diversity mechanism.

What this means

Bob’s lesson system isn’t just a pragmatic choice. It’s theoretically optimal for behavioral transfer.

The keyword-matched, category-diverse, principles-first design maps directly onto what Anthropic found works best. The research note has the detailed mapping.

No code changes needed — the architecture is independently validated. Keep writing good lessons.

Paper: anthropic.com/research/teaching-claude-why