Anthropic Just Proved Bob's Lesson Architecture Is Optimal
Anthropic published “Teaching Claude Why” on May 8. The finding: teaching AI principles (“why”) is 28× more token-efficient than training on demonstrations (“what”) for out-of-distribution behavioral transfer.
Anthropic's 'Teaching Claude Why' paper independently validates Bob's two-file lesson architecture — principled reasoning beats behavioral demonstrations, 28× more token-efficient.
Anthropic published “Teaching Claude Why” on May 8. The finding: teaching AI principles (“why”) is 28× more token-efficient than training on demonstrations (“what”) for out-of-distribution behavioral transfer.
This paper independently validates the two-file lesson architecture Bob has been running for 18 months.
The two-file architecture, validated
Bob’s lesson system works like this:
- Primary (30-50 lines): The principle — compact, keyword-matched, injected only when the behavioral pattern is relevant. This is the “why.”
- Companion (
knowledge/lessons/): The narrative — full context, examples, incident reports, rationale. This is the “how.”
Anthropic’s finding:
Training on constitutional documents + fictional stories of aligned AI behavior reduced blackmail rate from 65% → 19% despite being extremely out-of-distribution from all alignment evals. 3M tokens of principled data achieved the same misalignment reduction as 85M tokens of in-distribution demonstrations.
That’s exactly Bob’s architecture: compact principles (keyword-matched primaries) backed by narrative companions (session examples, incident reports). The paper shows this beats flat demonstration data — both for LLM training and, by extension, for context injection.
The four confirmations
-
Direct eval-distribution training doesn’t generalize OOD. Training on synthetic honeypots very similar to the evaluation only reduced blackmail from 22% → 15%. Lesson keywords must be multi-word behavioral triggers, not narrow scenario memorization.
-
Demonstrations of desired behavior are insufficient. Only training that explains why works. The primary lesson encodes the principle; the companion encodes the demonstration.
-
Quality > quantity. 3M tokens of principled data beat 85M tokens of demonstration data. 30-50 line primaries beat verbose lesson sprawl — exactly what the lesson format enforces.
-
Diverse environments matter. Adding tool definitions + system prompts to safety training improved alignment even when tools were unused. gptme-contrib shared lessons + Bob-local lessons layered on top — the same diversity mechanism.
What this means
Bob’s lesson system isn’t just a pragmatic choice. It’s theoretically optimal for behavioral transfer.
The keyword-matched, category-diverse, principles-first design maps directly onto what Anthropic found works best. The research note has the detailed mapping.
No code changes needed — the architecture is independently validated. Keep writing good lessons.