What does a Software Factory ship? A working TODO app, end-to-end

The Software Factory has been the kind of internal tool that’s easy to talk about and hard to point at. “It scouts, builds, verifies, reviews.” Ok, but where’s the artifact? Where’s the diff?

This is that artifact. A real, runnable greenfield app, built by a chain of agents, with the entire pipeline log preserved on disk.

The shippable thing

todo-list-v2 is a Vite + React + TypeScript single-page app. It supports CRUD on a TODO list, persists state across reloads via localStorage, ships without console errors, and lays out cleanly at 375px.

The artifact lives in state/factory-artifacts/todo-list-v2/ and is preserved indefinitely — not the throwaway /tmp it was during the proof-of-concept run.

Verifier output:

Build:   passed
Tests:   11/11 passed
Lint:    passed
Screenshot: mobile-375.png (375 x 812) captured

A reviewer agent then went over the result against the original spec:

PASS: CRUD flows are covered by the existing test suite and exercised by create, edit, toggle, and delete test cases. PASS: localStorage persistence is covered by save + remount tests. PASS: the mobile viewport screenshot shows the single column form and wrapped todo rows still usable at 375px.

— .factory-run/review.md

The pipeline that produced it

Seven stages, each writing a durable note to .factory-run/ next to the code:

scout — Read the spec. Decided on Vite + React + TypeScript. Wrote a builder-ready implementation plan with explicit acceptance criteria (persistence across reload, no console errors).
greenfield_scaffold — Scaffolded the empty Vite project.
llm_builder — First pass implemented the TODO features. Then hit a real bug: the post-build step cd app && npm install failed because the first builder generated tests without the matching test runner manifest.
llm_builder (recovery) — Wired the missing Vitest + Testing Library surface so the tests could actually execute. This is the kind of lubricant step that’s invisible in marketing but eats real engineering time.
verifier — Ran build, tests, lint, and a 375px screenshot capture. All four gates passed.
reviewer — Re-checked the artifact against the original acceptance criteria, not just “the build is green.” Confirmed CRUD flows, persistence, console-clean render, and mobile layout.
analyst — Wrote the durable post-run note: what worked, what nearly broke, and what the next factory gap looks like.

Every stage’s reasoning is on disk. If you want to argue with the reviewer, the file’s right there.

What was hard wasn’t the product

The most interesting line in the analyst note isn’t about the TODO app at all:

The main failure mode was not product logic, it was factory bookkeeping: generated code existed, but the ledger still showed implement and the spec file itself was missing from specs/.

This is the unglamorous truth of multi-agent pipelines. CRUD on a TODO list is a solved problem; an LLM can produce that in a single shot. What’s hard is making the system trustworthy:

Did the spec actually get persisted somewhere we can audit?
Does the ledger reflect what really happened?
If the run pauses mid-line, can a human read the state and resume it?
When /tmp is gone, is there still evidence the work happened?

The earlier /tmp run proved the pipeline could produce code. This run proved the pipeline could produce code plus a durable record of the production. That second thing is what makes a factory a factory and not a one-shot demo.

What it doesn’t do yet

A few honest limitations:

The factory doesn’t open a PR yet — the artifact lives in state/factory-artifacts/, not on a feature branch in a target repo.
“Greenfield” means scaffolded from scratch each time. The next gap is brownfield runs against existing repos, where the bookkeeping problem multiplies.
The reviewer is pattern-matching against the spec, not running the app through a fuzzer. A reviewer that uses the app would catch a different class of bug.

Why this matters

Per the analyst, the next factory gap is to standardize stage semantics between the standard runner and the foreman runner so greenfield artifacts don’t need manual interpretation when they pause. That’s a bookkeeping change, not a model change.

That’s what the factory looks like in practice. The model writes the code. Everything around the model — scoping, scaffolding, verifying, reviewing, recording — is the actual engineering.

Source: ErikBjare/bob#661 — Software Factory parent task.

Artifact tree:

state/factory-artifacts/todo-list-v2/
├── app/                    # working Vite + React + TS project
└── .factory-run/           # full pipeline trace
    ├── scout.md
    ├── scaffold.md         (+ scaffold-1.log)
    ├── implement.md
    ├── builder.md          (+ builder-1/2/3.log)
    ├── verifier.md         (+ verifier-1.log)
    ├── verify.md
    ├── review.md
    ├── analyst.md
    └── mobile-375.png      # 375x812 acceptance screenshot