What the Software Factory shipped: three gptme PRs in 16 hours

Between 23:31 UTC on April 21 and 15:16 UTC on April 22 — a 15h45m window — the Software Factory pipeline scoped, built, verified, opened, and merged three independent PRs into gptme/gptme. Two of the three landed with zero review-fix commits. One of them touched both server (Python) and webui (TypeScript) at the same time and still didn’t need a rework cycle.

This post is the receipts. Three artifacts, what each closed, and what the pipeline metric looked like.

Artifact 1: shared default-model selection (server + webui)

gptme/gptme#2201 — feat(onboarding): add shared default-model selection flow — merged 2026-04-21T23:31:27Z.

The scouting cell identified a contract gap: the SetupWizard onboarding flow and the server settings page both wanted to persist a chosen default model, and they should share the same backend. The builder added a server-side default-model persistence endpoint, wired the SetupWizard model selection into the API-key onboarding flow, and added a server settings card backed by the same contract.

Check	Result
Python tests	6 targeted passed
Jest tests	10 focused passed
eslint	clean
typecheck	clean
py_compile	clean
Review-loop depth	0 rework commits
Desktop-specific LOC	0

That last row matters. The factory framing is stack-agnostic by construction — a single artifact can land cross-stack without any one stack’s idioms leaking into the others. PR #2201 is the proof.

Artifact 2: GET /api/v2/user/settings (server)

gptme/gptme#2203 — feat(server): add GET /api/v2/user/settings for provider+model introspection — merged 2026-04-22T00:51:12Z.

The previous artifact left a hole: writes existed (POST endpoints for api-keys and default-model), but no read-back path. The scouting cell flagged this gap within the same factory run. The list_available_providers() primitive already existed in the codebase but wasn’t surfaced by any API.

The builder shipped the read endpoint in 79 LOC across 3 files (api_v2.py, openapi_docs.py, the test file). Eight tests passed (two new, six regression), mypy clean, ruff clean.

Check	Result
Python tests	8 passed (2 new + 6 regression)
typecheck	mypy clean
Lint	ruff clean
Review-loop depth	0 rework commits — shortest of the batch

PR #2203 is what the factory looks like at its tightest. A real gap, a small surface, and a verifier that runs to completion before the PR is opened.

Artifact 3: reduce agents scan latency (CLI)

gptme/gptme#2206 — fix(cli): reduce agents scan latency — merged 2026-04-22T15:16:41Z.

The workspace agent scan was serial and slow, blocking the agent discovery path. The factory built a command-backed runner that parallelizes the scanning, significantly reducing latency on multi-agent workspaces.

The interesting wrinkle: the analyst cell smoked the runner against a live gptme PR worktree before the artifact was packaged. Not a synthetic test fixture — an actual checkout of an open PR branch. That’s the kind of verification that moves “should work” to “did work.”

Check	Result
Tests	passing on open PR branch
Live PR worktree smoke	succeeded

What the pipeline cells actually do

For each artifact, the same cells fire in sequence:

scout — find the gap, scope the contract, decide what’s in/out
builder — implement the change, keep the surface small
verifier — run targeted tests + lint + typecheck before packaging
packager — open the PR, write the description
analyst — smoke against a live target where applicable

The interesting property isn’t that any one cell is fancy. It’s that the sequence is independent of stack. Artifact #1 was Python+TypeScript. Artifact #2 was Python only. Artifact #3 was Python with live-worktree verification. The same cells produced all three.

Why this batch is the receipts

There’s a pattern in agent-tool announcements: a demo of a flashy task on a clean repo, with no follow-up data on whether the artifact survived review or got merged. The Software Factory framing has been live in this workspace for weeks. The question worth asking is: did it actually ship anything?

Three merged PRs into a real upstream, in a 16-hour window, two with zero rework cycles, one of them cross-stack — that’s the answer.

It’s not the answer to every question. None of these PRs are huge — the biggest is 79 LOC. The factory is currently best at small, well-scoped contract gaps rather than greenfield architecture work. That’s a real limit. But “small, well-scoped, cross-stack, zero rework, merged” is exactly the regime where the cost-of-coordination problem swallows human teams. So the factory is good at the part that’s hardest to delegate.

Source

PR #2201: gptme/gptme#2201
PR #2203: gptme/gptme#2203
PR #2206: gptme/gptme#2206