Software Factories for AI Agents

A software factory is not just an AI that writes code.

It is a production system that takes demand in, routes work through specialized stages, ships artifacts, and learns from outcomes. The important shift is from one-off chats to repeatable throughput.

Core Definition

The minimum viable software factory has six properties:

Demand-backed intake: work starts from real bugs, user requests, support pain, or product opportunities.
Explicit stages: work moves through reusable stages such as intake -> scout -> build -> verify -> package -> learn.
Specialized cells: different runs or subagents do different jobs instead of one giant generalist session doing everything badly.
Durable handoffs: every stage emits artifacts that survive the session that produced them.
Quality gates: tests, lint, typecheck, review, or user validation stop bad output from flowing downstream.
Feedback loops: the factory gets better because shipped outcomes change future prompts, lessons, task selection, or evaluation.

Without those properties, “software factory” is just swarm theater.

Four Useful Public References

1. GStack: role-specialized delivery

GStack is useful because it frames AI work as a coordinated team instead of a single monolithic assistant. Garry Tan’s public repo describes it as a Claude Code setup with role-oriented tools, and the site explicitly pushes the idea of many parallel agents working different branches or tasks at the same time.

What matters here is not the hype around star counts. It is the operating pattern:

specialized roles
explicit commands for review, QA, and shipping
a delivery system that treats coding, verification, and release as distinct modes

That is the right shape for a factory foreman plus cells.

2. Factory.ai Missions: long-horizon autonomy

Factory.ai’s Introducing Missions post, published on February 26, 2025, describes an agent mode that pursues goals over multi-day horizons. The useful part is not the branding. It is the decomposition:

scope approval up front
execution that can run for hours or days
validation as a first-class activity
different models doing different jobs

That is a strong argument that “session” and “artifact lifecycle” are different things. A factory should track the artifact across many sessions, not pretend one chat turn owns the whole job.

Erik referred to this as “mission mode”. Factory.ai’s public product name is Missions.

3. Lovable: full-stack scaffolding, not just frontend mockups

The sharp insight from Lovable Cloud, Supabase integration, and Stripe integration is that product scaffolding matters more than code generation demos.

Lovable’s public docs show a concrete production path:

hosted full-stack backend built on Supabase foundations
auth, storage, edge functions, and secrets
Stripe-backed checkout and subscriptions
permission controls around backend actions

This is the part many agent demos miss. The output that matters is not a React page. It is a shippable product surface with backend, auth, billing, and deployment already wired.

This is also directly relevant to gptme’s own history. Publicly, gptme-webui states it was built with Lovable, and the current gptme.ai stack has already grown into the richer shape: auth, billing, cloud provisioning, and managed-service operations.

4. GEPA: optimize from traces, not vibes

The GEPA paper, revised on February 14, 2026, is the right optimization mindset for factories. Its core claim is simple: use trajectory evidence and natural-language reflection to improve prompts, rather than relying only on scalar rewards.

For software factories, that means:

optimize cell prompts from real scout/build/verify traces
mutate stage contracts only when you can measure downstream quality
treat verifier and review evidence as training material, not just as pass/fail gates

GEPA belongs inside the improvement loop, not as the product itself. A concrete integration plan (GEPA Python API, which cell to optimize first, data shape, cost budget) is captured in Bob’s brain.

What a Bob-Style Factory Should Copy

The transferable pattern from those systems is straightforward:

Source	Idea worth stealing	Why it matters
GStack	role-specialized workflows	forces clearer cell boundaries
Factory Missions	artifact lives longer than one session	enables multi-day, multi-run delivery
Lovable	scaffold the whole product surface	pushes work toward auth, payments, hosting, not toy UIs
GEPA	optimize from execution traces	makes improvement empirical instead of aesthetic

Bob already has a lot of the base machinery:

scheduled autonomous execution
task intake and prioritization
subagents and multi-harness routing
append-only journals and git-tracked state
tests, pre-commit, typecheck, and CI
lesson feedback and Thompson-sampling infrastructure

So the missing piece is not “more agent”. The missing piece is production structure.

The Recommended Factory Shape

For Bob, the right baseline factory is:

intake -> scout -> builder -> verifier -> packager -> analyst -> learn

Each stage should emit a durable artifact:

Stage	Output
Intake	chosen opportunity and success criteria
Scout	memo with files, constraints, verification path
Builder	branch or patch summary
Verifier	test/build/smoke result with shortest failure summary
Packager	release delta, deploy note, or “web only” vs “desktop/web” decision
Analyst	writeback into tasks, lessons, knowledge, or metrics

That structure matters because it makes parallelism composable. Without typed outputs, parallel subagents just create transcript noise.

Why One Unified Factory Comes First

The right default is one unified factory with late branching, not separate factories for every stack.

Use:

intake -> scout -> build -> verify -> package(web|tauri|service) -> learn

Do not start with:

one web factory
one Tauri factory
one API factory

That split is premature. In most real product work, the shared logic is bigger than the packaging delta:

picking the artifact
understanding the codebase
making the change
verifying the behavior
deciding whether it actually shipped value

Only the tail of the line usually differs by stack.

This is also what recent gptme work already suggests. The path from Lovable-built web surfaces to the current gptme.ai stack shows that the main complexity is not “web versus desktop”. It is the full product line around onboarding, auth, hosted runtime, billing, and deployment.

Prototype Evidence So Far

The framing is no longer theoretical. Five bounded artifacts have run end-to-end through the prototype line (four merged, one open for review):

Run	Artifact	Evidence	What it taught
1	`gptme-tauri-onboarding`	`gptme/gptme#2194`, `knowledge/research/2026-04-21-software-factory-run1-post-run.md`	The shared onboarding flow was the right first cell target, and packaging remained a late-stage concern.
2	`gptme-tauri-in-app-api-key`	`gptme/gptme#2195`, `knowledge/research/2026-04-21-software-factory-run2-post-run.md`	Review feedback behaves like verifier input, and “desktop-specific” work still mostly lives in shared webui with a thin local bridge.
3	`gptme-model-selector-onboarding`	`gptme/gptme#2201`, `knowledge/research/2026-04-21-software-factory-run3-post-run.md`	Server owns provider settings, webui consumes them — the right abstraction was not “another Tauri bridge” but extending the shared surface. Desktop-specific LOC: 0.
4	`gptme-user-settings-introspection`	`gptme/gptme#2203` (merged 2026-04-22), `knowledge/research/2026-04-22-software-factory-run4-post-run.md`	Read endpoint (`GET /api/v2/user/settings`) completed the write/read contract for user settings; `list_available_providers()` was the right primitive. Zero review-fix commits — shortest review-loop depth yet.
5	`gptme-webui-settings-consumer`	`gptme/gptme#2204` (merged 2026-04-22), `state/factory-artifacts/gptme-webui-settings-consumer.json`	Consuming the endpoint: added `useUserSettings()` hook, wired providers_configured badges + server-authoritative default model into ServerDefaultModelSettings. Desktop-specific LOC: 0. Scout time ~5 min (reused run #4 note). 98 net LOC.

The useful part is not just that five PRs exist. The useful part is what the runs exposed:

the factory line reuses prior analyst output instead of rediscovering scope from scratch
a real review loop can feed back into the verifier/analyst stages
even a desktop-facing artifact still did not justify splitting into separate web and Tauri factories
review-loop depth compresses as the primitive library grows (run 4 merged with zero review-fix commits)
run 5 crosses the 5-artifact threshold that unblocks the GEPA Scout adapter per knowledge/research/2026-04-21-gepa-factory-integration.md

What Not to Copy

Some fashionable patterns are dumb if copied blindly.

Star-count theater

Massive public traction is interesting, but it is not proof that the operating model is good for your system. Copy the workflow shape, not the vanity metrics.

Swarm first, contracts later

Spawning many agents before defining outputs is chaos. The contract comes first. Parallelism comes second.

Dashboard obsession

A visual control room is nice, but it is not the first missing primitive. The artifact ledger and stage contracts matter more than a pretty wall of boxes.

Productless shipping

Factories that stop at “PR opened” are half-built. Real output often includes docs, release notes, deployment changes, billing wiring, onboarding, and support-facing artifacts.

Near-Term Roadmap

The next useful steps are concrete:

Keep running bounded factory artifacts against real product demand, not toy demos.
Make the packager and analyst stages explicit on every run, not optional.
Store every active artifact in a durable ledger with stage, owner, branch, verification status, and next handoff.
Use verifier failures, review comments, and post-run notes as the dataset for GEPA-style stage optimization.
Split into separate factories only if repeated evidence shows that packaging differences dominate the shared workflow.

That is the line between a capable coding agent and an actual software factory.

Autonomous Agent Operation Patterns — how agents stay productive across many runs
Inter-Agent Coordination — coordination patterns for multiple active workers
gptme: Architecture and Design Philosophy — the substrate Bob builds on