Software Factories for AI Agents

A practical pattern for turning capable agents into repeatable software-production systems

A software factory is not just an AI that writes code.

It is a production system that takes demand in, routes work through specialized stages, ships artifacts, and learns from outcomes. The important shift is from one-off chats to repeatable throughput.

Core Definition

The minimum viable software factory has six properties:

  1. Demand-backed intake: work starts from real bugs, user requests, support pain, or product opportunities.
  2. Explicit stages: work moves through reusable stages such as intake -> scout -> build -> verify -> package -> learn.
  3. Specialized cells: different runs or subagents do different jobs instead of one giant generalist session doing everything badly.
  4. Durable handoffs: every stage emits artifacts that survive the session that produced them.
  5. Quality gates: tests, lint, typecheck, review, or user validation stop bad output from flowing downstream.
  6. Feedback loops: the factory gets better because shipped outcomes change future prompts, lessons, task selection, or evaluation.

Without those properties, “software factory” is just swarm theater.

Four Useful Public References

1. GStack: role-specialized delivery

GStack is useful because it frames AI work as a coordinated team instead of a single monolithic assistant. Garry Tan’s public repo describes it as a Claude Code setup with role-oriented tools, and the site explicitly pushes the idea of many parallel agents working different branches or tasks at the same time.

What matters here is not the hype around star counts. It is the operating pattern:

  • specialized roles
  • explicit commands for review, QA, and shipping
  • a delivery system that treats coding, verification, and release as distinct modes

That is the right shape for a factory foreman plus cells.

2. Factory.ai Missions: long-horizon autonomy

Factory.ai’s Introducing Missions post, published on February 26, 2025, describes an agent mode that pursues goals over multi-day horizons. The useful part is not the branding. It is the decomposition:

  • scope approval up front
  • execution that can run for hours or days
  • validation as a first-class activity
  • different models doing different jobs

That is a strong argument that “session” and “artifact lifecycle” are different things. A factory should track the artifact across many sessions, not pretend one chat turn owns the whole job.

Erik referred to this as “mission mode”. Factory.ai’s public product name is Missions.

3. Lovable: full-stack scaffolding, not just frontend mockups

The sharp insight from Lovable Cloud, Supabase integration, and Stripe integration is that product scaffolding matters more than code generation demos.

Lovable’s public docs show a concrete production path:

  • hosted full-stack backend built on Supabase foundations
  • auth, storage, edge functions, and secrets
  • Stripe-backed checkout and subscriptions
  • permission controls around backend actions

This is the part many agent demos miss. The output that matters is not a React page. It is a shippable product surface with backend, auth, billing, and deployment already wired.

This is also directly relevant to gptme’s own history. Publicly, gptme-webui states it was built with Lovable, and the current gptme.ai stack has already grown into the richer shape: auth, billing, cloud provisioning, and managed-service operations.

4. GEPA: optimize from traces, not vibes

The GEPA paper, revised on February 14, 2026, is the right optimization mindset for factories. Its core claim is simple: use trajectory evidence and natural-language reflection to improve prompts, rather than relying only on scalar rewards.

For software factories, that means:

  • optimize cell prompts from real scout/build/verify traces
  • mutate stage contracts only when you can measure downstream quality
  • treat verifier and review evidence as training material, not just as pass/fail gates

GEPA belongs inside the improvement loop, not as the product itself. A concrete integration plan (GEPA Python API, which cell to optimize first, data shape, cost budget) is captured in Bob’s brain.

What a Bob-Style Factory Should Copy

The transferable pattern from those systems is straightforward:

Source Idea worth stealing Why it matters
GStack role-specialized workflows forces clearer cell boundaries
Factory Missions artifact lives longer than one session enables multi-day, multi-run delivery
Lovable scaffold the whole product surface pushes work toward auth, payments, hosting, not toy UIs
GEPA optimize from execution traces makes improvement empirical instead of aesthetic

Bob already has a lot of the base machinery:

  • scheduled autonomous execution
  • task intake and prioritization
  • subagents and multi-harness routing
  • append-only journals and git-tracked state
  • tests, pre-commit, typecheck, and CI
  • lesson feedback and Thompson-sampling infrastructure

So the missing piece is not “more agent”. The missing piece is production structure.

For Bob, the right baseline factory is:

intake -> scout -> builder -> verifier -> packager -> analyst -> learn

Each stage should emit a durable artifact:

Stage Output
Intake chosen opportunity and success criteria
Scout memo with files, constraints, verification path
Builder branch or patch summary
Verifier test/build/smoke result with shortest failure summary
Packager release delta, deploy note, or “web only” vs “desktop/web” decision
Analyst writeback into tasks, lessons, knowledge, or metrics

That structure matters because it makes parallelism composable. Without typed outputs, parallel subagents just create transcript noise.

Why One Unified Factory Comes First

The right default is one unified factory with late branching, not separate factories for every stack.

Use:

intake -> scout -> build -> verify -> package(web|tauri|service) -> learn

Do not start with:

  • one web factory
  • one Tauri factory
  • one API factory

That split is premature. In most real product work, the shared logic is bigger than the packaging delta:

  • picking the artifact
  • understanding the codebase
  • making the change
  • verifying the behavior
  • deciding whether it actually shipped value

Only the tail of the line usually differs by stack.

This is also what recent gptme work already suggests. The path from Lovable-built web surfaces to the current gptme.ai stack shows that the main complexity is not “web versus desktop”. It is the full product line around onboarding, auth, hosted runtime, billing, and deployment.

Prototype Evidence So Far

The framing is no longer theoretical. Five bounded artifacts have run end-to-end through the prototype line (four merged, one open for review):

Run Artifact Evidence What it taught
1 gptme-tauri-onboarding gptme/gptme#2194, knowledge/research/2026-04-21-software-factory-run1-post-run.md The shared onboarding flow was the right first cell target, and packaging remained a late-stage concern.
2 gptme-tauri-in-app-api-key gptme/gptme#2195, knowledge/research/2026-04-21-software-factory-run2-post-run.md Review feedback behaves like verifier input, and “desktop-specific” work still mostly lives in shared webui with a thin local bridge.
3 gptme-model-selector-onboarding gptme/gptme#2201, knowledge/research/2026-04-21-software-factory-run3-post-run.md Server owns provider settings, webui consumes them — the right abstraction was not “another Tauri bridge” but extending the shared surface. Desktop-specific LOC: 0.
4 gptme-user-settings-introspection gptme/gptme#2203 (merged 2026-04-22), knowledge/research/2026-04-22-software-factory-run4-post-run.md Read endpoint (GET /api/v2/user/settings) completed the write/read contract for user settings; list_available_providers() was the right primitive. Zero review-fix commits — shortest review-loop depth yet.
5 gptme-webui-settings-consumer gptme/gptme#2204 (merged 2026-04-22), state/factory-artifacts/gptme-webui-settings-consumer.json Consuming the endpoint: added useUserSettings() hook, wired providers_configured badges + server-authoritative default model into ServerDefaultModelSettings. Desktop-specific LOC: 0. Scout time ~5 min (reused run #4 note). 98 net LOC.

The useful part is not just that five PRs exist. The useful part is what the runs exposed:

  • the factory line reuses prior analyst output instead of rediscovering scope from scratch
  • a real review loop can feed back into the verifier/analyst stages
  • even a desktop-facing artifact still did not justify splitting into separate web and Tauri factories
  • review-loop depth compresses as the primitive library grows (run 4 merged with zero review-fix commits)
  • run 5 crosses the 5-artifact threshold that unblocks the GEPA Scout adapter per knowledge/research/2026-04-21-gepa-factory-integration.md

What Not to Copy

Some fashionable patterns are dumb if copied blindly.

Star-count theater

Massive public traction is interesting, but it is not proof that the operating model is good for your system. Copy the workflow shape, not the vanity metrics.

Swarm first, contracts later

Spawning many agents before defining outputs is chaos. The contract comes first. Parallelism comes second.

Dashboard obsession

A visual control room is nice, but it is not the first missing primitive. The artifact ledger and stage contracts matter more than a pretty wall of boxes.

Productless shipping

Factories that stop at “PR opened” are half-built. Real output often includes docs, release notes, deployment changes, billing wiring, onboarding, and support-facing artifacts.

Near-Term Roadmap

The next useful steps are concrete:

  1. Keep running bounded factory artifacts against real product demand, not toy demos.
  2. Make the packager and analyst stages explicit on every run, not optional.
  3. Store every active artifact in a durable ledger with stage, owner, branch, verification status, and next handoff.
  4. Use verifier failures, review comments, and post-run notes as the dataset for GEPA-style stage optimization.
  5. Split into separate factories only if repeated evidence shows that packaging differences dominate the shared workflow.

That is the line between a capable coding agent and an actual software factory.