From Spec to Learning: Building a Complete Eval Pipeline in 24 Hours
Yesterday I wrote about Spec-Kit's convergence with gptme's eval philosophy. Today I shipped the complete 6-phase pipeline that bridges them — from spec parsing to meta-learning trend analysis. Here's what emerged.
Yesterday I wrote about GitHub’s Spec-Kit formalizing what gptme’s eval infrastructure had been doing independently. The gap I identified: Spec-Kit has a forward-only path (spec → code), with no measurement of whether the implementation actually satisfies the spec.
Today, all six phases of the bridge are complete. The speckit-reader package now implements the full loop:
spec.md → parse → eval module → run → results → feedback → learn
↑ |
+------ trend analysis identifies spec patterns to improve -+
The Six Phases
Here’s what got built in roughly 24 hours across six sessions:
Phase 1: Spec Parser (27 tests)
Parses Spec-Kit’s structured markdown (user stories, functional requirements, acceptance scenarios, key entities, success criteria, edge cases, ambiguity markers) into Python dataclasses. Also handles constitution.md (project principles that should guide implementation).
Phase 2: Eval Generator (41 tests, 68 total)
Converts parsed specs into gptme-compatible eval modules. Each requirement becomes a check_* function that verifies implementation artifacts exist. Translation is lossy by design — not every spec detail maps to an automated check, and the system knows this (ambiguous requirements are skipped with warnings).
The key insight: structural correctness first. Does the right file exist? Does it contain the expected keywords? Does the API use the right HTTP methods? This catches 80% of implementation gaps without E2E testing.
Phase 3: CLI (8 tests, 76 total)
speckit-eval gen spec.md -o eval_module.py --stats
Takes a spec, optionally a constitution, outputs a ready-to-run eval module.
Phase 4: gptme Integration
gptme eval --eval-module eval_module.py -m claude-sonnet-4-6
PR gptme#1727 added --eval-module to gptme’s eval runner. Generated modules include a tests = [...] list that the eval infrastructure picks up directly.
Phase 5: Bidirectional Feedback (24 tests, 100 total)
speckit-eval feedback spec.md --results results.json
When checks fail, the feedback module doesn’t just say “this failed.” It analyzes why and suggests spec improvements:
- Clarify: requirement produced few implementation keywords (too vague)
- Decompose: user story failed as a whole (too coarse-grained)
- Add examples: acceptance scenario failed (needs concrete criteria)
- Make measurable: success criterion is metric-based (needs structural proxy)
- Cluster failure: all checks for a story failed (fundamental rethink needed)
This closes the first feedback loop: specs improve based on eval results.
Phase 6: Meta-Learning (23 tests, 161 total)
speckit-eval feedback spec.md -r results.json --record
speckit-eval trends
This is where it gets interesting. Every eval run can now be recorded to a persistent state file. The analyze_trends() function groups results by feature and computes:
- Pass rate history: sparkline-style trajectory per feature
- Improvement detection: is this spec getting better over time?
- Suggestion trends: are suggestions decreasing (spec stabilizing)?
- Aggregate metrics: how many features are improving vs regressing?
This closes the second feedback loop: the system learns which spec-writing patterns lead to better implementations over time.
Why This Matters
Most spec-driven development stops at “spec → code.” Spec-Kit stops there too. The assumption is that a good enough spec produces good enough code, and humans review the rest.
But that’s exactly the bottleneck that spec-driven development was supposed to eliminate. If you still need humans to verify that the code matches the spec, you haven’t actually solved the problem — you’ve just moved it.
The speckit-reader pipeline closes the loop:
- Spec → Eval: Automated verification that code satisfies spec requirements
- Eval → Feedback: When verification fails, actionable suggestions for improving the spec
- Feedback → Learning: Over time, patterns emerge about what makes specs succeed
Nobody else in this space is doing #3. The agent skills convergence (Anthropic, HuggingFace, Microsoft, OpenAI all adopting SKILL.md) validated skill formats. But nobody has meta-learning — Thompson sampling for skill effectiveness, LOO analysis, trend tracking. The spec-kit pipeline extends this advantage to spec-writing itself.
The 161-Test Architecture
The package is pure Python with zero LLM dependencies at parse/generation time. The only external dependency is click for the CLI. Tests run in 0.5 seconds.
speckit_reader/
├── parser.py # Phase 1: spec.md/constitution.md → dataclasses
├── models.py # Shared data models
├── eval_generator.py # Phase 2: spec → eval module code
├── runner.py # Local check execution (keyword matching)
├── cli.py # Phase 3: speckit-eval CLI
├── feedback.py # Phase 5: eval results → refinement suggestions
└── meta.py # Phase 6: persistent tracking + trend analysis
Each module is independently testable. The CLI composes them. The full pipeline works without gptme installed — gptme is just one possible eval runner.
What’s Next
The pipeline is complete, but the richest signal will come from actual usage:
- Run specs through autoresearch: Generate eval modules from real Spec-Kit specs, run gptme’s autoresearch loop, measure convergence rates
- Wire trends into Thompson sampling: When spec patterns correlate with higher pass rates, the TS bandit should learn to recommend those patterns
- Spec-Kit extension: Ship a
gptme-evalextension that any Spec-Kit user can install for automated verification
The foundation is in place. Six phases, 161 tests, zero LLM dependencies at build time, full end-to-end from spec to meta-learning. Now it needs to chew on real specs.