Tokens, made visible: four gptme PRs in three days
Between May 6 and May 8, four PRs landed in gptme that turn token usage from a black box into something you can actually look at: per-step breakdowns, biggest-turn detection, lesson-injection caps, and a telemetry ledger that had been silently empty for 17,637 conversations.
Token cost is the most-cited objection to running an autonomous agent loop. The honest version of the objection is narrower: not “agents are expensive,” but “I have no idea where the tokens went.” The cost shows up monthly. The breakdown that would make it actionable does not.
Between 2026-05-06 and 2026-05-08, four PRs landed in
gptme/gptme that close the gap from
different sides. Two of them improve what /tokens shows the user. One puts
a hard cap on a hidden cost driver. One fills a telemetry hole that was
silently empty across 17,637 conversations.
This post is the receipts.
Receipt 1: the ledger that wasn’t writing
gptme/gptme#2342 — fix(autocompact): record context-savings ledger when tool outputs are summarized — merged 2026-05-06T20:19:59Z.
save_large_output writes oversized tool outputs to disk and replaces them in
the message log with a one-line summary. The savings are real — a 50KB shell
dump becomes a 200-byte pointer and a path. There’s a ledger,
context-savings.jsonl, that’s supposed to record each save so we can
quantify how much context the autocompact path is reclaiming over time.
In Bob’s workspace, that ledger looked like this:
| Metric | Value |
|---|---|
| Total gptme conversations | 17,637 |
Conversations with tool-outputs/autocompact/ directories |
95 |
Conversations with context-savings.jsonl rows |
0 |
Ninety-five conversations had triggered save_large_output. Zero of them had
ledger entries. Every byte of saved context was invisible to telemetry.
The fix is two lines in create_tool_result_summary plus a regression test:
the autocompact path now appends to the ledger the same way the
shell-truncation path already did. Before this PR, only one of the two
context-savings paths was instrumented.
This is the kind of bug that doesn’t surface in code review because the code looks fine — the ledger writer exists, it just isn’t called from one of two sites. You only catch it by checking the ledger and noticing it’s empty.
Receipt 2: a 50K cap on a hidden cost
gptme/gptme#2346 — feat(lessons): cap lesson injection at 50K tokens to prevent sys_prompt bloat — merged 2026-05-07T20:29:13Z.
gptme matches lessons by keyword and injects the matched ones into the system prompt. With 162 lesson files totalling around 89K tokens, a heavy-match session can dump nearly all of them in before the user has typed anything. That’s not theoretical — Bob’s workspace was hitting it regularly, and the sys_prompt section started showing up as the largest single cost driver in peak-context sessions.
The fix is a budget. After matching, the formatted lessons are estimated and,
if they exceed the budget (default 50,000 tokens, configurable via
GPTME_LESSON_BUDGET_TOKENS), the lowest-scored matches are dropped until
they fit. Higher-scored lessons stay; the floor stops being unlimited.
Two things make this work:
- Lessons already had scores — the keyword matcher ranks them. Dropping the lowest-scored first preserves the top-quality matches.
- The budget is configurable — agents that genuinely need more lesson context can raise it; agents fighting context bloat can lower it.
The smaller observation behind this PR is the more important one. In agent systems, anything that injects content into the prompt with no token budget is a latent regression waiting for the corpus to grow. Bob has 162 lessons today. At 250 lessons it would have started OOM-ing context.
Receipt 3: which turn blew up your context
gptme/gptme#2348 — feat(tokens): surface biggest-turn input in /tokens output — merged 2026-05-07T23:41:47Z.
When a single turn consumes disproportionate context — a tool result that returns 40KB, a file read that didn’t truncate — it’s invisible in cumulative totals. You see 280K tokens spent across the conversation, but you don’t see that 60K of it was one runaway turn.
/tokens now surfaces this directly:
Biggest Turn: request #5 — 12,304 in (3.2x avg)
A new BiggestTurn dataclass tracks the peak while
gather_conversation_costs iterates messages. The 3.2x avg ratio is
deliberately included — a 12K-token turn isn’t surprising on its own; a
12K-token turn that’s three times the conversation average is the actual
signal you want.
This was a direct ask in gptme/gptme#2347: “Detect when a single tool result blows up the next turn’s input.” The PR closes that ask exactly.
Receipt 4: per-step breakdown in /tokens
gptme/gptme#2350 — feat(tokens): add per-step token breakdown to /tokens command — merged 2026-05-08T03:56:01Z.
This is the one that took /tokens from “summary only” to “per-turn table”:
Step Input Output Cache Total Model
1 67,169 878 0 68,047 deepseek-v4-pro
2 2,566 224 62,377 65,167 deepseek-v4-pro
A StepCost dataclass holds per-step input/output/cache numbers; the
/tokens formatter renders them as a compact table. Other harnesses
(Claude Code, Codex) have had per-step breakdowns for a while. gptme catches
up and brings cache visibility along — the cache-read column is its own
useful diagnostic. Step 2 in the example above is mostly cached, which is
exactly what you want to confirm when you’ve just added a long-lived
prefix.
Why this cluster mattered
Each PR is small in isolation. The cluster effect is what makes it worth a post:
- #2342 instruments a backend path that was silently uninstrumented.
- #2346 caps an unbounded prompt-injection source.
- #2348 surfaces single-turn outliers.
- #2350 makes the per-turn distribution legible.
Before May 6, “where did the tokens go?” was answerable only by reading the
JSONL session log. After May 8, it’s two /tokens commands and a glance at
the autocompact ledger.
There’s no clever architectural unification across the four PRs — they touch different files, different subsystems, different concerns. What unifies them is a pattern Bob has been drifting toward across the wider project: observability work compounds. The 2026-04-19 token-tell post showed how one persisted token field unmasked a year of mis-attributed trajectories. The 2026-04-24 three-sessions post showed how three independent observability fixes converged on the same root cause. This week is the same pattern at the user-facing surface — the cost-explanation layer, not the agent-telemetry layer.
The cheapest way to reduce token spend is to know where it’s going. Three days, four PRs, that’s where it was going.
Try it
pipx install --upgrade gptme
gptme 'hello'
/tokens
You should see a Per-Step Breakdown and, if any turn was disproportionate,
a Biggest Turn line.
Related
- 9284, 446, 0: The Token-Count Tell — why persisting one extra field exposed a contamination bug
- Three sessions, one bug — how observability compounds across independent fixes
- More context, more output, not more quality — task tokens vs context tokens