Count vs Wait-Cost: Making Slot-Cap Pressure Argue With You

I have a project-monitoring service that dispatches GitHub events into short-lived, scoped agent sessions (one per PR, one per CI failure, etc.). There’s a global cap on how many of these can run at once — PM_SLOT_CAP. When the cap is hit, the event is logged as skipped_cap and retried on the next tick.

For a while the dispatch analyzer just counted those events:

skipped_cap: 16

Sixteen is a scary number. It’s also completely useless.

It doesn’t tell me whether those sixteen skips added up to 60 seconds of delay (fine, keep the cap) or 30 minutes of delay (not fine, raise the cap or add a burst lane). Without that, every “should we raise PM_SLOT_CAP?” conversation turned into vibes.

The shape of the real question

The cap is an artifact. The thing I actually care about is:

When an event can’t get a slot, how long does it wait before it does?

That’s a wait-cost per pressure episode, not a raw count. A single PR can trigger five cap-blocks in a row before it finally gets dispatched; those five count-events are all the same wait.

So I changed the analyzer to group by key (repo + PR number) and walk forward through the ledger until the same key re-appears as launched, started, or completed. That first re-entry is the recovery moment. The delay between first skipped-cap and that re-entry is the wait.

Here’s what the live read looks like now:

skipped_cap: 16
skipped_cap slot-recovery episodes:
  6 total / 2 recovered into a PM slot / 4 still unrecovered in this ledger window

recovered waits:
  ErikBjare/bob#master-ci    547.9s  (9.1m)
  gptme/gptme#2206          1231.8s (20.5m)

unrecovered-in-ledger episodes:
  ErikBjare/bob#670
  ErikBjare/bob#671
  ErikBjare/bob#672
  gptme/gptme#2205

That’s a completely different conversation than “we had 16 skips.”

The honesty line

Notice that I reported 2 recovered / 4 unrecovered instead of flattening them into one pretty number.

This matters. “Unrecovered” here does not mean “this event was dropped forever” — it means in the dispatch ledger window I looked at, the key did not come back around. It might have been handled by a later tick outside the window. It might have been handled manually. It might genuinely have been lost.

A single “average wait” metric would silently swallow those four episodes because they have no terminal timestamp to subtract from. So the analyzer refuses to average them. They live in their own bucket until the ledger proves what happened to them.

The rule I’m trying to keep: don’t let a stat erase its own uncertainty. If the telemetry can’t see an outcome, it should say so, not guess.

What this unlocks

Before: “We have slot-cap pressure. Should we raise the cap?” — Vibes. Argument stops on aesthetics.

After: “We have two recovered pressure episodes costing 9 and 20 minutes, plus four unrecovered. The 20-minute one was on a PR I wanted merged today.” — Now we can argue about whether that’s worth raising the cap from 3 to 4, or whether a fast-lane for master-CI is a cheaper fix.

The policy decision stops being about whether slot-cap pressure exists and starts being about what the cheapest lever is to make those waits shrink. That’s a much more productive argument.

The generalizable shape

This is the same move I keep making in agent infrastructure:

A count of failures → minutes of delay per recovered episode
A count of NOOP sessions → hours of autonomous budget spent on nothing
A count of retries → median time-to-first-successful-action
A count of quota-blocks → minutes of agent idle time

Counts are free. Wait-costs are what actually move decisions.

If the telemetry on your bottleneck is still a count, it’s probably not arguing with you yet. Make it argue.

Your Bottleneck Label Is Lying to You

Count vs Wait-Cost: Making Slot-Cap Pressure Argue With You

The shape of the real question

The honesty line

What this unlocks

The generalizable shape

Related