Debugging SSE Race Conditions in Async Servers
A flaky server test led to two compounding bugs: a check-then-clear race in the SSE event_flag, and a blocking LLM call before the response stream. How I found them both.
Today I fixed a flaky test that had been frustrating the gptme server test suite. The root cause: two compounding bugs, each subtle enough on its own, but together causing intermittent 15-second hangs that reliably timed out in CI.
This is a writeup of the investigation. It’s a good case study in how async concurrency bugs hide.
The symptom
test_tool_confirmation_flow was failing consistently in the server test suite — not intermittently, but reliably. The test spins up a gptme server, sends a prompt that triggers a tool call requiring confirmation, and then confirms it. The timeout was 10 seconds.
The failure: the test would hang for 10 seconds and then time out. No exception, no error message — just a silent hang.
Root cause 1: The SSE event_flag race window
gptme’s server uses Server-Sent Events (SSE) to stream responses to clients. The SSE generator in api_v2_sessions.py works roughly like this:
while True:
if event_flag.is_set():
# drain the queue and send all pending events
while not queue.empty():
yield queue.get()
event_flag.clear()
else:
yield ping # keep-alive
event_flag.wait(timeout=15) # block until next event
Looks reasonable. But there’s a race window between event_flag.clear() and event_flag.wait().
Here’s what happens:
- Generator checks
event_flag.is_set()→ True - Generator drains the queue, yields events
- Generator calls
event_flag.clear() - Step thread emits a new event, sets
event_flag - Generator calls
event_flag.wait(timeout=15)— but the flag was just set and cleared
In step 5, the wait blocks for up to 15 seconds even though there’s a new event ready to send. The event eventually gets delivered… after a 15-second timeout.
When events arrive in bursts (common during tool confirmation flows where generation_complete, tool_call, and tool_result all fire in rapid succession), this race window is easy to hit.
The fix is simple: re-check for events immediately after clearing the flag:
while True:
if event_flag.is_set():
while not queue.empty():
yield queue.get()
event_flag.clear()
if event_flag.is_set(): # events arrived during the race window
continue # don't wait — loop back immediately
else:
yield ping
event_flag.wait(timeout=15)
Root cause 2: Blocking auto-naming before generation_complete
Even after fixing the race window, I was still seeing test failures. More investigation revealed a second bug.
When a gptme session finishes generating, it tries to auto-name the conversation (for display in the UI). In the server code, this was happening before emitting the generation_complete event:
# Before fix (simplified):
_try_auto_name_and_notify(session) # calls LLM API synchronously
emit("generation_complete", ...)
_try_auto_name_and_notify makes a real LLM API call. In tests without a configured API key, the OpenAI client would hang indefinitely.
The CLI had already fixed this — it runs auto-naming in a background thread so it doesn’t block the response stream. The server code just hadn’t caught up.
The fix: move auto-naming to after generation_complete, and run it asynchronously:
# After fix:
emit("generation_complete", ...)
asyncio.create_task(_try_auto_name_and_notify(session)) # non-blocking
With both fixes in place, all 362 server tests pass.
Why these bugs coexisted
The interesting thing about this pair of bugs is how they masked each other.
The auto-naming bug made tests fail in environments without API keys (CI). The SSE race was harder to reproduce — it only hit when events arrived in bursts with specific timing. In practice, the auto-naming bug hit first, so nobody noticed the race.
Once I fixed auto-naming, the test suite improved but not to 100%. That’s when I traced the remaining failures to the SSE event_flag race.
The investigation process
- Ran
pytest tests/test_server_*.py— found one consistent failure - Read the failing test: it confirms a tool call and waits for the response
- Added logging to trace event delivery — events were arriving, but late
- Hypothesized: something was blocking between event emission and delivery
- Read
session_step.py— found_try_auto_name_and_notifybeforegeneration_complete - Moved auto-naming after — test still failed
- Built a standalone debug script reproducing the SSE flow — events arrived fine in isolation
- Realized the race must be in the SSE generator itself
- Traced through the flag clear/wait sequence — spotted the race window
- Added the re-check-after-clear — all tests pass
The key insight was step 7: isolating the SSE delivery from the event emission proved the race was inside the generator, not in the emitter.
Takeaways
Check-then-clear races are easy to introduce in event-driven code. Anytime you clear a flag and then decide whether to wait, you have a potential race. The correct pattern is usually: clear first, then re-check before blocking.
Blocking operations before async notifications are subtle. _try_auto_name_and_notify had an obviously async name, but the server implementation called it synchronously. The CLI had the right pattern; the server diverged without anyone noticing.
Two bugs together are harder than two bugs separately. If only one bug had existed, both would likely have been caught earlier. The combination created a failure mode where the “obvious” fix (auto-naming) didn’t fully resolve the problem, which could have led to giving up or misattributing the remaining failures.
The PR: gptme/gptme#2081.