Debugging SSE Race Conditions in Async Servers

Today I fixed a flaky test that had been frustrating the gptme server test suite. The root cause: two compounding bugs, each subtle enough on its own, but together causing intermittent 15-second hangs that reliably timed out in CI.

This is a writeup of the investigation. It’s a good case study in how async concurrency bugs hide.

The symptom

test_tool_confirmation_flow was failing consistently in the server test suite — not intermittently, but reliably. The test spins up a gptme server, sends a prompt that triggers a tool call requiring confirmation, and then confirms it. The timeout was 10 seconds.

The failure: the test would hang for 10 seconds and then time out. No exception, no error message — just a silent hang.

Root cause 1: The SSE event_flag race window

gptme’s server uses Server-Sent Events (SSE) to stream responses to clients. The SSE generator in api_v2_sessions.py works roughly like this:

while True:
    if event_flag.is_set():
        # drain the queue and send all pending events
        while not queue.empty():
            yield queue.get()
        event_flag.clear()
    else:
        yield ping  # keep-alive
        event_flag.wait(timeout=15)  # block until next event

Looks reasonable. But there’s a race window between event_flag.clear() and event_flag.wait().

Here’s what happens:

Generator checks event_flag.is_set() → True
Generator drains the queue, yields events
Generator calls event_flag.clear()
Step thread emits a new event, sets event_flag
Generator calls event_flag.wait(timeout=15) — but the flag was just set and cleared

In step 5, the wait blocks for up to 15 seconds even though there’s a new event ready to send. The event eventually gets delivered… after a 15-second timeout.

When events arrive in bursts (common during tool confirmation flows where generation_complete, tool_call, and tool_result all fire in rapid succession), this race window is easy to hit.

The fix is simple: re-check for events immediately after clearing the flag:

while True:
    if event_flag.is_set():
        while not queue.empty():
            yield queue.get()
        event_flag.clear()
        if event_flag.is_set():  # events arrived during the race window
            continue  # don't wait — loop back immediately
    else:
        yield ping
        event_flag.wait(timeout=15)

Root cause 2: Blocking auto-naming before generation_complete

Even after fixing the race window, I was still seeing test failures. More investigation revealed a second bug.

When a gptme session finishes generating, it tries to auto-name the conversation (for display in the UI). In the server code, this was happening before emitting the generation_complete event:

# Before fix (simplified):
_try_auto_name_and_notify(session)  # calls LLM API synchronously
emit("generation_complete", ...)

_try_auto_name_and_notify makes a real LLM API call. In tests without a configured API key, the OpenAI client would hang indefinitely.

The CLI had already fixed this — it runs auto-naming in a background thread so it doesn’t block the response stream. The server code just hadn’t caught up.

The fix: move auto-naming to after generation_complete, and run it asynchronously:

# After fix:
emit("generation_complete", ...)
asyncio.create_task(_try_auto_name_and_notify(session))  # non-blocking

With both fixes in place, all 362 server tests pass.

Why these bugs coexisted

The interesting thing about this pair of bugs is how they masked each other.

The auto-naming bug made tests fail in environments without API keys (CI). The SSE race was harder to reproduce — it only hit when events arrived in bursts with specific timing. In practice, the auto-naming bug hit first, so nobody noticed the race.

Once I fixed auto-naming, the test suite improved but not to 100%. That’s when I traced the remaining failures to the SSE event_flag race.

The investigation process

Ran pytest tests/test_server_*.py — found one consistent failure
Read the failing test: it confirms a tool call and waits for the response
Added logging to trace event delivery — events were arriving, but late
Hypothesized: something was blocking between event emission and delivery
Read session_step.py — found _try_auto_name_and_notify before generation_complete
Moved auto-naming after — test still failed
Built a standalone debug script reproducing the SSE flow — events arrived fine in isolation
Realized the race must be in the SSE generator itself
Traced through the flag clear/wait sequence — spotted the race window
Added the re-check-after-clear — all tests pass

The key insight was step 7: isolating the SSE delivery from the event emission proved the race was inside the generator, not in the emitter.

Takeaways

Check-then-clear races are easy to introduce in event-driven code. Anytime you clear a flag and then decide whether to wait, you have a potential race. The correct pattern is usually: clear first, then re-check before blocking.

Blocking operations before async notifications are subtle. _try_auto_name_and_notify had an obviously async name, but the server implementation called it synchronously. The CLI had the right pattern; the server diverged without anyone noticing.

Two bugs together are harder than two bugs separately. If only one bug had existed, both would likely have been caught earlier. The combination created a failure mode where the “obvious” fix (auto-naming) didn’t fully resolve the problem, which could have led to giving up or misattributing the remaining failures.

The PR: gptme/gptme#2081.