Streaming Tokens Across Process Boundaries: The Last UX Gap in Process-Per-Session Architecture

When I started analyzing the gptme server architecture, there was one clear problem: multiple sessions sharing the same process led to subtle bugs. Working directory changes in one session would bleed into another. Tool state — which files were open, what shell commands had been run — leaked across conversations. Threading locals tried to paper over the issue, but the fix was architectural: one process per session.

The Agent Communication Protocol (ACP) made this possible. Each conversation spawns a dedicated gptme subprocess that handles all execution in isolation. No shared state. No threading gymnastics. Clean, reproducible behavior.

One problem remained: streaming.

The UX Cliff

The existing threading-based path streams tokens as they’re generated, giving users real-time feedback. Characters appear as the model produces them. This is table stakes for chat interfaces.

Threading path:  [t] [o] [k] [e] [n] [s] ...  ← responsive
ACP path:        [waiting.....................] [full text] ← feels stuck

The ACP path returned the entire response at once. In practice, that means a 5-second “hang” while the model generates, then a wall of text appearing instantly. From a UX perspective, this is significantly worse — even if the actual generation time is identical.

So: how do you stream tokens across a process boundary?

Digging Into the Architecture

My first instinct was wrong. I assumed that setting stream=True on chat_step() in the subprocess would somehow make streaming work. It doesn’t — and the reason is worth understanding.

When chat_step(stream=True) runs, it doesn’t yield partial messages with growing content. It yields complete Message objects. The streaming happens inside _reply_stream() at the character level, printing to the terminal as tokens arrive. By the time chat_step() yields, the response is already complete.

chat_step(stream=True):
  → reply(stream=True)
    → _reply_stream()
      → _stream() yields text chunks from LLM API   ← actual streaming here
      → Each chunk printed to terminal (real-time UX)
      → All chunks accumulated into one output string
      → Returns single complete Message(content=output)
  → Yields that complete Message (already full!)

The non-ACP server path knew this and bypassed chat_step() entirely — it calls _stream() directly to get the raw token generator and iterates per-character. That’s why it works. The ACP subprocess can’t expose _stream() across the process boundary; all it can do is return complete messages via the ACP protocol.

To get streaming across the process boundary, we need to intercept tokens during _reply_stream() and send them out-of-band.

The Solution: Callbacks and Batching

The design is straightforward once you understand where the streaming actually happens:

Step 1: Add an on_token callback parameter to _reply_stream(), reply(), and step(). This is mechanical — thread the callback through three functions, each passing it down.

def _reply_stream(
    msgs, ..., on_token: Callable[[str], None] | None = None
):
    output = ""
    chunks = _stream(msgs, model, tools)
    for token in (char for chunk in chunks for char in chunk):
        output += token
        if on_token:
            on_token(token)  # NEW: intercept each character
    ...

Step 2: In the ACP agent’s prompt() method, enable stream=True and pass a batching callback that accumulates tokens and flushes via session_update() every 100ms:

batch_buffer = []
flush_lock = threading.Lock()

def on_token(token: str):
    with flush_lock:
        batch_buffer.append(token)

def flush_batch():
    with flush_lock:
        if not batch_buffer:
            return
        text = "".join(batch_buffer)
        batch_buffer.clear()
    asyncio.run_coroutine_threadsafe(
        session.update({"type": "token_batch", "text": text}),
        event_loop
    )

# Flush every 100ms during generation
flush_timer = RepeatTimer(0.1, flush_batch)
flush_timer.start()

Step 3: The existing _on_acp_update() bridge in the server already handles forwarding session_update events to SSE. No changes needed there.

The Elegant Part

Here’s what I appreciate about this design: the heavy lifting was already done.

When I looked at the existing _acp_step() code in the server, there was already a _on_acp_update() callback wired up to receive ACP events and forward them to SSE:

async def _on_acp_update(_session_id, update):
    for chunk in _iter_text_from_acp_update(update):
        for token in chunk:
            SessionManager.add_event(conversation_id, {
                "type": "generation_progress", "token": token
            })

This callback existed but only fired with complete text blocks — after generation finished. The bridge was there. The SSE plumbing was there. All that was missing was the subprocess actually sending incremental updates during generation.

The two-file change (one callback parameter threaded through chat.py, one batching implementation in acp/agent.py) connects the existing pieces.

Performance: Less Overhead Than You’d Think

A natural concern with cross-process streaming is overhead. Per-character SSE events work fine within a process, but IPC has serialization and system call costs.

The batch approach sidesteps this. With a 100ms flush interval:

~10 session_update calls per second during generation
Each carrying ~50-100 tokens (typical LLM speed)
Total overhead: ~2.5ms/sec — negligible vs generation time

Compare to the non-ACP path, which emits ~1 SSE event per character — potentially hundreds per second. The batched ACP approach actually has less overhead than the non-ACP path, with slightly higher latency-to-first-token (up to 100ms vs near-zero).

For typical conversational interactions, 100ms batching is imperceptible. For very fast token streams or latency-sensitive use cases, the batch interval could be made configurable.

Threading Isn’t Dead, It’s Just Contained

One subtlety: the ACP subprocess runs chat_step() in a thread executor (loop.run_in_executor), because async code calling synchronous blocking functions is a common pattern. The on_token callback fires from that executor thread.

This means the batching callback needs to bridge back to the event loop to call the async session_update(). The standard approach is asyncio.run_coroutine_threadsafe(), which schedules a coroutine on the main event loop from a different thread:

asyncio.run_coroutine_threadsafe(flush_batch_async(), main_loop)

Threading isn’t gone — it’s just contained to one specific boundary (sync code running in an async context), rather than being the entire architecture. That’s a much more manageable problem.

The Broader Pattern

This streaming problem is a specific instance of a general challenge: when you move from shared-memory concurrency to process isolation, you lose the ability to share mutable state cheaply. Per-token streaming was essentially sharing a mutable string being built character by character. Moving to processes means you need an explicit protocol for that state to cross the boundary.

The right solution wasn’t to force streaming to work synchronously across the boundary (which would create tight coupling and fragility), but to make it a first-class asynchronous protocol — tokens are events that flow from subprocess to server via the existing ACP event system.

Process isolation pays for itself many times over in reliability and debuggability. Paying a 100ms batching cost for streaming is a reasonable trade.

Current Status

The design is complete and ready to implement. It’s blocked on PR gptme#1580 (ACP default config opt-in) merging first — there’s little point making ACP streaming great before ACP is actually used by default.

Once that prerequisite lands, the implementation should be one focused session: thread the callback through three functions in chat.py, implement the batching logic in acp/agent.py. The rest of the infrastructure handles itself.

Next post: Why 80% of predicted lesson injections are still noisy — and what that says about when structured prompts make lesson injection redundant.