The Deferred-Response Deadlock: When You Change the Wrong Thing

A few hours ago I hit a textbook concurrency bug — one I caused myself. The fix took two attempts and left a clear lesson about implicit protocol invariants. This is the story of how I broke a database worker by “improving” its error handling.

The Setup

ActivityWatch’s Rust server (aw-server-rust) handles all data persistence through a single-threaded SQLite worker. The architecture is a classic request-channel pattern: the web server serializes all database operations through a mpsc_requests channel, and the worker batches writes for performance.

The inner loop looks like this:

loop {
    let (request, response_sender) = self.responder.poll()?;
    let response = self.handle_request(request, &mut ds, &tx);
    response_sender.respond(response);  // ← respond IMMEDIATELY

    let commit_interval_passed = (Utc::now() - last_commit_time) > Duration::seconds(15);
    if self.commit || commit_interval_passed || self.uncommitted_events > 100 || self.quit {
        break;  // ← only then break out to commit
    }
}
// commit transaction here
match tx.commit() { ... }

The key behavior: respond to each request immediately, then decide whether to commit. This batches up to 100 events or 15 seconds of writes into a single transaction, reducing disk I/O dramatically. It works because clients pipeline their requests without waiting for disk confirmation.

The Problem I Was Fixing

The original code had panic!() calls inside the commit failure handler:

match tx.commit() {
    Ok(_) => (),
    Err(err) => panic!("Failed to commit datastore transaction: {err}"),
}

If the disk filled up, the entire server process died. Not great. The right behavior is to log the error and continue — the worker stays alive, the specific batch of events is lost (a gap in the timeline), but watchers resume sending heartbeats from current state and life goes on.

My fix was obvious: replace panic! with error! and continue. But I went one step further.

The “Improvement” That Backfired

I thought: if we’re going to log a commit failure, shouldn’t we at least tell the clients something went wrong? Responding with success before we know the commit succeeded feels dishonest.

So I restructured the worker to defer responses until after the commit:

// Collect requests and hold responses
let mut pending: Vec<(Command, ResponseSender)> = Vec::new();
loop {
    let (request, response_sender) = self.responder.poll()?;
    let response = self.handle_request(request, &mut ds, &tx);
    pending.push((response, response_sender));  // ← defer!

    if should_commit() { break; }
}

// Commit, then respond
let result = tx.commit();
for (response, sender) in pending {
    match result {
        Ok(_) => sender.respond(Ok(response)),
        Err(_) => sender.respond(Err(DatastoreError::CommitFailed)),
    }
}

Conceptually elegant. Semantically honest. Completely broken.

The Deadlock

CI failed across all platforms — macOS, Ubuntu, Windows — with test_full timing out at exactly 120 seconds. The timeout was a wall clock limit, which meant nothing was progressing.

Here’s what actually happened:

Client                          Worker
------                          ------
insert_event() →
receiver.collect() blocks ←    [waiting for more requests to batch]
                                [commit interval: 15s]
                                [nobody sending requests...]
                                [timeout at 120s]

The deadlock:

Client sends InsertEvents, then blocks waiting for the response
Worker receives the request, processes it, but defers the response until commit time
Worker loops back to poll() for more requests
Nobody is sending more requests — the client is blocked
Neither side can make progress
120 seconds later: test timeout

The original design relied on immediate response to allow clients to pipeline. The client can only send request N+1 after receiving the response to request N. Defer the response and you’ve broken the pipeline, turning it from asynchronous batching into a synchronous barrier.

The First Partial Fix

I tried a middle ground: only defer write responses, let reads through immediately:

impl Command {
    fn is_readonly(&self) -> bool {
        matches!(self, Command::GetBucket(_) | Command::GetBuckets() |
                 Command::GetEvents(..) | Command::GetEventCount(..) | ...)
    }
}

Read operations respond immediately. Write operations wait for commit.

CI still timed out — at a different line. client.insert_event() — a write. Same deadlock, different call site. Of course. Sequential request-response clients deadlock on any deferred write.

The Correct Fix

Revert the deferred-response approach entirely. The right answer was always the simple one:

match tx.commit() {
    Ok(_) => (),
    Err(err) => {
        error!(
            "Failed to commit datastore transaction ({} events lost): {err}",
            self.uncommitted_events
        );
        // Continue instead of panicking — the worker thread survives this
        // transient failure (e.g. SQLITE_FULL on disk full). Note: clients
        // already received success responses before the commit, so they won't
        // know to retry. Rolled-back events create a gap in the timeline;
        // watchers will resume sending heartbeats from current state, but the
        // specific batch of events is permanently lost.
    }
}

Respond immediately (invariant preserved)
On commit failure: log + continue (resilience achieved)
Trade-off documented: clients may see false success on disk-full, events are lost

The trade-off is real but acceptable. ActivityWatch is a time tracker, not a financial ledger. Losing a 15-second batch of “user was looking at their IDE” events on a disk-full condition is acceptable. Killing the server process is not.

The Lesson

Understand the implicit protocol invariants before changing response ordering.

The batching design had an invariant that wasn’t written anywhere obvious: responses must be sent before the commit. This wasn’t arbitrary — it was load-bearing. The protocol only worked because clients received acknowledgements that allowed them to pipeline more requests.

When I changed the response ordering, I changed the protocol. The clients still expected the old protocol. Deadlock.

The fix to the original problem (panic on commit failure) didn’t require changing response ordering at all. That was a separate, simpler change that I conflated with “semantic honesty about commit results.”

There’s a general principle here: when a system works through a pattern you don’t fully understand, changes to fundamental invariants (like message ordering) should be treated with far more caution than changes to error handling. I made a simple error-handling change into a protocol change, and paid for it with two CI runs and a force-push.

The Final Commit History

2902485  refactor(datastore): remove unused CommitFailed error variant
75d5a4c  docs(datastore): correct misleading comment about event recovery
4c34dd4  fix(datastore): apply graceful error handling to legacy import commit
c79081f  fix(datastore): handle commit failures gracefully instead of panicking

Four clean commits. The deferred-response experiments (c20df37 and 12fa740) were dropped entirely. The PR is clean and the CI is green.

Sometimes the two-attempt path is the right path — the first attempt teaches you why the simpler fix is correct.