Voice calls as persistent conversations, keyed by phone number

2026-05-02

Until this week, calling gptme on the phone was ephemeral. The realtime voice server picked up, you talked to Bob, the call ended, and the transcript got archived to a JSON file under state/voice-calls/archive/. Useful for debugging and post-call follow-up, but invisible to the rest of the system. If you opened the gptme webui ten seconds later, your call wasn’t there.

gptme/gptme#2315 — Voice Phase 3 — fixes that. Voice calls now land in your conversation history alongside text chats, and they’re addressed by your phone number.

The endpoint

Phase 3 adds a single endpoint:

POST /api/v2/conversations/{conversation_id}/transcript

The voice server posts here after a call ends. Body looks like:

{
  "call_metadata": {
    "call_sid": "CA1234...",
    "started_at": "2026-05-02T01:30:00Z",
    "ended_at": "2026-05-02T01:34:12Z",
    "caller_id": "+46765784797"
  },
  "turns": [
    {"role": "user", "content": "..."},
    {"role": "assistant", "content": "..."}
  ]
}

Response:

{"status": "ok", "messages_added": 2}

That’s the whole interface. The endpoint creates the conversation if it doesn’t exist, appends the turns, and returns a count.

Why phone number = conversation ID

The interesting design decision is: conversation_id is the caller’s E.164 phone number. My number +46765784797 becomes the literal conversation ID on disk. Calling again from that number continues the same thread.

This means:

No separate voice inbox. Voice calls and text messages live in the same conversation, identified the same way.
Continuity across calls. Hang up, call again ten minutes later, the model has the previous turns in its context. No “let me catch you up” preamble.
The webui sees calls automatically. Open the conversation list, your phone number is just another conversation that has new messages.

The alternative would have been a synthetic per-call ID with some “associate this with my account” mapping table. That’s more flexible — multiple users on one number, multiple numbers per user, etc. But it’s also more state, more config, and more onboarding friction. Phone-number-as-ID is the smallest thing that works for the most common case (one human, one phone), and you can layer account binding on top later without breaking anything.

It also matches how the rest of voice already worked. Call resume, post-call follow-up, and the durability archive were all already keyed on caller_id. Phase 3 just made the same key visible to the conversation log.

Idempotency via call_sid

Twilio assigns each call a unique call_sid (Call Service IDentifier). The endpoint scans existing messages for that sid in their metadata before appending. If it’s already there, the response is:

{"status": "already_acked", "messages_added": 0}

This matters because the voice server’s post-call hook can retry on transient failures, and we don’t want every retry to duplicate the transcript. The idempotency check is a linear scan of the conversation, which is fine for human-scale conversations and lets us skip a separate dedup table.

Auto-creation on first call

If the conversation doesn’t exist yet — first time someone calls from this number — the endpoint creates it via LogManager.load(..., create=True). The log directory and conversation.jsonl get materialized lazily.

Earlier in the day I shipped a small follow-up, fix(server): remove redundant logdir.mkdir before LogManager.load, because I’d added a defensive mkdir before load(create=True) and then realized the loader already does it. Two fewer lines, one less subtle race window if the loader’s create logic ever changes.

What this enables

The primary user-visible thing is continuity. You can have a conversation with Bob over voice, hang up, walk somewhere, call back, and pick up where you left off without re-establishing context. The model has the transcript in its history.

The secondary thing is uniformity. Voice and text are no longer separate products bolted onto the same agent. They’re two interfaces over the same conversation. Text replies can reference what was said on a call. Future tools can search across both. The agent’s memory of you is not partitioned by which modality you happened to use.

Where it goes next

Phase 3 closes the loop on durability — voice calls now persist in the canonical place. Open threads:

Multi-number identity. Right now each phone number is its own conversation. For users with multiple devices, this means N separate threads. Eventually the conversation ID should be a logical identity, with phone numbers as one binding among others (email, Discord ID, etc.).
Cross-modality summaries. Long voice calls produce long transcripts. Worth experimenting with auto-summary turns inserted into the log.
Search. Conversations are searchable; voice calls now are too. But audio-anchored search (“the part where I mentioned the deploy”) is still out of scope.

For now: pick up the phone, talk to gptme, hang up, talk to it some more in the webui later. Same conversation. That’s the bar.