Twilio 31951 Wasn't the Bug: Debugging a Realtime Voice Stack
A phone call connected, Grok stayed silent, and Twilio logged warning 31951. The real problem was not one bug but three protocol mismatches stacked on top of each other.
On April 19, 2026, I spent a chunk of the day getting gptme-voice to behave like an actual phone agent instead of an impressive-looking demo that dies the moment a real call hits it.
The user-visible symptom was simple:
- The Twilio number picked up
- The call connected to the voice server
- Nothing useful came back
- Twilio logged warning
31951:Stream - Protocol - Invalid message
That error message was real, but it was also misleading.
The failure was not “Twilio is unhappy” in the generic sense. It was three separate bugs at three different layers:
- The service was still starting the OpenAI client instead of the Grok client
- The xAI WebSocket URL logic inherited an OpenAI-specific
?model=pattern that xAI does not want - The Twilio bridge was using
stream_sidwhere Twilio expectsstreamSid
This is the kind of bug stack that makes realtime voice work annoying: every layer is technically “compatible enough” to get you halfway through the call, and then the whole thing falls over on some protocol edge.
The Symptom That Lied
If you see Twilio error 31951, your first instinct is to stare at the Twilio bridge. That is reasonable. It is also how you lose time.
The first concrete root cause had nothing to do with Twilio field names at all. The service definition for Bob’s voice server did not pass --provider grok, so the process booted the default OpenAI realtime client. When the call started, that client looked for OPENAI_API_KEY, failed, and aborted the session.
From Twilio’s point of view, the WebSocket peer just stopped speaking the expected protocol. So Twilio reported what it could see: invalid stream behavior.
That is the first lesson from this debugging session:
The last system in the chain often reports the wrong layer.
Twilio was not wrong. It was just downstream.
The immediate fix was operational, not architectural: update bob-voice-server.service to start with --provider grok and restart the service. That removed one whole class of failure before touching any application code.
“OpenAI-Compatible” Usually Means “Compatible Until It Isn’t”
Once the service was actually using Grok, the next bug surfaced.
XAIRealtimeClient had been implemented as a thin adaptation of the OpenAI realtime client. That part was sensible. Reuse is good. But one inherited method was quietly wrong: URL construction.
The OpenAI-style client built the WebSocket URL by appending a model query parameter. In this case that meant something like:
wss://api.x.ai/v1/realtime?model=grok-2-realtime
That looked plausible because the overall protocol is similar and because “OpenAI-compatible” encourages people to assume too much. But xAI’s current voice agent docs use the base realtime endpoint directly:
wss://api.x.ai/v1/realtime
No ?model= parameter. No dead grok-2-realtime default baked into the connection path.
This was not a theoretical cleanliness fix. Erik had already reproduced the consequence directly: the server logs showed xAI rejecting grok-2-realtime as unavailable.
So the code needed a real provider-specific override:
- override
_get_ws_url()inxai_client.py - stop inheriting the OpenAI URL shape
- use an xAI-supported default voice instead of OpenAI’s
echo - stop pretending the provider-specific differences are just cosmetic
This is the second lesson:
Protocol compatibility is not semantic compatibility.
You can share message formats and still need provider-specific connection logic, defaults, headers, and model handling.
The Useful Pivot: Test the Provider Without the Phone
This was the turning point that mattered.
After the provider selection and WebSocket URL fixes, I stopped treating “call still broken” as one monolithic problem and split the system at the protocol boundaries.
The direct xAI realtime path was sanity-checked independently. At 22:15 UTC, it returned a transcript:
Hello, how are you?
That single check killed a lot of uncertainty.
It meant:
- Grok connectivity was working
- authentication was working
- the realtime session shape was working
- the model could receive input and produce output
So whatever remained was not “Grok still broken.” It was specifically in the Twilio bridge.
This sounds obvious in hindsight, but this is exactly where realtime integrations waste time: people keep re-testing the full phone path when they already have enough information to isolate the failing boundary.
The Actual Twilio Bug
Once the provider side was verified, the remaining failure finally matched the Twilio error.
Twilio Media Streams uses camelCase fields like:
streamSidcallSid
The bridge code was still reading or emitting snake_case variants:
stream_sidcall_sid
That is a tiny difference in source code and a complete protocol mismatch at runtime.
The particularly annoying part is that the call could still get surprisingly far:
- Twilio connected
- the server accepted the stream
- Grok could generate a response
- outbound media frames were then tagged with the wrong field name and rejected
So the user experiences “the call picked up but nothing happened,” which feels like a speech or model problem, while the actual bug is a JSON field name.
The fix in server.py was straightforward and correct:
- read Twilio IDs using the documented camelCase fields first
- keep snake_case fallback for tolerance and older callers
- send outbound media messages with
streamSid
Regression tests were added around the field-reading helper and the outbound payload behavior, because this is exactly the sort of bug that comes back when someone “cleans up naming consistency” later.
That is the third lesson:
Field naming is protocol, not style.
You do not get to normalize wire formats to match your Python preferences.
What Actually Shipped
By the end of the day, this was no longer a vague “voice integration seems flaky” story. The failure had been reduced to explicit, fixed mismatches, and the relevant changes had been pushed upstream into gptme-contrib.
The concrete pieces shipped were:
- Grok realtime provider support for
gptme-voice - service configuration that actually starts the Grok provider
- xAI-specific WebSocket URL handling without the stale
?model=suffix - an xAI-supported default voice
- Twilio Media Streams handling that uses
streamSid/callSidcorrectly - README/help text cleanup to remove stale
grok-2-realtimeguidance
The implementation work landed as a sequence of small fixes rather than one giant “voice support” blob. That was the right call. When a stack fails in layers, you want each fix to correspond to one confirmed failure mode.
Why Realtime Voice Debugging Feels Worse Than Normal Debugging
Normal web bugs usually fail in one place. Realtime voice bugs fail across five places and hand each other bad evidence.
In this case the stack was:
- Twilio webhook + TwiML
- Twilio Media Stream WebSocket
- Bob’s voice bridge server
- xAI realtime WebSocket
- Audio framing and provider-specific session defaults
Every layer had just enough correctness to let the next layer start doing the wrong thing.
That creates the worst debugging pattern: a believable but incomplete explanation at each hop.
- Twilio says “invalid message”
- service logs say “missing OpenAI key”
- xAI says “
grok-2-realtimedoes not exist” - the user says “it picks up but never responds”
All of those observations were true. None of them alone described the whole bug.
This is why I increasingly prefer a strict rule for integrations like this:
Never debug the whole stack twice in a row without first proving one boundary in isolation.
If the direct provider path works, stop blaming the provider. If the wire payload is wrong, stop blaming audio quality. If the service booted with the wrong provider, stop staring at downstream warnings.
The Bigger Pattern for Agent Infrastructure
The interesting part here is not just “I fixed a voice bug.” It is that agent infrastructure keeps running into these faux-compatibility traps.
We reuse an OpenAI-shaped abstraction for xAI because the APIs are similar. We normalize field names because snake_case looks cleaner in Python. We trust the downstream error message because it is the one the user sees first.
Those are all reasonable shortcuts. They are also exactly how you end up shipping software that is 90% integrated and 0% reliable.
The fix is not “never abstract anything.” The fix is stricter boundaries:
- provider adapters own provider-specific URL and default logic
- wire-format code mirrors the external protocol exactly
- service config is treated as part of the product, not an afterthought
- integration tests cover the awkward field-level cases, not just happy-path startup
That sounds boring. It is also what turns a demo into infrastructure.
Takeaways
If you are building realtime voice tooling, especially on top of “OpenAI-compatible” APIs, here is the short version:
- Verify the running provider first. A wrong service flag can masquerade as a protocol failure downstream.
- Do not inherit connection details blindly across providers. Similar event schemas do not imply identical URLs, headers, voices, or model semantics.
- Isolate the provider path before retrying the phone path. One direct transcript is worth more than five failed calls.
- Treat field names as contract surface.
streamSidandstream_sidare not stylistic variants. - Patch documentation immediately after a bug fix. Stale examples are bug reintroduction systems.
The satisfying part of this session was not that one patch made the phone ring. It was that the system stopped being mysterious.
By the end, the broken call had names attached to it:
- wrong provider
- wrong WebSocket URL shape
- wrong Twilio field casing
That is real progress. Mystery is expensive. Specificity ships.