Yeah that's a really good way of framing the argument, I wish I wrote that. The way robots listen/respond is bounded by compute, not time. Buffering audio isn't a great experience for humans but definitely works for robots.
To clarify, I meant waiting an extra 200ms if the alternative was dropping part of the prompt. During periods of zero congestion, the latency would be the same.
1. Of course users want lower latency, but they also want fewer instances where the LLM "misheard" them. It would be amazing to run A/B experiments on the trade-off between latency vs quality, but WebRTC makes that knob difficult to turn.
2. I'm obviously not an TTS expert, but what benefit is there to trickling out the result? The silicon doesn't care how quickly the time number increments?
3. Yeah, sometimes the client is aware when their IP changes and can do an ICE renegotiation. But often they aren't aware, and normally would rely on the server detecting the change, but that's not possible with your LB setup. It's not a big deal, just unfortunate given how many hoops you have to jump through already.
4. Okay, that draft means 7 RTTs instead of 8 RTTs? Again some can be pipelined so the real number is a bit lower. But like the real issue is the mandatory signaling server which causes a double TLS handshake just in case P2P is being used.
5. Of course WebRTC is easier for a new developer because it's a black box conferencing app. But for a large company like OpenAI, that black box starts to cause problems that really could be fixed with lower level primitives.
I absolutely think you should mess around with RTP over QUIC and would love to help. If you're worried about code size, the browser (and one day the OS) provides the QUIC library. And if you switch to something closer to MoQ, QUIC handles fragmentation, retransmissions, congestion control, etc. Your application ends up being surprisingly small.
The main shortcoming with RoQ/MoQ is that we can't implement GCC because QUIC is congestion controlled (including datagrams). We're stuck with cubic/BBR when sending from the browser for now.
Latency versus reliability is a false dichotomy anyway. The alternative to WebRTC isn't to wait for the user to finish speaking before you send any of the audio. Open a websocket and send the coded audio packets as they're generated. Now you're still sending audio packets immediately, but if one is dropped, TCP retransmits it until it makes it through. If the connection is really slow, packets queue up, and the user has to wait, but it still works. You get the low latency in the best case and the robustness in the worst case.
This is how Nova Sonic works. Having done some implementations it’s trickier than you might like (e.g. the Python library for Sonic had problems with echoes and we had to use the Java library)
You ultimately still need a jitter buffer large enough to absorb retransmisiones. Otherwise you’ve got stuttering audio. And dynamically adjusting this jitter buffer is hard
I'm not an expert. Can't we abuse that LLMs don't need to receive audio as a continuous stream without interruptions? Couldn't we just send data and pipe it into the LLM with deduplication (if resending happens)?
You’re absolutely correct. A jitter buffer is necessary for a human listener, but a LLM isn’t aware of a time lapse, just like it isn’t aware of the time since your last message in the conversion (unless the chat harness explicitly informs it).
1.) Latency vs quality doesn't come up enough to make people want to A/B test it unfortunately. At work I would say ~5 people care about WebRTC vs QUIC vs X. All effort is around the models (how can I provide tools to be support those doing that work)
2.) The model isn't processing just text anymore. Also taking into account breathing/emotion etc... not just spitting out big responses anymore. As it generates them it is taking into account the users response.
3.) It works with the LB setup today. Clients are sending ICE traffic, if it roams we lookup the ufrag and route appropriately.
4.) With DTLS 1.3 it is 1 RTT with SNAP[0] for WebRTC session. SCTP info goes in Offer/Answer, DTLS is packed into ICE. You are totally right about signaling though! [1] was my answer for doing WebRTC without signaling, couldn't get anyone to care though.
5.) I don't have anything that I need to tune. If I want to increase (or decrease) latency [3] is something I put into Transceiver. Otherwise I can't think of any 'change this WebRTC behavior' that has been asked by users/developers.
Human spoken conversation doesn’t really work like file buffering.
People can tolerate missing words surprisingly well. If a phrase is slightly clipped, masked by noise, or dropped, the listener can often infer it from context. That happens constantly in real speech.
But pauses and stalls are much more damaging. A sudden freeze in the middle of speech breaks turn-taking, timing, and attention. It feels like the speaker stopped thinking, the connection died, or the system got stuck.
For voice UX, a tiny omission is often less harmful than a perfectly complete sentence that freezes halfway.
> People can tolerate missing words surprisingly well. If a phrase is slightly clipped, masked by noise, or dropped, the listener can often infer it from context. That happens constantly in real speech.
LLMs are surprisingly good at this, too.
This entire blog post is based on assumptions that
1) WebRTC garbling is common
2) LLMs fall apart if there are any audio glitches
I would bet money that OpenAI explored and has statistics on both of those and how it impacts service. More than this blogger heaping snark upon snark to avoid having a realistic conversation about pros and cons
If I'm talking to a friend or peer and I'm on a crappy link, we can probably work it out. If I'm calling my lawyer from prison with my "one call" I really want my lawyer to get my instructions clearly and correctly, ideally the first time without a lot of coaching.
Where on this scale does "person talking to LLM" fit?
I believe there's a ton of research into the shannon limit and human speech. You can trivially observe how much redundancy there is by listening to a podcast at 1x, 1.2x, 1.5x, 2x, etc, and when you can't follow what's going on, you've found the "redundancy" built into that language. This number falls way off when you're listening to a person with an accent or when the recording is noisy or whatever.
You'll also find that your tolerance for lossy media is radically different based on latency and echos and jitter in the audio (which I believe is the point of the original "don't use webrtc" article...)
Finally, people may tolerate this, but the "phonem to token" thinger may be less tolerant, and will certainly not be able to magic correct meaning from lost packets, and if the resulting exchange is extremely expensive or important (from the lawyer and the "I'm in jail in poughkeepsie; I need bail!" exchange) you really want to take the time to get it right, not make things guess.
The misunderstanding the user comes down to understanding how the user prompts and what type of responses the user gets in return. I’m wondering if for anything code the llm could have an interrupter that it would first read what the user wrote and translate it to proper sentence structure and the return a truer value. I think the llm is having an understanding issue because everyone has a unique signature in how they explain something. That signature operates like a personal language of the user as to which most of us will run through different scenarios to come up with a conclusion from our personal signature/language in which we conduct ourselves. And since llms are gamed to get to the answer faster using less tokens it probably picks the average high level signature that can be used for multiple users.
Hello Mr Author here. Apologies that my comment replies aren't as funny.
Every low-latency application has to decide the user experience trade-off between quality and latency. Congestion causes queuing (aka latency) and to avoid that, something needs to be skipped (lower quality).
The WebRTC latency vs. quality knob is fixed. It's great at minimizing latency, but suffers from a lack of flexibility. We still (try to) use WebRTC anyway, because like you implied, browser support has made it one of the only options.
Until now of course! WebTransport means you can achieve WebRTC-like behavior via a generic protocol. Choose how long you want to wait before dropping/resetting a stream, instead of that decision being made for you.
And yeah my point in the blog is that often the user wants streaming, but not dropping. Obviously you can stream audio input/output without WebRTC. The application should be able to decide when audio packets are lost forever... is it 50ms or 500ms or 5000ms? My argument is that voice AI shouldn't pick the 50ms option.
QUIC libraries work by looping over pending streams (in priority order) to determine which UDP packet to send next. If there's more stream data than available congestion control, the data will send there in the stream send buffer.
Either side can abort a stream if it's taking too long, clearing the send buffer and officially dropping the data. It's a lot more flexible than opaque UDP send buffers and random packet loss.
FEC would make the most sense at the QUIC level because random packet loss is primarily hop-by-hop. But I'm not aware of any serious efforts to do that. There's a lot of ideas out there, but TBH MoQ is too young to have the production usage required to evaluate a FEC scheme.
But a huge difference is that there's a plan for congestion. We heavily rely on QUIC to drain network queues and prioritize/queue media based on importance. It's doable with multicast+unicast, but complicated.
You can convert any push-based protocol into a pull-based one with a custom protocol to toggle sources on/off. But it's a non-standard solution, and soon enough you have to control the entire stack.
The goal of MoQ is to split WebRTC into 3-4 standard layers for reusability. You can use QUIC for networking, moq-lite/moq-transport for pub/sub, hang/msf for media, etc. Or don't! The composability depends on your use case.
And yeah lemme know if you want some help/advice on your QUIC-based solution. Join the discord and DM @kixelated.
Hey lewq, 40Mbps is an absolutely ridiculous bitrate. For context, Twitch maxes out around 8.5Mb/s for 1440p60. Your encoder was poorly configured, that's it. Also, it sounds like your mostly static content would greatly benefit from VBR; you could get the bitrate down to 1Mb/s or something for screen sharing.
And yeah, the usual approach is to adapt your bitrate to network conditions, but it's also common to modify the frame rate. There's actually no requirement for a fixed frame rate with video codecs. It also you could do the same "encode on demand" approach with a codec like H.264, provided you're okay with it being low FPS on high RTT connections (poor Australians).
Overall, using keyframes only is a very bad idea. It's how the low quality animated GIFs used to work before they were secretly replaced with video files. Video codecs are extremely efficient because of delta encoding.
But I totally agree with ditching WebRTC. WebSockets + WebCodecs is fine provided you have a plan for bufferbloat (ex. adaptive bitrate, ABR, GoP skipping).
reply