Hey HN, we've been working with OpenAI for the past few months on the new Realtime API.
The goal is to give everyone access to the same stack that underpins Advanced Voice in the ChatGPT app.
Under the hood it works like this:
- A user's speech is captured by a LiveKit client SDK in the ChatGPT app
- Their speech is streamed using WebRTC to OpenAI’s voice agent
- The agent relays the speech prompt over websocket to GPT-4o
- GPT-4o runs inference and streams speech packets (over websocket) back to the agent
- The agent relays generated speech using WebRTC back to the user’s device
The Realtime API that OpenAI launched is the websocket interface to GPT-4o. This backend framework covers the voice agent portion. Besides having additional logic like function calling, the agent fundamentally proxies WebRTC to websocket.
The reason for this is because websocket isn’t the best choice for client-server communication. The vast majority of packet loss occurs between a server and client device and websocket doesn’t provide programmatic control or intervention in lossy network environments like WiFi or cellular. Packet loss leads to higher latency and choppy or garbled audio.
Or, have the app call a pharmacy every month to refill prescriptions. For some drugs, the pharmacy requires a manual phone call to refill which gets very annoying.
So many use cases for this.