Hacker News new | past | comments | ask | show | jobs | submit login

An audio-to-audio model is definitely a step forward. And I do think that's where things are going to go, generally speaking.

For context relating to real-time voice AI: once you're down below ~800ms things are fast enough to feel naturally responsive for most people and use cases.

The GPT-4o announcement page says they average ~320ms time to first token from an audio prompt. Which is definitely next level and is really, really exciting. You can't get to 800ms with any pipeline that includes GPT-4 Turbo today, so this is a big deal.

It's possible to do ~500ms time to first token by pipelining today's fastest transcription, inference, and tts models. (For example, Deepgram transcription, Groq Llama-3, Deepgram Aura voices.)




Every opening phrase is a platitude like ‘sure let’s do it’. So the OpenAI latency is probably higher, they are just using clever orchestration to generate some filler tokens to make latency lower. Unlikely the initial response at OpenAI is coming from the main model.


I'm familiar with Deepgram, groq, and Eleven Labs. I have recently built something on those and it's really not too bad as far as latency. But OpenAI has shown that audio-to-audio can't be beat.


Gazelle is (or will be) significantly faster than that.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: