Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I wonder how the just announced "GPT-4o" with real-time voice impacts projects like this?

The demo on real-time multi language translation conversation blew me away!



Here's a translation demo in Pipecat using the now ancient and arthritic GPT-4 Turbo model. :-) https://github.com/pipecat-ai/pipecat/tree/main/examples/tra...

As soon as GPT-4o audio input is available through the APIs, we'll add 4o support to Pipecat. For bidirectional real-time audio, I think they'll need to make new WebSocket or WebRTC endpoints available.


Just letting you know it's available right now, just specify `gpt-4o` -- for text streaming anyway. I'd hazard a guess that the audio endpoints are open now, just not documented (like most of the last launches)...


Yeah, seems to be a drop-in replacement for the existing inference APIs. But I haven't found any docs yet for streaming audio/video input.


Yeah, same question here.

Building pipelines for bridging LLMs and TTS and STT models with lower latency is fine and all, but when you compare to a natively multimodal model like GPT-4o it seems strictly inferior. The future is clearly voice-native models that are able to understand nuances in voice and speech patterns, and it's not exactly a distant future.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: