Hacker News new | past | comments | ask | show | jobs | submit login

Isn’t gpt 4o voice not audio to audio, but audio to text to audio?



It isn't released yet, but the new one that they demoed is audio-to-audio. That's why it can do things like sing songs, and detect emotion in voices.

The one that you can currently access in the ChatGPT app (with subscription) is the old one which is ASR->LLM->TTS.


Are we sure it’s a single model behind the scenes doing that?

Practically it doesn’t really matter, but I’d like to know for sure.


It's the second paragraph in their announcement blog post. https://openai.com/index/hello-gpt-4o/

> Prior to GPT-4o, you could use Voice Mode to talk to ChatGPT with latencies of 2.8 seconds (GPT-3.5) and 5.4 seconds (GPT-4) on average. To achieve this, Voice Mode is a pipeline of three separate models: one simple model transcribes audio to text, GPT-3.5 or GPT-4 takes in text and outputs text, and a third simple model converts that text back to audio. This process means that the main source of intelligence, GPT-4, loses a lot of information—it can’t directly observe tone, multiple speakers, or background noises, and it can’t output laughter, singing, or express emotion.

> With GPT-4o, we trained a single new model end-to-end across text, vision, and audio, meaning that all inputs and outputs are processed by the same neural network.


I'm pretty sure you can use the new GPT4o audio-to-audio model already – even without a subscription. You can even use the "Sky" model if you didn't update your app.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: