Sam tweeted today: “ also for clarity: the new voice mode hasn't shipped yet (though the text mode of GPT-4o has). what you can currently use in the app is the old version.
the new one is very much worth the wait!”
Can any practitioners explain how this works for a natively multi-modal model that works with text, voice, and vision? How do you only ship part of that? In my testing of GPT-4o, it insists that it uses Dall-e 2 to generate images and the images have similar artifacts to those that I’ve seen when using Dall-e 2. My browser is also hitting an ab.chatgpt.com endpoint during conversations. Am I part of an experiment or just under-informed on this subject?