AudioPaLM: A large language model that can speak and listen

valine · on June 26, 2023

Preserving the speakers voice after translation is super cool. It’s one of the things you don’t really think about, but voice inflection and identity is missing with our current translation tools.

I like to watch political speeches from non english speaking politicians, and the speakers tone can easily be lost in translation. Emphasis is hard to discern when you don’t know which spoken word maps to which word in the subtitles. Dubbed speeches are even worse in that respect.

brucethemoose2 · on June 26, 2023

Uh oh, here come the multimedia multimodels.

I see a lot of talk about transformers LLMs being close to "topping out," which I am skeptical of for many reasons, but not the least of which is prompting/outputs other than pure text.

mupuff1234 · on June 26, 2023

Is there really a difference between a multimodal vs a text LLM + stable diffusion?

valine · on June 26, 2023

Multimodal can refer to a lot of different types of models, but feeding LLM text into stable diffusion definitely doesn’t count.

LLaVA is the first one that comes to my mind, it takes images and text as input and outputs text.

There’s an unreleased version of GPT4 that can do that same thing.

mupuff1234 · on June 26, 2023

Sure technically not the same, but won't there be the same affect?

How do our brains work? Isn't there a separation between image and text processing?

valine · on June 26, 2023

Surely there needs to be some amount of training with both models in the loop before it can be considered a multimodal system.

jamilton · on June 26, 2023

What do you mean?

mupuff1234 · on June 26, 2023

That if you have 3 separate models, text -> text, image -> text and text -> image you can just glue them together to make it behave like a multimodal.

(Just like gpt4 is rumored to be a few different sub models and not just one giant model)

rhogar · on June 26, 2023

Additional discussion from post two days ago: https://news.ycombinator.com/item?id=36443676

ml_basics · on June 26, 2023

Demo website with speech-speech translation examples https://google-research.github.io/seanet/audiopalm/examples/

leobg · on June 26, 2023

Absolutely amazing. Anything like this available to play with?