Is it just me, or does the spoken content not correspond with the written prompt...

cookingrobot · on Jan 6, 2023

The first column of audio is just a sample of that person reading different text. That’s what the model gets to hear to learn what they sound like, before trying to speak the text in their voice.

woodson · on Jan 6, 2023

Ah thanks! I looked at the page on a phone screen, where only the text and the first audio playback button are visible. My bad..

babakd · on Jan 6, 2023

The speaker prompt is the sample speaker voice reading a random text, that’s one piece that the model uses as input. The second column corresponds to the human speaker reading the text (ground truth) The two next columns are baseline and VALL-E producing text-to-speech respectively, given the first column and only the text as input.

bredren · on Jan 6, 2023

I did the same thing—-on mobile the many column headings are not discoverable in portrait.