Why is that we still can't have a perfect or near-perfect text-to-speech given all the astonishing advances in ML taking place? Is TTS an area nobody is really interested in or is it harder than generating beautiful pictures and sophisticated writings?
This thing by Apple already sounds way better than the best I heard previously (NextUp Ivona) but it is not an instant-result offline tool yet and that's sad.
It's an extremely hard problem that lots of people are working on.
The trick is that we have "pretty good" results for TTS as-is, but it has significant shortcomings that are more visible in certain use cases. The operative word is "prosody" - the cadence, rhythm, and pauses that are natural when speaking that are heavily dependent on context and content.
Prosody is incredibly important to making natural utterances - TTS models that do not model prosody end up sounding very "flat", which is mostly all of the heavily used TTS engines out there right now. This is less glaring for short responses like what you would get from a voice assistant, but becomes a huge grating problem when you try to do long-form text reading.
The trick with prosody is that it often requires information and context not contained in the text to be read. You would apply a different rhythm and stresses to a horror story than you would to a conference keynote speech, for example. It also requires a more sophisticated understanding of the content of text rather than simply its constituent words, in order to figure out proper stresses and pauses.
All of this is eminently solvable (as demonstrated here with the book voices) but is... rather difficult. I suspect we're not terribly close to a product where you can just feed it raw text (with annotating or otherwise providing additional data as context) and get a great result.
I wonder how effective it would be to feed the book to some other AI model first that reads the whole thing and figures out the necessary context that it can then go back and feed into the TTS model
I wanted to make a human-like reading feature for our language-learning software. Training a model isn't too hard using something like https://github.com/coqui-ai/TTS.
The weak link was the available free/open datasets. You needed a single speaker with a pleasant voice, 20hrs+ material from varied sources, recorded in a good recording enviroment with a good mic etc. For English, the go-to was LJSpeech, which doesn't fulfill all these requirements. I say 'was', as I haven't followed developments recently.
Last year we decided to make our own dataset with a Irish woman, Jenny. She has a soft Irish lilt.
Never got around around to training the model, but I will upload the raw audio and prompts here in a few hours (need to pay my internet bill in town..):
Are visual generative models really that more advanced, or could this simply be an artifact of their usage?
With generative visual art, people usually spend considerable time fine-tuning the results, and we don‘t get to see all the prompts that didn‘t work out (except if the failure is notable in some way).
Try e.g. illustrating a book, but using only your first prompt for each image. I think the quality would be in the same ballpark as having Siri narrate the corresponding audiobook.
You’re describing the effects of familiarity with a subject.
Stable Diffusion / Midjourney etc look really pretty to the average person but on closer inspection they rarely hold up out of the box. If you’re an experienced artist you pick up on all the flaws right away.
ChatGPT and Copilot are similar. The answers seem confident , but the more familiar you are with the domain of the answer, the quicker it becomes to see how flawed the results are.
Now going back to TTS. You’ve spent your whole life knowing what speech sounds like. Unlike those other models that require an extra level of domain knowledge, everyone innately knows the sound of humans speaking. So you’re effectively, and subconsciously, a domain expert.
This is essentially the uncanny valley effect but for other areas.
Chat-GPT and StableDiffusion aren't perfect. They still produce weird responses or visual artifacts sometimes. But, it can be easy to move past these idiosyncrasies.
I think the brain is just more sensitive to speech, because inflection and tone is a key part of communication. So even subtle artifacts in the generated voice are really obvious and annoying.
Plus, as another commenter mentioned, books are long. An issue in 1 out of 10,000 words will be enough to break emersion.
I don't find it easy to look past their idiosyncrasies at all although they can produce impressive results with fiddling and luck.
Listening to these samples, they're still robotic sounding to me just listening for 10 seconds. I can't imagine wanting to listen to a whole book like this given the option of listening to an even modestly-competent voice actor.
My uneducated opinion on the matter is that we are more tolerant of subtle errors in pictures and writings than we are in sounds. Subtle variations of tone can change the meaning of a conversation that words on paper just can't convey.
As a person who has listened to a number of non-fiction books narrated by Microsoft Sam I don't really mind "subtle variations of tone" :-) This Apple thing will already satisfy me if they release it as an offline app for converting plain text files into audio files.
Because to understand intonation and rhythm you need to perfectly understand context and emotions. I don’t doubt these things will be added soon enough, so I expect perfect reading end of this year and perfect reading in anyone’s voice with a few samples in 2024.
> Why is that we still can't have a perfect or near-perfect text-to-speech
Define perfect ;) Two different people will read the same text slightly (or not slightly) differently.
A great example is this brilliant and funny rendition of "To be or not to be" by Tim Minchin, Benedict Cumberbatch, Judy Dench, David Tennant and others. Sorry for the Facebook link, but it's very hard to find this video anywhere: https://www.facebook.com/watch/?v=585252039999241
> is it harder than generating beautiful pictures and sophisticated writings
I think one differences with pictures and audio is that pictures are two-dimensional and we can't take in the whole image at a time. This makes it easy to overlook flaws without careful inspection. And I find that although there has been some amazing AI-generated art, there are still a lot of rough edges and tweaking required to get really clean images.
As far as writing goes, I suspect that the rules of written language are easier to learn and violations easier to overlook than with generated audio.
This thing by Apple already sounds way better than the best I heard previously (NextUp Ivona) but it is not an instant-result offline tool yet and that's sad.