I’ve always been curious about how so seemingly little work has gone into speech generation, compared to graphics and physics and other technologies.
Problems with voice actors:
• The costs, often unaffordable by solo developers and small teams.
• All the text and the scripts have to be finalized and set in stone. You can’t change much, if at all, after you’ve finished recorded everything. Being able to re-hire the same actors for future sequels is also not guaranteed.
• Complete lack of real-time customization. For example, in MMORPGs and other online games. So we’re stuck with a few set phrases or using our own voice (not desirable all the time for everyone). Imagine if you could fully customize your voice as well as you can do with your character’s appearance, and have it realistically speak any text you type, or NPCs actually speaking your name instead of using generic placeholders.
• Players expect voice acting in all “major” titles. Plenty of games with great stories get skipped over if they don’t have voice acting.
Wow, it was not in my radar that AI could replace voice actors' jobs, but your comment has opened me to this idea.
To give an example, in Spain (I'm Spanish), in the Simpsons series after several seasons the much beloved voice actor for Homer Simpson died. They had to replace it ofc with another one. Yet, I know that many people in my circle started stopping watching the Simpsons at that time because they couldn't stand the new voice. Regardless of whether the new voice was "worse" or "better", it was certainly different, and we humans get weirded with that.
In light of this, I could see animated pictures with fully AI-rendered voices. Incredible.
Lyrebird allows you to create a digital voice that sounds like you with only one minute of audio.
The fake voice by Donald Trump is terribly close.
This definitely has a repercussion for fake news and overall trust. We will just not be able to trust what we see (computer-rendered images) or hear (rendered voice).
That does beg the question though; would a voice actor have legal grounds to stop such reproduction?
The issue is coming first from the lack of assets. Voice samples that you can use for speech synthesis are not available easily, apart from your voice and a few friends' maybe. Then there's always been the issue that speech synthesis was not very good (we had working speech synthesis back in 1985 on the Amiga Workbench, but it was pretty much what you'd expect in terms of quality), and "not good enough" can break the experience.
On top of that, speech synthesis is not an easy problem, it's definitely in the realm of machine learning, and video games development does not touch so much on machine learning so far. I think there's a ton of applications for machine learning in games (such as making less stupid AI, or more human-like AI), and speech synthesis is definitely one of them.
Though speech synthesis specifically might be a good use case for some simpler ML, such as to form the final sound from processed phonemes (i.e acting as a fuzzy lookup table).
AoE 2 let players make their own custom AIs. They took one of the best fan made AI scripts and made it the default AI in the new release. And players loved it. Because it provided a much harder and human like challenge, and didn't cheat like the old AI did.
Obviously you shouldn't just train an NN to instant headshot players with optimal strategy. You can add realistic handicaps to the AI like noisy controls and human reaction times, so it isn't superhuman.
GOAP is still pretty much the end-all of game AI, but it can be pretty tricky to work with.
While our samples sound great, there are still some difficult problems to be tackled. For example, our system has difficulties pronouncing complex words (such as “decorum” and “merlot”), and in extreme cases it can even randomly generate strange noises.
Anyone remember the company?
These results seem fantastic nonetheless! I anticipate they'll be able to optimize the system to real-time generation or better within the next year, looking at WaveNet.
I do think the difference would become obvious with a paragraph or more of speech, though. It's difficult to judge what the correct intonation should be on these single sentences without context. Ultimately, correct intonation requires a complete understanding of meaning which is still out of reach. An audiobook read by tacotron 2 would still sound strange.
I thought the same thing initially -- I guess it fooled a few of us with that first one!
It's 1000 hours of audio book readings, segmented by sentence, with transcripts. All from project Gutenberg, so maybe a little bit heavy on Victorian bodice rippers and such, but certainly a great trove of training data...
Very cool stuff.
"We manually analyze the error modes of our system on the custom 100-sentence test set from Appendix E of . Within the audio generated from those sentences, 0 contained repeated words, 6 contained mispronunciations, 1 contained skipped words, and 23 were subjectively decided to contain unnatural prosody, such as emphasis on the wrong syllables or words, or unnatural pitch. In one case, the longest sentence, end-point prediction failed."
For an production GCP API, I think faster than real-time would be necessary.
For example, WaveNet took a year to go from research to production in Google Assistant: https://deepmind.com/blog/wavenet-launches-google-assistant/
"We train all models on an internal US English dataset, which contains 24.6 hours of speech from a single professional female speaker."
For 24 hours that would be 7 mil frames of 80 values ~= 500 mil data points, or 2 GB of raw data (assuming floats)
E.g. ImageNet is 50 gb of compressed data, and there are many much larger datasets in practical use.