Hacker News new | past | comments | ask | show | jobs | submit login
Tacotron 2: Generating Human-Like Speech from Text (googleblog.com)
290 points by stablemap on Dec 19, 2017 | hide | past | web | favorite | 54 comments

One of the domains where realistic and fully customizable speech synthesis could usher in a renassaince of innovation, would be video games developement.

I’ve always been curious about how so seemingly little work has gone into speech generation, compared to graphics and physics and other technologies.

Problems with voice actors:

• The costs, often unaffordable by solo developers and small teams.

• All the text and the scripts have to be finalized and set in stone. You can’t change much, if at all, after you’ve finished recorded everything. Being able to re-hire the same actors for future sequels is also not guaranteed.

• Complete lack of real-time customization. For example, in MMORPGs and other online games. So we’re stuck with a few set phrases or using our own voice (not desirable all the time for everyone). Imagine if you could fully customize your voice as well as you can do with your character’s appearance, and have it realistically speak any text you type, or NPCs actually speaking your name instead of using generic placeholders.

• Players expect voice acting in all “major” titles. Plenty of games with great stories get skipped over if they don’t have voice acting.

> Being able to re-hire the same actors for future sequels is also not guaranteed.

Wow, it was not in my radar that AI could replace voice actors' jobs, but your comment has opened me to this idea.

To give an example, in Spain (I'm Spanish), in the Simpsons series after several seasons the much beloved voice actor for Homer Simpson died. They had to replace it ofc with another one. Yet, I know that many people in my circle started stopping watching the Simpsons at that time because they couldn't stand the new voice. Regardless of whether the new voice was "worse" or "better", it was certainly different, and we humans get weirded with that.

In light of this, I could see animated pictures with fully AI-rendered voices. Incredible.

I seem to recall a company doing this for NPR personalities. It makes a lot sense -- you have one person doing a boatload of speaking in front of a QA controlled mic, and you generally have their transcripts. Pretty straightforward dataset for training. Sadly, can't find the podcast that discussed this right now.

I assume you know about https://lyrebird.ai/

Lyrebird allows you to create a digital voice that sounds like you with only one minute of audio.

Yes, just tried to record my voice, but the service didn't finish. Can be a temporal glitch.

The fake voice by Donald Trump is terribly close.

This definitely has a repercussion for fake news and overall trust. We will just not be able to trust what we see (computer-rendered images) or hear (rendered voice).

Related: I've been looking for a good solution to voice synthesis with style transfer. One obstacle many potential Fallout 4 mods will face is that the protagonist is fully voiced. Imagine a mod author being able to create new voice lines in the same tone+style as the original voice actor.

That does beg the question though; would a voice actor have legal grounds to stop such reproduction?

Or better yet, just deploy the NN as a renderer for sounds, along with a bunch of text to render...

> I’ve always been curious about how so seemingly little work has gone into speech generation, compared to graphics and physics and other technologies.

The issue is coming first from the lack of assets. Voice samples that you can use for speech synthesis are not available easily, apart from your voice and a few friends' maybe. Then there's always been the issue that speech synthesis was not very good (we had working speech synthesis back in 1985 on the Amiga Workbench, but it was pretty much what you'd expect in terms of quality), and "not good enough" can break the experience.

On top of that, speech synthesis is not an easy problem, it's definitely in the realm of machine learning, and video games development does not touch so much on machine learning so far. I think there's a ton of applications for machine learning in games (such as making less stupid AI, or more human-like AI), and speech synthesis is definitely one of them.

Machine learning really doesn't have many uses for games outside of the development cycle (or non-critical gimmicks). If it's exposed to player input - or especially; random input - it absolutely must have robust, stable and controllable behaviour, at least on paper. Most of the time getting any of those out of an NN is either difficult or blatantly impossible.

Though speech synthesis specifically might be a good use case for some simpler ML, such as to form the final sound from processed phonemes (i.e acting as a fuzzy lookup table).

It may not be applicable in all games. But there are many games where the AI skill matters a lot to players. Or where they just get annoyed or bored with stupid AI. Games with bad AI (which is most games) frustrate me a great deal.

AoE 2 let players make their own custom AIs. They took one of the best fan made AI scripts and made it the default AI in the new release. And players loved it. Because it provided a much harder and human like challenge, and didn't cheat like the old AI did.

Obviously you shouldn't just train an NN to instant headshot players with optimal strategy. You can add realistic handicaps to the AI like noisy controls and human reaction times, so it isn't superhuman.

Enemy AI is most certainly the perfectly wrong place to apply ML due to the wide range of possible inputs (scenarios) and being in the utterly wrong complexity class (I mean, it's ostensibly doable with an RNN, but definitely not worth it).

GOAP is still pretty much the end-all of game AI, but it can be pretty tricky to work with.

I feel like once there's something open source/free which works serviceabley for a big enough range of voices, you might see some mod projects jumping in to it. Certainly the idea of being able to put together new single player content fully voiced solo sounds appealing as hell to me.

I can see trouble in the political arena when "recordings" are generated. Not for legal reasons but some voters might be easily duped.

https://grail.cs.washington.edu/projects/AudioToObama/ is a nice example. The tech is already there, we can expect the psyops people of any major nationstate and/or any political campaign with significant financial resources to be able to generate such fake content in 2018.

Combined with realistic “Animoji”-like virtual face masks for digital doppelgängers, yes.

Yes, games. And the other thing the internet is for.

Huge kudos to authors for being upfront about what doesn’t work. I am getting pretty tired of people not doing this more consistently and only put out the most attractive results even when they would have observed many not so good results.

While our samples sound great, there are still some difficult problems to be tackled. For example, our system has difficulties pronouncing complex words (such as “decorum” and “merlot”), and in extreme cases it can even randomly generate strange noises.

A while ago, someone posted this sample from their TTS startup: https://instaud.io/KlA

Anyone remember the company?

Might it have been Lyrebird[0]? They're (as far as I can tell) they have the best text-to-speech that you can actually use. It's kinda annoying when Google makes these announcements about their advances in TTS only to reveal that it's not actually something you can make use of, and no, the dataset they used to train their model is not available.

0. https://lyrebird.ai/demo/

Nope, that is definitely not it. It might have been a smaller, AFAIR french startup. The (english) demos were all weather related.

Wow! I can't believe that's an AI.

You can hear the chirps occasionally.

I am constantly surprised by how robust and versatile Mel spectrograms (and Mel frequency cepstrum) are, despite the filterbanks and the transformations being relatively arbitrarily engineered and performed on evenly spaced frames.

These results seem fantastic nonetheless! I anticipate they'll be able to optimize the system to real-time generation or better within the next year, looking at WaveNet.

this is the first time I can't tell the difference between the synthesized voice and a human voice.

If you listen carefully, it's possible with some of the samples to hear that the human is stressing a different word (e.g. "that girl" vs "that girl", or "too busy for romance" vs "too busy for romance"), but I couldn't tell which was the real recording based on that alone.

My take is that the human voices have a more emotionally and weightier tone to the voice, while the robot is flat and direct.

I thought it was pretty easy to pick the audio clips with greater variance in pitch and pronunciation. 2-2-2-1 were the human voices.

Haha, try again, the human is 1,2,2,1 according to the filenames (I was fooled too).

I do think the difference would become obvious with a paragraph or more of speech, though. It's difficult to judge what the correct intonation should be on these single sentences without context. Ultimately, correct intonation requires a complete understanding of meaning which is still out of reach. An audiobook read by tacotron 2 would still sound strange.

Depends on the audiobook. I think technical docs would be alright, which is what I want this mostly for. Lots of technical docs I'd like to listen while I work out.

Looks like the real ones are actually 1-2-2-1. The file names of the samples end in either "gt" or "gen", which kinda gives it away.

I thought the same thing initially -- I guess it fooled a few of us with that first one!

I thought the first one was the clearest once you've read that the synthesised voice attempts to guess which words should be stressed from syntax: sentences beginning with the word "that" often should stress "that" because they're distinguishing that choice from some other, but probably not for this particular instance where it's an off hand reference to some girl from some video...

I've been wondering lately, it seems like audio books might be an amazing training resource for models like these, if you could get the script that the reader was working from!

Your wish, granted. http://www.openslr.org/12/

It's 1000 hours of audio book readings, segmented by sentence, with transcripts. All from project Gutenberg, so maybe a little bit heavy on Victorian bodice rippers and such, but certainly a great trove of training data...

That data is no good for this purpose, as it’s from a lot of different speakers and does not have speaker labels, i.e., you can’t tell which sentences were spoken by which speaker.

I wonder how hard it is to get this effect in other languages. "The google translate lady" is really helpful for my foreign studies but I wonder how robotic it sounds to a native

Arstechnica mentioned this research in an approachable article [1] on the challenges of getting speech recognition and generation working for more languages. I found it fascinating.

Very cool stuff.

[1] https://arstechnica.com/information-technology/2017/12/teach...

I suppose an API will come to Google Cloud platform to use this service sooner or later ? Or, is it still on the research side (not production ready) ?

It's still a research project and not a production system:

"We manually analyze the error modes of our system on the custom 100-sentence test set from Appendix E of [11]. Within the audio generated from those sentences, 0 contained repeated words, 6 contained mispronunciations, 1 contained skipped words, and 23 were subjectively decided to contain unnatural prosody, such as emphasis on the wrong syllables or words, or unnatural pitch. In one case, the longest sentence, end-point prediction failed."

To add: > Also, our system cannot yet generate audio in realtime.

For an production GCP API, I think faster than real-time would be necessary.

For example, WaveNet took a year to go from research to production in Google Assistant: https://deepmind.com/blog/wavenet-launches-google-assistant/

On the audio samples pages, at the end, which one do you think is the human one?


It'd be nice if they could make this a paid service at least.

Maybe services like AWS Polly would upgrade over time to be as good as this. (Sadly, Polly is currently worse than the standard macOS TTS, IMHO..)

I'm incredibly disappointed that this isn't related to tacos or robots, more importantly robots making tacos.

A tacos-making robot is already a level of a project for university students.

This is great - but I imagine they are trying to get this to work in Realtime.

Ooooh, good point.

is this available to use by non-googlers?

Apparently not? I imagine it would have to be replicated by someone else. One also assumes that the volume of training data - I don't actually know - is not insignificant!

From the paper, the training data is surprisingly small:

"We train all models on an internal US English dataset, which contains 24.6 hours of speech from a single professional female speaker."

Uncompressed, that's still quite a lot of data - perhaps not relative to other ML projects, but still.

They say they split the audio in "80-dimensional audio spectrogram with frames computed every 12.5 milliseconds". The picture from the post supports that.

For 24 hours that would be 7 mil frames of 80 values ~= 500 mil data points, or 2 GB of raw data (assuming floats)

So, something that easily fits into RAM, One might even keep a copy of the whole dataset in GPU memory to avoid copying to and from even if matrix operations are done on smaller batches.

E.g. ImageNet is 50 gb of compressed data, and there are many much larger datasets in practical use.

Can the research team find a way to display a blog entry without requiring javascript? Perhaps some sort of mark up language would be ideal. The current version requires all sorts of crap from 30 different domains and at least 2 trackers. By default with uBlock I got just the header.

Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact