
Tacotron 2: Generating Human-Like Speech from Text - stablemap
https://research.googleblog.com/2017/12/tacotron-2-generating-human-like-speech.html
======
Razengan
One of the domains where realistic and fully customizable speech synthesis
could usher in a renassaince of innovation, would be video games developement.

I’ve always been curious about how so seemingly little work has gone into
speech generation, compared to graphics and physics and other technologies.

Problems with voice actors:

• The costs, often unaffordable by solo developers and small teams.

• All the text and the scripts have to be finalized and set in stone. You
can’t change much, if at all, after you’ve finished recorded everything. Being
able to re-hire the same actors for future sequels is also not guaranteed.

• Complete lack of real-time customization. For example, in MMORPGs and other
online games. So we’re stuck with a few set phrases or using our own voice
(not desirable all the time for everyone). Imagine if you could fully
customize your voice as well as you can do with your character’s appearance,
and have it realistically speak any text you type, or NPCs actually speaking
your name instead of using generic placeholders.

• Players expect voice acting in all “major” titles. Plenty of games with
great stories get skipped over if they don’t have voice acting.

~~~
juanmirocks
> Being able to re-hire the same actors for future sequels is also not
> guaranteed.

Wow, it was not in my radar that AI could replace voice actors' jobs, but your
comment has opened me to this idea.

To give an example, in Spain (I'm Spanish), in the Simpsons series after
several seasons the much beloved voice actor for Homer Simpson died. They had
to replace it ofc with another one. Yet, I know that many people in my circle
started stopping watching the Simpsons at that time because they couldn't
stand the new voice. Regardless of whether the new voice was "worse" or
"better", it was certainly different, and we humans get weirded with that.

In light of this, I could see animated pictures with fully AI-rendered voices.
Incredible.

~~~
nl
I assume you know about [https://lyrebird.ai/](https://lyrebird.ai/)

 _Lyrebird allows you to create a digital voice that sounds like you with only
one minute of audio._

~~~
juanmirocks
Yes, just tried to record my voice, but the service didn't finish. Can be a
temporal glitch.

The fake voice by Donald Trump is terribly close.

This definitely has a repercussion for fake news and overall trust. We will
just not be able to trust what we see (computer-rendered images) or hear
(rendered voice).

------
sytelus
Huge kudos to authors for being upfront about what doesn’t work. I am getting
pretty tired of people not doing this more consistently and only put out the
most attractive results even when they would have observed many not so good
results.

 _While our samples sound great, there are still some difficult problems to be
tackled. For example, our system has difficulties pronouncing complex words
(such as “decorum” and “merlot”), and in extreme cases it can even randomly
generate strange noises._

------
minxomat
A while ago, someone posted this sample from their TTS startup:
[https://instaud.io/KlA](https://instaud.io/KlA)

Anyone remember the company?

~~~
clickok
Might it have been Lyrebird[0]? They're (as far as I can tell) they have the
best text-to-speech that you can actually use. It's kinda annoying when Google
makes these announcements about their advances in TTS only to reveal that it's
not actually something you can make use of, and no, the dataset they used to
train their model is not available.

0\. [https://lyrebird.ai/demo/](https://lyrebird.ai/demo/)

~~~
minxomat
Nope, that is definitely not it. It might have been a smaller, AFAIR french
startup. The (english) demos were all weather related.

------
hackpert
I am constantly surprised by how robust and versatile Mel spectrograms (and
Mel frequency cepstrum) are, despite the filterbanks and the transformations
being relatively arbitrarily engineered and performed on evenly spaced frames.

These results seem fantastic nonetheless! I anticipate they'll be able to
optimize the system to real-time generation or better within the next year,
looking at WaveNet.

------
billconan
this is the first time I can't tell the difference between the synthesized
voice and a human voice.

~~~
justonepost
I thought it was pretty easy to pick the audio clips with greater variance in
pitch and pronunciation. 2-2-2-1 were the human voices.

~~~
modeless
Haha, try again, the human is 1,2,2,1 according to the filenames (I was fooled
too).

I do think the difference would become obvious with a paragraph or more of
speech, though. It's difficult to judge what the correct intonation should be
on these single sentences without context. Ultimately, correct intonation
requires a complete understanding of meaning which is still out of reach. An
audiobook read by tacotron 2 would still sound strange.

~~~
justonepost
Depends on the audiobook. I think technical docs would be alright, which is
what I want this mostly for. Lots of technical docs I'd like to listen while I
work out.

------
sixdimensional
I've been wondering lately, it seems like audio books might be an amazing
training resource for models like these, if you could get the script that the
reader was working from!

~~~
sdenton4
Your wish, granted. [http://www.openslr.org/12/](http://www.openslr.org/12/)

It's 1000 hours of audio book readings, segmented by sentence, with
transcripts. All from project Gutenberg, so maybe a little bit heavy on
Victorian bodice rippers and such, but certainly a great trove of training
data...

~~~
j_s
Project Common Voice
[https://news.ycombinator.com/item?id=14794654](https://news.ycombinator.com/item?id=14794654)

~~~
woodson
That data is no good for this purpose, as it’s from a lot of different
speakers and does not have speaker labels, i.e., you can’t tell which
sentences were spoken by which speaker.

------
SurrealSoul
I wonder how hard it is to get this effect in other languages. "The google
translate lady" is really helpful for my foreign studies but I wonder how
robotic it sounds to a native

------
ansonhoyt
Arstechnica mentioned this research in an approachable article [1] on the
challenges of getting speech recognition and generation working for more
languages. I found it fascinating.

Very cool stuff.

[1] [https://arstechnica.com/information-
technology/2017/12/teach...](https://arstechnica.com/information-
technology/2017/12/teaching-old-virtual-assistants-new-language-tricks/)

------
nblavoie
I suppose an API will come to Google Cloud platform to use this service sooner
or later ? Or, is it still on the research side (not production ready) ?

~~~
applejinn
It's still a research project and not a production system:

"We manually analyze the error modes of our system on the custom 100-sentence
test set from Appendix E of [11]. Within the audio generated from those
sentences, 0 contained repeated words, 6 contained mispronunciations, 1
contained skipped words, and 23 were subjectively decided to contain unnatural
prosody, such as emphasis on the wrong syllables or words, or unnatural pitch.
In one case, the longest sentence, end-point prediction failed."

~~~
thesandlord
To add: > Also, our system cannot yet generate audio in realtime.

For an production GCP API, I think faster than real-time would be necessary.

For example, WaveNet took a year to go from research to production in Google
Assistant: [https://deepmind.com/blog/wavenet-launches-google-
assistant/](https://deepmind.com/blog/wavenet-launches-google-assistant/)

------
lobo_tuerto
On the audio samples pages, at the end, which one do you think is the human
one?

[https://google.github.io/tacotron/publications/tacotron2/ind...](https://google.github.io/tacotron/publications/tacotron2/index.html)

------
justonepost
It'd be nice if they could make this a paid service at least.

~~~
petercooper
Maybe services like AWS Polly would upgrade over time to be as good as this.
(Sadly, Polly is currently worse than the standard macOS TTS, IMHO..)

------
pssdbt
I'm incredibly disappointed that this isn't related to tacos or robots, more
importantly robots making tacos.

~~~
nine_k
A tacos-making robot is already a level of a project for university students.

------
alttab
This is great - but I imagine they are trying to get this to work in Realtime.

~~~
exikyut
Ooooh, good point.

------
plg
is this available to use by non-googlers?

~~~
mhh__
Apparently not? I imagine it would have to be replicated by someone else. One
also assumes that the volume of training data - I don't actually know - is not
insignificant!

~~~
heinrichf
From the paper, the training data is surprisingly small:

"We train all models on an internal US English dataset, which contains 24.6
hours of speech from a single professional female speaker."

~~~
mhh__
Uncompressed, that's still quite a lot of data - perhaps not relative to other
ML projects, but still.

~~~
21
They say they split the audio in "80-dimensional audio spectrogram with frames
computed every 12.5 milliseconds". The picture from the post supports that.

For 24 hours that would be 7 mil frames of 80 values ~= 500 mil data points,
or 2 GB of raw data (assuming floats)

~~~
PeterisP
So, something that easily fits into RAM, One might even keep a copy of the
whole dataset in GPU memory to avoid copying to and from even if matrix
operations are done on smaller batches.

E.g. ImageNet is 50 gb of compressed data, and there are many much larger
datasets in practical use.

------
flukus
Can the research team find a way to display a blog entry without requiring
javascript? Perhaps some sort of mark up language would be ideal. The current
version requires all sorts of crap from 30 different domains and at least 2
trackers. By default with uBlock I got just the header.

