
WaveNet launches in the Google Assistant - stablemap
https://deepmind.com/blog/wavenet-launches-google-assistant/
======
DanAndersen
I'm most interested in its potential for audiobooks, especially of the vast
array of old and less-common books that don't have human-made audiobook
equivalents. I find myself constrained by the limits of audiobook choices,
which tends toward best-sellers lists or pop-sci. Current attempts to use
text-to-speech to generate audiobooks results in something that's frankly
unlistenable. If TTS could get to a "good enough" point for audiobooks, that
would open up a huge range of less-common content.

~~~
exhilaration
One of my co-workers told me in 2012 that he was doing exactly this. He used
an ebook reader to download free ebooks from Gutenberg and then the IVONA text
to speech engine for Android to listen to them on his drives. He had already
finished a few classics like Treasure Island this way.

I'm sure things have improved significantly since 2012 so what you're looking
for is probably easily done.

~~~
degenerate
If your co-worker could share how he did this, it would be appreciated
(specifically, what apps/code needs to be run).

~~~
thorum
Not the coworker, but I've been doing this for a couple years. If you're on
iOS, the simplest option is Voice Dream Reader [1].

It can read .epub files directly, along with text files and webpages, and
integrates with Dropbox, Pocket, Gutenberg, etc.

On Android, you can get Voice Dream Reader (though it has less features than
on iOS) or @Voice Aloud Reader which is free [2].

[1] [http://www.voicedream.com/reader/](http://www.voicedream.com/reader/)

[2] [http://www.hyperionics.com/atVoice/](http://www.hyperionics.com/atVoice/)

~~~
woodson
It also works reasonably well for academic papers as PDF. Equations get
butchered, of course, but overall it's not bad (you can set PDF margins so
that headers/footers are not read out aloud on every page). I use it to proof-
read my own writing, it helps you spot things that spelling/grammer checkers
miss.

------
mannigfaltig
I am wondering what their baseline is. They call it "Current Best Non-
WaveNet". Quite frankly, Apple's most recent deep learning-based speech
synthesis sounds superior, but there aren't enough samples to for a proper
comparison: [https://machinelearning.apple.com/2017/08/06/siri-
voices.htm...](https://machinelearning.apple.com/2017/08/06/siri-voices.html)

~~~
microcolonel
It could just be a matter of opinion, but I prefer both Google's unit
selection synthesis, and their WaveNet synthesis. The prosody in Apple's
latest method is still annoying, nowhere near as good as the Google models of
2015 and 2016, and not remotely comparable to the WaveNet models.

Apple's change in voice talent is an improvement though, and they may have
more units than before, which is helpful. I believe their model also works
offline, which is a huge plus (though I think Google's prior model works
offline as well).

~~~
CobrastanJorji
> prosody

I learned a useful new word, thank you!

------
microcolonel
Wow, I hear a _huge_ improvement in the Japanese model: the difference between
a robot in person and a young woman on the phone.

~~~
aikinai
It's a huge improvement, but still nowhere near the state of the art for
Japanese TTS which has always been ahead of English (since the phonetics are
much simpler).

Listen to the samples here for example:
[http://voicetext.jp/samplevoice/](http://voicetext.jp/samplevoice/)

~~~
microcolonel
Mm, I see what you mean. It seems like voicetext.jp has more "correct"
prosody, and seems to reproduce vocal fry correctly, I'm noticing that's
missing from the WaveNet sample now. There's still some work to be done around
filtering with voicetext, I hear pretty glaring artifacts between the units,
whereas WaveNet doesn't produce any such artifacts.

------
tymekpavel
I've always wondered why companies don't just take all the close-captioned TV
streams and use that as training data for their voice models. Seems like it
would create a much more natural sounding voice model (at least as far as
humans are accustomed to).

~~~
epmaybe
What happens if the CC isn't entirely in sync with the video or audio?

~~~
gwern
Strictly speaking, the text is never 'entirely in sync' because spoken words
inherently blur together and are seamless; individual letters in the text do
not start and end at precise intervals. This is one of the things that makes
speech recognition so hard: letters, syllables, and words do not really exist
as discrete things on the raw audio level. So this problem exists for any
speech transcription dataset. To provide a loss function, then, you would use
something like CTC:
[http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=FD2...](http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=FD2D37798AA9ABA95F00F67FA604F1AD?doi=10.1.1.75.6306&rep=rep1&type=pdf)
Fortunately, NNs are good at handling noisy data, and in practice they work
very well for speech recognition/transcription.

------
komali2
I still don't understand the difference between Google Assistant and Google
Now (or whatever), other than when I accidentally launch Assistant instead of
Now, my commands are never understood.

~~~
2600
I don't understand the difference either other than Assistant seems to be
replacing Now. The issue, I'm finding, is that Google Assistant requires that
you turn on the recording of usage data and history to your account, such as
Web & App activity, Location History, Voice & Audio activity. All recorded and
tied to your account. I don't believe this was the case with Google Now.

Google, to their credit, normally allows you to erase all this data and turn
recording off, but with Google Assistant, they require it all to be on and
recording. I avoid using it because of this restriction.

~~~
gvurrdon
IIRC one could use Google Now with voice recordings off and the extra-spying
setting of "web and app activity" off (though web and app activity had to be
on for many features). All this must now be on for the assistant, as you say.

I did try out Google Now for a time due to having an Android Wear watch and
found that as time went on more functionality required the turning on of these
settings. One particular oddity for a time was that "OK Google, Navigate Home"
would produce only a complaint about web and app activity being off, but "OK
Google, navigate to $HOME_STREET, $HOME_TOWN" was fine.

I've abandoned Android Wear because of this.

------
mrguyorama
To be honest, it still obviously sounds machine generated. I guess it's a
slight improvement, but the examples shown do not include any challenging
words or phrases. We've been able to generate adequate sounding speech for
simple phrases for quite some time. I bet this was really fun to work on
however.

~~~
sgk284
If you ever listen to people try to record sound that's clear and precise,
they actually sound fairly robotic. See this Google 20% Project where they
explore Google Assistant's voice creation:
[https://youtu.be/qnGNfz7JiZ8?t=5m23s](https://youtu.be/qnGNfz7JiZ8?t=5m23s)

WaveNet is probably modeling the source data very well. It sounds like they
just need more data with emotion and inflection, rather than having source
data that is optimized for monotonicity and precision.

~~~
Yhippa
The improved voices sound exactly like the audio from CBT's I've taken in the
past. Could possibly have fooled me.

~~~
spuz
What is CBT?

~~~
anakron
CBT is commonly understood to mean Cognitive Behavioral Therapy, but in this
case I believe OP means computer based training(s?)

------
lawlessone
This would be great for games with lots of speaking characters and other npcs.

You could vary so much

------
Tade0
I hope somebody uses this to immortalize Sir David Attenborough or the sadly
departed Don LaFontaine.

~~~
spuz
Oh man, I can already see the court cases of a certain robot's voice sounding
a little too similar to a deceased human's that it should make royalty
payments.

~~~
Tade0
Heh, in my original post I wanted to follow up that paragraph with a question:
Is someone's voice their IP?

Would e.g. Don LaFontaine's family go all J.R.R. Tolkien and forbid the use of
his voice in certain contexts?

Then again, for now, it seems that it's easy to get away with synthesizing
voices of popular characters as long as you don't use copyrighted names:

[https://acapela-box.com/AcaBox/index.php](https://acapela-
box.com/AcaBox/index.php)

(choose English(USA) - "Little Creature". Borked on Firefox but works in
Chrome)

"Little Creature" my ass.

Apparently, names are IP, voices not so much. But we can't have nice things
because like you said - eventually, someone's going to figure out how to file
an effective lawsuit.

~~~
hughes
I don't see it as being very different than using someone's face to promote
something.

For example, should a lifelike digital model (or even a still image) of an
actor's face still generate royalties for the actor's family after their
death? Both cases are using a unique attribute of the person to promote
something.

------
kuschku
So, when can we expect an open source release of the tech and model?

~~~
Sir_Cmpwn
Never, and it fucking sucks.

~~~
imaginenore
Do you know any higher quality speech synthesis software?

~~~
Sir_Cmpwn
Nope.

------
alexasmyths
These things are highly nuanced.

The amount of tweaking and finessing that goes on under the hood for specific
scenarios can change outcomes quite a lot.

And from a product perspective, it can be taken quite far - for example, most
of the 'common' things Siri says are not synthesized, it's literally are
recording of the voice over artists. The more arcane stuff is synthesized.

It's always comparing apples to oranges to bananas unless you really know what
they're doing, even then it's hard.

------
StavrosK
I would love something like this for computer notifications, or any sort of
automated system notification. For a concrete example, my RC controller has
the ability to do voice prompts (e.g. "landing gear down"), and it would be
great if I could get a high-quality voice to speak all this.

Hell, I'd settle for an API where I could send text and get high-quality voice
back. Maybe I can somehow hack the Google Assistant app to do it...

~~~
craigforster
I think AWS Polly ([https://aws.amazon.com/blogs/aws/polly-text-to-speech-
in-47-...](https://aws.amazon.com/blogs/aws/polly-text-to-speech-in-47-voices-
and-24-languages/)) can do this. I'm not sure of the quality compared to
Google's or Apple's TTS though.

~~~
StavrosK
Apparently it's semi-decent, thank you!:

[https://aws.amazon.com/polly/](https://aws.amazon.com/polly/)

~~~
dx034
I was hoping that the voice in the video was created by Polly. That would've
been amazing but the samples they provide sound quite robotic.

~~~
StavrosK
Yeah, they do, unfortunately :/ Hopefully there will be some improvement they
make or some other API that will be able to provide this.

------
svantana
While the 100x speedup sounds impressive, a raw speed number without details
regarding hardware is pretty meaningless. I'm guessing they got the 20x
realtime speed from running it on their new TPU hardware, which they say can
do 180 Tflops. That means you would need 9 Tflops of computing power to run
this in realtime -- still pretty far away from running on a phone, or PC for
that matter.

~~~
chmod775
They literally said it's launching in the Google Assistant (i.e. running on
phones) now.

In fact that's the title of the article.

~~~
wmf
Does TTS run on the phone or the cloud?

~~~
chmod775
It runs on the phone afaik, while speech recognition on the other hand is
supported by stuff Google runs in the cloud.

Edit: I might be wrong, at the end of a paragraph they say it runs on Google’s
TPU cloud infrastructure, though it isn't clear to me whether they just use
that for training.

Edit 2: I just tried it on my phone. At least stuff like asking it to "Turn on
WiFi" works without an internet connection, and yields a TTS response.

~~~
spuz
> I just tried it on my phone. At least stuff like asking it to "Turn on WiFi"
> works without an internet connection, and yields a TTS response.

But this is the status quo. You would not expect Google to disable offline TTS
just for slightly improved quality. The real question is, is it running
Wavenet offline or the previous version of its TTS engine offline?

~~~
trevorstrohman
Today's announcement is cloud-only. We also support an older algorithm for
offline use that's less computationally intensive.

~~~
modeless
Is the offline one improving also? Google Maps often falls back to it (much
more often than necessary for some reason) and it sounds completely different
and far worse.

------
CoolGuySteve
I'm interested in using similar generative adversarial networks to reduce
artifacting in video streams. For example, highly compressed streams tend to
show blocking artifacts on dark scenes, gradients, and static that could be
smoothed in the decoder.

I haven't actually done much about it yet, but I'm interested.

~~~
nicklovescode
Adding matching of the laplacian to the optimization could be really helpful
for this. I found this paper the other day and really enjoyed it
[https://arxiv.org/abs/1707.01253](https://arxiv.org/abs/1707.01253)

------
taf2
Is there an api for this like amazon Polly?

~~~
vagab0nd
I was wondering about that too. Google already has speech recognition APIs. I
don't understand why they don't provide TTS.

------
baxuz
Awesome. I guess. Hard to know when everything AI/Assistant related is
currently focused on the US market.

------
philtar
But I'm imagining, but don't the wavenet versions sound very formal and a
little depressed? As if it were a reflection of our society. Listen to them
again. Very formal, lifeless and depressed.

------
supermdguy
Wow. I can't believe they got such a significant speed increase.

~~~
jay-anderson
Hopefully there will be enough details on how they accomplish this when they
release their paper.

------
Abishek_Muthian
Guys, for those of you who would like to see how Microsoft Cognitive service’s
TTS fares when compared to Google’s TTS.

We had launched a Bot which gives voice summary of web content on Messenger,
Slack, Telegram & Twitter with In-line audio player on first three. It’s great
for sharing audio summary to our visually impaired friends.

Check it out here -
[https://larynx.io/#larynxBot](https://larynx.io/#larynxBot)

------
visarga
Can I get this on Mac OS as system voice?

------
0xbear
Apple’s iOS 11 voice is pretty amazing too. I wonder what they are using.

------
samstave
Beuatiful, but for certain voices.. seems very close to how computerized it
sounds.... except Japanese....

I'd like to be able to record my voice - and have it translated to text, then
compare how it sounds via both engines vs my voice/cadence.

------
dzhiurgis
Any ideas when Assistant for iOS will released outside US?

------
dharma1
wow. I hope they will publish details on the optimisations

------
denfromufa
is there open-source implementation on github?

~~~
erichocean
There are two reachable from here:
[https://arxiv.org/abs/1611.09482](https://arxiv.org/abs/1611.09482)

------
imaginenore
Oh how much I wish Google paid to Morgan Freeman or David Attenborough to have
their voice as an option.

~~~
peteretep
Wonder -- in all seriousness -- if the BBC will do Attenborough. If you're
British and you've grown up with his documentaries, all other nature voices
simply sound wrong.

------
zebrafish
I think the logical next step for this is voice recognition.

