
Google’s DeepMind Achieves Speech-Generation Breakthrough - jrcii
http://www.bloomberg.com/news/articles/2016-09-09/google-s-ai-brainiacs-achieve-speech-generation-breakthrough
======
runesoerensen
Also
[https://news.ycombinator.com/item?id=12455510](https://news.ycombinator.com/item?id=12455510)

~~~
aab0
This Bloomberg article adds nothing, and is less informative & interesting
(doesn't have the audio samples for starters) than the original DeepMind blog
post, and shouldn't be on the front page.

------
e0m
Here's the link to the paper:
[https://drive.google.com/file/d/0B3cxcnOkPx9AeWpLVXhkTDJINDQ...](https://drive.google.com/file/d/0B3cxcnOkPx9AeWpLVXhkTDJINDQ/view)

And the WaveNet site with audio samples: [https://deepmind.com/blog/wavenet-
generative-model-raw-audio...](https://deepmind.com/blog/wavenet-generative-
model-raw-audio/)

The comparison against state of the art Parametric and Concatenative methods
are pretty mind blowing.

Particularly listen to the music samples. That's a generated piano piece that
sounds quite musical.

They even include breaths and other auditory signals that really make for a
convincing speech sample.

~~~
exDM69
> Particularly listen to the music samples. That's a generated piano piece
> that sounds quite musical.

That blew my mind. I've heard computer generated music before, but it's been
just synthesized with usual methods while the computer is just the "composer".

I find the crackling and buzzing a bit awkward, though. It's probably an
artifact of the algorithm and can probably be mitigated with some simple
filtering.

------
Xcelerate
This raises some interesting questions. I've always thought that recording
conversations would be sufficient "proof" of what someone said. It seems that
soon audio alone will not be sufficient — video will be necessary as well.

~~~
veeragoni
Sorry! look at this video and restate your statement about "video".
[https://www.youtube.com/watch?v=ohmajJTcpNk](https://www.youtube.com/watch?v=ohmajJTcpNk)

~~~
BatFastard
Wow!! That is amazing!

~~~
AJRF
They used this technique in Mr.Robot to emulate Obama talking about the Ecorp
hack

~~~
kkhire
WOW

i figured knowing obama's interest in good tv shows, he might have done the
cameo! this is really neat stuff

------
nxzero
It'll be interesting to see once "true" voice emulators become main stream how
it's exploited for good and evil.

Reminds me of "virtual" kidnapping scams where an attacker knows the victims
phone will be unreachable and the attacker calls a relative demanding they
wire funds or the victim will be killed. Attacker plays back a voice sample
they've captured from the victim that makes it sound like they're in trouble
an need help; basically the attacker calls the victim and say something like
"May I help you?" repeatively until the victim responds with something like
"No, I don't need your help!" an panicked voice - which is then edited to say
"Help! I need your help! Help!"

------
byebyetech
On the fluff note: If this system is called WaveNet. The next weather
prediction system based on deep learning should be called SkyNet.

~~~
jacobkranz
SpaceX's server farm is called exactly that. Supposedly Elon hates the name...

~~~
mtgx
I can almost hear him: "SkyNet is no joke, guys..."

------
perseusprime11
Here are some fun samples for curious

[https://deepmind.com/blog/wavenet-generative-model-raw-
audio...](https://deepmind.com/blog/wavenet-generative-model-raw-audio/)

------
sharemywin
always wondered if something like this could be done. Now just train it on an
famous individuals wav data and you got a free celebrity endorsement.

~~~
MrZongle2
I think that something rather different may take place: celebrity voices will
be copyrighted (or granted similar protection under law) by the individual
themselves or their estate, and licensed for use.

David Attenborough, Morgan Freeman and Billy West: call your offices.

~~~
jfoster
What if they are trained on soundalikes' voices? The tricky part would be that
you probably couldn't label it as the celebrity's voice.

~~~
sharemywin
The more I think about it even if you used originals you could licensed John
Wayne's voice from Disney(or who ever owns clips of it) or something you
couldn't imply there was an endorsement it would be fraud.

------
asah
I want to see WaveNet used to create a synthetic William Shatner, trained on
old episodes of Star Trek, TJ Hooker and priceline.com commercials!!!

~~~
mtgx
Patrick Stewart and Morgan Freeman's voices should definitely be replicated.

------
rocky1138
What's the point of authoring an article like this and not including samples
so we can listen to it?

~~~
Jarwain
The original blogpost has samples and everything, resulting in a much better
article

[https://deepmind.com/blog/wavenet-generative-model-raw-
audio...](https://deepmind.com/blog/wavenet-generative-model-raw-audio/)

------
inputcoffee
Really this is natural text-to-speech.

16,000 analyses per second. So, given Moore's law, around the iphone 9?

~~~
aab0
90 minutes of computation for 1 second of audio on DeepMind's GPUs:
[https://twitter.com/hardmaru/status/773968758519902208](https://twitter.com/hardmaru/status/773968758519902208)

That's a lot of cranks of Moore's law. Better hope for considerable
algorithmic improvements. (Raw is probably overkill anyway.)

~~~
inputcoffee
54000 seconds! Let's say it halves every year. To get it to one second you
need about 12 years. But if it drops by 1/3 every year, it would take 7.

Not stating anything especially mind-blowing. Just restating the shocking
speed of exponential growth/shrinkage.

