
WaveNet: A Generative Model for Raw Audio - benanne
https://deepmind.com/blog/wavenet-generative-model-raw-audio/
======
augustl
The music examples are utterly fascinating. It sounds insanely natural.

The only thing I can hear that sounds unnatural, is the way that the
reverberation in the room (the "echo") immediately gets lower when the raw
piano sound itself gets lower. In a real room, if you produce a loud sound and
immediately after a soft sound, the reverberation of the loud sound remains.
But since this network only models "the sound right now", the volume of the
reverberation follows the volume of the piano sound.

To my ears, this is most prevalent in the last example, which starts out loud
and gradually becomes softer. It sounds a bit like they are cross-fading
between multiple recordings.

Regardless, the piano sounds completely natural to me, I don't hear any
artifacts or sounds that a real piano wouldn't make. Amazing!

There are also fragments that sounds inspiring and very musical to my ears,
such as the melody and chord progression after 00:08 in the first example.

~~~
DarkTree
It shot me forward to a time where people just click a button to generate
music they want to listen to. If you really like the generation, you save it
and share it. It wouldn't have all of the other aspects that we derive from
human-produced music like soul/emotion (because we know it's coming from a
human, not because of how it sounds), but it would be a cool application idea
anyway.

~~~
JasonStorey
Have you tried [https://www.jukedeck.com](https://www.jukedeck.com) ? AI
composed music at the touch of a button.

------
erichocean
This can be used to implement seamless voice performance transfer from one
speaker to another:

1\. Train a WaveNet with the source speaker.

2\. Train a second WaveNet with the target speaker. Or for something totally
new, train a WaveNet with a bunch of different speakers until you get one you
like. This becomes the _target WaveNet_.

3\. Record raw audio from the source speaker.

Fun fact: any algorithmic process that "renders" something given a set of
inputs can be "run in reverse" to recover those inputs given the rendered
output. In this case, we now have raw audio from the source speaker that—in
principle— _could_ have been rendered by the source speaker's WaveNet, and we
want to recover the inputs that would have rendered it, had we done so.

To do that, usually you convert all numbers in the forward renderer into Dual
numbers and use automatic differentiation to recover the inputs (in this case,
phonemes and what not).

4\. Recover the inputs. (This is computationally expensive, but not difficult
in practice, especially if WaveNet's generation algorithm is implemented in
C++ and you've got a nice black-box optimizer to apply to the inputs, of which
there are many freely available options.)

5\. Take the recovered WaveNet inputs, feed them into the target speaker's
WaveNet, and record the resulting audio.

Result: The resulting raw audio will have the same overall performance and
speech as the source speaker, but rendered completely naturally in the target
speaker's voice.

~~~
itcrowd
Another fun fact: this _actually_ happens with (cell) phone calls.

You don't send your speech over the line, instead you send some parameters
over the line which are then, at the receiving end, fed into a white(-ish)
noise generator to recover the speech.

Edit: not by using a neural net or deep learning, of course.

~~~
JonnieCache
In case anyone is wondering, the technique is called linear predictive coding.

~~~
AstralStorm
Predictive coding. Linear is a specific variant of it used in older codecs.

------
rdtsc
Wonder if there are any implications here for breaking (MitM) ZRTP protocol.

[https://en.wikipedia.org/wiki/ZRTP](https://en.wikipedia.org/wiki/ZRTP)

At some point to authenticate both parties verify a short message by reading
it to each other.

However, NSA has already tried to MitM that about 10 years ago by using voice
synthesis. It was deemed inadequate at the time. Wonder if TTS improvements
like these, change that game and make it more plausable scenario.

~~~
luckystarr
This will make private in person key exchange way more important. Especially
as the attack vector is so cheap (software).

------
dharma1
The samples sound amazing. These causal convolutions look like a great idea,
will have to re-read a few times. All the previous generative audio from raw
audio samples I've heard (using LSTM) has been super noisy. These are crystal
clear.

Dilated convolutions are already implemented in TF, look forward to someone
implementing this paper and publishing the code.

~~~
kastnerkyle
I did a review for PixelCNN as a part of my summer internship, it covers a bit
about how careful masking can be used to create a chain of conditional
probabilities [0], which AFAIK is exactly how this "causal convolution" works
(can't have dependencies in the 'future'). The PixelCNN and PixelRNN papers
also cover this in a fair bit of detail. Ishaan Gulrajani's code is also a
great implementation reference for PixelCNN / masking [1].

[0]
[https://github.com/tensorflow/magenta/blob/master/magenta/re...](https://github.com/tensorflow/magenta/blob/master/magenta/reviews/pixelrnn.md)

[1]
[https://github.com/igul222/pixel_rnn/blob/master/pixel_rnn.p...](https://github.com/igul222/pixel_rnn/blob/master/pixel_rnn.py)

~~~
dharma1
Heh, just read it! Very useful, will have to go through in detail

------
novalis78
What's really intriguing is the part in their article where they explain the
"babbling" of wavenet, when they train the network without the text input.

That sounds just like a small kid imitating a foreign (or their own) language.
My kids grow up bilingual and I hear them attempt something similar when they
are really small. I guess it's like listening in to their neural network
modelling the sound of the new language.

~~~
sjwright
To my Australian English ears, the babbling sounded vaguely Scandinavian.

~~~
novalis78
Indeed. I was surprised by that as well. Sounded like a Dutch speaker with a
muffled voice behind a screen.

~~~
vintermann
Especially funny as the main authors are Dutch.

~~~
rattray
Ah. Perhaps it was trained on Dutch speakers, not English.

~~~
sjwright
That would explain it. Would be interesting to hear babbling trained with
other languages and accents.

------
noonespecial
So when I get the AI from one place, train it with the voices of hundreds of
people from dozens of other sources, and then have it read a book from Project
Gutenberg to an mp3... who owns the mechanical rights to that recording?

~~~
kuschku
Every single person who had rights on the sources for audio you used.

For the same reason, Google training neural networks with userdata is very
legally doubtful – they changed the ToS, but also used data collected before
the ToS change for that.

~~~
feral
>Every single person who had rights on the sources for audio you used.

What if my 'AI' was a human who learned to speak by being trained with the
voices of hundreds of people from dozens of other sources? What's the
difference?

Those waters seem muddy. I think that'd be an interesting copyright case,
don't think it's self evident.

~~~
kuschku
So if I remix just 200 songs together, the result is not copyright protected
anymore?

~~~
noonespecial
No its not like remixing. Its more like listening to 200 songs and then
writing one that sounds _just_ like them.

More like turning the songs into series of numbers (say 44100 of these numbers
per second) and then using an AI to predict which number comes next to make a
song that sounds something like the 200. The result is not possible without
ingesting the 200 songs _but_ the 200 songs are not "contained" in the net and
then sampled to produce the result like stitching together a recoding from
other recordings by copying little bits.

The hairs split too fine at the bottom for our current legal system to really
handle. That's why its interesting.

~~~
kuschku
In the US legal system, that’d still be a derived work.

This might be an interesting read for you:
[http://ansuz.sooke.bc.ca/entry/23](http://ansuz.sooke.bc.ca/entry/23)

------
jay-anderson
Any suggestions on where to start learning how to implement this? I understand
some of the high level concepts (and took an intro AI class years ago -
probably not terribly useful), but some of them are very much over my head
(e.g. 2.2 Softmax Distributions and 2.3 Gated Activation Units) and some parts
of the paper feel somewhat hand-wavy (2.6 Context Stacks). Any pointers would
be useful as I attempt to understand it. (EDIT: section numbers refer to their
paper)

~~~
visarga
Best advice is to wait for a version to pop up on github. It's hard to
implement such a paper as a beginner.

~~~
datenwolf
Well, I think since we now have frameworks for doing this kind of stuff
(Tensorflow and similar) the barrier of entry is much, much lower. Also the
computing power required to build the models can be found in commodity GPUs.

On a hunch I'd say an absolute beginner may be able good results with these
tools, just not as quickly as experts on the field who already know how to use
the tools properly. That's why I'm going to wait for something to pop up on
GitHub, because I have zero practical experience with these things, but I can
read these papers comfortably without the need to look up every other term.

There are a number of applications I'd like to throw at deep learning to see
how it performs. Most notably I'd like to see how well a deep learning system
can extract feature from speckle images. At the moment you have to average out
the speckles from ultrasound or OCT images before you can feed it to a feature
recognition system. Unfortunately this kind of averaging eliminates certain
information you might want to process further down the line.

------
chestervonwinch
Is it possible to use the "deep dream" methods with a network trained for
audio such as this? I wonder what that would sound like, e.g., beginning with
a speech signal and enhancing with a network trained for music or vice versa.

~~~
dontreact
We tried this but with less success than what wavenet did.
[https://wp.nyu.edu/ismir2016/wp-
content/uploads/sites/2294/2...](https://wp.nyu.edu/ismir2016/wp-
content/uploads/sites/2294/2016/08/ardila-audio.pdf)

~~~
dontreact
There is a link to examples at the end

~~~
chestervonwinch
Interesting! So if I understand correctly, much of the noise in the generated
audio is due to the noise in the learned filters?

I assume some regularization is added to the weights during training, say L1
or L2? If this is the case, this essentially equivalent to assuming the weight
values are distributed i.i.d. Laplacian or Gaussian. It seems you could learn
less noisy filters by using a prior that assumes dependency between values
within each filter, thereby enforcing smoothness or piecewise smoothness of
each filter during training.

~~~
dontreact
Yes. Working on some different regularization techniques.

------
fastoptimizer
Do they say how much time is the generation taking?

Is this insanely slow to train but extremely fast to do generation?

~~~
georgehm
"After training, we can sample the network to generate synthetic utterances.
At each step during sampling a value is drawn from the probability
distribution computed by the network. This value is then fed back into the
input and a new prediction for the next step is made. Building up samples one
step at a time like this is computationally expensive, but we have found it
essential for generating complex, realistic-sounding audio."

So it looks like generation is a slow process.

------
ronreiter
Please please please someone please share an IPython notebook with something
working already :)

~~~
ThePhysicist
I have some iPython notebooks for speech analysis using a Chinese corpus. I
used those for a tutorial on machine learning with Python and unfortunately
they are still a bit incomplete, but maybe you find them useful nevertheless
(no deep learning involved though). What I do in the tutorial is to start from
a WAV file and then go through all the steps required for analyzing the data
(using a "traditional" approach), i.e. generate the Mel-Cepstrum coefficients
of the segmented audio data and then train a model to distinguish individual
words. Word segmentation is another topic that I touch a bit, and where we can
also use machine learning to improve the results.

Here's a version with very simple speech training data (basically just
different syllables with different tones):

[https://github.com/adewes/machine-learning-
chinese/blob/mast...](https://github.com/adewes/machine-learning-
chinese/blob/master/training/part_3/Chinese%20Speech%20Recognition.ipynb)

More complex speech training data (from a real-world Chinese speech corpus
[not included but downloadable]):

[https://github.com/adewes/machine-learning-
chinese/blob/mast...](https://github.com/adewes/machine-learning-
chinese/blob/master/training/part_3/old/Chinese%20Speech%20Recognition.ipynb)

There are other parts of the tutorial that deal with Chinese text and
character recognition as well, if you're interested:

[https://github.com/adewes/machine-learning-
chinese](https://github.com/adewes/machine-learning-chinese)

For part 2 I also train a simple neural network with lasagne (a Python library
for deep learning), and I plan to add more deep learning content and do a
clean write-up of the whole thing as soon as I have some more time.

~~~
ronreiter
Thanks! will take a look.

------
grandalf
This is incredible. I'd be worried if I were a professional audiobook reader
:)

~~~
espadrine
I wouldn't. The results they offer are excellent, but the missing points they
need to achieve human level are related to producing the correct intonation,
which requires accurate understanding of the material. That is still at least
ten years in the future, I expect.

~~~
syllogism
Not really. They're training directly on the waveform, so the model can learn
intonation. They just need to train on longer samples, and perhaps augment
their linguistic representation with some extra discourse analysis.

A big problem with generating prosody has always been that our theories of it
don't really provide a great prediction of people's behaviours. It's also very
expensive to get people to do the prosody annotations accurately, using
whatever given theory.

Predicting the raw audio directly cuts out this problem. The "theory" of
prosody can be left latent, rather than specified explicitly.

~~~
ycombinatorMan
theres 0 chance of effective intonation and tone without understanding of the
material

~~~
syllogism
I think your use of the term "understanding" is very unhelpful here. It's
better to think about what you need to condition on to predict correctly.

In fact most intonation decisions are pretty local, within a sentence or two.
The most important thing are given/new contrasts, i.e. the information
structure. This is largely determined by the syntax, which we're doing pretty
well at predicting, and which latent representations in a neural network can
be expected to capture adequately.

~~~
espadrine
The same sentence can have a very nonlocal difference in intonation.

Say, “They went in the shed”. You won't pronounce it in a neutral voice if it
was explained in the previous chapter that a serial killer is in it.

On the other hand, if the shed contains a shovel that is quickly needed to dig
out a treasure, which is the subject of the novel since page 1, you will imply
urgency.

------
JoshTriplett
How much data does a model take up? I wonder if this would work for
compression? Train a model on a corpus of audio, then store the audio as text
that turns back into a close approximation of that audio. (Optionally store
deltas for egregious differences.)

~~~
kastnerkyle
It would be a slow (but very efficient information-wise - only have to send
text which itself can be compressed!) decompression process with current
models / hardware due to sequential relationships in generation.

I am sure people will start trying to speed this up, as it _could_ be a game
changer in that space with a fast enough implementation. Google also has a lot
of great engineers with direct motivation to get it working on phones, and a
history of porting recent research in to the Android speech pipeline.

The results speak for themselves - step 1 is almost always "make it work"
after all, and this works amazingly well! Step 2 or 3 is "make it fast",
depending who you ask.

~~~
Houshalter
We've known for decades that neural networks are really good at image and
video compression. But as far as I know, this has never been used in practice,
because the compression and decompression times are ridiculous. I imagine this
would be even more true for audio.

~~~
dharma1
The magic pony guys (who sold to twitter) have patents and implementations of
a super resolution CNN for realtime video.

[http://www.cv-
foundation.org/openaccess/content_cvpr_2016/pa...](http://www.cv-
foundation.org/openaccess/content_cvpr_2016/papers/Shi_Real-
Time_Single_Image_CVPR_2016_paper.pdf)

------
bbctol
Wow! I'd been playing around with machine learning and audio, and this blows
even my hilariously far-future fantasies of speech generation out of the
water. I guess when you're DeepMind, you have both the brainpower and
resources to tackle sound right at the waveform level, and rely on how
increasingly-magical your NNs seem to rebuild everything else you need. Really
amazing stuff.

------
fpgaminer
I'm guessing DeepMind has already done this (or is already doing), but
conditioning on a video is the obvious next step. It would be incredibly
interesting to see how accurate it can get generating the audio for a movie.
Though I imagine for really great results they'll need to mix in an
adversarial network.

~~~
visarga
Oh yes, extract voice and intonation from one language, and then synthesize it
in another language -> we get automated dubbing. Could also possibly try to
lipsync.

------
JonnieCache
Wow. I badly want to try this out with music, but I've taken little more than
baby steps with neural networks in the past: am I stuck waiting for someone
else to reimplement the stuff in the paper?

IIRC someone published an OSS implementation of the deep dreaming image
synthesis paper fairly quickly...

~~~
kastnerkyle
Re-implementation will be hard, several people (including me) have been
working on related architectures, but they have a few extra tricks in WaveNet
that seem to make all the difference, on top of what I assume is "monster
scale training, tons of data".

The core ideas from this can be seen in PixelRNN and PixelCNN, and there are
discussions and implementations for the basic concepts of those out there
[0][1]. Not to mention the fact that conditioning is very interesting / tricky
in this model, at least as I read it. I am sure there are many ways to do it
wrong, and getting it right is crucial to having high quality results in
conditional synthesis.

[0]
[https://github.com/tensorflow/magenta/blob/master/magenta/re...](https://github.com/tensorflow/magenta/blob/master/magenta/reviews/pixelrnn.md)

[1]
[https://github.com/igul222/pixel_rnn/blob/master/pixel_rnn.p...](https://github.com/igul222/pixel_rnn/blob/master/pixel_rnn.py)

~~~
JonnieCache
Is there any usable example code out there I can play with? I don't care if it
sounds noisy and weird, it's all grist for the sampler anyway.

------
visarga
And when you think of all those Hollywood SF movies where the robot could
reason and act quite well but in a tin-voice. How wrong they got it. We can
simulate high quality voices but we can't have our reasoning, walking robots.

~~~
ilaksh
Depending on how you mean 'reasoning, walking robots' then not yet really..
but every few weeks or months another amazing deep learning/NN whatever thing
comes out in different domains. So these types of techniques seem to have very
broad application.

Of course, if you mean 'walking' in a literal sense, there are a number of
impressive walking robots such as Atlas
[https://www.youtube.com/watch?v=rVlhMGQgDkY](https://www.youtube.com/watch?v=rVlhMGQgDkY),
HRP-2
[https://www.youtube.com/watch?v=T6BSSWWV-60](https://www.youtube.com/watch?v=T6BSSWWV-60)
or HRP 4C
[https://www.youtube.com/watch?v=YvbAqw0sk6M](https://www.youtube.com/watch?v=YvbAqw0sk6M),
etc.. Also there are many types of useful reasoning systems. I am guessing you
are thinking of language understanding and generation.. but I believe these
types of techniques are being applied quite impressively in that area also,
from DeepMind or Watson
[https://www.youtube.com/watch?v=i-vMW_Ce51w](https://www.youtube.com/watch?v=i-vMW_Ce51w)
etc.

------
ericjang
"At Vanguard, my voice is my password..."

------
kragen
This is amazing. And it's not even a GAN. Presumably a GAN version of this
would be even more natural — or maybe they tried that and it didn't work so
they didn't put it in the paper?

Definitely the death knell for biometric word lists.

------
imaginenore
Please make it sound like Morgan Freeman.

~~~
TeeWEE
Morgan Freeman +1

------
banach
I hope this shows up as a TTS option for VoiceDream
([http://www.voicedream.com/](http://www.voicedream.com/)) soon! With the best
voices they have to offer (currently, the ones from Ivona), I can suffer
through a book if the subject is really interesting, but the way the samples
sounded here, the WaveNet TTS could be quite pleasant to listen to.

------
imurray
Would delete this post if I could. Was a request to fix a broken link. Now
fixed.

~~~
andrew3726
It seems fixed now.

------
rounce
So when does the album drop?

~~~
rounce
In case the above came across as an example of bad sarcasm, I'm very serious.
I've a somewhat lazy interest in generative music, and found the snippets in
the paper quite appealing.

Though, as was mentioned in a previous comment, due to copyright (attribution
based on training data sources, blah blah) I might already have an answer. :(

~~~
b0ner_t0ner
“Is this Hiromi Uehara or WaveNet?”

------
nitrogen
I wonder how a hybrid model would sound, where the net generates parameters
for a parametric synthesis algorithm (or a common speech codec) instead of
samples, to reduce CPU costs.

------
partycoder
The first to do semantic style transfer on audio gets a cookie!

------
mtgx
When can we expect this to be used in Google's TTS engine?

------
tunnuz
Love the music part! Mmmh ... infinite jazz.

------
AstralStorm
Finally a convincing Simlish generator!

------
billconan
hope they can release some source code.

wonder how many gpus are required to hold this model.

------
baccheion
I suppose it's impressive in a way, but when I looked into "smoothing out"
text to speech audio a few years ago, it seemed fairly straightforward. I was
left wondering why it hadn't been done already, but alas, most Engineers at
these companies are either politicking know-nothing idiots, or are constantly
being road blocked, preventing them from making any real advancements.

