
WaveNet implementation in Keras - basve
https://github.com/basveeling/wavenet/
======
svantana
The most interesting thing here is the note at the bottom regarding
computational cost: "A recent macbook pro reaches about 5 samples per second."

This shows how far this model is from realtime usage. However I'm sure
Deepmind researchers are already looking into how to make this blockbased or
some other optimization strategy.

~~~
basve
And I should add that this was measured using a downsized model (just two
blocks of dilated convolutions and a sampling rate of 4khz). Deepmind's paper
does not report how many stacks are used to generate the samples, but I assume
it's quite a bit more.

~~~
unlikelymordant
They (deepmind) reported it took 90 minutes of processing to generate 1s of
speech via tweet. Hopefully this comes down in the future.

~~~
espadrine
Do you have a link?

This implementation says: “A Tesla K80 needs around ~4 minutes for generating
a second of audio at a sampling rate of 4000hz”, which is significantly
faster.

~~~
basve
90 minutes for 1s of audio was reported by someone from Google on twitter, but
the tweet has been deleted. I've clarified in the readme that my measurements
are for a much lighter/smaller model than Deepmind's :).

------
robeastham
I was very impressed with the TTS examples in the original DeepMind article
([https://deepmind.com/blog/wavenet-generative-model-raw-
audio...](https://deepmind.com/blog/wavenet-generative-model-raw-audio/)).

Can someone elaborate on the usefulness of this implementation for Text-to-
Speech?

I'm keen to experiment with voice synthesis. I want to create dialog, from
multiple voice sources, for some characters in a VR application that I'm
working on.

Perhaps this lib is a better option for TTS:

[https://github.com/ibab/tensorflow-
wavenet](https://github.com/ibab/tensorflow-wavenet)

I guess I could do with an ELI5 on how I'd approach this with either of these
libraries. I'm not familiar with any deep learning frameworks. But I am pretty
handy with Python and have implemented SciKit stuff.

Also thinking this will give me a reason to try Azure K80 instance vs the AWS
GPU instances I've been using for other stuff. That said, is a Tesla K80 the
only option for WaveNet? I'm guessing I could run it on other GPU's but had
read that memory might be an issue on some cards. If so what the lowest card I
can run it on and will one of the AWS GPU instances suffice? I also have a GTX
970 at home, but I'm guessing that won't cut it.

~~~
hiddencost
Short answer: Don't use this for practical purposes. It takes 90 minutes to
generate 1 second of audio.

Here's a good TTS system:

[http://www.cstr.ed.ac.uk/projects/festival/](http://www.cstr.ed.ac.uk/projects/festival/)

~~~
robeastham
But does anyone know if it's possible to do TTS with the recently released
libraries?

Thanks for the links, but to my ear the samples on those links don't hit the
mark. The Wavenet samples in the original article cross the threshold for me.
So I'd like to try some short length dialog tests, especially as I've read
elsewhere that 1 second only takes 4 minutes on a K80.

Any light anyone else can shed on this would be great.

~~~
basve
Afaik none of the released libraries support the TTS experiment described in
the paper. Deepmind used pre-computed linguistic features to guide the system
in generating natural sounding speech, so your output will probably depend on
the quality of those features. For the sake of not spreading misinformation;
the 4 minutes was measured using a small model with a sampling rate of 4khz,
this would not generate something sounding like the samples from Deepmind.

~~~
robeastham
Thanks for the clarification and for spotting the 4khz error. This is
fascinating stuff.

Looks like I'll have to concede that voice acting is much more practical, for
now at least.

------
alphaBetaGamma
Slightly off-topic question about WaveNet:

In the paper, they say that they double the dilation factor up to a limit and
then repeat: 1, 2, 4, ..., 512, 1, 2, 4, ..., 512, 1, 2, 4, ..., 512

The doubling of the dilation factor makes sense to me, but what is happening
with the "repeat" part? I don't understand what they are trying go do.
Wouldn't make more sense to continue doubling?

~~~
basve
My intuition is that the doubling up to 512 does increase the receptive field,
but you're essentially building a non-linear convolutional filter with a
kernel size of 1024. The network benefits from stacking multiple of these
groups, because each group can again convolve over the previous outputs at
every temporal distance, which allows for learning deeper/higher level
features. It is similar to the stacked 2d convolutions used for images, where
every subsequent convolutional layers learns more abstract and higher level
features/attributes of the data. This is just intuition though, there is no
evidence yet that this holds for wavenet's architecture.

------
toth
Tangentially related, but was not aware of the python package sacred that they
used. Seems pretty useful for organizing data science runs.

