Hacker News new | past | comments | ask | show | jobs | submit login
WaveNet: A Generative Model for Raw Audio (deepmind.com)
627 points by benanne on Sept 8, 2016 | hide | past | web | favorite | 145 comments

The music examples are utterly fascinating. It sounds insanely natural.

The only thing I can hear that sounds unnatural, is the way that the reverberation in the room (the "echo") immediately gets lower when the raw piano sound itself gets lower. In a real room, if you produce a loud sound and immediately after a soft sound, the reverberation of the loud sound remains. But since this network only models "the sound right now", the volume of the reverberation follows the volume of the piano sound.

To my ears, this is most prevalent in the last example, which starts out loud and gradually becomes softer. It sounds a bit like they are cross-fading between multiple recordings.

Regardless, the piano sounds completely natural to me, I don't hear any artifacts or sounds that a real piano wouldn't make. Amazing!

There are also fragments that sounds inspiring and very musical to my ears, such as the melody and chord progression after 00:08 in the first example.

I can hear some distortion in the piano notes - which may be an audio compression artefact, or it may be the output of the resynthesis process.

If you train NNs at the phrase level and overfit, then you get something that is indeed more or less the same as cross-fading at random between short sections.

Piano music is very idiomatic, so you'll capture some typical piano gestures that way.

But I'd be surprised if the music stays listenable for long. Classical music has big structures, and there's a difference between recognising letters (notes), recognising phrases (short sentences), recognising paragraphs (phrase structures), and parsing an entire piece (a novel or short story with characters and multiple plot lines.)

Corpus methods don't work very well for non-trivial music, because there's surprisingly little consistency at the more complex levels.

NN synthesis could be an interesting thing though. If you trained an NN on $sounds$ at various pitches and velocity levels, you might be able to squeeze a large and complex collection of samples into a compressed data set.

Even if the output isn't very realistic, you'd still get something unusual and interesting.

The samples are uncompressed WAV files, so everything you hear is a direct result of the synthesis process. Some of the distortion is a result of the 16kHz sample rate-- it's not 44.1kHz CD quality.

It's quantized to just 256 values though, which could be causing some of the distortion.

It shot me forward to a time where people just click a button to generate music they want to listen to. If you really like the generation, you save it and share it. It wouldn't have all of the other aspects that we derive from human-produced music like soul/emotion (because we know it's coming from a human, not because of how it sounds), but it would be a cool application idea anyway.

Have you tried https://www.jukedeck.com ? AI composed music at the touch of a button.

This reminds me of the Library of Babel short story.

I agree, the samples sound very natural. I ask myself though how similar they are to the data that has been used for training, as it would be trivial to rearrange individual pieces of a large training set in ways that sound good (especially if a human selects the good samples for presentation afterwards).

What I'd really like to see therefore is a systematic comparison of the generated music to the training set, ideally using a measure of similarity.

A nice property of the model is that it is easy to compute exact log-likelihoods for both training data and unseen data, so one can actually measure the degree of overfitting (which is not true for many other types of generative models). Another nice property of the model is that it seems to be extremely resilient to overfitting, based on these measurements.

Good point! Are (some of) the chords completely made up, for example, or is it only using chords it has heard before?

Filtering out certain notes from a piano chord can be done by e.g. Melodyne, but that seems far from what's necessary to generate speech, so it would surprise me, if WaveNet can do that?

Decades ago, I was testing a LPC-10 vocoder. I discovered many new and strange sounds by playing with the input mike, such as blowing into it, or rubbing it. Like the LPC-10, I wonder about untapped musical possibilities that this allows.

That seems completely tractable by simply adding a bit of the right reverb to the generated sample, more or less "in post".

Good point! Just train it with recordings that has no reverberation, and add it later.

It's quite difficult to have no reverberation, but not too bad at all to keep to a minimum. But reverb plus reverb equals reverb, so it's just a matter of finding one that sounds good.

It'd also be interesting to know if this technique could solve the "de-reverberation" problem.

This can be used to implement seamless voice performance transfer from one speaker to another:

1. Train a WaveNet with the source speaker.

2. Train a second WaveNet with the target speaker. Or for something totally new, train a WaveNet with a bunch of different speakers until you get one you like. This becomes the target WaveNet.

3. Record raw audio from the source speaker.

Fun fact: any algorithmic process that "renders" something given a set of inputs can be "run in reverse" to recover those inputs given the rendered output. In this case, we now have raw audio from the source speaker that—in principle— could have been rendered by the source speaker's WaveNet, and we want to recover the inputs that would have rendered it, had we done so.

To do that, usually you convert all numbers in the forward renderer into Dual numbers and use automatic differentiation to recover the inputs (in this case, phonemes and what not).

4. Recover the inputs. (This is computationally expensive, but not difficult in practice, especially if WaveNet's generation algorithm is implemented in C++ and you've got a nice black-box optimizer to apply to the inputs, of which there are many freely available options.)

5. Take the recovered WaveNet inputs, feed them into the target speaker's WaveNet, and record the resulting audio.

Result: The resulting raw audio will have the same overall performance and speech as the source speaker, but rendered completely naturally in the target speaker's voice.

Another fun fact: this actually happens with (cell) phone calls.

You don't send your speech over the line, instead you send some parameters over the line which are then, at the receiving end, fed into a white(-ish) noise generator to recover the speech.

Edit: not by using a neural net or deep learning, of course.

In case anyone is wondering, the technique is called linear predictive coding.

Predictive coding. Linear is a specific variant of it used in older codecs.

What's the difference in bandwidth?

> Fun fact: any algorithmic process that "renders" something given a set of inputs can be "run in reverse"

Now wait a minute, most algorithms cannot be run in reverse! The only general way to reverse an algo is to try all possible inputs, which has exponential complexity. That's the basis of RSA encryption. Maybe you're thinking about automatic differentiation, a general algo to get the gradient of the output w.r.t. the inputs. That allows you to search for a matching input using gradient descent, but that won't give you an exact match for most interesting cases (due to local minima).

I'm not trying to nitpick -- in fact I believe that IF algos were reversible then human-level AI would have been solved a long time ago. Just write a generative function that is capable of outputting all possible outputs, reverse, and inference is solved.

This also makes me think of "inverse problems", in the context of mathematics, physics.

E.g. a forward problem might be to solve some PDE to simulate the state of a system from some known initial conditions.

The inverse problem could be to try to reverse engineer what the initial conditions were given the observed state of the system.

Inverse problems are typically much harder to deal with, and much harder to solve. E.g. perhaps they don't have a unique solution, or the solution is a highly discontinuous function of the inputs, which amplifies any measurement errors. In practice this can be addressed by regularisation aka introducing strong structural assumptions about what the expected solution should be like. This can be quite reasonable from a Bayesian perspective.


Maybe I'm reading this paper incorrectly, but it seems that in this system "voice" is part of the model parameters not inputs. What they did was train the same model with multiple reader voices while using one of the inputs to keep track of which voice the model was currently trained on. So the model can switch between different voices, but only between those which it was trained on.

"The conditioning was applied by feeding the speaker ID to the model in the form of a one-hot vector. The dataset consisted of 44 hours of data from 109 different speakers."

Am I missing something?

These are the "inputs" I'm talking about recovering (from the link):

"In order to use WaveNet to turn text into speech, we have to tell it what the text is. We do this by transforming the text into a sequence of linguistic and phonetic features (which contain information about the current phoneme, syllable, word, etc.) and by feeding it into WaveNet."

The raw audio from Step 3 was (in principle) generated by that input on a properly trained WaveNet. We need to recover that so we can transfer it to the target WaveNet.

How a specific WaveNet instance is configured (as you point out, it's part of the model parameters) is an implementation detail that is irrelevant for the steps I proposed.

Oh, pair this with facial mapping[1] and you pretty much have an "impersonate any famous person" system.

[1] http://www.graphics.stanford.edu/~niessner/thies2016face.htm...

Yup, I work in virtual filmmaking and there are tons of way to use this stuff.

I give us 10-15 years before it's not possible to trust anything you see or hear that's recorded.

Really? I haven't trusted anything recorded in years.

Speech production is incredibly hard to fake at the moment.

> Speech production is incredibly hard to fake at the moment.

Sound-alikes have been used in the music industry since forever.

Or transmitted from one place to another :(

Basically the same idea as style transfer with image algorithms. Looking forward to Abraham Lincoln reading audiobooks to me.

That would require audio recordings of Abraham Lincoln's voice. Not sure recording technology existed back then.

Audio quality does leave something to be desired. https://vimeo.com/47987691

Lincoln died before Edison invented the phonograph. That's a hoax.

Lincoln died in 1865, but the oldest recordings are from the 1860s. The video is definitely a hoax (http://www.firstsounds.org/research/others/lincoln.php), but it's at least theoretically possible his voice could have been recorded. In fact I believe there are some even older recordings from the 1850s, but I don't think those have been successfully recovered yet.

These early recordings are incredibly crude, and they did not have the technology at the time to play them back. They were just experiments in trying to view sound waves, not attempts to preserve information for future generations.

Ah I stand corrected, thanks.

It seems like you're using WaveNet to do speech-to-text when we have better tools for that. To transfer text from Trump to Clinton, first run speech-to-text on Trump speech and then give that to a WaveNet trained on Clinton to generate speech that sounds like her but says the same thing as Trump.

> It seems like you're using WaveNet to do speech-to-text

I'm proposing reducing a vocal performance into the corresponding WaveNet input. At no point in that process is the actual "text" recovered, and doing so would defeat the whole purpose, since I don't care about the text, I care about the performance of speaking the text (whatever it was).

In your example, I can't force Trump to say something in particular. But I can force myself, so I could record myself saying something I wanted Clinton to say [Step 3] (and in a particular way, too!), and if I had a trained WaveNet for myself and Clinton, I could make it seem like Clinton actually said it.

I see. I still think it's easier to apply deepmind's feature transform on text rather than to try to invert a neural network. Armed with a network trained on Trump, deepmind's feature transform from text->network inputs, you should be able to make him say whatever you want, right?

Text -> features -> TrumpWaveNet -> Trump saying your text

> Armed with a network trained on Trump, deepmind's feature transform from text->network inputs, you should be able to make him say whatever you want, right?

Yes, that should work, and by tweaking the WaveNet input appropriately, you could also get him to say it in a particular way.

Sounds like a very fancy way to do compression with a massive custom dictionary.

Thanks for the tl;dr. However, the fun fact is not true for surjective functions, IIRC, in which case multiple inputs may relate to one output, if this is relevant for WaveNets.

Nitpicking: surjective functions do not relate to unicity of ouptuts; you'd rather talk about non-injective functions. I agree with your point, though.

(surjective != non-injective, in the same way that non-increasing != decreasing)

Wonder if there are any implications here for breaking (MitM) ZRTP protocol.


At some point to authenticate both parties verify a short message by reading it to each other.

However, NSA has already tried to MitM that about 10 years ago by using voice synthesis. It was deemed inadequate at the time. Wonder if TTS improvements like these, change that game and make it more plausable scenario.

This will make private in person key exchange way more important. Especially as the attack vector is so cheap (software).

The samples sound amazing. These causal convolutions look like a great idea, will have to re-read a few times. All the previous generative audio from raw audio samples I've heard (using LSTM) has been super noisy. These are crystal clear.

Dilated convolutions are already implemented in TF, look forward to someone implementing this paper and publishing the code.

I did a review for PixelCNN as a part of my summer internship, it covers a bit about how careful masking can be used to create a chain of conditional probabilities [0], which AFAIK is exactly how this "causal convolution" works (can't have dependencies in the 'future'). The PixelCNN and PixelRNN papers also cover this in a fair bit of detail. Ishaan Gulrajani's code is also a great implementation reference for PixelCNN / masking [1].

[0] https://github.com/tensorflow/magenta/blob/master/magenta/re...

[1] https://github.com/igul222/pixel_rnn/blob/master/pixel_rnn.p...

Heh, just read it! Very useful, will have to go through in detail

What's really intriguing is the part in their article where they explain the "babbling" of wavenet, when they train the network without the text input.

That sounds just like a small kid imitating a foreign (or their own) language. My kids grow up bilingual and I hear them attempt something similar when they are really small. I guess it's like listening in to their neural network modelling the sound of the new language.

To my Australian English ears, the babbling sounded vaguely Scandinavian.

Indeed. I was surprised by that as well. Sounded like a Dutch speaker with a muffled voice behind a screen.

Might just be that English is fairly close to German and the like but as English speakers it doesn't sound like English to us because we know English so it gets mapped as a similar but different language.

Confirms my thought that Dutch sounds like unintelligible babbling :)

Especially funny as the main authors are Dutch.

Ah. Perhaps it was trained on Dutch speakers, not English.

That would explain it. Would be interesting to hear babbling trained with other languages and accents.

To my German ears it sounded definitely English, not Dutch, like a very hard to understand dialect.

So when I get the AI from one place, train it with the voices of hundreds of people from dozens of other sources, and then have it read a book from Project Gutenberg to an mp3... who owns the mechanical rights to that recording?

> who owns the mechanical rights to that recording?

The monkey who shot the picture. https://en.wikipedia.org/wiki/Monkey_selfie

good point ... I am pretty sure there are a thousand audible products waiting to be launched.

Every single person who had rights on the sources for audio you used.

For the same reason, Google training neural networks with userdata is very legally doubtful – they changed the ToS, but also used data collected before the ToS change for that.

>Every single person who had rights on the sources for audio you used.

What if my 'AI' was a human who learned to speak by being trained with the voices of hundreds of people from dozens of other sources? What's the difference?

Those waters seem muddy. I think that'd be an interesting copyright case, don't think it's self evident.

So if I remix just 200 songs together, the result is not copyright protected anymore?

No its not like remixing. Its more like listening to 200 songs and then writing one that sounds just like them.

More like turning the songs into series of numbers (say 44100 of these numbers per second) and then using an AI to predict which number comes next to make a song that sounds something like the 200. The result is not possible without ingesting the 200 songs but the 200 songs are not "contained" in the net and then sampled to produce the result like stitching together a recoding from other recordings by copying little bits.

The hairs split too fine at the bottom for our current legal system to really handle. That's why its interesting.

In the US legal system, that’d still be a derived work.

This might be an interesting read for you: http://ansuz.sooke.bc.ca/entry/23


Any suggestions on where to start learning how to implement this? I understand some of the high level concepts (and took an intro AI class years ago - probably not terribly useful), but some of them are very much over my head (e.g. 2.2 Softmax Distributions and 2.3 Gated Activation Units) and some parts of the paper feel somewhat hand-wavy (2.6 Context Stacks). Any pointers would be useful as I attempt to understand it. (EDIT: section numbers refer to their paper)

Best advice is to wait for a version to pop up on github. It's hard to implement such a paper as a beginner.

Well, I think since we now have frameworks for doing this kind of stuff (Tensorflow and similar) the barrier of entry is much, much lower. Also the computing power required to build the models can be found in commodity GPUs.

On a hunch I'd say an absolute beginner may be able good results with these tools, just not as quickly as experts on the field who already know how to use the tools properly. That's why I'm going to wait for something to pop up on GitHub, because I have zero practical experience with these things, but I can read these papers comfortably without the need to look up every other term.

There are a number of applications I'd like to throw at deep learning to see how it performs. Most notably I'd like to see how well a deep learning system can extract feature from speckle images. At the moment you have to average out the speckles from ultrasound or OCT images before you can feed it to a feature recognition system. Unfortunately this kind of averaging eliminates certain information you might want to process further down the line.

Agreed there's a lot of breath here, I'm coming from the opposite end with some experience in "manual" concatenative speech synthesis and very little in the ML area, you'd need to be cross disciplined from the get go

https://github.com/ibab/tensorflow-wavenet - looks like they're starting to show up.

Is it possible to use the "deep dream" methods with a network trained for audio such as this? I wonder what that would sound like, e.g., beginning with a speech signal and enhancing with a network trained for music or vice versa.

We tried this but with less success than what wavenet did. https://wp.nyu.edu/ismir2016/wp-content/uploads/sites/2294/2...

There is a link to examples at the end

Interesting! So if I understand correctly, much of the noise in the generated audio is due to the noise in the learned filters?

I assume some regularization is added to the weights during training, say L1 or L2? If this is the case, this essentially equivalent to assuming the weight values are distributed i.i.d. Laplacian or Gaussian. It seems you could learn less noisy filters by using a prior that assumes dependency between values within each filter, thereby enforcing smoothness or piecewise smoothness of each filter during training.

Yes. Working on some different regularization techniques.

The piano stuff already seemed like 'dream music', as did the 'babble' examples. I found myself terribly frustrated by how short all those examples were. I wanted lots more :)

Do they say how much time is the generation taking?

Is this insanely slow to train but extremely fast to do generation?

"After training, we can sample the network to generate synthetic utterances. At each step during sampling a value is drawn from the probability distribution computed by the network. This value is then fed back into the input and a new prediction for the next step is made. Building up samples one step at a time like this is computationally expensive, but we have found it essential for generating complex, realistic-sounding audio."

So it looks like generation is a slow process.

Relatively, training is fast (due to parallelism / masking so you don't have to sample during training) but during generation sampling is a sequential process. They talk about it a bit in the previous papers for PixelCNN and PixelRNN.

According to 3rd hand reports I've heard (apply copious amounts of salt), it may take 1 hour of CPU time to generate 1 second of speech.

I was wondering the same. They don't mention anything about how long it took on what kind of system. Even for a first beta it would give us some ballpark idea of how slow it is -- because it's clearly slow, they just keep back how slow exactly, so it's probably bad.

Please please please someone please share an IPython notebook with something working already :)

I have some iPython notebooks for speech analysis using a Chinese corpus. I used those for a tutorial on machine learning with Python and unfortunately they are still a bit incomplete, but maybe you find them useful nevertheless (no deep learning involved though). What I do in the tutorial is to start from a WAV file and then go through all the steps required for analyzing the data (using a "traditional" approach), i.e. generate the Mel-Cepstrum coefficients of the segmented audio data and then train a model to distinguish individual words. Word segmentation is another topic that I touch a bit, and where we can also use machine learning to improve the results.

Here's a version with very simple speech training data (basically just different syllables with different tones):


More complex speech training data (from a real-world Chinese speech corpus [not included but downloadable]):


There are other parts of the tutorial that deal with Chinese text and character recognition as well, if you're interested:


For part 2 I also train a simple neural network with lasagne (a Python library for deep learning), and I plan to add more deep learning content and do a clean write-up of the whole thing as soon as I have some more time.

Thanks! will take a look.

It takes 90 minutes to synthesize 1 second. Sorry, no laptop version yet.


I second that!

This is incredible. I'd be worried if I were a professional audiobook reader :)

I worked for Audible for five years, and this exact conversation was had often in my division (ACX.com - Audible's "Audiobook Creation Exchange".)

Audible brought ACX together in order to bolster its catalog. The company-wide initiative was called PTTM ('pedal to the metal') and ACX was Audible's secret weapon to gain an enormous competitive foothold over the rest of the audiobook industry. Because we paid amateurs dirt-cheap rates to record horrible, self-published crap (to which Amazon, Audible's parent company had the exclusive rights), Audible was able to bolster its numbers substantially in a short period of time.

The dirty not-so-secret behind this strategy was: nobody bought these particular audiobooks. These audio titles were not really made to be "purchased," but rather to bulk up Audible's bottom line. We knew that the ACX titles were not popular, because the amateur narrators' acting talents and audio production skills were remarkably subpar.

Neural nets may be able to narrow the gap between the pros and the lowest-common-denominator to the point where they can become the next "ACX," but frankly, it won't matter to audiobook listeners, because audiobook listeners don't buy "ACX" audiobooks. Books, even in audio form, are a major intellectual and temporal commitment (not to mention -- they tend to be pricey.) Customers will always want to buy the human-narrated version of a book - the professional production of a book. If that stops being offered, Audible will anger a lot of customers and I think Bezos has better shit to worry about than his puny audiobooks subsidiary.

Despite that, user-generated content is a secret weapon that a lot of websites wield effectively - including HN - but this is beginning to shed its effectiveness. Indeed, the next generation of cost-slashing-while-polluting-the-quality-of-your-catalog will belong to the neural nets. They may be able to get better sales than ACX titles do today with AI-generated audio content, but the actors are going nowhere.

I've listened to some LibriVox recordings of public domain works, notably A Princess of Mars. The price was right at the time, though the quality was, as you say, remarkably subpar. If I could have had a neural net read me the book instead of having to change with narrators changing every chapter, that would have been preferable.

That said, I have money now, so give me Todd McLaren narrating Altered Carbon for the cost of an Audible Credit every time.

I wouldn't. The results they offer are excellent, but the missing points they need to achieve human level are related to producing the correct intonation, which requires accurate understanding of the material. That is still at least ten years in the future, I expect.

Not really. They're training directly on the waveform, so the model can learn intonation. They just need to train on longer samples, and perhaps augment their linguistic representation with some extra discourse analysis.

A big problem with generating prosody has always been that our theories of it don't really provide a great prediction of people's behaviours. It's also very expensive to get people to do the prosody annotations accurately, using whatever given theory.

Predicting the raw audio directly cuts out this problem. The "theory" of prosody can be left latent, rather than specified explicitly.

theres 0 chance of effective intonation and tone without understanding of the material

I think your use of the term "understanding" is very unhelpful here. It's better to think about what you need to condition on to predict correctly.

In fact most intonation decisions are pretty local, within a sentence or two. The most important thing are given/new contrasts, i.e. the information structure. This is largely determined by the syntax, which we're doing pretty well at predicting, and which latent representations in a neural network can be expected to capture adequately.

The same sentence can have a very nonlocal difference in intonation.

Say, “They went in the shed”. You won't pronounce it in a neutral voice if it was explained in the previous chapter that a serial killer is in it.

On the other hand, if the shed contains a shovel that is quickly needed to dig out a treasure, which is the subject of the novel since page 1, you will imply urgency.

With enough labor, you could annotate enough sentences to cover a lot of dialogue cases. Sections like "'stop!', he said angrily/dryly/mockingly are probably fairly common. You'd be modeling the next most probable inflection given previous words and selected tones.

What would require understanding would be novel arrangements and metaphor to indicate emotional state. On the fly variations to avoid mononticity might also be difficult, as well as sarcasm or combinations/levels (e.g. she spoke matter of factly but with mirth lightly woven through).

And who says it can't understand the material? There have been recurrent networks trained that can translate between languages, or predict the next word in a sentence, at remarkable accuracy. Combined with wavenet this could be quite effective.

There could be cases where the intonation is dependent on things entirely outside of the book. If say a politician does something in the writing that is far from what we would expect them to do in today's world.

How about we allow annotation of text with prosody cues? Mark the words you want stressed. We already use question and exclamation marks.

I'd love that. Writing is a poor representation of language. It'd be nice to bring it up a notch. Here's a suggestion in a paper I wrote on better second language acquisition. https://www.researchgate.net/publication/261022308_BETTER_SE...

Like traditional audio books can capture perfectly what you're referring to...

They can, though?

I don't see why many aspects of intonation couldn't be taught the same way ...

I think the point is that different parts of the story need different intonation patterns (reading a scary part vs a boring part, etc.).

So in theory, it could be achieved by having multiple training sets (for the different intonation styles), along with analysis of the text to direct which part of the text needs what intonation. You might even be able to blend intonations.

Or just pay MTurk workers to annotate texts with intonation cues.

I kinda doubt that would be profitable relative to just hiring readers, but in general you don't need to replace workers completely to cannibalize some of their wages/jobs.

Or treat it as part of the original author's job. When you write a piece of music you add tempo and intensity metadata to the score, so why not do the same when writing a novel?

Or the author could just add that information to the text. This way there's no need to "understand" it.

There is significant advance in sentiment analysis too. Trading bots use sentiment analysis as some of the input for their time series prediction algorithms. I would not say 10 years.

What about auto-tuning? I can do a pretty good reading-with-intention but I don't have the melt-your-brain-rich tones of Stephen Fry or Ian McKellen.

That is so exciting for me. I love listening to audiobooks when I'm walking my dog, or driving, or something boring that doesn't need my brain but does need my arms.

The issue is the selection is so much smaller than the selection of books.

Indeed. It also sounds like it could be trained to correctly read math or code, the two things that require enough expertise to properly pronounce that most text to speech engines fail miserably.

Something like:

"a times the quantity b plus c"

If read with proper inflection, this would be a vast improvement and could open up all sorts of technical material to people for whom audio learning is preferred.

I think back to the first math teacher I had whose pronunciation of the notation was precise and unambiguous enough that one didn't really have to be watching the board. This is a rare gift, yet it is possible in many areas of math, yet few teachers master it (or realize how helpful it is).

I'm an audiobook junkie and as far as professional narrators go, I think it'd be hard to replace a high-end performance with something computer generated and end up with the level of quality offered by the likes of a great narrator like Scott Brick. I mention him by name because it was him that made me realize how important good quality narration is. I had purchased a book at an airport bookstore on a whim and while waiting for a plane was so disgusted with the poor quality of the writing that I actually threw the book out[0]. Years later, I had grabbed an audio book by an author I hadn't heard of simply because it was read by Scott Brick and recommended to "Read Next". Two hours in and I realized the book I had been enjoying so much was the same terrible book I had thrown out years before[1].

While I don't doubt it'll be possible for a computer to match it with enough input data (both in voice and human adjustment), it'll probably be a while before we'll be there and when we are there it'll likely require a lot of adjustment on the part of a professional. A big part of narration is knowing when and where a part of the story requires additional voice acting (and understanding what is required). A machine generated narration would have to understand the story sufficiently to be able to do that correctly. They might be able to get the audio to sound as good as it would sound if I narrated it, but someone with talent in the area is going to be hard to match.

All of that aside, it's getting pretty close to "good enough". When it reaches that point, my hope is more books will have audio versions available[2] and in all likelihood, some books that would have been narrated by a person today will likely be narrated by technology when it reaches that point, limiting human narration only to the top x% of books.

[0] I always resell books or donate them. This book was so bad that the half-hour it took from my life felt like a tragedy. I threw it out to prevent someone from experiencing its awfulness -- even for free.

[1] I realized it was the same book at the point a story was told that I had only read in the first book (and found mildly humorous). The reason I hated the other book was that it was written in the first person as a New York cop. I couldn't form a mental picture and the character was entirely unbelievable and one dimensional. When narrated properly, that problem was eliminated.

[2] I "speed read" (not gimmicky ... scan/skimming) and consume a ton of text. I've been doing it for 20 years or so and find it difficult to read word-for-word as is required for enjoyment of fiction, so to "force" it, I stick with audio books for fiction and love them.

I too greatly appreciate highly skilled readers. It's another layer of creativity and inspiration in addition to the text, and when done well adds a lot to the book.

I only fell in love with the voice of a single audiobook narrator. I checked, and yes, he was Scott Brick. I think he adds about 50% on top of the value of the written book by his interpretation.

He's incredible - some people complain that he's a little bit of a slow reader, but audio book apps usually have a speed option. He enunciates well and adds a depth of feeling to the work that can take a book that's average up several notches.

He's also the only narrator that I can name[0].

[0] Who isn't well known for other things -- Douglas Adams narrated his entire series, and some actors are also regular audio book narrators, but Scott Brick is purely a narrator (or at least, was when I last looked).

How much data does a model take up? I wonder if this would work for compression? Train a model on a corpus of audio, then store the audio as text that turns back into a close approximation of that audio. (Optionally store deltas for egregious differences.)

It would be a slow (but very efficient information-wise - only have to send text which itself can be compressed!) decompression process with current models / hardware due to sequential relationships in generation.

I am sure people will start trying to speed this up, as it could be a game changer in that space with a fast enough implementation. Google also has a lot of great engineers with direct motivation to get it working on phones, and a history of porting recent research in to the Android speech pipeline.

The results speak for themselves - step 1 is almost always "make it work" after all, and this works amazingly well! Step 2 or 3 is "make it fast", depending who you ask.

We've known for decades that neural networks are really good at image and video compression. But as far as I know, this has never been used in practice, because the compression and decompression times are ridiculous. I imagine this would be even more true for audio.

The magic pony guys (who sold to twitter) have patents and implementations of a super resolution CNN for realtime video.


Wow! I'd been playing around with machine learning and audio, and this blows even my hilariously far-future fantasies of speech generation out of the water. I guess when you're DeepMind, you have both the brainpower and resources to tackle sound right at the waveform level, and rely on how increasingly-magical your NNs seem to rebuild everything else you need. Really amazing stuff.

I'm guessing DeepMind has already done this (or is already doing), but conditioning on a video is the obvious next step. It would be incredibly interesting to see how accurate it can get generating the audio for a movie. Though I imagine for really great results they'll need to mix in an adversarial network.

Oh yes, extract voice and intonation from one language, and then synthesize it in another language -> we get automated dubbing. Could also possibly try to lipsync.

Wow. I badly want to try this out with music, but I've taken little more than baby steps with neural networks in the past: am I stuck waiting for someone else to reimplement the stuff in the paper?

IIRC someone published an OSS implementation of the deep dreaming image synthesis paper fairly quickly...

Re-implementation will be hard, several people (including me) have been working on related architectures, but they have a few extra tricks in WaveNet that seem to make all the difference, on top of what I assume is "monster scale training, tons of data".

The core ideas from this can be seen in PixelRNN and PixelCNN, and there are discussions and implementations for the basic concepts of those out there [0][1]. Not to mention the fact that conditioning is very interesting / tricky in this model, at least as I read it. I am sure there are many ways to do it wrong, and getting it right is crucial to having high quality results in conditional synthesis.

[0] https://github.com/tensorflow/magenta/blob/master/magenta/re...

[1] https://github.com/igul222/pixel_rnn/blob/master/pixel_rnn.p...

Is there any usable example code out there I can play with? I don't care if it sounds noisy and weird, it's all grist for the sampler anyway.

And when you think of all those Hollywood SF movies where the robot could reason and act quite well but in a tin-voice. How wrong they got it. We can simulate high quality voices but we can't have our reasoning, walking robots.

Depending on how you mean 'reasoning, walking robots' then not yet really.. but every few weeks or months another amazing deep learning/NN whatever thing comes out in different domains. So these types of techniques seem to have very broad application.

Of course, if you mean 'walking' in a literal sense, there are a number of impressive walking robots such as Atlas https://www.youtube.com/watch?v=rVlhMGQgDkY, HRP-2 https://www.youtube.com/watch?v=T6BSSWWV-60 or HRP 4C https://www.youtube.com/watch?v=YvbAqw0sk6M, etc.. Also there are many types of useful reasoning systems. I am guessing you are thinking of language understanding and generation.. but I believe these types of techniques are being applied quite impressively in that area also, from DeepMind or Watson https://www.youtube.com/watch?v=i-vMW_Ce51w etc.

"At Vanguard, my voice is my password..."

This is amazing. And it's not even a GAN. Presumably a GAN version of this would be even more natural — or maybe they tried that and it didn't work so they didn't put it in the paper?

Definitely the death knell for biometric word lists.

Please make it sound like Morgan Freeman.

Morgan Freeman +1

I hope this shows up as a TTS option for VoiceDream (http://www.voicedream.com/) soon! With the best voices they have to offer (currently, the ones from Ivona), I can suffer through a book if the subject is really interesting, but the way the samples sounded here, the WaveNet TTS could be quite pleasant to listen to.

Would delete this post if I could. Was a request to fix a broken link. Now fixed.

It seems fixed now.

So when does the album drop?

In case the above came across as an example of bad sarcasm, I'm very serious. I've a somewhat lazy interest in generative music, and found the snippets in the paper quite appealing.

Though, as was mentioned in a previous comment, due to copyright (attribution based on training data sources, blah blah) I might already have an answer. :(

“Is this Hiromi Uehara or WaveNet?”

I wonder how a hybrid model would sound, where the net generates parameters for a parametric synthesis algorithm (or a common speech codec) instead of samples, to reduce CPU costs.

The first to do semantic style transfer on audio gets a cookie!

When can we expect this to be used in Google's TTS engine?

Love the music part! Mmmh ... infinite jazz.

Finally a convincing Simlish generator!

hope they can release some source code.

wonder how many gpus are required to hold this model.

I suppose it's impressive in a way, but when I looked into "smoothing out" text to speech audio a few years ago, it seemed fairly straightforward. I was left wondering why it hadn't been done already, but alas, most Engineers at these companies are either politicking know-nothing idiots, or are constantly being road blocked, preventing them from making any real advancements.

Applications are open for YC Winter 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact