The only thing I can hear that sounds unnatural, is the way that the reverberation in the room (the "echo") immediately gets lower when the raw piano sound itself gets lower. In a real room, if you produce a loud sound and immediately after a soft sound, the reverberation of the loud sound remains. But since this network only models "the sound right now", the volume of the reverberation follows the volume of the piano sound.
To my ears, this is most prevalent in the last example, which starts out loud and gradually becomes softer. It sounds a bit like they are cross-fading between multiple recordings.
Regardless, the piano sounds completely natural to me, I don't hear any artifacts or sounds that a real piano wouldn't make. Amazing!
There are also fragments that sounds inspiring and very musical to my ears, such as the melody and chord progression after 00:08 in the first example.
If you train NNs at the phrase level and overfit, then you get something that is indeed more or less the same as cross-fading at random between short sections.
Piano music is very idiomatic, so you'll capture some typical piano gestures that way.
But I'd be surprised if the music stays listenable for long. Classical music has big structures, and there's a difference between recognising letters (notes), recognising phrases (short sentences), recognising paragraphs (phrase structures), and parsing an entire piece (a novel or short story with characters and multiple plot lines.)
Corpus methods don't work very well for non-trivial music, because there's surprisingly little consistency at the more complex levels.
NN synthesis could be an interesting thing though. If you trained an NN on $sounds$ at various pitches and velocity levels, you might be able to squeeze a large and complex collection of samples into a compressed data set.
Even if the output isn't very realistic, you'd still get something unusual and interesting.
What I'd really like to see therefore is a systematic comparison of the generated music to the training set, ideally using a measure of similarity.
It'd also be interesting to know if this technique could solve the "de-reverberation" problem.
1. Train a WaveNet with the source speaker.
2. Train a second WaveNet with the target speaker. Or for something totally new, train a WaveNet with a bunch of different speakers until you get one you like. This becomes the target WaveNet.
3. Record raw audio from the source speaker.
Fun fact: any algorithmic process that "renders" something given a set of inputs can be "run in reverse" to recover those inputs given the rendered output. In this case, we now have raw audio from the source speaker that—in principle— could have been rendered by the source speaker's WaveNet, and we want to recover the inputs that would have rendered it, had we done so.
To do that, usually you convert all numbers in the forward renderer into Dual numbers and use automatic differentiation to recover the inputs (in this case, phonemes and what not).
4. Recover the inputs. (This is computationally expensive, but not difficult in practice, especially if WaveNet's generation algorithm is implemented in C++ and you've got a nice black-box optimizer to apply to the inputs, of which there are many freely available options.)
5. Take the recovered WaveNet inputs, feed them into the target speaker's WaveNet, and record the resulting audio.
Result: The resulting raw audio will have the same overall performance and speech as the source speaker, but rendered completely naturally in the target speaker's voice.
You don't send your speech over the line, instead you send some parameters over the line which are then, at the receiving end, fed into a white(-ish) noise generator to recover the speech.
Edit: not by using a neural net or deep learning, of course.
Now wait a minute, most algorithms cannot be run in reverse! The only general way to reverse an algo is to try all possible inputs, which has exponential complexity. That's the basis of RSA encryption. Maybe you're thinking about automatic differentiation, a general algo to get the gradient of the output w.r.t. the inputs. That allows you to search for a matching input using gradient descent, but that won't give you an exact match for most interesting cases (due to local minima).
I'm not trying to nitpick -- in fact I believe that IF algos were reversible then human-level AI would have been solved a long time ago. Just write a generative function that is capable of outputting all possible outputs, reverse, and inference is solved.
E.g. a forward problem might be to solve some PDE to simulate the state of a system from some known initial conditions.
The inverse problem could be to try to reverse engineer what the initial conditions were given the observed state of the system.
Inverse problems are typically much harder to deal with, and much harder to solve. E.g. perhaps they don't have a unique solution, or the solution is a highly discontinuous function of the inputs, which amplifies any measurement errors. In practice this can be addressed by regularisation aka introducing strong structural assumptions about what the expected solution should be like. This can be quite reasonable from a Bayesian perspective.
"The conditioning was applied by feeding the speaker ID to the model in the form of a one-hot vector. The dataset consisted of 44 hours of data from 109 different speakers."
Am I missing something?
"In order to use WaveNet to turn text into speech, we have to tell it what the text is. We do this by transforming the text into a sequence of linguistic and phonetic features (which contain information about the current phoneme, syllable, word, etc.) and by feeding it into WaveNet."
The raw audio from Step 3 was (in principle) generated by that input on a properly trained WaveNet. We need to recover that so we can transfer it to the target WaveNet.
How a specific WaveNet instance is configured (as you point out, it's part of the model parameters) is an implementation detail that is irrelevant for the steps I proposed.
I give us 10-15 years before it's not possible to trust anything you see or hear that's recorded.
Sound-alikes have been used in the music industry since forever.
These early recordings are incredibly crude, and they did not have the technology at the time to play them back. They were just experiments in trying to view sound waves, not attempts to preserve information for future generations.
I'm proposing reducing a vocal performance into the corresponding WaveNet input. At no point in that process is the actual "text" recovered, and doing so would defeat the whole purpose, since I don't care about the text, I care about the performance of speaking the text (whatever it was).
In your example, I can't force Trump to say something in particular. But I can force myself, so I could record myself saying something I wanted Clinton to say [Step 3] (and in a particular way, too!), and if I had a trained WaveNet for myself and Clinton, I could make it seem like Clinton actually said it.
Text -> features -> TrumpWaveNet -> Trump saying your text
Yes, that should work, and by tweaking the WaveNet input appropriately, you could also get him to say it in a particular way.
(surjective != non-injective, in the same way that non-increasing != decreasing)
At some point to authenticate both parties verify a short message by reading it to each other.
However, NSA has already tried to MitM that about 10 years ago by using voice synthesis. It was deemed inadequate at the time. Wonder if TTS improvements like these, change that game and make it more plausable scenario.
Dilated convolutions are already implemented in TF, look forward to someone implementing this paper and publishing the code.
That sounds just like a small kid imitating a foreign (or their own) language. My kids grow up bilingual and I hear them attempt something similar when they are really small. I guess it's like listening in to their neural network modelling the sound of the new language.
The monkey who shot the picture. https://en.wikipedia.org/wiki/Monkey_selfie
For the same reason, Google training neural networks with userdata is very legally doubtful – they changed the ToS, but also used data collected before the ToS change for that.
What if my 'AI' was a human who learned to speak by being trained with the voices of hundreds of people from dozens of other sources? What's the difference?
Those waters seem muddy. I think that'd be an interesting copyright case, don't think it's self evident.
More like turning the songs into series of numbers (say 44100 of these numbers per second) and then using an AI to predict which number comes next to make a song that sounds something like the 200. The result is not possible without ingesting the 200 songs but the 200 songs are not "contained" in the net and then sampled to produce the result like stitching together a recoding from other recordings by copying little bits.
The hairs split too fine at the bottom for our current legal system to really handle. That's why its interesting.
This might be an interesting read for you: http://ansuz.sooke.bc.ca/entry/23
On a hunch I'd say an absolute beginner may be able good results with these tools, just not as quickly as experts on the field who already know how to use the tools properly. That's why I'm going to wait for something to pop up on GitHub, because I have zero practical experience with these things, but I can read these papers comfortably without the need to look up every other term.
There are a number of applications I'd like to throw at deep learning to see how it performs. Most notably I'd like to see how well a deep learning system can extract feature from speckle images. At the moment you have to average out the speckles from ultrasound or OCT images before you can feed it to a feature recognition system. Unfortunately this kind of averaging eliminates certain information you might want to process further down the line.
I assume some regularization is added to the weights during training, say L1 or L2? If this is the case, this essentially equivalent to assuming the weight values are distributed i.i.d. Laplacian or Gaussian. It seems you could learn less noisy filters by using a prior that assumes dependency between values within each filter, thereby enforcing smoothness or piecewise smoothness of each filter during training.
Is this insanely slow to train but extremely fast to do generation?
So it looks like generation is a slow process.
Here's a version with very simple speech training data (basically just different syllables with different tones):
More complex speech training data (from a real-world Chinese speech corpus [not included but downloadable]):
There are other parts of the tutorial that deal with Chinese text and character recognition as well, if you're interested:
For part 2 I also train a simple neural network with lasagne (a Python library for deep learning), and I plan to add more deep learning content and do a clean write-up of the whole thing as soon as I have some more time.
Audible brought ACX together in order to bolster its catalog. The company-wide initiative was called PTTM ('pedal to the metal') and ACX was Audible's secret weapon to gain an enormous competitive foothold over the rest of the audiobook industry. Because we paid amateurs dirt-cheap rates to record horrible, self-published crap (to which Amazon, Audible's parent company had the exclusive rights), Audible was able to bolster its numbers substantially in a short period of time.
The dirty not-so-secret behind this strategy was: nobody bought these particular audiobooks. These audio titles were not really made to be "purchased," but rather to bulk up Audible's bottom line. We knew that the ACX titles were not popular, because the amateur narrators' acting talents and audio production skills were remarkably subpar.
Neural nets may be able to narrow the gap between the pros and the lowest-common-denominator to the point where they can become the next "ACX," but frankly, it won't matter to audiobook listeners, because audiobook listeners don't buy "ACX" audiobooks. Books, even in audio form, are a major intellectual and temporal commitment (not to mention -- they tend to be pricey.) Customers will always want to buy the human-narrated version of a book - the professional production of a book. If that stops being offered, Audible will anger a lot of customers and I think Bezos has better shit to worry about than his puny audiobooks subsidiary.
Despite that, user-generated content is a secret weapon that a lot of websites wield effectively - including HN - but this is beginning to shed its effectiveness. Indeed, the next generation of cost-slashing-while-polluting-the-quality-of-your-catalog will belong to the neural nets. They may be able to get better sales than ACX titles do today with AI-generated audio content, but the actors are going nowhere.
That said, I have money now, so give me Todd McLaren narrating Altered Carbon for the cost of an Audible Credit every time.
A big problem with generating prosody has always been that our theories of it don't really provide a great prediction of people's behaviours. It's also very expensive to get people to do the prosody annotations accurately, using whatever given theory.
Predicting the raw audio directly cuts out this problem. The "theory" of prosody can be left latent, rather than specified explicitly.
In fact most intonation decisions are pretty local, within a sentence or two. The most important thing are given/new contrasts, i.e. the information structure. This is largely determined by the syntax, which we're doing pretty well at predicting, and which latent representations in a neural network can be expected to capture adequately.
Say, “They went in the shed”. You won't pronounce it in a neutral voice if it was explained in the previous chapter that a serial killer is in it.
On the other hand, if the shed contains a shovel that is quickly needed to dig out a treasure, which is the subject of the novel since page 1, you will imply urgency.
What would require understanding would be novel arrangements and metaphor to indicate emotional state. On the fly variations to avoid mononticity might also be difficult, as well as sarcasm or combinations/levels (e.g. she spoke matter of factly but with mirth lightly woven through).
So in theory, it could be achieved by having multiple training sets (for the different intonation styles), along with analysis of the text to direct which part of the text needs what intonation. You might even be able to blend intonations.
I kinda doubt that would be profitable relative to just hiring readers, but in general you don't need to replace workers completely to cannibalize some of their wages/jobs.
The issue is the selection is so much smaller than the selection of books.
If read with proper inflection, this would be a vast improvement and could open up all sorts of technical material to people for whom audio learning is preferred.
I think back to the first math teacher I had whose pronunciation of the notation was precise and unambiguous enough that one didn't really have to be watching the board. This is a rare gift, yet it is possible in many areas of math, yet few teachers master it (or realize how helpful it is).
While I don't doubt it'll be possible for a computer to match it with enough input data (both in voice and human adjustment), it'll probably be a while before we'll be there and when we are there it'll likely require a lot of adjustment on the part of a professional. A big part of narration is knowing when and where a part of the story requires additional voice acting (and understanding what is required). A machine generated narration would have to understand the story sufficiently to be able to do that correctly. They might be able to get the audio to sound as good as it would sound if I narrated it, but someone with talent in the area is going to be hard to match.
All of that aside, it's getting pretty close to "good enough". When it reaches that point, my hope is more books will have audio versions available and in all likelihood, some books that would have been narrated by a person today will likely be narrated by technology when it reaches that point, limiting human narration only to the top x% of books.
 I always resell books or donate them. This book was so bad that the half-hour it took from my life felt like a tragedy. I threw it out to prevent someone from experiencing its awfulness -- even for free.
 I realized it was the same book at the point a story was told that I had only read in the first book (and found mildly humorous). The reason I hated the other book was that it was written in the first person as a New York cop. I couldn't form a mental picture and the character was entirely unbelievable and one dimensional. When narrated properly, that problem was eliminated.
 I "speed read" (not gimmicky ... scan/skimming) and consume a ton of text. I've been doing it for 20 years or so and find it difficult to read word-for-word as is required for enjoyment of fiction, so to "force" it, I stick with audio books for fiction and love them.
He's also the only narrator that I can name.
 Who isn't well known for other things -- Douglas Adams narrated his entire series, and some actors are also regular audio book narrators, but Scott Brick is purely a narrator (or at least, was when I last looked).
I am sure people will start trying to speed this up, as it could be a game changer in that space with a fast enough implementation. Google also has a lot of great engineers with direct motivation to get it working on phones, and a history of porting recent research in to the Android speech pipeline.
The results speak for themselves - step 1 is almost always "make it work" after all, and this works amazingly well! Step 2 or 3 is "make it fast", depending who you ask.
IIRC someone published an OSS implementation of the deep dreaming image synthesis paper fairly quickly...
The core ideas from this can be seen in PixelRNN and PixelCNN, and there are discussions and implementations for the basic concepts of those out there . Not to mention the fact that conditioning is very interesting / tricky in this model, at least as I read it. I am sure there are many ways to do it wrong, and getting it right is crucial to having high quality results in conditional synthesis.
Of course, if you mean 'walking' in a literal sense, there are a number of impressive walking robots such as Atlas https://www.youtube.com/watch?v=rVlhMGQgDkY, HRP-2 https://www.youtube.com/watch?v=T6BSSWWV-60 or HRP 4C https://www.youtube.com/watch?v=YvbAqw0sk6M, etc.. Also there are many types of useful reasoning systems. I am guessing you are thinking of language understanding and generation.. but I believe these types of techniques are being applied quite impressively in that area also, from DeepMind or Watson https://www.youtube.com/watch?v=i-vMW_Ce51w etc.
Definitely the death knell for biometric word lists.
Though, as was mentioned in a previous comment, due to copyright (attribution based on training data sources, blah blah) I might already have an answer. :(
wonder how many gpus are required to hold this model.