Hacker News new | past | comments | ask | show | jobs | submit login
Extending “Let It Go” with recurrent neural network (snucse.org)
131 points by elnn on June 11, 2015 | hide | past | favorite | 45 comments

If you are into this kind of stuff (time series density/probability modeling), there are several places to go beyond this.

The simplest example is the RNN-RBM (restricted boltzmann machine) from the Theano deep learning tutorials [1]. There are follow-up papers by Kratarth Goel extending this to LSTM-DBN, and RNN-NADE (neural autoregressive density estimator) has also been explored in this space. The results for MIDI are pretty good.

Ales Graves [2] has done a huge amount of work in sequence modeling for classification and generation. His work with LSTM-GMM for online handwriting modeling can also be applied to speech features (vocoded speech), and seems to work well. No official publication to date on the speech experiments themselves, but the procedure is well documented in his paper [3]

The most recent thing is our paper "A Recurrent Latent Variable Model for Sequential Data" [4]. It operates on raw timeseries (!) even though all intuition would say that an RNN shouldn't handle this well. Adding a recurrent prior seems to preserve style in the handwriting generations at least - I may try on this "Let It Go!" task and see what happens! We are also exploring conditional controls and higher-level speech features, which should result in samples as good or better than LSTM-GMM.

[1] http://deeplearning.net/tutorial/rnnrbm.html#rnnrbm

[2] https://www.youtube.com/watch?v=-yX1SYeDHbg#t=36m50s

[3] http://arxiv.org/abs/1308.0850

[4] http://arxiv.org/abs/1506.02216

The infinite jukebox[0] is another interesting one to look at.

[0] http://labs.echonest.com/Uploader/index.html

Wow, that is quite good at what it does.

I love the potentially-infinite loops that are hit on "Rock the Casbah". "ROCK THE CASBAH ROCK THE CASBAH ROCK THE CASBAH ROCK THE CASBAH ROCK THE CASBAH".

Specifically, for these kinds of sequences (chord sequences, etc.) check out models that uses note-level features but with joint probability (RNN with sigmoid out and cross-entropy loss, RBM, NADE, etc.). It should do even better!

So I took a class in college (15 years ago) where we learned how to do some backprop (out of plain Matlab). I've been seeing impressive demonstrations of RNNs these weeks. How (if at all) can a Matlab-Fortran-SAS grunt start-from-the-beginning to at least understand the rudiments of how these work?

There is a nice book by my profs (Y. Bengio, A. Courville) and a former student of the lab (I. Goodfellow) here: http://www.iro.umontreal.ca/~bengioy/DLbook/ The chapter on RNNs is pretty enlightening. You probably want to start with something the goes through feedforward networks, convolutional, etc. first though.

Andrew Ng's Coursera course does a super basic NN in MATLAB which might be good to shake the rust off [1]. Hugo Larochelle's youtube course [2], Geoff Hinton's Coursera course [3], and Nando's youtube Deep Learning course [4] are all very good. There are also some great lectures from Aaron Courville from the Representation Learning class that just finished here at UdeM [5]

If you want to do this for real (not self teaching RNN backprop on toy examples) tool-wise you generally go with either Theano (my preference) or Torch - implementing the backwards path for these (and verifying gradients are correct) is not pretty.

Theano does it almost for free, and Torch can do it at the module level. Either way you definitely want one of the higher order tools!

[1] https://www.coursera.org/learn/machine-learning

[2] https://www.youtube.com/watch?v=SGZ6BttHMPw&list=PL6Xpj9I5qX...

[3] https://www.coursera.org/course/neuralnets

[4] https://www.youtube.com/watch?v=PlhFWT7vAEw

[5] https://ift6266h15.wordpress.com/

Thanks for the references!

It's really hard to connect "Let it Go" with the results that the program generates. Maybe if the song starts with a recognisable line of tones and the algorithm buts in some recognisable parts here and there it becomes more clear? Not sure. But for my untrained ears it was not possible to recognise the connection to "Let it Go".

My six-year-old identified it as reminding him of "Let It Go" with no prompting from me other than "What does this make you think of?" (He also said "It's great!" about four times before that.)

not sure how many songs kids these days are exposed to, but maybe it's easier for young children to be reminded of "let it go" since it's very popular and they haven't heard a lot of other songs?

Having a 3 year old who cannot get enough of "Snowman!" and as a result listening to this song more times than Long.MAX_VALUE, this is a good continuation of the version which appears in the movie.

For me it actually noodled around for a while, played a haunting MINOR KEY version of the entire chorus, then went back to noodling.

After reading some of the comments here I can hear vestiges of the song a bit more now. The rhythms are pretty similar to the rhythms in the song. And it does run through the chorus changes (starting in m11) in the relative minor key which is interesting.

I would like to see this done with just a melody line, or a melody and a bass line, or melody and chord changes- as doing it based on a full piano score is probably quite a bit trickier for a neural network to handle intelligibly.

It's a bit tough because MIDI-style playback engines all sound pretty rubbish, but it definitely sounds like something that could appear in an extended jazz version.

Well, the rhythm can be kinda traced back, but the training piece likely had the accompaniment and melody parts smushed together, so it is quite difficult

The one I just generated seems to be randomly banging out minor chords.

Are you folks sure anything is being generated on the fly here? I get a pre-generated version that is exactly the same every time I load the page.

I'd like to see some sort of explanation of how the neural network knew how to produce unique chord progressions pleasing to human ears, even adapting the lead to fit these new chords, for a song that contains none of those chord progressions. What is the other unrelated music it used as learning material for this?

I'd ask OP to fill us in, since it's his site; but it seems he perhaps prefers not to waste his time translating back and forth to English, considering his historical lack of comments on HN. Oh well.

Aha, figured it out. When you switch away from the tab, Chrome reduces the play speed, which is why it sounded different and like slowly banged out minor chords.

I found the "Extrapolated art" link at the bottom of that page more impressive: http://extrapolated-art.com/

Starry Night, the very last image, is particularly good. Would love to have that on display and see if people notice anything strange.

That page is showcasing examples of Content-Aware Fill in Photoshop. If you have a copy, then it's very easy to try yourself. Edit -> Fill, then select "Content-Aware".


Wow, following this link[1] on their website, some of those examples are pretty impressive stuff.

[1] : http://extrapolated-art.com/

Resynthesizer plugin for GIMP [1] by Dr. Paul Harrison lets you do those stuff and more (e.g. removing an object from a scene). Also look at Mathematica's digital inpainting function [2] which achieves the same thing.

[1]: http://www.logarithmic.net/pfh/resynthesizer

[2]: http://blog.wolfram.com/2014/12/01/extending-van-goghs-starr...

If you just want to "extend" it, there are easier ways. http://labs.echonest.com/Uploader/index.html?trid=TRDSMSO14C... I know "easier" wasn't the point of this project, I just think the Infinite Jukebox is cool!

I've been listening to Infinite Jukebox's rendition of "Frontier Psychiatrist" for the last 10 minutes. Pretty awesome.

I've got on my to do list something similar, feed the raw wave data from some pop songs into an RNN and then see what the resulting music sounds like.

Am I a fool for trying? It would be a lot of data compared to a midi track.

I'd suggest feeding the raw FFT transform into the RNN, then translating back to normal sound afterwards. Raw sound has an awful lot of just bouncing up and down... a RNN isn't going to be any better at you than figuring out what's going on from there, whereas the FFT view of a song is much more meaningful... I can glance at one and at least have an idea of what the song is doing.

I'm still not all that optimistic, frankly, but it'll be better than raw data. Some further experimentation with different encoding might be necessary.

That said, I'd put $10 bucks that just putting an FFT representation through wouldn't necessarily produce a new pop song, but it would sound uniquely spooky, I bet. If nothing else you might have something you can hook up to some speakers next Halloween and make some kids cry.

This is pretty hard - we use raw data for speech [1, talked about in comment above] but it still needs some work to do really good synthesis. FFT is not really the way to go either - then you still need to deal with the problems of complex data which is very, very unpleasant. Most people use FFT -> IDCT (cepstrum) or a filtered version (mel-frequency cepstral coefficients, MFCC). This can work but it is a lot of domain knowledge.

One thing we tried in early testing, but did not pursue farther was vector quantized X (where X is MFCCs, LPC, LSF, FFT, cepstrum). Basically you use K-means to find clusters for some large number of K, then simply assign every real value (or real-valued vector) to the closest cluster. The cluster mapping becomes a codebook, and your problem goes from input vectors like [0.2, 0.7, 0.111, ...] to [0, 1, 0, ...] where the length of the vector of 0s and 1s is the number clusters K.

This is a much easier learning problem, and closely corresponds to most "bag-of-words" or word-level models. The quantization is lossy but for large enough K I do not think it would be noticeable. After all, we listen to discrete audio every day, all the time in wav format :)

To synthesize, you can either map codebook points back to the corresponding cluster center, or as most people do, map it to the cluster center with some small variance so you have a little bit of interesting variation.

[1] http://arxiv.org/abs/1506.02216

Thank you for expanding on my uninformed, off-the-cuff comment like that.

I was going to try the same thing but ran into trouble installing the RNN software dependencies.

Please try it and let us know how it turns out, good or bad.

This sounds pretty decent actually. I remember Stephen Wolfram's "A new kind of music" project which tried to generate music using cellular automata, with limited success.

This sounds much better in comparison!

A few weeks ago I used the same RNN that was used to produce kernel source code to produce MIDI files. It wasn't really a success, but it did generate mostly-valid MIDI files and independently discovered avant garde prog jazz.

Try https://soundcloud.com/david-given/unsigned-composition-1 or https://soundcloud.com/david-given/procedural-jazz-sting.

This reminds me the fact that makes human brain unique against other systems is the capability of jumping out of the logic system. When composing a piece of song, the best a current AI system can do is to compose by keep adding the most logical (or intuitive, depends on how you look at it) notes. A human composer, on the other hand, can from time to time jump out of that logic, and inspect the notes from complete different perspectives (different than if these notes sequence is logical), such as the emotion it conveys, is it innovative, what style it feels like, etc, etc.

These things you mention are conditional variables - from a modeling perspective it would be perfectly plausible to build - think something like text-to-speech but instead of the sentence you input some characters representing the dynamics/feel of the desired song. You also have the issue of RNNs being deterministic - even sampling from the output softmax probably doesn't give enough to have interesting variations if you run the network multiple times from the same seed. Also choosing the argmax(prob(y | x)) at each step does not guarantee that the path generated is maximally likely. For that you need a beam search or something like it. But I don't think any of that is the key problem here.

In this case, an RNN meant for 1-of-K output is not well suited to outputting chords. It worked fine for single notes however! Check out this link from Joao Felipe Santos https://soundcloud.com/seaandsailor/sets/char-rnn-composes-i... . These are pretty cool - and he generated the titles to boot.

For chords you really can't model all possible combinations (2^88 for midi) naively, and you also can't really model notes independently - chords are highly structured and follow specific rules! Even bounding to only chords of up to 3 or 4 notes still makes a pretty large output space, which means more training data is needed, it is harder to optimize, etc. etc.

You should be much better off with some kind of conditional/factorized model strapped to the output of an RNN - this is the idea for RNN-RBM, RNN-NADE, LSTM-DBN, etc. You could also just try to model the audio representation directly using LSTM-GMM or VRNN, but this is pretty hard and an active area of research.


I've been itching to mess around with dynamically-generated music lately. Can anyone recommend any (js/webaudio) libraries or resources to check out?

It's not web based or js but give sonic http://sonic-pi.net/ a go. Really simple and easy to get started.

Thanks! Not what I was thinking of but it looks really interesting.

Yeah, to be honest it's a nice try but it didn't work out.

Maybe all it needs is more training time, lower temperature (and maybe some more disney songs or related ones)

If you add K:Ab above the first line, you'll hear it in its actual key. It sounds unmistakably like "Let it Go."

This should be the top comment - changes everything.

Makes me wonder if the key change was a mistake, or a deliberate attempt to make it sound less like the original?

Does this do any better than a markov chain?

For some reason it's not working well for me with Safari. I tried it on Chrome and it sounded pretty good

This kinds of sounds like Castlevania (i.e. an improvement) with the general MIDI playback.

I quite liked the score sheet and playback plugin

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact