

Extending “Let It Go” with recurrent neural network - elnn
http://elnn.snucse.org/sandbox/music-rnn/#hn

======
kastnerkyle
If you are into this kind of stuff (time series density/probability modeling),
there are several places to go beyond this.

The simplest example is the RNN-RBM (restricted boltzmann machine) from the
Theano deep learning tutorials [1]. There are follow-up papers by Kratarth
Goel extending this to LSTM-DBN, and RNN-NADE (neural autoregressive density
estimator) has also been explored in this space. The results for MIDI are
pretty good.

Ales Graves [2] has done a huge amount of work in sequence modeling for
classification and generation. His work with LSTM-GMM for online handwriting
modeling can also be applied to speech features (vocoded speech), and seems to
work well. No official publication to date on the speech experiments
themselves, but the procedure is well documented in his paper [3]

The most recent thing is our paper "A Recurrent Latent Variable Model for
Sequential Data" [4]. It operates on raw timeseries (!) even though all
intuition would say that an RNN shouldn't handle this well. Adding a recurrent
prior seems to preserve style in the handwriting generations at least - I may
try on this "Let It Go!" task and see what happens! We are also exploring
conditional controls and higher-level speech features, which should result in
samples as good or better than LSTM-GMM.

[1]
[http://deeplearning.net/tutorial/rnnrbm.html#rnnrbm](http://deeplearning.net/tutorial/rnnrbm.html#rnnrbm)

[2]
[https://www.youtube.com/watch?v=-yX1SYeDHbg#t=36m50s](https://www.youtube.com/watch?v=-yX1SYeDHbg#t=36m50s)

[3] [http://arxiv.org/abs/1308.0850](http://arxiv.org/abs/1308.0850)

[4] [http://arxiv.org/abs/1506.02216](http://arxiv.org/abs/1506.02216)

~~~
LukeB_UK
The infinite jukebox[0] is another interesting one to look at.

[0]
[http://labs.echonest.com/Uploader/index.html](http://labs.echonest.com/Uploader/index.html)

~~~
tgb
Wow, that is quite good at what it does.

------
erikb
It's really hard to connect "Let it Go" with the results that the program
generates. Maybe if the song starts with a recognisable line of tones and the
algorithm buts in some recognisable parts here and there it becomes more
clear? Not sure. But for my untrained ears it was not possible to recognise
the connection to "Let it Go".

~~~
colomon
My six-year-old identified it as reminding him of "Let It Go" with no
prompting from me other than "What does this make you think of?" (He also said
"It's great!" about four times before that.)

~~~
ljk
not sure how many songs kids these days are exposed to, but maybe it's easier
for young children to be reminded of "let it go" since it's very popular and
they haven't heard a lot of other songs?

------
tagawa
I found the "Extrapolated art" link at the bottom of that page more
impressive: [http://extrapolated-art.com/](http://extrapolated-art.com/)

Starry Night, the very last image, is particularly good. Would love to have
that on display and see if people notice anything strange.

~~~
stevenh
That page is showcasing examples of Content-Aware Fill in Photoshop. If you
have a copy, then it's very easy to try yourself. Edit -> Fill, then select
"Content-Aware".

[http://i.lightimg.com/d2112.jpg](http://i.lightimg.com/d2112.jpg)

------
rjaco31
Wow, following this link[1] on their website, some of those examples are
pretty impressive stuff.

[1] : [http://extrapolated-art.com/](http://extrapolated-art.com/)

~~~
oxplot
Resynthesizer plugin for GIMP [1] by Dr. Paul Harrison lets you do those stuff
and more (e.g. removing an object from a scene). Also look at Mathematica's
digital inpainting function [2] which achieves the same thing.

[1]:
[http://www.logarithmic.net/pfh/resynthesizer](http://www.logarithmic.net/pfh/resynthesizer)

[2]: [http://blog.wolfram.com/2014/12/01/extending-van-goghs-
starr...](http://blog.wolfram.com/2014/12/01/extending-van-goghs-starry-night-
with-inpainting/)

------
sp332
If you just want to "extend" it, there are easier ways.
[http://labs.echonest.com/Uploader/index.html?trid=TRDSMSO14C...](http://labs.echonest.com/Uploader/index.html?trid=TRDSMSO14CC9DD2123)
I know "easier" wasn't the point of this project, I just think the Infinite
Jukebox is cool!

~~~
yellowapple
I've been listening to Infinite Jukebox's rendition of "Frontier Psychiatrist"
for the last 10 minutes. Pretty awesome.

------
jewel
I've got on my to do list something similar, feed the raw wave data from some
pop songs into an RNN and then see what the resulting music sounds like.

Am I a fool for trying? It would be a lot of data compared to a midi track.

~~~
jerf
I'd suggest feeding the raw FFT transform into the RNN, then translating back
to normal sound afterwards. Raw sound has an awful lot of just bouncing up and
down... a RNN isn't going to be any better at you than figuring out what's
going on from there, whereas the FFT view of a song is much more meaningful...
I can glance at one and at least have an _idea_ of what the song is doing.

I'm still not all that optimistic, frankly, but it'll be better than raw data.
Some further experimentation with different encoding might be necessary.

That said, I'd put $10 bucks that just putting an FFT representation through
wouldn't necessarily produce a new pop song, but it would sound uniquely
_spooky_ , I bet. If nothing else you might have something you can hook up to
some speakers next Halloween and make some kids cry.

~~~
kastnerkyle
This is pretty hard - we use raw data for speech [1, talked about in comment
above] but it still needs some work to do really good synthesis. FFT is not
really the way to go either - then you still need to deal with the problems of
complex data which is very, very unpleasant. Most people use FFT -> IDCT
(cepstrum) or a filtered version (mel-frequency cepstral coefficients, MFCC).
This can work but it is a lot of domain knowledge.

One thing we tried in early testing, but did not pursue farther was vector
quantized X (where X is MFCCs, LPC, LSF, FFT, cepstrum). Basically you use
K-means to find clusters for some large number of K, then simply assign every
real value (or real-valued vector) to the closest cluster. The cluster mapping
becomes a codebook, and your problem goes from input vectors like [0.2, 0.7,
0.111, ...] to [0, 1, 0, ...] where the length of the vector of 0s and 1s is
the number clusters K.

This is a much easier learning problem, and closely corresponds to most "bag-
of-words" or word-level models. The quantization is lossy but for large enough
K I do not think it would be noticeable. After all, we listen to discrete
audio every day, all the time in wav format :)

To synthesize, you can either map codebook points back to the corresponding
cluster center, or as most people do, map it to the cluster center with some
small variance so you have a little bit of interesting variation.

[1] [http://arxiv.org/abs/1506.02216](http://arxiv.org/abs/1506.02216)

~~~
jerf
Thank you for expanding on my uninformed, off-the-cuff comment like that.

------
ThePhysicist
This sounds pretty decent actually. I remember Stephen Wolfram's "A new kind
of music" project which tried to generate music using cellular automata, with
limited success.

This sounds much better in comparison!

------
david-given
A few weeks ago I used the same RNN that was used to produce kernel source
code to produce MIDI files. It wasn't really a success, but it did generate
mostly-valid MIDI files and independently discovered avant garde prog jazz.

Try [https://soundcloud.com/david-given/unsigned-
composition-1](https://soundcloud.com/david-given/unsigned-composition-1) or
[https://soundcloud.com/david-given/procedural-jazz-
sting](https://soundcloud.com/david-given/procedural-jazz-sting).

------
kailuowang
This reminds me the fact that makes human brain unique against other systems
is the capability of jumping out of the logic system. When composing a piece
of song, the best a current AI system can do is to compose by keep adding the
most logical (or intuitive, depends on how you look at it) notes. A human
composer, on the other hand, can from time to time jump out of that logic, and
inspect the notes from complete different perspectives (different than if
these notes sequence is logical), such as the emotion it conveys, is it
innovative, what style it feels like, etc, etc.

~~~
kastnerkyle
These things you mention are conditional variables - from a modeling
perspective it would be perfectly plausible to build - think something like
text-to-speech but instead of the sentence you input some characters
representing the dynamics/feel of the desired song. You also have the issue of
RNNs being deterministic - even sampling from the output softmax probably
doesn't give enough to have interesting variations if you run the network
multiple times from the same seed. Also choosing the argmax(prob(y | x)) at
each step does not guarantee that the _path_ generated is maximally likely.
For that you need a beam search or something like it. But I don't think any of
that is the key problem here.

In this case, an RNN meant for 1-of-K output is not well suited to outputting
chords. It worked fine for single notes however! Check out this link from Joao
Felipe Santos [https://soundcloud.com/seaandsailor/sets/char-rnn-
composes-i...](https://soundcloud.com/seaandsailor/sets/char-rnn-composes-
irish-folk-music) . These are pretty cool - and he generated the titles to
boot.

For chords you really can't model all possible combinations (2^88 for midi)
naively, and you also can't really model notes independently - chords are
highly structured and follow specific rules! Even bounding to only chords of
up to 3 or 4 notes still makes a pretty large output space, which means more
training data is needed, it is harder to optimize, etc. etc.

You should be much better off with some kind of conditional/factorized model
strapped to the output of an RNN - this is the idea for RNN-RBM, RNN-NADE,
LSTM-DBN, etc. You could also just try to model the audio representation
directly using LSTM-GMM or VRNN, but this is pretty hard and an active area of
research.

------
fenomas
Snazzy.

I've been itching to mess around with dynamically-generated music lately. Can
anyone recommend any (js/webaudio) libraries or resources to check out?

~~~
polite_wine
It's not web based or js but give sonic [http://sonic-pi.net/](http://sonic-
pi.net/) a go. Really simple and easy to get started.

~~~
fenomas
Thanks! Not what I was thinking of but it looks really interesting.

------
raverbashing
Yeah, to be honest it's a nice try but it didn't work out.

Maybe all it needs is more training time, lower temperature (and maybe some
more disney songs or related ones)

------
ericye16
If you add K:Ab above the first line, you'll hear it in its actual key. It
sounds unmistakably like "Let it Go."

~~~
klipt
This should be the top comment - changes everything.

Makes me wonder if the key change was a mistake, or a deliberate attempt to
make it sound less like the original?

------
jimmaswell
Does this do any better than a markov chain?

------
jldupinet
For some reason it's not working well for me with Safari. I tried it on Chrome
and it sounded pretty good

------
mpdehaan2
This kinds of sounds like Castlevania (i.e. an improvement) with the general
MIDI playback.

------
ff7c11
I quite liked the score sheet and playback plugin

