The simplest example is the RNN-RBM (restricted boltzmann machine) from the Theano deep learning tutorials . There are follow-up papers by Kratarth Goel extending this to LSTM-DBN, and RNN-NADE (neural autoregressive density estimator) has also been explored in this space. The results for MIDI are pretty good.
Ales Graves  has done a huge amount of work in sequence modeling for classification and generation. His work with LSTM-GMM for online handwriting modeling can also be applied to speech features (vocoded speech), and seems to work well. No official publication to date on the speech experiments themselves, but the procedure is well documented in his paper 
The most recent thing is our paper "A Recurrent Latent Variable Model for Sequential Data" . It operates on raw timeseries (!) even though all intuition would say that an RNN shouldn't handle this well. Adding a recurrent prior seems to preserve style in the handwriting generations at least - I may try on this "Let It Go!" task and see what happens! We are also exploring conditional controls and higher-level speech features, which should result in samples as good or better than LSTM-GMM.
Andrew Ng's Coursera course does a super basic NN in MATLAB which might be good to shake the rust off . Hugo Larochelle's youtube course , Geoff Hinton's Coursera course , and Nando's youtube Deep Learning course  are all very good. There are also some great lectures from Aaron Courville from the Representation Learning class that just finished here at UdeM 
If you want to do this for real (not self teaching RNN backprop on toy examples) tool-wise you generally go with either Theano (my preference) or Torch - implementing the backwards path for these (and verifying gradients are correct) is not pretty.
Theano does it almost for free, and Torch can do it at the module level. Either way you definitely want one of the higher order tools!
I would like to see this done with just a melody line, or a melody and a bass line, or melody and chord changes- as doing it based on a full piano score is probably quite a bit trickier for a neural network to handle intelligibly.
I'd like to see some sort of explanation of how the neural network knew how to produce unique chord progressions pleasing to human ears, even adapting the lead to fit these new chords, for a song that contains none of those chord progressions. What is the other unrelated music it used as learning material for this?
I'd ask OP to fill us in, since it's his site; but it seems he perhaps prefers not to waste his time translating back and forth to English, considering his historical lack of comments on HN. Oh well.
Starry Night, the very last image, is particularly good. Would love to have that on display and see if people notice anything strange.
 : http://extrapolated-art.com/
Am I a fool for trying? It would be a lot of data compared to a midi track.
I'm still not all that optimistic, frankly, but it'll be better than raw data. Some further experimentation with different encoding might be necessary.
That said, I'd put $10 bucks that just putting an FFT representation through wouldn't necessarily produce a new pop song, but it would sound uniquely spooky, I bet. If nothing else you might have something you can hook up to some speakers next Halloween and make some kids cry.
One thing we tried in early testing, but did not pursue farther was vector quantized X (where X is MFCCs, LPC, LSF, FFT, cepstrum). Basically you use K-means to find clusters for some large number of K, then simply assign every real value (or real-valued vector) to the closest cluster. The cluster mapping becomes a codebook, and your problem goes from input vectors like [0.2, 0.7, 0.111, ...] to [0, 1, 0, ...] where the length of the vector of 0s and 1s is the number clusters K.
This is a much easier learning problem, and closely corresponds to most "bag-of-words" or word-level models. The quantization is lossy but for large enough K I do not think it would be noticeable. After all, we listen to discrete audio every day, all the time in wav format :)
To synthesize, you can either map codebook points back to the corresponding cluster center, or as most people do, map it to the cluster center with some small variance so you have a little bit of interesting variation.
Please try it and let us know how it turns out, good or bad.
This sounds much better in comparison!
Try https://soundcloud.com/david-given/unsigned-composition-1 or https://soundcloud.com/david-given/procedural-jazz-sting.
In this case, an RNN meant for 1-of-K output is not well suited to outputting chords. It worked fine for single notes however! Check out this link from Joao Felipe Santos https://soundcloud.com/seaandsailor/sets/char-rnn-composes-i... . These are pretty cool - and he generated the titles to boot.
For chords you really can't model all possible combinations (2^88 for midi) naively, and you also can't really model notes independently - chords are highly structured and follow specific rules! Even bounding to only chords of up to 3 or 4 notes still makes a pretty large output space, which means more training data is needed, it is harder to optimize, etc. etc.
You should be much better off with some kind of conditional/factorized model strapped to the output of an RNN - this is the idea for RNN-RBM, RNN-NADE, LSTM-DBN, etc. You could also just try to model the audio representation directly using LSTM-GMM or VRNN, but this is pretty hard and an active area of research.
I've been itching to mess around with dynamically-generated music lately. Can anyone recommend any (js/webaudio) libraries or resources to check out?
Maybe all it needs is more training time, lower temperature (and maybe some more disney songs or related ones)
Makes me wonder if the key change was a mistake, or a deliberate attempt to make it sound less like the original?