This is really cool stuff - the network structure reminds me a lot of Graves' MD...

This is really cool stuff - the network structure reminds me a lot of Graves' MDRNN[1] and Grid LSTM[2], as well as some work I helped with (ReNet [3])

I wonder if the structure over frequency/time is too "regular" - in general for sound the frequency correlation and the time correlation are on wildly different scales.

Also if you are looking to go farther you might reconsider adding NADE or RBM [4] on top, or latent variables in the hiddens[5][6] to add more stochasticity.

There was some alternate work by Kratarth Goel extending RNN-RBM to LSTM and DBN, it might give you some ideas to look at [7]. I know when we messed with bidirectional LSTM + DBN for midi generation it lead to this kind of "jumbled/dissonant" sound you seem to be having - don't know what to make of it here. You might consider bi-directionality over the notes, though it makes the generation way more annoying.

Awesome work! I will definitely be sharing around and checking out your code.

[1] http://arxiv.org/pdf/0705.2011.pdf

[2] http://arxiv.org/abs/1507.01526

[3] http://arxiv.org/abs/1505.00393

[4] http://www-etud.iro.umontreal.ca/~boulanni/ICASSP2013.pdf

[5] http://arxiv.org/abs/1411.7610

[6] http://arxiv.org/abs/1506.02216

[7] http://arxiv.org/pdf/1412.6093.pdf