My take (as someone working in the area) on how/why we are encouraged by this pr...

My take (as someone working in the area) on how/why we are encouraged by this progress:

The model is unconditional i.e. any concept of form at all was learned directly from data, with no real structural hints to what it should learn.

Generally you get much stronger "global structure" the more prior information you provide either in the model itself or in the input/targets (such as chord constraints, etc.). This is one reason that harmonization tasks often sound notably better than pure generation - the "backbone" in harmonization was human provided, whereas in pure generation the model needs to be self-consistent.

The analogous task in language would be character level language modeling, whereas something like harmonization or conditional generation from chords would be more akin to machine translation. Character level language models may be a bit wonky looking when generating, but any structure at all was purely model discovered from a simple rule (generally, maximum likelihood), whereas in conditional models a lot of extra help is given by the conditioning variable.

This model is a pretty big jump in quality from previous efforts in the same vein of unconditional generation, and should combine well with techniques for conditioning in sequence models from all over the research community (translation, speech recognition, question answering, summarization, speech synthesis). On top of that, the model is IMO quite elegant, and much simpler to understand than many other attempts at polyphonic generation including my own.

It is worth comparing this output to some previous efforts (from Magenta and others), some with more complicated internal structure such as [0][1][2][3][4][5][6], and also some models with stronger conditioning or different input representations such as [7][8][9], the extreme case being [10] where a human (Benoît Carré) used a modeling/ML toolkit along with standard arrangement and production approaches to produce a wacky, awesome pop song.

Magenta is putting out lots of neat stuff around interactive usage of these models, with web demos and plugins to standard music tools [11]. Indeed, Doug's quotes in this article [12] sum it up for me, excerpt: “I don’t think that machines themselves just making art for art’s sake is as interesting as you might think,” he explained. “The question to ask is can machines help us make a new kind of art?”

Link to the model in question: Performance RNN, Magenta https://magenta.tensorflow.org/performance-rnn

[0] Work from Doug in his postdoc doing LSTM Blues, http://people.idsia.ch/~juergen/blues/

[1] RNN-RBM from Boulanger-Lewandowski et. al, see TF impl http://danshiebler.com/2016-08-17-musical-tensorflow-part-tw...

[2] Early Magenta models, https://magenta.tensorflow.org/2016/07/15/lookback-rnn-atten... , ex https://www.youtube.com/watch?v=nHxr9u9_4_s

[3] Daniel Johnson's Biaxial RNN http://www.hexahedria.com/2015/08/03/composing-music-with-re...

[4] Some experiments only published in my thesis to date, https://www.youtube.com/watch?v=tavHJoum--g&list=PLRMa_gJ8vx... , thesis in question https://github.com/kastnerkyle/udem_masters_thesis

[5] Polyphonic Magenta model https://www.youtube.com/watch?v=s0BVFVqEY4A

[6] RL tuning to improve generation https://magenta.tensorflow.org/2016/11/09/tuning-recurrent-n...

[7] Folk-RNN is similar but uses ABC representation (https://highnoongmt.wordpress.com/2015/05/22/lisls-stis-recu...), https://soundcloud.com/seaandsailor/sets/char-rnn-composes-i..., also see related post https://maraoz.com/2016/02/02/abc-rnn/

[8] DeepBach , http://www.flow-machines.com/deepbach-polyphonic-music-gener...

[9] Lawson He's LSTM approach, http://www.lawsonhe.com/music.html

[10] Daddy's Car https://www.youtube.com/watch?v=LSHZ_b05W7o

[11] Interactive Magenta Jam https://magenta.tensorflow.org/2016/12/16/nips-demo

[12] https://www.technologyreview.com/s/604010/google-brain-wants...