As I understand it, the author has trained an LSTM on a single MIDI file -- "And Then I Knew" by Pat Metheny. The network is then asked to generate MIDI notes in sequence.
What this network has been asked to do is to produce an output stream that is statistically similar to the single MIDI input file it has been trained on. It would be more accurate to call this an "And Then I Knew" generator. Its "cost function" -- the function the network is trying to minimize during training -- is exactly how well it reproduced the target song.
Neural networks are "universal function approximators". It's not surprising that given a single input, a network can produce outputs that are statistically similar to it.
A network that could compose novel MIDI jazz would look like this:
* Train a network on a corpus of thousands to hundreds of thousands of MIDI jazz files.
* Add significant regularization and model capacity limits to prevent the network from "memorizing" its inputs.
* Generate music somehow -- the char-RNN approach described here is fine. There are other methods.
You want the network to build representations that capture the patterns of jazz music necessary to pastiche them but not high-level enough representations that the network is exactly humming the tune "And Then I Knew". This is so much of a problem that any paper presenting a novel result in generative modeling pretty much must include a section presenting evidence their model is not memorizing its inputs.
I can hum a few classic jazz tunes from memory but that mental process is not jazz music composition -- it's reproducing something from memory. If we're going to call a model "AI-generated jazz" you need some way to tell the network to not hum a tune it knows and instead compose a new tune with the principles/patterns it knows. Since we can't speak to our models and tell them to think one way and not the other, part of the trick in this field is to come up with models that can only do one thing and not the other.
Generating random patterns that sound jazz-ish is interesting, but until multiple generators can react to what the other is doing in real time (or to a human participant), it isn't exactly jazz.
I'd equate it to a basketball playing robot. Teaching it to shoot free throws is interesting, but doesn't really take a step towards approximating what basketball is. Can it call for picks, lead passes to cutting teammates, box out for rebounds, force bad shots, etc?
The most interesting thing was if you read the commentary on the matches. The announcers were mystified by the computer's moves. 'Alien' comes up a lot in describing the play-style. Us humans can't play Go and evaluate each stone in the game. We have to 'chunk' the game. Exp: These 3 stones are a 'wall' or a 'platoon', this stone is 'hot' and can take your stones, this stone is 'down' and will be used in 3 turns, etc. The computer doesn't have to do that chunking, each stone is evaluated individually. As such, the play-style was totally foreign to people. It did things no player had tried or, importantly, could have thought of given the limits of our brains having to 'chunk' the information.
I would predict that a b-ball-bot would play the same way, in totally strange ways that a human can't think of. Exp: Calculating a reasonably high probability that the ball will bounce off your nose and go into the left hand of it's team-mate, throwing the ball as hard as it can at it's own head to make a shot, not trying to get past just 1 opponent but the entire team's right thighs 57 seconds from now, etc.
Similarly with jazz, the computer is a dumb machine that will just do strange things because humans have to 'chunk'. In music, we play in chords and notes and with rhythm and timing. The computer can evaluate the whole song, and every other song at the same time and can borrow from all those. You and I can pull in the feelings of loss of a child, or the joy of strawberry ice-cream bars in a Memphis summer, things a computer will never. But we cannot pull in the obscure Tuvan throat singing techno-remixes on Youtube , the Afro-Thai heavy metal Vimeo channels, or the terrible pre-teen angst poems set to crappy guitar, etc, all at once. It can only see what you feed it, but you can feed it the life-outputted-into-music of billions of humans with live updates. The computer will know more.
But music is emotional and about feelings. The feel of music is most important to us. And I think that a human songwriter is therefore essential, one that cares and puts effort into the work. It connects us, and that is what is important, not the sounds.
Children can play music very emotionally (or rather, in a way that adults associate with emotional) without having any experience of or real comprehension of the emotions. Imitation and training is sufficient to be convincing. A program doesn't need to experience emotion, only know that certain characteristics of the sound are associated with certain emotions.
A conference paper: http://www.aaai.org/Papers/AAAI/2000/AAAI00-100.pdf
For an arbitrarily complex network, it could internally develop independent generators that react to each other.
However, the likelihood of common optimization strategies used for training RNNs (back-propagation through time, foveation/attention, etc.) developing a network like this is probably quite small.
It would be possible for a network designer to come up with a structure (as Hochreiter did with LSTMs) that lends itself to this sort of structure but then you're baking in assumptions about how humans accomplish a task (which comes with trade-offs).
A human composing new creative Jazz is using a much wider set of sources for creativity, not just existing jazz songs.
If so, why do you believe the network is only reproducing statistics rather than having learned the same circuitry humans have when improvising/composing jazz? It's hard to show that it's doing one thing or the other. In this case with n=1, it seems pretty clear it's doing the former.
If not, then it doesn't seem to matter since that's what humans are doing.
Most musicians learn from others and thus develop a style semi inspired by what they listen to.
So add 50.000 more songs and you have something.
Perception is reality.
Musicians do learn from each other, but then they learn how to play what they like, or what sounds good. To this model, "what it likes" is a 100% representation of 'And Then I Knew' . You could swap the target song for another, but not for multiple targets at the same time without reworking the logic.
I.e. aren't the mechanics there and is primarily limited by the the size of the feedback loop?
I am genuinely interested in the answer.
You've probably never listened to free jazz (or 100 other genres besides)...
I think people find unfamiliar music difficult to listen to. I don't think it's really about genre or artist.
I suppose some genres are trying to be difficult on some level (rock and roll, punk, metal, and rap each took up that mantle) but all of those were meant to be easy to listen to for a target audience.
Bartok never struck me as super combative. Brainy, perhaps.
I don't mean this as a slight at all, but definitely raise the bar on your experiments.
But then I listened to the original (the track used to train the network) and realized the problem: the network only knows how to write one song. What you hear on SoundCloud is the equivalent of giving someone a 5 paragraph essay, and then telling them to write a 10,000 word paper using only sentences contained in that essay.
Supposing that this program can accept more than 1 song in its training data, I expect it could produce really interesting stuff.
But there's a part of music where human soul needs to be, and that is interesting too, and some of the expression stuff is harder to do in MIDI land, you can modulate a filter cutoff or velocity or something - but compared to a live player there is a LOT of work to do.
Also, unless I missed something the clips just play the network's attempt at duplicating the "head" of the track; not the soloing.
As a jazz musician I find this cool but I also feel safe that it won't be stealing gigs from me anytime soon.
In fast tempo bebop they tend to have relatively equal durations, and in other jazz styles they trend closer to 2/3 + 1/3 of the beat respectively.
In a typical jazz swing drum beat the high-hat is closing on 2 and 4 (the upbeats).
In a typical rock drum beat the bass drum is on 1 and the snare drum is on 3 (the downbeats). There's almost never emphasis on the upbeats.
The two styles are almost completely opposite in feel and that Metheny track is using the rock style.
The two approaches you describe ("emphasis on 2 and 4" and "emphasis on 3") are actually the same thing. You're just counting twice as fast when you identify 3 as the back beat vs 2 & 4. To say that another way, any song that can be notated with the snare on 3 could also be notated with the snare on 2 & 4 by halving the tempo. I think most musicians would notate "And Then I Knew" such that the snare falls on 2 and 4, but that could probably go either way.
I'd contrast this with classical music (classical as in Mozart), where the emphasis is on beats 1 and 3.
Maybe you are thinking about counting eighth notes on the high hat -- in that case the bass drum would be on the first high hat hit, and the snare would be on the third. However the counting should always be on the quarter note, i.e., two high hat hits per count -- "One and two and three and four and."
In your "boom...bap...boom...bap", the ellipses are quarter-note rests. Listen to any simple rock tune, say, AC/DC's Back in Black. With BD == Bass Drum, SD == Snare Drum, HH == High Hat, what you get is:
Note | 1 2 3 4
HH | x x x x
BD | x
SD | x
Not really. There are lots of jazz styles not dependent on rhythm, and whole lot of genres having little to do with rhythm either (e.g. ambient music).
ETA: The default midi sound font doesn't do it any favors, either. I have some software instruments I could throw at this that would make it sound a whole lot better.
"GenJam (short for Genetic Jammer) is an interactive genetic algorithm that learns to improvise jazz."
I stumbled across some music generators. A downloadable one http://duion.com/link/cgmusic-computer-generated-music
Both are "procedurally generated music" so I'm not sure where that falls in the AI spectrum.
I found that the quality was interesting and there was some potential there but at least in these cases, there were some issues with the quality of the midi instruments and song structure was very "same-y"
Anyways, Looking forward to poking around in the DeepJazz code.
I started on recently - and need to do more work on it - to do some things in a bit more of an object-oriented way trying to model more music theory concepts (like scales) as objects, not so much analyzing existing files but making the primatives you might need to build a sequencer (and eventually some generative stuff).
If people are interested check out:
https://github.com/mpdehaan/camp (in the README, there is mailing list info).
The next thing for me is to make an ASCII sequencer so it's a program that can also be used by people who can't code, and then I'll get back more into the generative parts.
I'd be surprised if a current gen LSTM will be able to generalise music or language rules well enough to be able to piece together music or sentences, long enough for a coherent story or a jazz piece that matches that of a competent human author.
The same author's Endless Traditional Music Session supplies all the Irish session music you could ever need, by mechanical means:
Having said that, and as a Jazz fan, the generated music is horrible. Keep feeding it more jazz tunes :P
Blogpost + music:
As someone who maybe is not as sophisticated in his taste for jazz, this sounds good enough for me. Especially this can be passed as elevator music.
On the other hand, it would be more valuable if there were more than a single file used for seeding. This way this is a theme that is listenable but will always have the same style of it's seed.
I intend to play with it and see if I can get more interesting melodies.
Until such time as we discover an algorithm that replicates human taste in music, any AI-based approach to composing music will fail because it will not have any feedback about the quality of the music.
It seems like the word 'AI' is getting thrown around.
It would massively improve the quality of the output and make it sound more "humane" IMO.
You can use the samples from www.freesound.org for instance.
From my perspective, AI generated music at the present often falls really short on two areas. The first is instrumentation and dynamics. AI music often sounds "robotic". Probably better soundsets for some AI examples would help, but beyond that, I find a lot of AI music "overly quantized" sounding. Humans often don't play the music exactly as written(see: https://en.wikipedia.org/wiki/Expressive_timing); this "non-perfect timing" is a large part of many music works' expressive element.
The second problem to me is that AI music often falls short on overall coherent musical themes. A lot of AI pieces tend to sound "structureless" with no real direction, no thematic elements, nothing that could be called a motif or hook, etc. There are definitely some established "rules and patterns" for music, so it's not like some of this could be fed into the AI. The best composers however bend and play with convention a bit, though.
I don't like the first one very much, it resembles me improvising some times; randomly repeating patterns without any direction or structure and without going anywhere.
The second one is better, it has some good moments, but still has the same problem, it lacks general structure, and just seems to go from pattern to pattern.
I like the third one, it may help that the form is very formulaic. Some rhythms that it makes are weird but in a good interesting way. The structure is better and seems to be going somewhere but unfortunately it doesn't finish.
Conclusion, if the program could incorporate structure in some way it would make for passable music, but I would say the humans are still safe ;)
I played around with looping different speech synthesizers back into different speech recognizers, kind of like audio or video feedback, but with chaotic noise injected like quirks of the synthesizer, the voice, speech speed and pitch, and the audio environment around the microphone (you could talk over it to interfere with the words it was speaking and lay down new words in the loop), working against the lawful pattern matching and error correction behavior of the speech recognizer, and the HMM language model it was trained with.
It was a lot like beat poetry, in that it tended to rhyme and have the same number of syllables and use plausible sounding sequences of words that didn't actually make any sense, like Sarah Palin.
You can start it out with a sensible sentence, and it will play the telephone game, distorting it again and again. If you slow down the speech rate, words will split into more words or syllables, and if you speed it up, words will collapse into fewer words or syllables, or you can tune the speech rate to maintain the same number of syllables. Its analogous to zooming the video camera in and out with video feedback.
It would wander aimlessly randomly around poetic landscapes, sometimes falling into strange attractors in the speech recognizer's hidden markov model and repeating itself with little or no variation.
At any time you can join in with your own voice and add words during the pause at the end of the loop, or talk over its voice, much the way you can hold things in front of the camera during video feedback to mix them in.
Different speech recognizers are better at recognizing different vocabularies, and therefore like to babble about different topics, depending on which data they were trained on, which we could guess by attepmting to psychoanalyze their incoherent babbling.
IBM's ViaVoice was apparently trained on a lot of newspaper articles about the Watergate hearings, as it was quite paranoid, but business like, as if it were dictating a memo, and would start chanting and fixating on phrases like "congressional investigation," and "burglary and wiretapping," and "convicted of conspiracy".
Microsoft's speech recognizer had obviously been trained on newspaper articles about the Clinton Lewinsky scandal, since it was quite obsessed with repeatedly chanting about blow jobs (just like the news of the time), and whenever you mentioned Clinton this or Clinton that, it would rapidly converge on Clinton Lewinsky, Clinton presidency, Clinton impeachment, etc.
What I'd love to have would be a speech recognizer that returns a pitch envelope and timing that you could apply back on the synthesized words, then it could sing to you!
Have my upvote. (It was downvoted when I wrote this.)
And it happens to be mine, also, in regard to the SoundCloud samples. Sure, the project behind it might be mathematically interesting and all... but really now, this ain't music, let alone jazz. In fact, if I came across those samples whilst flipping between radio stations, I would probably hover for at most a second or two, before giving the dial another turn... or turning the damn thing off.
Absolutely unlistenable, in other words.
Your comment on the other hand (aside the complaining about downvoting, which is discouraged in the HN guidelines) was fair and interesting - and I personally upvoted it.
Your comment, on the other hand, provided some more insight into why this might not be notable or impressive.
1. I heard more convincing music composed by AI a decade ago.
2. It suffers from the same pointlessness that seems to plague these attempts: Disjointed, monotonous, doesn't go anywhere.
3. The exercise did not place itself in the context of other approaches to AI music compodituon.
I apologize for my rudeness and brevity.
(and yet is still more human-sounding than the atonalism that dominates modern orchestra works)
So it does the usual expert system/AI thing of cycling between "Almost music" and "And... lost the plot" over and over.
If it learned from a real musician, one of the first things it would have been taught is how to listen to other musician's music. And then, how to listen to itself. And then, how to play a much simpler piece, than then play it right, so it sounds like music... not like someone typing (which is what the SoundCloud samples sound like, to my ears).
But of course machines don't "listen" in any meaningful sense. And they certainly can't tell what it is they like about Pat Metheny's music; or why they "like" his music, but not the music of Billy Joel or Anthony Braxton.
So maybe that's where these researchers should start -- by creating systems that (at least attempt to) understand and evaluate music. And to tell good from bad.
Then, maybe, they can toy around with systems that generate music.
Like pretty much exactly what you'd expect coming from an an entity or a thing that thinks music is just about notes and mathematical patterns... and not about emotions, or an experience that you feel in your body.
On a less handy-wavy level, I suspect there are issues of tone and timing embedded in human-generated musical expression that these algorithms don't begin to capture. In the same way that - no matter how much they keep tweaking their markov chains and phoneme scales -- we can always tell that a machine-generated voice sounds "off" somehow, literally within tens of milliseconds.
As the saying goes: it don't mean thing if it ain't got that je n'sais quoi.