MIDI is extraordinarily expressive and is likely used to sequence a large majority of music produced within the last three decades. A lot of the instruments you hear are synthesizers or samplers running directly from MIDI. There is a lot more to what MIDI can do, and is used for, than the conception most people have from "canyon.mid" or old website background music. If an AI can do MIDI just fine then it's an extremely small leap to doing audio just fine.
If an AI can do MIDI just fine then it's an extremely small leap to doing audio just fine.
Unfortunately this is not true. It takes a huge amount of human effort to make MIDI encoded music sound good. The difference between MIDI and raw audio music generation is the same as the difference between drawing a cartoon and producing a photograph.
To clarify, yes MIDI can be expressive, but what's being generated when people say "AI generates MIDI music" is basically a piano roll.
I'm not familiar enough with existing implementations of such systems to dispute it, but there's no fundamental reason algorithmic composition systems could not include modulation parameters of all kinds (pitch/breath/effects/synthesizer controls/etc) in their output. I am envisioning a DAW set up with several VST's and samplers with routing and effects in place, then using some combination of genetic algorithms and other methods to "tweak the knobs" in the search for something pleasing.
The search space is absolutely enormous, though, so I don't dispute that it's very difficult, but I wouldn't go so far as to say that it can't be done. In such a space there are "no wrong answers" so to speak. I have a python script which creates randomized sequences of notes/rhythm and gives each one a different combination of LP/HP filters and random envelopes - it's not music but it takes on a much less mechanical quality by emulating different attacks and timbres over time, even though it's completely random.
I would go so far as to say I'd be genuinely surprised if algorithmic composition and production hasn't been used to some extent significantly greater than "basically a piano roll" in at least some of the past decade's top 40 music on the radio.
there's no fundamental reason algorithmic composition systems could not include modulation parameters of all kinds (pitch/breath/effects/synthesizer controls/etc) in their output
There is such a reason - lack of training data. Very few high quality detailed MIDI samples exist to train machine learning models like AudioLM.
For state of the art in MIDI generation, take a look at what https://aiva.ai/ produces (it's free for personal use). There you can compare raw MIDI output to an automatically generated mp3 output (using "VST's and samplers with routing and effects in place, then using some combination of genetic algorithms and other methods to "tweak the knobs" in the search for something pleasing.")
mp3 version will sound much better than raw MIDI, but (usually) significantly worse than music recorded in a studio and arranged/processed by a human.
As a clasically-trained pianist who then got into electronica and synthesis, it was mind blowing to me that people could wrangle expression and phrasing from a MIDI sequencer.
That particular niche has had some pretty amazing successes already. It's coming.
We can't produce arbitrary media streams with many "stack layers" of meaning and detail yet, but we can do a lot of specific instrumental transformations...