We don't. Most of the applications of ANNs to music that I've seen so far are solving problems that either don't exist or have been very successfully solved by much simpler methods with much better results.
If you want to apply AI to something useful in music, how about:
- Machine learning for reverb generation. I.e. being able to "record" reverb from real space and apply it to sound.
- AI/ML for high-quality sample transposition.
- AI for pure sound synthesis. Imagine something that does PCA a set of sounds and then allows you to generate new sounds by changing components with a bunch of knobs.
- AI for real-time sample morphing. For example, being able to synthesize a choir that sings the words you need. Or, how about transforming your own voice into a quality choir vocals? Hey, style transfer!
- AI for high-quality music transcription (sound to notes or MIDI). This is being done to some extent, but I don't think it's very good yet.
And so on.
I can assure you, every singe one of these would be something musicians would be willing to pay good money for.
- I haven't heard the term transposition for samples - is this the name for moving samples in pitch and compensating in time so that it sounds undistorted?
- On pure sound synthesis, see NSynth . It's a beginning step toward this with abstract sounds.
- Sample morphing and changing the words with voices is pretty tough (overlaps neural TTS heavily, which is pretty recently starting to work well), but if you can synthesize a single voice decently the rest seems doable with manual intervention. Check Neural Parametric Singing Synthesizer , or try the online demo 
- Every few months there is a new convnet which improves the baselines for transcription, hopefully it is only a matter of time. The current best is quite a bit better than only a few years ago.
There is still a huge gap between "what is possible in research" and "what is easily usable in a practical editing/creation workflow" on a typical DAW, but hopefully that gap will diminish over time.
There are also a lot of tools from statistical signal processing that were doing these tasks to some degree, digging them back up and "neuralizing" the probability parts with neural nets is a promising way to get quality improvements, and would mirror what has been successful in neural TTS so far.
Yes, that's exactly it. There are traditional algorithms that do it, but most are variations on time stretching and granular synthesis. They aren't as good as they could be. A good algorithm would be able to take in a handful of examples and generate a model that seamlessly interpolates across different pitches and velocities, accounting for differences in timbre, attack, decay, noises and so on.
Thank you for the links.
The only limits to this is how close to a perfect impulse you can generate.
Also, there is more to reverb than just impulse response. A more sophisticated reverb simulator would map out the space and allow you to choose where the "listener" and the source are located in that environment.
The nearby comment about moving obstructions is on-point as well.
There is ample space for application of machine learning here.
By that I mean, we have huge masses of music theory, applied across genres and focusing on differences between genres, in terms of heuristics for both analysis & composition, and because of a tradition of procedural generation of music that goes back a couple centuries, a lot of it is fairly easy to translate into computer programs. (For instance, end-to-end serialist composition is easier for computers than it is for people, while canons and other mechanisms for creating permutations of melodies are equally straightforward.)
This doesn't translate into a straightforward method for putting in two wav files and producing a third with transferred style, but it does mean that a sufficiently motivated person can write something that translates notes between two known genres with greater ease than they would with images.
(Text is somewhere in the middle. I've worked on a couple attempts at 'style transfer' for text -- mostly using word2vec.)
There are a lot of regularities and patterns in harmony and harmonic sequences which music theory covers, but there are also a combinatorial explosion of melodies that will be justified by music theory in a particular harmonic context. The choice of which melodic path to go down is very poorly constrained by music theory.
This is one reason many people in the space are focused on new tools for creators, instead of "automated creation" - what is "cool/interesting/listenable" is ill-defined, but making something which allows creators to explore the sonic landscape in really different ways seems a lot more plausible to me - more "weird synthesizers with neural networks" than "robo-artists".
If you write a rule-based counterpoint solver - this has been done, with varying levels of success - you'll find that not only do you have to include an extra set of rules that aren't defined in any of the standard texts, but that the best output you can expect is musically mediocre.
The other approach is to create a patchwork of idiomatic cliches. That usually sounds more convincing, but still isn't musically interesting. And you can usually hear where the edges are glued together.
It turns out that there is no theory of "good" music. It literally doesn't exist. All the standard texts for each style - and it doesn't matter which style you pick, from Palestrina to pop - are very incomplete guides that rely on human intelligence and creativity to fill in the gaps.
Throwing neural networks at this problem doesn't make it any easier, because no one knows what to look for.
Features in artistic style transfer are easy to parse - shape, texture, and that's pretty much it, all packaged in a ready-to-go 2D distribution.
What are the musical features that define not just one possible musical style, but all of them, and would allow anyone to morph smoothly from Wagner to Taylor Swift to Balinese Gamelan to DubStep?
I think there is potential to do something by blending all 3 (rules and constraint checking, rewards (perhaps based on idioms, riffs, and cliches), and some additional discriminator/critic/metric learning) if you narrow the criteria sufficiently (3 voice counterpoint in the style of Josquin de Prez for example).
I would really be pleased to see something that can fill out "theory 101" counterpoint worksheets against a cantus firmus, even if the results would be poorly graded on "style" it is someplace to start getting feedback and collecting data to try and quantify that it factor for a narrow narrow subsection of the wide world of music.
It's not so much that I'm overestimating music theory, but that the state of really concrete analytic structures in other arts (in terms of being well-positioned to produce new works from PRNG output) is pretty bad.
It's pretty straightforward to construct a program that produces mediocre music that really resembles human-written mediocre music. Computers are also really good at producing poetry that is indistinguishable from the work of mediocre human poets. Computers are not good at producing images that look like they were hand-drawn by a 12 year old Deviantart user or producing stories that read like they were taken from fanfiction.net
All lots of fun.
In music the "style" is the content in some sense. For example jazz has very different "style" than classical, at many levels (key and tempo choice/mode choice/melodic intervals/motifs/amount of repetition of said motif/how it varies/harmonization and chord choice/global structure (AABA format)) and it isn't easy separate what pieces make it "jazz", and what don't (what factors of variation matter).
The equivalent in images would be replacing objects as well as texture, to form a new image that is reminiscent of the original but also novel at multiple scales - think Simpson's "Last Supper" as the goal of a style transfer .
It is also hard because as consumers we are used to hearing high quality versions of these types of "style transfer" for some styles all the time - and we even have a name for it ... "muzak".
With music, it's much more complicated due not only to the time domain, but also to the emotional content, the interrelationships of rhythm, tempo, chordal progressions, expression, timber, transitions from the simple to the complex, the way the music is layered, and how all these things are perceived by the listener. Like human faces, music elicits an emotional response that runs deep, and whereas the emotional content of a picture is relatively static, music's emotional content is dynamic and evolving. Music has its own "uncanny valley" between the machine and the human.
Think of a MIDI track that is generated by precisely laying out notes and velocities on a piano roll, vs. a track that is recorded live and captures the dynamics of the player. The difference is immediately obvious, and that with a track that is composed by a human.
I wrote this blog post, with some data that might help improve that.