
A reflection about automatic transcription of music (2018) - whocansay
https://www.seventhstring.com/resources/autotrans.html
======
ekelsen
Transcribing solo piano to MIDI, even very complex polyphonic pieces, is
fairly reasonable these days. You can try a demo from a not quite state of the
art system here: [https://piano-scribe.glitch.me/](https://piano-
scribe.glitch.me/) . (full disclosure, I was one of researchers who produced
this system).

I would argue that there's a lot more people that can take the output of the
MIDI and turn it into a reasonable score than there are people that can listen
to the raw audio and turn it into a reasonable approximation of the notes.

Papers / dataset:

[https://arxiv.org/pdf/1810.12247.pdf](https://arxiv.org/pdf/1810.12247.pdf)
[New dataset and slightly improved network]

[https://arxiv.org/pdf/1710.11153.pdf](https://arxiv.org/pdf/1710.11153.pdf)
[Original Network]

~~~
jacquesm
Have you made any effort at extracting insights from the models?

------
adrianh
I agree 100% with this essay. I’ve been transcribing music manually for just
over 20 years, and I’ve dabbled with automatic transcription products and
algorithms. A few months ago, I attended my first ISMIR conference (the mostly
academic community of researchers working on these problems).

With that background, I’ve come to believe automatic transcription is still so
far from good that we’re better off creating tools that make it easier for
humans to transcribe. That’s our philosophy behind Soundslice
([https://www.soundslice.com/transcribe/](https://www.soundslice.com/transcribe/)),
which combines a music notation editor with transcription tools. For anybody
interested in transcribing music, I encourage you to give it a try.

------
skybrian
It seems like an interesting machine learning problem rather than AI complete?
There are lots of finicky details when transcribing speech as well, but it's
apparently not AI complete. You do need a large corpus so the algorithm knows
what's typical.

If I were going to work on this, I would work on generating lead sheets from
YouTube videos. Recognizing chords seems like a useful thing to solve?

~~~
unlinked_dll
>Recognizing chords seems like a useful thing to solve?

It's also pretty easy to teach and learn for a practicing musician but much
more difficult to teach a machine, since you run into issues with blind source
separation.

Speech transcription is _good enough_ provided we have enough preprocessing
power, assume a single speaker, and know the language beforehand and have
trained the model on a large number of previous speakers with the same
dialect/accent.

In music, you don't know how many speakers there are (instruments playing),
the dialect/accent (orchestration/chord voicing) changes on the fly,
representations are non-unique and contextual, and artists intentionally
subvert expected results to make good music.

Humans are just better at this and easier to train to do it than computers,
for the moment.

~~~
yorwba
On the other hand, most instruments are built for repeatability, so you can
play the same note twice and have it sound pretty much the same. Humans
produce speech with a resonant cavity made of soft tissue, so a sound like
"aaah" corresponds to an unsharp range of tongue shapes etc. and saying the
same thing twice is pretty much guaranteed to sound quite different.

Being able to separate individual notes of a musical piece into sharply
defined buckets (keys of a piano) or one-dimensional subspaces (finger
position on stringed instruments like guitars) simplifies the source
separation problem a lot.

That representations are contextual and subject to interpretation by the
artist is a harder problem (as discussed in TFA), but it should be possible to
treat it separately from the pure chord recognition problem. (E.g. it would be
easy to take notation and a matching MIDI file and then pretend that it's the
output of the recognition step which the original notation should be recovered
from.)

~~~
unlinked_dll
The buckets are more flat than sharply defined (ha). That's part of the
problem. Information is lost from notation to pitch to recording, and I
suspect one could prove that these transforms don't preserve topology and are
not homophormisms that can be easily reversed.

To use your example, finger position is not a one dimensional subspace on a
guitar. There are between 1-6 ways to play a given pitch even assuming
standard tuning on a six string guitar, which is not a safe assumption to
make.

But the issue with chord transcription is one at the heart of source
separation outside of VC demos, which is the causality problem that no one
likes to talk about. To separate into N sources you need to know a priori that
there are N sources to separate into. This is not a trivial thing to predict
in chord voicings, where N changes and is not predictable. Then you need to
make a best guess at which instrument the pitches fall into, which may be
shared.

This is something that even humans fall victim too. Untrained listeners are
very bad at quantifying sources in an ensemble, and even trained listeners
struggle with notating chord voicings with decent accuracy.

------
tomcam
All true, although one of the things that bothers me is that even if you can
read music you to need to understand the historical context in order to play a
piece properly. For example, you would play a Bach piece with a lighter touch
and less legato than you would a Beethoven piece with the same notes. And of
course a jazz musician looking at music from a certain area is almost
certainly going to play what look like eighth notes with swing, which is a
completely different feel and should actually be notated differently as well.

~~~
meowface
This is why I tend to prefer "auteur" music; music composed, arranged,
produced, performed, and ideally even mixed by the same person/small group of
people.

It's a shame we'll never get to hear Beethoven or Mozart playing their pieces
themselves, in their intended locations and with their preferred instruments
and seats and such. I suspect it'd sound much better than any other performer
who's interpreted their work.

~~~
jacquesm
There are pianoroll recordings of Rachmaninoff and also some very precious
early recordings:

[https://www.youtube.com/watch?v=pBx-
tr1FDvY](https://www.youtube.com/watch?v=pBx-tr1FDvY)

------
dekhn
Rather than targeting formal music notation, I think it makes a lot more sense
to use modern digital audio workstation representation (horizontal bars
representing individual tracks). I believe this can represent everything music
notation can (for practical purposes) and has the advantage of being easily
visually readable by people with little musical training (I've spent the time
to learn basic music notation and I find it has a very high cognitive overhead
for people who didn't learn it during early childhood).

Here's a lovely representation:
[https://www.youtube.com/watch?v=G2tEVVeGCk0&feature=share](https://www.youtube.com/watch?v=G2tEVVeGCk0&feature=share)

It would seem to be likely that given enough training data (labelled scores)
you could make a net that took raw complex music in and generated that out.

------
bertr4nd
I’d love to work on automatic transcription for drum set. I trained in
classical percussion, but I’m self-taught when it comes to rock, so I have
trouble figuring out how to reproduce fills that I want to play.

I remember using Transcribe!, created by the author of this article, to try to
figure out some of the fills in ZZ Top’s La Grange, which isn’t a complicated
song exactly, but I can’t figure out how to make the fills feel right.
Transcribe! Is great, btw.

Anyways, I’d be very curious to hear if there’s any existing work in this
area. I feel like drum transcription could be easier in some ways than
“pitched” instruments, but possibly harder in some unique ways too.

~~~
adrianh
Yes, there is indeed work in this area! Check out the academic papers
published in association with ISMIR.

------
jacquesm
So, I've been working on just this for the last 5 months. It's getting there.
I'm still a few % behind what 'onset and frames' (see comment by ekelsen) can
do but the gap is shrinking and my stuff is orders of magnitude faster.

That said, I'm not sure yet if I'll be able to close the gap, and it's a hard
problem to solve. And yes, like the author correctly identifies: for piano
solo, or harpsichord.

------
kazinator
Transcription to accurate MIDI rather than notation is useful: MIDI that
captures portamentos, legatos, vibratos and such details of expression.

