Hacker News new | past | comments | ask | show | jobs | submit login
Researchers’ AI aligns sheet music with MIDI audio (venturebeat.com)
97 points by Bud 11 days ago | hide | past | favorite | 64 comments





Does this really need AI? Couldn't you just take the maximum of the fourier transform of the rendering multiplied with the audio? Maybe across multiple dimensions.

This is one of variants of an area called "score following" and it's really hard. I remember about 10-15 years ago there was this craze in start-up land which led to many corpses of start-ups who thought solving semantic music problems would be easy with FFTs. But forgot to talk to an actual MIR (music information retrieval) or Musicologist. I remember a friend of mine working for one about 15 years ago and when she describe their deal I was like "have you talked to a real music person? Because that is only going to get you a tiny part of the way there". Nope. And sure enough, dead in the water a couple of years later. I'm studying music cognition right now in a cross discipline masters (music, comp sci) and it is crazy complicated.

In a nutshell: the human brain has an amazing ability to separate a garbage pail mush of audio into meaningful streams that we perceive as different event streams. This enabled us to do things like notice (as my prof says) the difference in white noise between the nearby river and the rustle in the leaves from something else, like an approaching animal. Score following requires doing this to separate the aggregate noise into different instruments and into voices within that instrument, suitable for note by note analysis. This, it turns out, is a wicked hard problem.


That makes sense when you're dealing with raw audio and need to disentangle it. But here we can work in the other direction to using the structure of the sheet to find the right spot in the mess on the other side.

I'm sure there are further difficulties, just saying that your answer isn't entirely satisfying.


Sheet music has far less structure than you think it does.

Can you explain what you mean by that? Sheet music would seem to be incredibly structured to me. Unless we’re talking about sheet music hand scribbled on paper?

Consider the same piece of "sheet music" delivered by different conductors, eg., Beethoven's 5th:

https://www.youtube.com/watch?v=1lHOYvIhLxo&t=24s https://www.youtube.com/watch?v=9aDEq3u5huA

Sheet music is a guide for human beings who understand "musical context" to, with much leeway, create a cohesive sound.

The sound created does not have a deterministic relationship to what's written, or else, why have any live music at all?

And, more importantly, even if it did deterministically produce the same "sound", that sound is layered to us by our prior familiarity with instruments (, room geometry, etc.).

We aren't deconstructing it by "mere frequency", but by meaningfully parsing out & grouping frequencies. There is no naive algorithm to do this.


I'm guessing you've never taken an academic instrumentation or orchestration course. Sheet music is like HL7, a bunch of common practices that at a passing glance look like a standard but have enough exceptions and idiomatic variance to drive you to drink.

This kind of problem has been extensively studied for years and it turns out that it is not that easy. The fundamental is not the strongest harmonic, the harmonics don't line up, what happens when you play multiple notes at the same time, etc. It is a big stinking mess.

There's a game called Rocksmith that has been on the market for many years that basically does this in real time on live guitar playing with far more than enough accuracy to make the game fun.

And that is doing it basically per-note! This task (aligning sheet music with MIDI) would allow for global optimization approaches which should be far easier and more likely to work.


Electric guitar going through a clean (ish) input is a pretty narrowly scoped problem that won't necessarily generalize

MIDI music is far cleaner than analogue input from a guitar.

I'm no expert, but isn't that just autocorrelation? (https://en.wikipedia.org/wiki/Autocorrelation)

Autocorrelation is the algorithm used for Radar (at least, that's what my professors told me). When radar bounces off of another object, there are different "echos" of that radar (ex: an object was 10 miles away and 20 miles away: the 20-mile object will take 2x longer to come back). Its messy and everything.

The radar input is very messy, full of reflections, echos, and more. But autocorrelation takes all of that information, and tells you where the objects were.


Autocorrelation is a good start, IF there is only one note playing at a time, and then it SOMETIMES works. You also get multiple peaks in the autocorrelation and it is not obvious which one to choose as the fundamental frequency.

There is an equivalent problem in audio, which is source localization. That's much simpler than note cognition.

And that's before even separating out by instrument! It's really quite incredible what goes on between the ear and the brain for this kind of thing.

Correlative methods are the wrong model by themselves but could be one part of the right model. Cognition happens by correlation, but also by denoising, by adaptive filtering, by interference cancellation, even by predictive generation. Listening is not a passive process.

Care to elaborate?

If you already have a MIDI why would you need the sheet music?

The first example I can think of, because I run into it often, is when you use syncopation. It's made even worse if the syncopated line contains chords, and then even more-so if the note lengths within the chord are not robotically static.

Writing that on the MIDI piano roll is slobber proof. But if you take music written that way and view it as sheet music, it is nearly impossible to read because of the many ties that don't even necessarily begin or even end in the same place as the others.

It all ends up looking like layered sheets of snow sliding off your roof. A ribbon of differently-sagging tie-lines making it not only unsightly, but impossible to read.

Why sheet music if you have MIDI already? Because as someone who records, I use MIDI for writing, but I don't expect other musicians to follow my piano rolls when recording guitar, bass, vocals, etc. and I wouldn't want them ever to sound as robotic as MIDI tends to be.

Drums however - that's one area where MIDI piano roll to sheet music actually works flawlessly, even if the notes don't have the usual x| appearance.


Does your midi software properly translate drum rudiments that have specific notation symbols? That would be quite impressive actually. Drumming has a lot of technique-specific and informal stylistic things that would be hard to translate.

MIDI is a very poor substitute for sheet music, and vice versa. Think of MIDI as a performance and sheet music as the text being performed.

If you want to analyze the music or perform it, you want the sheet music. If you want a computer to perform the music, you want the MIDI.


What information isn't captured in the MIDI but is captured in the electronic sheet music?

It's more the other way around. The MIDI has extra information that obfuscates the original sheet music. Most commonly, timing variation which can be intentional and quite large, but also other kinds of variation and even mistakes.

Additionally, the system described in the article works on scanned paper sheet music, not electronic sheet music, so you also have to factor in OCR issues. All things considered it would be very difficult to manually write an algorithm to do this reliably.


technically, that would be a case of the MIDI data having noise in it, not extra information.

It seems unreasonable to call something "noise" if it's meant to be kept. Most traditional definitions of that word would not apply to things like syncopation and variation

"Timing variations" as mentioned in the comment I was replying to was, in context, "noise". If the performer was adding syncopation or other "planned" variation, then sure, it's information, but I don't think that is what was meant.

That is what was meant, in part, as indicated by the comment "which can be intentional". And even unintentional variation can be desired, as it makes the sound less robotic. Noise can be information too.

My reading of this was different. I felt that the commenter was noting how the timing information in MIDI is not restricted to the note types represented in sheet music, and can include swing and shuffle that could not be represented in that format.

If those timing variations were intentional, then I agree they represent information. If they are just sloppy playing, they are noise. And if the performer is just applying a stylistic form ("swing these bars"), then it's an inefficient encoding of the musical intent.


Consistent microtiming deviations or other forms of "expressive timing" are an essential element to "groove" or rhythmic "feel" in almost all modern popular styles of music. Things can get even crazier with world music styles like Samba, and let's not even get started with classical rubato.

And yet notating any of that information in every bar as, say, three-quarters to one-and-a-half hemidemisemiquavers early or late to each offbeat defeats the purpose of having a readable score.

The score is a reduction that's intentionally lacking tons of information that defines the essence of a performance. That's not a bug, but a feature.


>The score is a reduction that's intentionally lacking tons of information that defines the essence of a performance.

This is precisely part of my point, or would be if you corrected it. The score contains, either implicitly or explicitly, global performance clues that will tell the performer what to do about timing (and if it doesn't that's because it is expected to come from a conductor or other similar source). It's a highly efficient mechanism precisely because it is a global property of the score (or perhaps locally scoped to sections). Much better than providing timing information for every note (as MIDI would do). The MIDI version is an inefficient means of information transfer.

BTW, things do not get even crazier with "world music styles like samba". Samba is an extremely regular groove that is very easy to understand. This is true of most Afro-Cuban derived rhythmic structures - the complexities come from layering a set of very simple patterns. Things do get "even crazier" with rhythmic traditions from Indian, Balinese and some parts of Africa, places where conventional western ways of describing things really don't do a good job at all.


I was remarking on your comment that timing variation is noise, as you're relegating far too much into that category due to your narrow view on what counts as musical intent. Rhythmic variations can be highly irregular and even seemingly random, yet follow an underlying logic and be stylistically essential to the performance, which means they would merit notation in some form.

And I'm sorry, but regarding samba, you have absolutely no idea what you're talking about. I assume you're referring only to the surdo's backbeat, but the essence of the samba rhythm is the sixteenth-note groove played by the pandeiro, and that sound is pretty much as far from "extremely regular" as you can get while still maintaining a consistent pulse. I found the style relevant to bring up, as it's a commonly given basic example of a groove featuring microrhythmic variation.


> Rhythmic variations can be highly irregular and even seemingly random, yet follow an underlying logic and be stylistically essential to the performance, which means they would merit notation in some form.

and yet ... in almost all the musical forms where this happens, it isn't notated.

I've played samba (surdo and tamborim parts, mostly). I have many friends who play Brasilian music in general. I think we have a problem with definitions, because the pandeiro part precisely fits my definition of "extremely regular timing". When playing samba, unlike various jazz influenced forms, you do not play ahead or behind the groove. The variation still uses a 16th note grid, albeit with lots of freedom of which parts of the grid to play or not play.


I see. I took "extremely regular" to mean conforming to an equidistant grid structure, while you seem to have been talking about the consistency in repetition that makes it a groove.

Regardless, the context of this discussion was determining whether MIDI data can hold "extra information" that the score does not, and I can't agree with most of your statements about that.

> technically, that would be a case of the MIDI data having noise in it, not extra information.

> and yet ... in almost all the musical forms where this happens, it isn't notated.

There's a term for this type of thinking, and it's called "notational centricity."

> "musicological methods tend to foreground those musical parameters which can be easily notated" such as pitch relationships or the relationship between words and music. On the other hand, historical musicology tends to "neglect or have difficulty with parameters which are not easily notated", such as tone colour or non-Western rhythms.

So any given parameter not being notated with ease or in detail doesn't prove anything about its role as an intentional, stylistic element of the music. That is, not being notated doesn't make a parameter any more likely to be "noise," because what does get notated is not a "core representation" of the musical text. In fact, the distinction between the two mostly comes down to historical coincidence or other non-musical factors.

MIDI or other more granulous performance capture standards have plenty of "extra information" to offer that is not noise, and I'd even say it's mostly the case that you'll see a robust SNR there.


> Samba is an extremely regular groove that is very easy to understand. This is true of most Afro-Cuban derived rhythmic structures

Samba isn’t Afro-Cuban derived, whether in its rhythmic structures or otherwise.


Sorry, I wrote that down in a misleading way. Afro/Cuban might have been better, but what I really meant was "rhythmic structures influenced by African and/or Cuban musical culture", which samba definitely was. That's redundant in a way, because Cuban forms themselves are deeply indebted (if not directly playing) African forms, but my understanding is that there was a particular development/expansion of things in Cuba (much as in Brasil) that is perhaps best understood as a new branch of the tree. Samba doesn't owe anything in particular to Cuban traditions, but does draw on other African percussive structures.

There's a lot of "annotations"[0] in sheet music which allow the performer to use their judgement for playing.

Sure you can extract these patterns from MIDI based on things like velocity curves, but by default raw MIDI won't tell you that.

[0] https://en.wikipedia.org/wiki/Dynamics_(music)


Tons. Nearly everything. Sheet music is very expressive. So is MIDI. It’s kind of like asking, “What information isn’t captured in the movie but is captured in the novel?” They’re just radically different ways of expressing music, and much is lost when you translate from one to the other.

If you are curious, try flipping through a book on orchestration at the library sometime, or a book on music theory. Even at its most basic, a score records which notes to play, but MIDI doesn’t even do that—MIDI records the note values only, so you can’t tell the difference between G# and Ab. Then add in all the articulations, dynamics, and arbitrary instructions that a composer can put into a score.


> you can’t tell the difference between G# and Ab

That's a problem with MIDI-the-standard, not MIDI-the-idea. In the 1.x standard, it's not just a simplistic definition; the "known workaround" (pitch bend) for when you actually need to express this difference in a played note, is global per instrument. But it's nothing that can't be fixed with a protocol that follows the same principles, but has a richer data model (like OSC).

> What information isn’t captured in the movie but is captured in the novel?

This, absolutely this. And it also doesn't mean we shouldn't be getting better cameras.


Depends a lot on the instrument.

For a keyboard instrument, played conventionally, MIDI does a fairly good job of capturing everything that a composer could tell a performer (and vice-versa).

Slightly less true for percussion, but still somewhat true.

Rather untrue for any instrument where technique can be (ab)used to alter timbre significantly (e.g. most reeds, most strings)

Very untrue for any instrument capable of continuous pitch generation (e.g. unfretted strings).

Obviously "arbitrary instructions" are out, but then they are also not formally part of "sheet music", but more "additional written material from the composer", which can accompany MIDI in various ways too.


> Very untrue for any instrument capable of continuous pitch generation (e.g. unfretted strings).

What are the dimensions of a note that can be varied on a violin other than pitch and loudness?


In a piano, the relevant acoustic properties controllable by the player is key down (with differing amounts of force) and key up, with the pedals effectively acting as a way to delay key up (although one pedal causes the hammers to hit two instead of three strings, which largely manifests as quieting the piano). The simple note on/off + note velocity of MIDI models this well, and it's basically the use case MIDI was designed for.

By contrast, the technique involved in playing the violin family is very varied:

* Which string are you playing the note on? You get different sympathetic vibrations depending on which one you choose.

* Are you moving the bow up or down? Direction matters!

* How are you changing bow movement between notes? Keeping the same direction, or changing? Resting your bow on the string the entire time, or bouncing the bow?

* Where you are playing the bow? Down by the bridge, or up by the fingerboard?

* Are you even using the bow to the play notes? You can pluck it instead!

* Fingering too, you can make many small motions with your finger to give it a vibrating quality.

* On a related note, the pitch you play is a continuous quality: you can slide from one note to the other and hit all the notes in between. Unlike a piano (and MIDI), where notes are discrete pitches.

* You can also adjust tuning of the strings (though not on the fly), or damper them with a mute (which can be done during a long rest in a piece).

There's probably a few more expressive techniques I've forgotten, and I've definitely forgotten all of the fancy Italian names for these techniques.

You can look at the violin sections of Saint-Saëns' Danse Macabre (https://www.youtube.com/watch?v=71fZhMXlGT4) to see how different bowing techniques can produce rather stark effects.


I'm partial to glissando.

In general, think of this: You are touching a musical instrument. You are physically touching the strings, and you can undoubtedly imagine many different ways of touching the strings that would have different effects on the sound produced. If you were to sit down and imagine different ways to play violin, you would probably come up with techniques that already have standardized names (in Italian, of course) and perhaps even standardized symbols in notation. The possibilities are AMAZING and definitely not limited to simple pitch/loudness (I mean, obviously!)

If you want to see a reference, go to your local library and find a book like The Study of Orchestration (https://www.amazon.com/Study-Orchestration-Fourth-Samuel-Adl...) and flip to the section on strings. I own this book, but I don’t know where it is at the moment, so… off the top of my head, here are other dimensions besides pitch and loudness:

- Notes can be connected in different ways. Legato, détaché, martelé, staccato, spiccato, sautillé, jeté/ricochet, tremolo, pizzacato, louré, marcato

- Bowing direction: up/down (they sound different)

- Adjust bow position: sul ponticello, sul tasto

- Natural and artificial harmonics

- Use of mutes

- “Extended techniques”—altering the tuning, col legno, etc.

These all affect the sound. Note that the difference between staccato and legato is NOT accounted for solely by the length of the notes as in a MIDI file. You also might be surprised how many of these have really boring, everyday notational conventions. As in, a violinist would look at sheet music and say, “obviously, technique X is used for this note”, but that would not be encoded in the MIDI file at all.

All of the above techniques are explained and demonstrated in YouTube videos if you are interested.


Not a violin, but on mandolin I can greatly change the tone (which is somehow the overtones) by changing here and how the pick hits the strings. A good violin player can do a lot with the bow.

Not relevant to the line quoted above, but ... you can apply more or less bow pressure, resulting in different timbres. You can pick/pluck the strings rather than bowing them at all.

Dynamics. MIDI has volume, but that's not all there is.

Legato vs detache. Articulation.

All sorts of esoteric annotations, e.g. "Fire the cannon!", or "Release the penguins."


Sheet music always has room for human interpretation.

Sheet music is human readable, MIDI is for computers.

MIDI was designed around the piano, which is a very low-dimensional instrument. Due to how it works (hammers that hit the string with a given physical velocity), the only things that affect how a note sounds are: when you play it, what key you play, how hard you hit it, and the exact position of the damper pedal (and other pedals if used)%. Pianos have a mechanical system that abstracts your playing and confines it to these dimensions. These are the exact dimensions that MIDI covers. And this is why these days we have extremely accurate piano samples as well as physical modeling synths that sound amazing with MIDI input.

However, on other instruments, you have a much more direct relationship with the sound. With a violin or guitar, you physically touch the strings - directly or with some device. That already means you have something like 6 degrees of freedom in physical space, not counting fretting. Plus if you use a finger directly, your playing won't sound the same as someone else's with different fingers. The whole action of playing a note is a complex process in time, not a single event, so you can't capture it as one set of parameters. Then there's things like tapping on the body of the instrument, not the strings. You'd basically need a model of the contact surface that touches the string, its physical parameters, and its precise motion in time at high resolution, to accurately capture all possible playing techniques that people actually use.

What we've done with sheet music is standardize many of these techniques into instructions that players can perform. It's obviously only a much lower information version, but it's there.

Of course, if your question is "can we encode everything in sheet music with MIDI" the answer is obviously yes, but it's not standardized. A given virtual instrument could hypothetically implement anything you could write a score for (perhaps with an obscene number of samples or a really clever modeling synths) and be controlled by MIDI, but only a tiny subset of these options have vaguely standardized mappings to MIDI. MIDI is a very limited standard with tons of room for expansion, and only the most basic MIDI features interoperate between manufacturers. You can make MIDI music sound amazing with specific setups (and this is effectively how a lot of modern music is produced), but you can't make General MIDI music (the standardized subset) sound amazing.

The other issue is that sheet music is at a higher abstraction level than MIDI. It leaves more to the interpretation of the performer, but it's designed to be easily interpertable. For the portions of this that MIDI does express at a lower level, it's very hard to infer the higher level representation.

% This is excluding prepared pianos and techniques like reaching in and touching the strings, which are rare; of course technically you can do crazy things to a piano, but in practice we don't.


Think of it as the difference between a computer algebra system and LaTeX. Sure, they're rudimentarily both about mathematics. But the former is about building software that can do interesting computations with algebraic expressions. The latter is about building graphical representations expressly for humans.

That's the difference between sheet music and MIDI. MIDI is extremely powerful, but it's not really about setting not just instructions but guidelines for human performers who will ultimately use their judgment on how to interpret it and perform it -- often in large groups. Compared to that, MIDI is not just way more low level, but simply for a difference purpose.


See this thread for an example:

https://news.ycombinator.com/item?id=24193405


Perhaps they used MIDI only for reference during development. In an actual application they wouldn't need the MIDI. Just my guess.

If anyone bothered to click the link, it's not MIDI but MIDI-generated audio (title should be changed). They used MIDI audio because likely it simplified the problem along a number of dimensions: rhythm and timbre are fixed, and there is no noise. It also means it's cheating and it will not work on acoustic performances, but it's a good start. I mean if you can't solve the problem for MIDI audio you will never solve it for anything else.

Handing a MIDI printout to a musician who expects sheet music would be like handing a novel to a movie director who expects a script.

You may have composed something you need to provide to classic musicians to perform

What I don't get is why the emphasis on sheet music as an image?

I get that mapping from an auditory stream onto a score is an interesting problem, but isn't it naturally decoupled from the visual processing problem of reading and segmenting images of sheet music?

Wouldn't it make more sense to deal primarily with a digital intermediate encoding of the score?

What am I missing?


I'm going through various books by Arnold Schoenberg. He adds lots of tiny examples that I dont have the skill to play nor the ear to "listen" to in my head.

HN: Is there a tool/service/app that allows me to point at a piece of sheet music and it just plays it? I've tried a couple apps but they don't work well.



Intuitively (to me), it feels like a clever variation of something like Needleman-Wunsch could pull this off without using machine learning (as long as you have the right encodings). Can someone enlighten me?

The sheet music is input as an image, which is not the right encoding for comparison with MIDI, so you need some kind of image recognition in the pipeline. I do think that they could replace the recurrent part of their network by dynamic programming and only use machine learning for feature extraction.

Sheet music is a sequence of images. Maybe you thought sheet music was more like a text file?

I should have been more clear. This is what I meant be "correct encodings". I would be surprised (but am welcome to being wrong) if a large amount of sheet music hadn't been encoded somehow.

Sheet music isn't mathematical, ie., there isn't a numerical "encoding". It's has a fair amount of natural semantics (eg., "play like a fantasy"),

There is only an embedding or re-presentation in some less informative form. ie., you have to choose some partial measurement system and apply this to the images of scores to produce a numeric representation.

This is, in part, why a midi-file is always going to sound worse than an orchestra operating from sheet music. The musicians understand the intensions of the composer.


“Digitized” just means PDF, at best.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: