I wonder if software exists that is aware of the overtones of each particular note of a given instrument. Theoretically, this would be a more precise way to pull out a chord from a recording without having to manually compensate for them later. Standard Fourier analysis is based on sine waves which aren't the best model for complex tones like that of a piano or a 12-string guitar.
Mathematically, the best approach I'm aware of in the long run would be to study the harmonic spectrum of the exact instrument as closely as possible, and then perform inner products against the spectrum of the mystery sound being analyzed. There's no guarantee that the note-vectors would be orthogonal (probably not), so the problem becomes more interesting in that you might want to perform a sparsity maximization on the resulting output - that is, look for the least number of notes capable of matching the recording within epsilon error in sample (or spectrum) space. Sparsity maximization is NP-complete, so a good way to approximate this is to actually perform L1-minimization using linear programming. Another approach might be to look for a balance between the absolute error term and the sparsity term.
The harmonic spectrum of each instrument will be extremely difficult to get, though, as each note can have a different waveform and will be affected by unknown non-linear effects in the instrument itself as well as in mixing (i.e. compression or distortion).
If you know the make of the guitar and the strings and where it was picked each time and the location of each of the tone and volume knobs and the position of the pickup switches.
>Tube amps do some really funky mixing, which can exasperate FFTs by giving weird/wrong results.
>unknown non-linear effects...distortion
I want to take a piece of Salsa music and extract a pattern each of the instruments make. To figure out whether it is 2/3 or 3/2 (for clave). Which patterns the bongo plays. Where is the '1' (phrase start). And - of course - I want this all happen auto-magically.
It sounds like there must be a solution for it, but I did not find the actual ready answer for it. And my own music skills are below useful.
I say "seemingly compelling" because much of this is over my head!
"The track was from a CD and therefore suffers from compression issues. That is, some of the frequencies may be missing due to the method employed to fit everything on the CD"
Programmers dicking around with audio and getting it all wrong is the new programmers dicking around with graphic design/UX and getting it all wrong.
FFTs really don't work for this stuff. There's not enough frequency resolution at the low end. Even higher up the most common error is an octave error. This can be caused by even harmonic distortion. Also, the brain is great at creating a missing fundamental. Mixing engineers use this to clear out the low frequency mud from a mix.
Tinkering with AutoTune for even a few minutes you'll quickly learn that the name itself is a lie. It doesn't Auto-matically Tune anything. It generates a crude, crappy attempt at tracking pitch (and monophonic only pitch) which then requires a trained musician to wiggle against using a mouse in a tiny window for hours and hours. Get it wrong and it sounds like a robot. Most people get it wrong. "Wrong" has become a sound. Now people get it wrong on purpose.
Melodyne is more interesting but also trips over itself.
This is a hard, hard problem. Even if you can find the loudest frequency and map it to a note, it doesn't mean it's the note in question.
When you find the loudest frequency and map it to a note, it is most definitely the note in question (in the mathematical sense). But cultural and stylistic expectations of the human brain might not agree with the math.
That's why I think somewhat surprisingly audio engineering will be a human career as long as music will be a human art. You can't say that about many automatable professions.
Even if the instrument and signal path cooperate to capture this intent (the inharmonicity of the string itself; nonlinearities of the microphone and amplifiers; resonances in the instrument soundboards and rooms themselves) you're already chasing ghosts!
Then add in a mixing engineer boosting and reducing frequencies, a tape machine varying in speed ever so slightly, and a mastering compression step throwing away amplitude detail -- and you've really got to wonder what the hell someone thought they were doing by running FFTs on a 50 year old analog mix and pretending they've got insight into the original solo instruments.
Instead of trying to reconstruct the sound directly you could use the data for getting hints at which instruments (by looking at their particular characteristic spectral distribution) were involved. From there you could then try to "fill" the observed frequency pattern with the instruments involved at the notes that would seem more plausible.
Still much harder to do than to say but that's what comes to my mind.
If anyone's ever interested in doing this kind of analysis I can't recommend this tool  enough, is it one of the most easy to use yet very versatile and helpful when dealing with these kind of problems. It is also free.
: SPEAR, http://www.klingbeil.com/spear/
Take a concrete example. Let's assume that I have a recording in PCM at 44.1kHz, and I want to see all of the frequencies present, to a resolution of 10Hz. Nyquist tells us that we only have 22.05kHz of signal present in the sample (well, one can hope, otherwise if the recording didn't filter out everything above that when coding to PCM, we're going to have a mess of aliasing going on, which is outside the scope of this post - just assume the person doing the encoding knew what they were doing, mkay...). So, 22.05kHz. And we want a resolution of 10Hz, so we're going to need 22.05kHz / 10Hz = 2205 samples.
Of course, we actually need double that in the time domain (we throw away the phase information remember), so we need 4410 samples in the time domain to get our 2205 samples in the frequency domain. 4410 samples at 44.1kHz sampling is equal to 0.1 seconds.
If I decide that I want a better frequency resolution, say down to 1Hz, then you quickly arrive at the conclusion that you will need a full second of samples. In other words, your resolution in the time domain has decreased as you improve the resolution in the frequency domain.
You can improve a bit on this situation by using a sliding window with a carefully chosen windowing function. The frequency information will still be smeared out in time, but the windowing function reduces the amplitude of a frequency the further it is from the centre of the sampling window, which sharpens up the smear a bit.