
Pitch Detection with Convolutional Networks - zeroxfe
https://0xfe.blogspot.com/2020/02/pitch-detection-with-convolutional.html
======
oever
Just the other day I discovered what a big research field, Music Information
Retrieval, this is.

Here is the video archive of a recent conference on the topic.

[https://ismir2019.ewi.tudelft.nl/](https://ismir2019.ewi.tudelft.nl/)

There are some FOSS applications e.g.
[https://www.sonicvisualiser.org/](https://www.sonicvisualiser.org/) but I'm
surprised how bad the results of the analysis are. Intuitively, it seems such
a simple problem.

~~~
Cactus2018
On the theme of visualizing audio, Audacity has a built-in Spectrogram.

[https://manual.audacityteam.org/man/audio_track_dropdown_men...](https://manual.audacityteam.org/man/audio_track_dropdown_menu.html#spgram)

[https://en.wikipedia.org/wiki/Spectrogram](https://en.wikipedia.org/wiki/Spectrogram)

------
unlinked_dll
Using synthesized audio from midi seems like an atrocious way to train a
neural net for pitch detection. It's also not particularly difficult to detect
pitch on such sounds, you should mess with it a bit to remove the fundamental,
add inharmonic tones, noise, vibrato, etc. Just counting zero crossings is
enough for most of that data.

Side note: cepstral processing is going to be a lot more effective than
spectrograms, and preprocessing is cheaper than training the ANN.

~~~
zeroxfe
> Using synthesized audio from midi seems like an atrocious way to train a
> neural net for pitch detection.

Not sure why you think so. Almost everything you suggested (missing
fundamentals, noise, vibrato, reverb, velocity, distortion, etc.) can be
synthesized with tools like sox. It worked very well for me. :-)

> cepstral processing is going to be a lot more effective than spectrograms,
> and preprocessing is cheaper than training the ANN.

I did do this in my initial attempts, and found no improvement over
spectrograms. Turns out NNs can learn log nonlinearities quite easily. (EDIT:
to be precise, I calculated the mel-cepstrum and fed it to the network.)

~~~
matheist
> _to be precise, I calculated the mel-cepstrum and fed it to the network_

The mel scale cepstrum is inappropriate for pitch detection. You want to use
the cepstrum _without_ scaling frequencies. Take the fourier transform,
normalize, take logarithms[+], then (inverse) fourier transform in the
frequency domain.

The advantage of using the cepstrum for pitch detection is that most signals
you're looking at will be harmonic — equally spaced overtones — and so when
you take a fourier transform in the frequency domain you'll get a peak
corresponding to that equal spacing, which will provide you with the
fundamental frequency. (Even if it's missing!)

Using the mel scale totally wrecks that periodicity and throws away the pitch
information. (Which is part of why it's used for speech-to-text! In those
cases you _want_ to throw away pitch information. Unless you're processing a
tonal language, in which case probably don't use the mel scale.)

[+] Why logarithms? In the frequency domain, most harmonic sounds look like
the _product_ of a high frequency periodic "signal" (the fundamental and its
overtones) with a slowly varying signal (frequencies which are emphasized or
de-emphasized, like e.g. formants in the case of speech). Taking the logarithm
splits that into the sum of a high frequency signal (overtones) and a low
frequency signal (formants), and since the fourier transform is linear,
that'll show up as a single peak in the cepstrum corresponding to the gap in
overtones (ie the fundamental) and some stuff in the low-frequency bins
corresponding to the formants.

~~~
zeroxfe
Thanks, that's very helpful, and explains why mel-cepstrum didn't work so
well. So far, I found that I'm getting the best results with FFT + log scale,
particularly against audio with fundamentals stripped.

------
dentalperson
I think this is a nice writeup, but I would argue with the claim [The error
is] "Pretty much exactly the resolution of the FFT we used. It's very hard to
do better given the inputs."

The 19hz error being explained by the FFT resolution only makes sense for a
classification-based loss/error that used the (FFT_bins / 2) as classes.

Since the proposed network is using regression, even though you have a
frequency resolution of 19hz, you should be able to estimate pitch with finer
resolution if you are using any popular non-rectangular window because it can
be fit to match the shape of the main lobe. You would only expect such a large
error at the very low frequencies, where there wasn't much to interpolate on
because the next harmonic would overlap.

For an example see figure 4 in PARSHL, (One of the original sinusoidal
analysis frameworks where the frequency of each harmonic is estimated by
fitting to a parabola)
[https://ccrma.stanford.edu/~jos/parshl/parshl.pdf](https://ccrma.stanford.edu/~jos/parshl/parshl.pdf)

A neural network should be able to do much better than parabolic fitting.

~~~
zeroxfe
Thanks yes, this is totally correct -- the FFT uses a Tukey window, which
should be able to match the main lobe. I got some other feedback that
perceptual training tasks have a very long tail, so it's possible that the
network will learn better if I run it for a few hours (I only ran it for about
10 minutes.)

I'll give it a shot (and edit the post.)

------
oever
Did you consider using Constant Q Transform instead of STFT?

mpv and ffmpeg come with CQT visualization:

mpv --lavfi-complex="[aid1]asplit[ao][a]; [a]showcqt[vo]" "$@"

You can even get it from microphone with some piping:

parec --latency-msec=1 | sox -V --buffer 32 -t raw -b 16 -e signed -c 2 -r
44100 - -r 44.1k -b 16 -e signed -c 2 -t wav - | ffplay -fflags nobuffer -f
lavfi 'amovie=pipe\\\:0,asplit=2[out1][a],[a]showcqt[out0]'

~~~
zeroxfe
Have not tried it -- def worth investigating. Thanks.

~~~
oever
The Constant Q Transforms uses bins that are spaced on a log scale like
musical notes. So the bins corresponds better to how humans perceive pitch.
You wont waste hundreds of bins to the high frequencies.

Calculating CQT can be roughly as fast as FFT.

[http://academics.wellesley.edu/Physics/brown/pubs/effalgV92P...](http://academics.wellesley.edu/Physics/brown/pubs/effalgV92P2698-P2701.pdf)

And here are some real musical samples you can use instead of the artificial
midi notes:

[http://virtualplaying.com/virtual-playing-
orchestra/](http://virtualplaying.com/virtual-playing-orchestra/)

------
BookPage
nice post! I'm working on tempo detection with deep networks atm and found the
synthesis section very helpful. I'm wondering if you read many of the recent
publications on pitch detection + deep learning to guide your model building.
At least for tempo detection there's an abundance of material there to use as
starting points which can help bootstrap the network build.

~~~
zeroxfe
I didn't look very hard for deep learning approaches to pitch detection,
mainly because I was really interested in chord recognition (which is a far
more interesting problem, IMO.) I didn't find any good research here (would
love pointers.)

I did though spend a lot of time studying non-DL approaches to pitch-
detection, mainly because I wanted better real-time performance for my game
Pitchy Ninja ([https://pitchy.ninja](https://pitchy.ninja)).

~~~
BookPage
Yeah chord recognition is a much meatier problem for sure. Anything where the
pitch signal gets is blended with others is pretty cool. I haven't dived deep
into pitch work yet, but this[1] is a fairly solid recent review paper that I
followed with good results for tempo stuff.

[1] [https://arxiv.org/abs/1905.00078](https://arxiv.org/abs/1905.00078)

------
jmwilson
Resolution in the frequency domain can be significantly improved over the
natural resolution of the DFT (19 Hz in your case). If the fundamental
frequency exactly matches one of the DFT bin frequencies, it would mean the
frame size is an exact multiple of the period, so the DFT of successive frames
would look the same, and there would be no phase shift at the fundamental
component. (Or if overlapping frames were analyzed, there would be an expected
phase shift in proportion to amount of overlap.) In the more likely case that
the fundamental doesn't match a bin frequency, there's a phase shift between
successive frames that's proportional to difference between the fundamental
and the associated bin frequency. This can be extracted to get a more accurate
estimate of the actual fundamental. I wrote a STFT-based chromatic tuner as a
hobby project using this technique and it would easily resolve better than .1
Hz changes using 10(ish) ms frame sizes.

------
knzhou
It would be interesting to see how this performs, compared to simpler methods,
on hard cases like a missing fundamental. (If you play a sound with power at
200 Hz, 300 Hz, 400 Hz, ..., i.e. all the multiples of 100 Hz but not 100 Hz
itself, humans always perceive a pitch of 100 Hz.)

~~~
zeroxfe
I have training data which includes examples with missing fundamentals
(synthetically removed), and the network does learn to recognize this. (The
heavy regularization also helped a lot.)

Although I was able to test with real instruments (and my grotesque voice), I
didn't find any good live examples of audio with missing fundamentals to test
with. It did recognize held-out synthesized data correctly though.

------
amylene
Could this be used on human voices to detect tones like anger, openness, etc?
If so, it might be monetizable as a sales value add.

~~~
zeroxfe
That's definitely interesting, never thought about it. Prob hard to find
"angry voice" training data though :-)

~~~
keenmaster
You can scrape a database of movie scripts, tie each word in the script to a
moment in the movie (accurate to the second), and extract recordings that are
supposed to demonstrate an emotion. You can even use the modifiers that are in
a script, such as “ _very_ angrily” to train on various degrees of each
emotion. If emotions are not inscribed into the script, you can use textual
affect detection tools. There’s got to be some way to do this without having
mTurks label every second of a recording.

------
rkagerer
What kind of latency was achieved on the detection side? Could you use this
for real-time applications?

~~~
zeroxfe
About 10ms/sample (including pre-processing inputs, and generating
spectrograms) on a current-gen GPU. It's a bit too high for real-time
detection (compared to well-known pitch estimation approaches.)

------
anonytrary
I have no idea why we need to do this when we can just compute the strongest
frequency components with a Fourier transform. Can anyone explain the
advantages of this seemingly expensive method over simple (and more complex)
Fourier analysis?

~~~
zeroxfe
See the section "On Pitch Estimation" which addresses exactly that:

\--- Pitch detection (also called fundamental frequency estimation) is not an
exact science. What your brain perceives as pitch is a function of lots of
different variables, from the physical materials that generate the sounds to
your body's physiological structure.

One would presume that you can simply transform a signal to its frequency
domain representation, and look at the peak frequencies. This would work for a
sine wave, but as soon as you introduce any kind of timbre (e.g., when you
sing, or play a note on a guitar), the spectrum is flooded with overtones and
harmonic partials. \---

All this said, you don't need deep learning for decent pitch detection -- it's
a solved problem, and there are lots of well-known algorithms for it. Deep
learning is useful for more advanced music info retrieval such as interval and
chord recognition, which was one of my goals with this experiment.

~~~
AstralStorm
You can use deep networks to get better resolution there though, keyword is
spectral reassignment, normal methods use maximum likelihood.

What you will get is a variant spectrogram. Method is related to edge directed
interpolation and bilateral transform. (You can do the same with log-
cepstrum.)

I find this whole exercise linked an undergrad level toy, even in such basic
thing as pitch.

The real problem is polyphony and instrument segmentation which this does not
begin to touch.

