
Machine Learning for Drummers - psobot
http://blog.petersobot.com/machine-learning-for-drummers
======
TeMPOraL
Great! Refreshing to see a ML post using some well-understood methods instead
of throwing a random neural net from Kaggle at the problem...

Tangential:

> _Is a given audio file a sample of a kick drum, snare drum, hi-hat, other
> percussion, or something else? (...) Humans have no trouble classifying
> these two sounds, as we’ve likely heard them tens of thousands of times
> before._

Are people taught that in schools or something? Because I personally can't
classify those sounds, don't know these names, and I'm not sure how I was
supposed to learn them, other by playing in a band.

~~~
ludston
> Are people taught that in schools or something? Because I personally can't
> classify those sounds, don't know these names, and I'm not sure how I was
> supposed to learn them, other by playing in a band.

This is something that is taught at schools with a music program. (Although,
not necessarily discretely).

If you are someone whom has played music before, it is easy to forget what
music sounded like before your ear was trained. (i.e. certain instruments and
harmonies can be indistinguishable without training)

Is it common to have never played on at a drum kit in your entire life?

~~~
TeMPOraL
Thanks!

> _Is it common to have never played on at a drum kit in your entire life?_

I didn't, not on a real one at least. I know the sounds though, I spent
ungodly amount of time playing on an electronic keyboard as a kid, where I
could (and often would) change the sounds under keys to drums. However,
nowhere (AFAIR) were the names of those sounds mentioned, and I'm not sure
where I could encounter them.

~~~
jbenner-radham
In my anecdotal experience I only know the names of various percussion by
sound because I was a drummer in band class at school. My friends weren’t
taught that in the general education program though. That could possibly vary
by region though.

------
zneveu
Had an idea to do this a couple months ago, but haven't got around to
implementing it yet. I'm curious: did you consider using standard image
processing techniques with spectrograms as an alternative to decision trees? I
know thats how Izotope does their Neutron instrument detection, but I'm not
sure how it would compare performance wise. Also, have you tried classifying
percussive sounds that aren't actual drums? I'd love to see how it categorizes
various stuff.

~~~
psobot
Hey! In order:

\- I did consider using image processing techniques as opposed to decision
trees, but the point here was not to come up with the most advanced and
accurate classifier possible, but rather to build something simple and
explainable to folks without an ML (or even a CA) background.

\- I haven't tried this extensively on non-drum-like percussion, but that'd be
a great follow up post.

~~~
blt
I totally understand your decision, but I bet a 1D deep convolutional network
would do really well at this given a larger dataset. You can also do a lot of
data augmentation by speed changes, filters, adding reverb, etc.

------
bagrow
Surprised there's no discussion of FFT, power spectra, etc. Would like to see
someone with an electrical engineering/signal processing background work on
this problem.

~~~
ssalazar
Stock FFT is a really high-dimension feature vector given the number of
training examples used here, and most of the resolution of the FFT would be
unneeded anyways. "Average loudness in several frequency ranges" captures
spectral information at a granularity much more appropriate to the data and
classification task. For analyzing drum samples you don't need a lot of
frequency resolution, although other low-dimension spectral features like
MFCCs or flux would probably be useful.

------
flashman
Could I use something like this to identify which of two or three people is
speaking in an audio clip? Assume I can label several samples of each person's
speech, then present an unlabeled sample for classification.

~~~
RileyJames
I’m looking for something that can do this as well. Anything out already?

~~~
flashman
I had a go of it by replacing the drum samples with voice samples (both 1-2
seconds and 3-5 seconds), then removing the features concerned with length and
volume. Fiddled with the number of sub-sections per sample, and some of the
random forest settings, but never consistently got higher than 77% accuracy
between the four speakers. Maybe it would do better with two speakers.

