
Finding the genre of a song with Deep Learning - Despoisj
https://medium.com/@juliendespois/finding-the-genre-of-a-song-with-deep-learning-da8f59a61194#.r28dkpg2e
======
iverjo
To the author: Have you tried to use a logarithmic frequency scale in the
spectrogram? [1] That representation is closer to the way humans perceive
sound, and gives you finer resolution in the lower frequencies. [2] If you
want to make your representation even closer to the human's perception, take a
look at Google's CARFAC research. [3] Basically, they model the ear. I've
prepared a Python utility for converting sound to Neural Activity Pattern
(resembles a spectrogram when you plot it) here:
[https://github.com/iver56/carfac/tree/master/util](https://github.com/iver56/carfac/tree/master/util)

[1] [https://sourceforge.net/p/sox/feature-
requests/176/](https://sourceforge.net/p/sox/feature-requests/176/)

[2]
[https://en.wikipedia.org/wiki/Mel_scale](https://en.wikipedia.org/wiki/Mel_scale)

[3]
[http://research.google.com/pubs/pub37215.html](http://research.google.com/pubs/pub37215.html)

~~~
Terribledactyl
I don't think this problem is bound by absolute frequency resolution, the
tightest distance between two notes on a typical piano is ~2hz and if you
assume a doubling between octaves you're at <90 notes. The temporal changes
and relative chord progressions probably give more info.

~~~
Despoisj
Thanks for your insights! I agree that log/mel spectrograms could be even more
detailed and effective, and could be used with the SoX patch discussed here
[https://sourceforge.net/p/sox/feature-
requests/176/](https://sourceforge.net/p/sox/feature-requests/176/).

------
nkurz
Wow, I find it incredible that this works. As I understand it, the approach is
to do a Fourier transform on a couple seconds of the song to create a 128x128
pixel spectrogram. Each horizontal pixel represents a 20 ms slice in time, and
each vertical pixel represents 1/128 of the frequency domain.

Then treating these spectrograms as images, train a neural net to classify
them using pre-labelled samples. Then take samples from the unknown songs, and
let it classify them. I find it incredible that 2.5 seconds of sound
represented as a tiny picture captures information enough for reliable
classification, but apparently it does!

~~~
iammyIP
One reason might be that the mentioned genres are highly formulaic to begin
with. The standard rap song contains about 2 bars of unique music stretched
out over 3 minutes with slight variations. Same with dubstep and techno. All
highly repetitive. Classical music got no drums, so you can detect that. Metal
got guitar distortion all over the spectrum. So with these examples the
spectral images should have enough distinctive features that can be learned.
Why should it be different than with 'normal' pictures? Also it looks like
they take four 128x128 guesses per song.

~~~
dagw
If they can write some code that can classify metal into one of its 72 sub-
genres then I'll be truly impressed :)

Although I wonder what that would do to the metal scene if their main topic of
discussion and contention got completely solved.

~~~
Despoisj
Haha that would be awesome ! I guess we'd need a lot of data, and probably use
much more detailed spectrogram (time-wise and frequency-wise).

------
chestervonwinch
1\. I wonder how the continuous wavelet transform would compare to the
windowed Fourier transform used here. See [1] an python implementation, for
example.

2\. The size of frequency analysis blocks seems arbitrary. I wonder if there
is a "natural" block size based on a song's tempo, say 1 bar. This would of
course require a priori tempo knowledge or a run-time estimate.

[1]:
[https://docs.scipy.org/doc/scipy-0.15.1/reference/generated/...](https://docs.scipy.org/doc/scipy-0.15.1/reference/generated/scipy.signal.cwt.html)

~~~
Despoisj
The slice size is indeed quite arbitrary, and knowing the BPM would help, but
isn't reliable either (various tempos, rubato for classical etc.)

------
maxerickson
See also [http://everynoise.com/](http://everynoise.com/) which is a view into
how Spotify classifies music.

The creator wrote about it here:

[http://blog.echonest.com/post/52385283599/how-we-
understand-...](http://blog.echonest.com/post/52385283599/how-we-understand-
music-genres)

and writes a lot about it on their blog:

[http://www.furia.com/page.cgi?terms=noise&type=search](http://www.furia.com/page.cgi?terms=noise&type=search)

Of course those are going in the other direction, not generating the
classification from the data, but it's probably one of the best data sets as
far as classifying existing music.

~~~
Despoisj
Thanks for the awesome ressources ! :D

------
jschmitz28
Unless I'm misunderstanding the validation set, I'm skeptical of the ability
of this classifier to tag unlabeled tracks, given that it is only being
trained and tested on tracks which are already known to belong to one of the
few trained genres. I'd be curious to see the performance if you were to
additionally test on tracks which are not any of (Hardcore, Dubstep, Electro,
Classical, Soundtrack and Rap), with a correct prediction being no tag.

~~~
Despoisj
It's true that the validation set only contains genres I used for the
training. I'll try this out this evening ;)

------
iverjo
Nice approach, and well explained! By the way, Niland is a startup that also
does music labeling with the help of deep learning.

Demo available here: [http://demo.niland.io/](http://demo.niland.io/)

For example, it can output Drum Machine: 87%, House: 88%, Female Voice: 55%,
Groovy: 93%

~~~
Despoisj
Thanks for the kind words, I'll take a look !

------
GFK_of_xmaspast
See also Bob Sturm's work on genre classification:
[http://link.springer.com/article/10.1007/s10844-013-0250-y](http://link.springer.com/article/10.1007/s10844-013-0250-y)

~~~
Despoisj
Wow, that's an in-depth analysis of the task ! Thanks for sharing

------
tunesmith
That's pretty cool, I'd like to use something like this to tell me what genre
my own songs are. It's annoying to write a song and then upload it to some
service or another and have no idea what genre to pick. :-) My stuff is
somewhere in the jazz-influenced singer-songwriter american piano pop realm
which is a combination that works for me but it generally feels like I'm
selling the song short if I have to pick only one.

~~~
Despoisj
Yeah that's a problem I know - I used to make some Electro/Dubstep/Trap music
- and I feel people will always disagree with the genre you pick anyway.

------
return0
Good luck convincing musicians that "THAT's your genre"

~~~
Despoisj
Haha, I should have titled the article "How to build an internet music-genre-
troll-bot"

------
dkarapetyan
Hmm, convolution is perfectly good operation to run on wave forms as well. In
fact the wikipedia article
([https://en.wikipedia.org/wiki/Convolution](https://en.wikipedia.org/wiki/Convolution))
shows the operation on functions which would correspond to time-domain wave
forms. What is the point of converting everything to pictures and then using
2D convolutions when that step could have been skipped entirely?

Converting to pictures is unnecessary. It makes the processing harder. The
pooling should just happen on segments of the wave form instead of the fourier
transform (frequency-domain) picture spectrograms.

~~~
highd
The idea is that the vertical axis of the spectrogram is basically already an
hierarchical set of features (in scale/frequency). Then convolutions on that
is a lot like how DenseNets combine hierarchical features.

I agree it seems a little jank, but the features are pretty good - and a lot
of network architectures / training techniques are most practiced in an image
processing context.

~~~
Despoisj
Thanks for your inputs, it's true that we can use convolutions on raw
waveform, however the main reason I've used a spectrogram was to work on
precomputed relevant features as highd pointed out, instead of running the
convolution on lots of data.

------
jtmarmon
i'm not super familiar with deep learning so forgive me if i'm missing some
nuance, but what's the purpose of writing/reading to/from images? seems like
it would add a ton of processing time. could the CNN not just read from a 50
item array of tuples representing the data from the 20ms slice?

~~~
Despoisj
I'm not sure what you mean, but I have chosen to store slices on the disk so
that I could still take a look at them, and not store the data only in numpy
arrays. That could be optimize for a better processing time!

