
Audio Fingerprinting with Python and Numpy (2013) - sillysaurus3
http://willdrevo.com/fingerprinting-and-audio-recognition-with-python/
======
loser777
I used dejavu to fingerprint the entire ~2007 snapshot of the modarchive (repo
of tracker-based music) which was about 120,000 individual songs in order to
identify a song I'd first heard back in ~2006. There was definitely something
special about finally identifying the song after almost a decade.

[https://linux.ucla.edu/~eqy/molasses.html](https://linux.ucla.edu/~eqy/molasses.html)

The biggest surprise for me was that the entire process took only a few days
on a single laptop (Sandy Bridge, no SSD).

~~~
bane
It's almost insane at how much artistic output the demoscene has produced and
yet has remained almost entirely underground.

That's almost a year of round the clock music playing 24/7 for free and
completely unencumbered by any sort of licensing or playback restrictions.
Most modern media players will play the formats back also without trouble (VLC
for example).

Awesome write up, I'm looking forward to reading it all.

------
drewbanin
I used dejavu for a college project with some classmates. We made an app
called JamJar which stitches concert videos together based on audio
fingerprints. It worked super well! We've neglected the app for a few months
now, but you can still check it out at
[http://projectjamjar.com](http://projectjamjar.com)

Huge thanks to Deja-Vu! Awesome library

~~~
muzakthings
Hey! Creator here. That's so cool to see! I made Dejavu as a fun side project
in grad school, and it's super fun to see all the cool stuff people make with
it.

I get a couple emails a week about it, but probably the weirdest/coolest was a
guy in Spain who used Dejavu to make two rap-dueling robots who spoke in
Basque.

~~~
peeb
Hey, I wanted to thank you personally, too. As far as I know, yours is the
only completely open source implementation of the “shazam algorithm”, and it
helped me a lot during my thesis work. The blog post was also great.

Actually, I encourage anyone interested to look up Panako[1], which is meant
to be a framework for comparison of different audio fingerprinting techniques.

[1]: [http://panako.be/releases/Panako-
latest/readme.html](http://panako.be/releases/Panako-latest/readme.html)

~~~
muzakthings
Oh very cool! It looks like this runs in realtime as well?

------
Cyph0n
Excellent article. His/her explanation of the the theory behind audio
fingerprinting was simple, to-the-point, and most importantly, very well-
explained. Thanks for submitting this.

~~~
bogomipz
I am curios why streaming service don't incorporate audio fingerprinting a la
Shazam into their services. Does Shazam hold some patent that would prevent
this?

~~~
foobarrio
Yes. Shazam has a bunch of patents and has sued people before:

[https://en.wikipedia.org/wiki/Shazam_(service)#Patent_infrin...](https://en.wikipedia.org/wiki/Shazam_\(service\)#Patent_infringement_lawsuit)

The core algorithm is well known and fun to implement:

[https://www.ee.columbia.edu/~dpwe/papers/Wang03-shazam.pdf](https://www.ee.columbia.edu/~dpwe/papers/Wang03-shazam.pdf)

A blog had a simple implementation of it and were contacted by Shazam's
lawyers:

[http://royvanrijn.com/blog/2010/07/patent-
infringement/](http://royvanrijn.com/blog/2010/07/patent-infringement/)

~~~
bogomipz
Thanks for the links. I know in the US patenting an algorithm is legal but
this is not the case in Europe where patenting any software is not legal. If
this was the case I wonder why streaming services based in Europe such as
Spotify or Soundcloud don't offer it considering the algorithms are well
known.

~~~
ptzz
I think the claim of it being impossible to patent algorithms in Europe is a
bit overstated. In my experience, you just add "Method and apparatus that
implements <whatever_algorithm>" and it's granted.

Pretty sure Shazam/Philips/Dolby patents cover Europe.

------
glup
Interesting how easy this is, while spoken word recognition is so hard. I
think this is because 1) human speakers vary more in terms of the signal than
our various song-playing devices and 2) spoken word recognition depends
strongly on context, whereas song identity doesn't. Said another way, a lot of
the challenges in language processing are not signal processing problems.

------
soruly
I tried dejavu a few months ago, but dropped my project due to its large
database size. Now I'm tring pyAudioAnalysis
[https://github.com/tyiannak/pyAudioAnalysis](https://github.com/tyiannak/pyAudioAnalysis)

------
kmm
I'm surprised the fingerprinting is so reliably repeatable that you can use an
exact hash, like SHA-1. I would have guessed that noise or especially
filtering could shift the peaks a few hertzes around. Why isn't this a
problem?

~~~
ptzz
I guess one explanation for the robustness is that noise is typically additive
and other types of distortions (loudness compression, eq, reverb, etc.) are
usually linear filters, i.e. only modifies the amplitude (and phase) of
already existing frequencies. If the underlying peaks are strong, this
normally does not change peak locations.

~~~
muzakthings
Exactly. Typically you won't see radio/streaming services messing around with
reverb, but certainly different stereo systems have their own EQ (changes
phase, amplitude) and often times the format/bitrate (again: phase, amplitude)
or loudness (amplitude) will be different.

Most people don't realize this, but usually artists/labels will have a
different mastered version of a song on each platform. Spotify, for instance,
has it's own normalization algorithm that it puts tracks through to even out
the listening experience in terms of loudness (RMS). Of course artists and
their mastering engineers want to have some control over how that sounds, so
they will change it.

------
xvilka
I wish projects like MusicBrainz[1] would be more popular, along with tools
like Picard[2]. They're using AcoustID[3] audio fingerprinting
service/library.

[1] [https://musicbrainz.org/](https://musicbrainz.org/)

[2] [https://picard.musicbrainz.org/](https://picard.musicbrainz.org/)

[3] [https://acoustid.org/](https://acoustid.org/)

~~~
soruly
Some audio fingerprinting database aims at identifying the whole song, usually
for tagging purposes. These are a bit different from music identification apps
like TrackID, which just need a few seconds of sample.

------
dylan604
In the past, I've been in discussions on using a fingerprinting technique for
videos. Provide a small clip of a movie, and out pops the title of the movie.
This was always intended to be used on a small and well defined library, and
never intended to be used on something like youtube.

One major problem with video is that you could have SD/HD versions and/or full
frame/original aspect ratio types of differences of the same movie. One idea I
wanted to play with was to detect edit points. The number of frames between
edits could be used as the fingerprints. The entire concept was never anything
more than a thought exercise. For the purpose of the exercise, we had to
assume that there is no audio with the picture.

There are a lot of FFT libraries to process image data since most image
compression techniques use some sort of FFT. Would this same type of
fingerprinting be able to be used for a visual image. Could the amplitude of
the RGB frequencies be used over time? The data set would increase with 3
channels of color, but would it also not help decrease false positives by
making the combinations more unique?

~~~
jwatte
SD vs HD is largely a solved problem in academia -- the key phrase is
"resolution independence." Also, some robustness to cropping, scaling, and
flipping is desired in robust fingerprinting systems.

------
rcarmo
I once used a modified version of Echoprint to fingerprint a few million
tracks from a music service we were working on. Most fun bit was maxing out 80
cores and a LAN segment using a mix of Celery workers to fetch tracks and feed
them to a C++ fingerprinter and store the data in Postgres.

The EP fingerprints were a lot smaller, though. IIRC it used a mix of beat and
tonal detection.

~~~
muzakthings
Echoprint is wonderful for fingerprinting speed (C++), but the fingerprint
size is actually smaller in Dejavu (binary(10) field in SQL for each
fingerprint).

The other interesting differences to note are that Echoprint doesn't use a
constellation fingerprinting approach along with offsets, and the
fingerprinting is meant to be the same across all platforms / use cases so you
can compare them.

As a direct result, you also can't get the offset in seconds that your query
audio refers to like you can with Dejavu.

When I coded up this project, I wanted something that was more customizable -
allowing you to decide the speed, number of fingerprints, size of the
fingerprints etc to match your own false positive / memory / CPU requirements.

When you do, you sacrifice interoperability between all Dejavu index
installations, but you gain that application specific performance. It of
course depends on your use-case which library is better.

~~~
rcarmo
We modified EP to take multiple fingerprints to achieve the same result with
offsets (down to 10 seconds, I think), and built a web UI prototype for
matching audio from a desktop browser.

It didn't end up becoming a telco service solely due to commercial agreements,
but it was a lot of fun and almost embarrassingly accurate with ABBA songs
(since we ended up trying a lot of variations on the first entries in the
catalogue).

------
nojvek
This is awesome. I'm going to use this to fingerprint a huge library of tracks
in a local language that I have.

------
ttrbls
Would it be possible to create some software that will prevent apps like
Shazam from recognizing the song?

~~~
muzakthings
Not really. It's not like a neural network which approximates a crazy function
to the feature space leaving brittle points where you can exploit that to get
false positives.

You could, for instance, insert other tracks into the audio additively to try
and confuse the fingerprint retrieval logic into suspecting a different track,
but since this and many other fingerprinting techniques depend on the actual
frequency of the audio emitted, there's no shortcut to obscuring the actual
track.

