
Dejavu: Audio fingerprinting and recognition in Python - of
https://github.com/worldveil/dejavu
======
muzakthings
Hey! Creator here. Awesome to see this get posted and people excited about the
project.

I made a cool writeup about it here: [http://willdrevo.com/fingerprinting-and-
audio-recognition-wi...](http://willdrevo.com/fingerprinting-and-audio-
recognition-with-python.html)

It's a great little library for doing audio recognition, stream radio
advertisement verification, and all sorts of interesting people email about
all the time that I never would have thought of.

It's certainly not as speedy as Echoprint, which is both written in C++ and
doesn't use an FFT for the locality sensitive hashing, but is quite user
friendly. The benefit of doing constellation or time delta based LSH methods
like in Dejavu is that you can actually recover the time at which you matched.

If you love it, feel free to dig in and contribute!

~~~
ibisum
Great work! Thanks for open sourcing this - its very educational.

At the moment I'm using it to process a few hundred gigs of song files that
I've collected as a big furry hairball of a mess over the years - something
about having multiple iPods and MP3 players over the years, and not really
doing very good house-keeping in the move from one to the other (and avoiding
things like iTunes where possible) has meant that I have a lot of files that
may have duplicate songs in them - but the filenames and organization doesn't
necessarily reflect that fact.

So I'm using dejavu right now to clean this up .. I'm assuming you'd be happy
to have a "find_duplicates.py" script added - if so, I'll let you know as soon
as I have one working .. ;)

Thanks again!

~~~
muzakthings
Glad to see it's working well for you!

I'd be curious as well to see how the performance holds up getting into the
terabytes as I haven't tested that. Remember too that there are a lot of
parameters for the matching algorithm here
([https://github.com/worldveil/dejavu/blob/master/dejavu/finge...](https://github.com/worldveil/dejavu/blob/master/dejavu/fingerprint.py))
which allow you to trade off accuracy, speed, and storage in different ways.
I've tried to document it throughly.

Finding duplicates is a great one! Actually generating a checksum for each
audio file (minus the header and ID3 tags) and adding this as a column in the
songs table for all the different filetypes Dejavu supports (mp3, wav, etc)
would probably be the best way to do this.

I say this because so many songs today are built on sampling. Mashups and EDM
music often samples from other work, and as such, the fingerprints _and_ their
alignment can be shared across different songs. Something more clever like
seeing the percentage of hashes by song that are the same and comparing to a
threshold might do the trick, though.

Happy hacking, and feel free to send in a PR! :)

------
rogerbinns
I worked on this too a few years ago. There is a massive patent minefield out
there, including this algorithm.

We were doing recognition of TV shows where you are holding a cell phone in
your hand some distance from the TV. A prototype with this algorithm was okay,
but very easily confused especially as the number of items in the database
increases. You also end up with a lot less amplitude at that distance from the
TV which makes the source messier. In theory phase shouldn't have had an
effect, but in practise it did so things had to be run multiple times at
different offsets to improve matching.

Our final algorithm was way better. It was based on what audio codecs do. It
even worked reliably with a 60db signal while there was a 70db interferer
signal!

------
keypusher
Surprised nobody has mentioned MusicBrainz, it's the free and open source
music fingerprinting database which powers the Picard, Jaikoz, Beets, etc
taggers. They have been doing audio fingerprinting for years, you can download
the DB or access it via a web API. The author's solution may work quite well
with small number of entries to match against, but I suspect the match rate
goes down significantly when lookup is against hundreds of thousands or
millions of other fingerprints.

[https://wiki.musicbrainz.org/Fingerprinting](https://wiki.musicbrainz.org/Fingerprinting)

~~~
lukaslalinsky
Audio fingerprinting as used by MusicBrainz is a little different concept.
Because it doesn't have the need to match short phone-recorded samples, we can
use more efficient algorithms for both the fingerprinting and their matching.
It's usually not the match rate that goes down when dealing with a large
database, but the false match rate that goes up. And of course performance.
Those were my two main things to worry about when I was working on AcoustID
(the current fingerprinting technology used by MusicBrainz).

------
discardorama
Fingerprinting is fine; but the actual value would come from a large database
of all sorts of fingerprints, so it could be used to identify songs, snippets,
movies, etc.

~~~
muzakthings
That's certainly useful, and what Echoprint and MusicBrainz have tried to do.

Unfortunately, many fingerprinting use cases require hashing at different
granularities (ie, FFT windows), or need different collision guarantees to
trade off space vs. accuracy and so on and so forth.

A perfect example is throwing away part of the SHA-1 hash of a fingerprint.
You lose some entropy, but you become more space efficient.

Thus in many cases, while the core algorithm might be the same, the parameters
and constraints of the individual use case often mean that the fingerprints
themselves aren't universal in size or format.

------
morenoh149
This is amazing! truly great explanation in the related blog post here
[http://willdrevo.com/fingerprinting-and-audio-recognition-
wi...](http://willdrevo.com/fingerprinting-and-audio-recognition-with-
python.html) Does anyone know of a good explanation of locality sensitive
hashing? I know there are other applications.

~~~
stevetjoa
Shameless self-promotion: in my Stack Overflow answer [1], I reference good
introductory LSH papers [2-5].

In short, LSH is an algorithm that hashes points that are _nearby_ in a
feature space into the same bin with high probability. Contrast that with
cryptographically secure hashes where the tiniest change in the input is
designed to yield a completely different hash. The point is that, in domains
like multimedia, you want to tolerate some distortions to your signal, e.g.
microphone noise, blur, etc. These minor distortions shouldn't affect your
characterization of the data, e.g. "is this a guitar", "is this a cat", etc.

The advantages are that it's simple to implement, and it has mathematically
provable probability bounds and query complexity.

[1] [http://stackoverflow.com/questions/5751114/nearest-
neighbors...](http://stackoverflow.com/questions/5751114/nearest-neighbors-in-
high-dimensional-data/5773066#5773066)

[2]
[http://www.cs.princeton.edu/courses/archive/spr05/cos598E/bi...](http://www.cs.princeton.edu/courses/archive/spr05/cos598E/bib/p253-datar.pdf)

[3]
[http://www.vldb.org/conf/1998/p194.pdf](http://www.vldb.org/conf/1998/p194.pdf)

[4]
[http://www.vldb.org/conf/1999/P49.pdf](http://www.vldb.org/conf/1999/P49.pdf)

[5]
[http://web.iitd.ac.in/~sumeet/Slaney2008-LSHTutorial.pdf](http://web.iitd.ac.in/~sumeet/Slaney2008-LSHTutorial.pdf)

------
UncleChis
I worked for a company working on audio fingerprinting before (as well as
image/video). I have to say it's great that we have a song (audio) recognition
system that works pretty well, but the business never takes off. May be it's
just my previous company as I'm not sure how Shazam is doing, but I haven't
heard of them for a while

~~~
ma2rten
Actually Google is now competing with them:
[https://play.google.com/store/apps/details?id=com.google.and...](https://play.google.com/store/apps/details?id=com.google.android.ears&hl=en)

------
imroot
I've putzed around on something very similar to this for performing analytics
on terrestrial radio stations and the commercials. Great work, and I love that
you've open sourced it...I didn't use python for mine (C++), but, python
allows for a much easier barrier to entry versus my spaghetti code :)

------
doctoboggan
I've written similar code, but using image processing algorithms. You can find
it here:

[https://github.com/jminardi/audio_fingerprinting](https://github.com/jminardi/audio_fingerprinting)

------
Osmium
How well would this work for spoken word instead of music?

~~~
aidos
Probably not particularly well. It's based on the variations in pitch over
time and unfortunately the human voice fills a very narrow frequency band.
Maybe you could limit it to focus on that area of the frequency spectrum to
get better results but I suspect it would require more dramatic changes to get
good results.

------
ntsb
this is great! i was just starting out on a project in which one component was
to recognize gunshot sounds from varying distances.

~~~
muzakthings
Not sure if you've seen this, but relevant:

[http://www.shotspotter.com/](http://www.shotspotter.com/)

------
crimsonalucard
Rather then just fingerprinting recorded audio can this thing fingerprint
words and passphrases that the user just says outloud?

~~~
rogerbinns
Not even close. This technique only works when the relative energy of
different frequency buckets remains the same, and the same time periods apart
(in milliseconds). You are very unlikely to have the same fingerprints when
repeating the same words/phrases.

Try using an app that shows the FFT and see if you can get it to show the same
thing twice when speaking. For example on Android this works
[https://play.google.com/store/apps/details?id=org.hermit.aud...](https://play.google.com/store/apps/details?id=org.hermit.audalyzer)

