I made a cool writeup about it here:
It's a great little library for doing audio recognition, stream radio advertisement verification, and all sorts of interesting people email about all the time that I never would have thought of.
It's certainly not as speedy as Echoprint, which is both written in C++ and doesn't use an FFT for the locality sensitive hashing, but is quite user friendly. The benefit of doing constellation or time delta based LSH methods like in Dejavu is that you can actually recover the time at which you matched.
If you love it, feel free to dig in and contribute!
At the moment I'm using it to process a few hundred gigs of song files that I've collected as a big furry hairball of a mess over the years - something about having multiple iPods and MP3 players over the years, and not really doing very good house-keeping in the move from one to the other (and avoiding things like iTunes where possible) has meant that I have a lot of files that may have duplicate songs in them - but the filenames and organization doesn't necessarily reflect that fact.
So I'm using dejavu right now to clean this up .. I'm assuming you'd be happy to have a "find_duplicates.py" script added - if so, I'll let you know as soon as I have one working .. ;)
I'd be curious as well to see how the performance holds up getting into the terabytes as I haven't tested that. Remember too that there are a lot of parameters for the matching algorithm here (https://github.com/worldveil/dejavu/blob/master/dejavu/finge...) which allow you to trade off accuracy, speed, and storage in different ways. I've tried to document it throughly.
Finding duplicates is a great one! Actually generating a checksum for each audio file (minus the header and ID3 tags) and adding this as a column in the songs table for all the different filetypes Dejavu supports (mp3, wav, etc) would probably be the best way to do this.
I say this because so many songs today are built on sampling. Mashups and EDM music often samples from other work, and as such, the fingerprints and their alignment can be shared across different songs. Something more clever like seeing the percentage of hashes by song that are the same and comparing to a threshold might do the trick, though.
Happy hacking, and feel free to send in a PR! :)
We were doing recognition of TV shows where you are holding a cell phone in your hand some distance from the TV. A prototype with this algorithm was okay, but very easily confused especially as the number of items in the database increases. You also end up with a lot less amplitude at that distance from the TV which makes the source messier. In theory phase shouldn't have had an effect, but in practise it did so things had to be run multiple times at different offsets to improve matching.
Our final algorithm was way better. It was based on what audio codecs do. It even worked reliably with a 60db signal while there was a 70db interferer signal!
Unfortunately, many fingerprinting use cases require hashing at different granularities (ie, FFT windows), or need different collision guarantees to trade off space vs. accuracy and so on and so forth.
A perfect example is throwing away part of the SHA-1 hash of a fingerprint. You lose some entropy, but you become more space efficient.
Thus in many cases, while the core algorithm might be the same, the parameters and constraints of the individual use case often mean that the fingerprints themselves aren't universal in size or format.
In short, LSH is an algorithm that hashes points that are nearby in a feature space into the same bin with high probability. Contrast that with cryptographically secure hashes where the tiniest change in the input is designed to yield a completely different hash. The point is that, in domains like multimedia, you want to tolerate some distortions to your signal, e.g. microphone noise, blur, etc. These minor distortions shouldn't affect your characterization of the data, e.g. "is this a guitar", "is this a cat", etc.
The advantages are that it's simple to implement, and it has mathematically provable probability bounds and query complexity.
Try using an app that shows the FFT and see if you can get it to show the same thing twice when speaking. For example on Android this works https://play.google.com/store/apps/details?id=org.hermit.aud...