Although normalized correlation (which is what `match_template` uses) has been applied outside of image processing before. I also tried other image processing techniques like harris corner detection and SURF feature detection. I didn't write them up but you can see the code on github:
The usefulness of the kinds of experiments in this post, imo, is to investigate whether some things developed as domain-specific features actually belong more in the common foundation part. Sometimes something will be developed initially in computer-music or computer-vision not because it's really music or vision specific, but just due to where a particular person happened to be working. Or where research funding was allocated, for that matter. A funding trick computer-music people like doing lately is to apply their algorithms to bioinformatics as a way of funding research in their real domain of interest.
The next step, of course, would be to try to determine if these image-processing techniques already do exist in audio processing, just under a different name, or in some kind of variant.
I find it amazing that the brain can accept and process large amounts of information if it is just presented in a coherent visual way. Imagine looking at an image vs a printout of the pixel values. Same data but one is designed to be processed by humans.
Good point. I guess I was trying to say that non-visual information processing in the brain is more than just organized similarly to visual processing, in some cases the brain actually uses the visual system to process non-visual information.
For example, you gotta window your FFTs in the spectrogram, kid.
If I was implementing this in production I certainly would apply some sort of window function. However this blog post was mostly just to show some cool code, and ease of reading was my first constraint. Notice also how I kept intermediate arrays around in my functions. This uses more memory but is easier to read.
In the end this approach doesn't seem to scale to the amount of data Shazaam must be processing so it's all moot anyway.
For the curious: http://en.wikipedia.org/wiki/Window_function
Synesthesia, by definition, is the _mixing_ of your senses. Letters having colors, smells having personalities, music tasting like cheeseburgers. NOT, using a spectrogram to display audio.
I can sort of see where OP is coming from, but I think it's a bit farfetched to call this synesthesia. As fluidcruft said, a more suitable title would probably be "Audio Analysis with Signal Processing Algorithms" or something along those lines.
Regardless, it's still a pretty impressive project.
Here's how audio data in that compressed domain looks like [you'll need to zoom in]: https://raw.github.com/dchichkov/DeepLearningTutorials/maste...
Its a really basic algorithm, even more basic in 1D. So it would be pretty trivial to just compare the shifted spectrograph profiles for a massive gain in performance.
Infact, you probably don't even the "shift" bit of the algorithm because you will end up comparing one frequency to a different frequency which does not make much sense (outside of Doppler shift calcs). So its a really long winded approach to taking the cross correlation of two spectrograms that introduces a load of unwanted homomorphism.
see also http://dsp.stackexchange.com/questions/736/how-do-i-implemen...
I would like to add that just because an algorithm is in vision processing library, doesn't make it "clever". Basic template matching implementation is 4 nested for loops. Clever template matching is FFT to save one two of those loops, but the results are the same. In audio processing you don't need all those loops because the signal is 1D. So going via a vision processing library is just introducing pointless loops that do nothing but worsen results and slow down code.
I won't post any image to audio examples since they are generally really annoying, but here is a handful of audio samples converted to images (mainly game samples due to size and features):
http://0x09.net/img/bgm3.png (warning 22mb)
It's actually invertible aside from quantization and rounding errors, so there is for example this mp3-compressed Lenna:
http://0x09.net/img/lenna.mp3.png (the colorful fuzz is mainly from quantization/rounding rather than the compression itself)
Or downsampled: http://0x09.net/img/lenna11khz.png
Or upsampled: http://0x09.net/img/lenna96khz.png
I can't imagine this is actually useful for anything but it is pretty neat.
the former's what Aphex Twin used to insert his face into the Windowlicker"" spectrogram: http://en.wikipedia.org/wiki/Windowlicker#Background
edit: his face is actually in "[equation]"; "Windowlicker" has a spiral.
It's interesting (but not surprising) to see that there are many parallels among popular algorithms in computer vision and computer audio.
As far as the cross-domain use of these algorithms. I realized recently how much overlap there is between what we perceive as 'different' domains. Both audio and image processing tend to deal with detecting and quantifying patterns. It was a Eureka moment for me, but I'm just a simpleton, so I haven't done much with the new insight :/
For a specific implementation of computer-vision related techniques used for audio fingerprinting, see:
However, a Google Scholar search should produce a lot more examples.
This is computational synesthesia - http://www.youtube.com/watch?v=yVOD0X4KbYk
I'm currently looking for someone to work with on building a hemispheric projection for viewing these videos (in SF). Don't suppose anyone is interested? Will release source code in conjunction with my first installation. It's not real time yet but should be possible after GPU-ifying.
So this is somewhat less synesthesia, and more of an organic, bioinformatic method of processing that data. So very cool.
So this may well be a case of borrowing audio processing techniques from an image processing library and then using it to process audio.
The FFT on the 1D audio would not work because the FFT of a sub sample would not be immediately recognizable in the FFT of the whole. This is why I took the spectrogram approach, which is many FFTs over time.