Hacker News new | past | comments | ask | show | jobs | submit login
Computational Synesthesia: Audio Analysis with Image Processing Algorithms (minardi.org)
102 points by doctoboggan on Oct 1, 2013 | hide | past | web | favorite | 35 comments

This seems to more accurately be "Audio Analysis with Signal Processing Algorithms". It shouldn't be surprising that Fourier analysis applies to audio... up next: discovering that wavelets apply to audio.

I used the spectrogram just to get the audio in a 2d representation that was insensitive to timeshifts. The image processing part was the `match_template` function from skimage.

Although normalized correlation (which is what `match_template` uses) has been applied outside of image processing before. I also tried other image processing techniques like harris corner detection and SURF feature detection. I didn't write them up but you can see the code on github:


Using an image algorithm to process audio reminds me of Jeff Hawkins' arguments in On Intelligence, specifically in arguing that the brain processes all types of input in the same way of recognizing a pattern of activation within a sequence. Very interesting book.


The way non-biological (i.e. mathematical, engineering) analyses are conducted has a substantial degree of unification as well. A typical engineering text focused on signal processing will develop an analysis method and then show example applications to audio, to images, to bridge resonance, to electronic circuits, etc. Things do get more specific as you dig further into domain-specific features and heuristics, but there's a large common foundation.

The usefulness of the kinds of experiments in this post, imo, is to investigate whether some things developed as domain-specific features actually belong more in the common foundation part. Sometimes something will be developed initially in computer-music or computer-vision not because it's really music or vision specific, but just due to where a particular person happened to be working. Or where research funding was allocated, for that matter. A funding trick computer-music people like doing lately is to apply their algorithms to bioinformatics as a way of funding research in their real domain of interest.

The next step, of course, would be to try to determine if these image-processing techniques already do exist in audio processing, just under a different name, or in some kind of variant.

Yeah, I was thinking about this algorithm because I realized how much my brain uses the visual system to solve non visual problems. Maybe I am more of a visual thinker than others but I often visualize abstract concepts naturally.

I find it amazing that the brain can accept and process large amounts of information if it is just presented in a coherent visual way. Imagine looking at an image vs a printout of the pixel values. Same data but one is designed to be processed by humans.

Recruiting the visual areas of the brain to process information is actually not a strong support for the argument that information processing in other regions of the brain is organized similarly--if it were, processing could be recruited elsewhere and you would not perceive it as "visual".

> Recruiting the visual areas of the brain to process information is actually not a strong support for the argument that information processing in other regions of the brain is organized similarly

Good point. I guess I was trying to say that non-visual information processing in the brain is more than just organized similarly to visual processing, in some cases the brain actually uses the visual system to process non-visual information.

I keep seeing this book referenced; need to check it out.

I was originally going call this 'pretty good for amateur hour stuff' but I clicked around the site a little bit and dude's got an EE, which raises the bar of what's acceptable, and the mistakes are more damming.

For example, you gotta window your FFTs in the spectrogram, kid.

Hi sanskrit, thanks for the feedback.

If I was implementing this in production I certainly would apply some sort of window function. However this blog post was mostly just to show some cool code, and ease of reading was my first constraint. Notice also how I kept intermediate arrays around in my functions. This uses more memory but is easier to read.

In the end this approach doesn't seem to scale to the amount of data Shazaam must be processing so it's all moot anyway.

I don't think I'd call a rectangular window a "damning mistake." Depending upon the application, maybe the energy spread is acceptable (and computationally quicker).

For the curious: http://en.wikipedia.org/wiki/Window_function

As someone who has synesthesia, a background in engineering, and currently studies music technology, I'm pretty confident saying this has nothing to do with the former. I'm genuinely confused why OP called this synesthesia.

Synesthesia, by definition, is the _mixing_ of your senses. Letters having colors, smells having personalities, music tasting like cheeseburgers. NOT, using a spectrogram to display audio.

I can sort of see where OP is coming from, but I think it's a bit farfetched to call this synesthesia. As fluidcruft said, a more suitable title would probably be "Audio Analysis with Signal Processing Algorithms" or something along those lines.

Regardless, it's still a pretty impressive project.

That's pretty cool. Just for the fun of it, I also did something along these lines, trying to make a model of human voice with stacked denoising autoencoders (SDAs, which were used in the image recognition originally) in the compressed audio domain (MDCT, basically MP3, but without Huffman encoding). It worked, but a simple nearest neighbour model was producing a lot more realistic voice.

Here's how audio data in that compressed domain looks like [you'll need to zoom in]: https://raw.github.com/dchichkov/DeepLearningTutorials/maste...

As a related note, nowadays it's becoming more common to use image processing to restore audio from old film reels. It's way easier and more efficient to first remove scratches and enhance contrast of the optical audio track before going through standard audio tools.

Template matching is normally doing the pixelwise correlation between source + offset and dest (in 2D).

Its a really basic algorithm, even more basic in 1D. So it would be pretty trivial to just compare the shifted spectrograph profiles for a massive gain in performance.

Infact, you probably don't even the "shift" bit of the algorithm because you will end up comparing one frequency to a different frequency which does not make much sense (outside of Doppler shift calcs). So its a really long winded approach to taking the cross correlation of two spectrograms that introduces a load of unwanted homomorphism.

see also http://dsp.stackexchange.com/questions/736/how-do-i-implemen...

I would like to add that just because an algorithm is in vision processing library, doesn't make it "clever". Basic template matching implementation is 4 nested for loops. Clever template matching is FFT to save one two of those loops, but the results are the same. In audio processing you don't need all those loops because the signal is 1D. So going via a vision processing library is just introducing pointless loops that do nothing but worsen results and slow down code.

In school I had a project that used edge detection to find certain patterns in audio samples. There were probably better ways to do it than looking at the spectrogram but I didn't pay enough attention in that class to know what they were. I was always curious about what else could be done with image processing algorithms and audio, it's cool to see a thread about it.

This is funny, I was fooling around with something really similar last week. The difference is rather than operating on a STFT spectrum, instead transform the whole signal, re-order the coefficients by folding from low to high in the desired dimension, and invert the transform in that dimension, effectively changing from time to "spatial" domain or vice versa.

I won't post any image to audio examples since they are generally really annoying, but here is a handful of audio samples converted to images (mainly game samples due to size and features):





http://0x09.net/img/bgm3.png (warning 22mb)

It's actually invertible aside from quantization and rounding errors, so there is for example this mp3-compressed Lenna:

http://0x09.net/img/lenna.mp3.png (the colorful fuzz is mainly from quantization/rounding rather than the compression itself)

Or downsampled: http://0x09.net/img/lenna11khz.png

Or upsampled: http://0x09.net/img/lenna96khz.png

I can't imagine this is actually useful for anything but it is pretty neat.

I'm amazed at how the 11khz lenna actually looks more like dithering than anything else. It's really uncanny how it looks like a floyd-steinberg dithered lenna [1]. I suppose that's probably the intent of the dithering, so that you can turn a higher frequency (the dither) into a lower one with aliasing basically.

[1] http://www.malcolmmclean.site11.com/www/BinaryImageProcessin...

you might be interested in MetaSynth (http://www.uisoftware.com/MetaSynth/index.php) & Izotope's Iris (http://www.izotope.com/products/audio/iris/) which basically offer the converse of what you're doing here. fascinating from both a technical & a sound design point of view.

the former's what Aphex Twin used to insert his face into the Windowlicker"" spectrogram: http://en.wikipedia.org/wiki/Windowlicker#Background

edit: his face is actually in "[equation]"; "Windowlicker" has a spiral.

CMU researchers followed this same approach some time ago: http://www.cs.cmu.edu/~yke/musicretrieval/

Yes, this seems pretty much what the author wants to implement. It's also worthwhile to have a look at the paper describing Shazam's original algorithm (http://www.ee.columbia.edu/~dpwe/papers/Wang03-shazam.pdf).

It's interesting (but not surprising) to see that there are many parallels among popular algorithms in computer vision and computer audio.

Just yesterday I was looking at this http://m.fbi.gov/#http://www.fbi.gov/news/stories/2013/april... as a fun, different challenge to work on and wanted to be able to translate the dot-notation characters into something more easily distinguishable. The pattern matching in python would have helped a lot, I think.

As far as the cross-domain use of these algorithms. I realized recently how much overlap there is between what we perceive as 'different' domains. Both audio and image processing tend to deal with detecting and quantifying patterns. It was a Eureka moment for me, but I'm just a simpleton, so I haven't done much with the new insight :/

There has been a fair amount of research in the Music Information Retrieval (MIR) community that borrows from image processing techniques for the likes of audio fingerprinting, onset detection, tempo estimation, chord recognition, etc. Most of the techniques used start by transforming the audio into some time-frequency representation (like an STFT, but also short-time versions of the wavelet transform, constant-q transform, etc) as is done here.

For a specific implementation of computer-vision related techniques used for audio fingerprinting, see:


However, a Google Scholar search should produce a lot more examples.

Very cool. I've been looking into doing some speaker recognition, and I've considered using image processing algorithms as well - they seem like a good way to recognize patterns in a spectrogram.

This is not "Computational Synesthesia" - it's nice but that is much harder.

This is computational synesthesia - http://www.youtube.com/watch?v=yVOD0X4KbYk

I'm currently looking for someone to work with on building a hemispheric projection for viewing these videos (in SF). Don't suppose anyone is interested? Will release source code in conjunction with my first installation. It's not real time yet but should be possible after GPU-ifying.

I would recommend you look at the work of Professor Paul Bourke at the University of Western Australia in Perth. He creates some amazing immersive environments.

http://paulbourke.net/dome/ http://paulbourke.net/exhibition/domeinstall/

I'm not surprised at how well this worked. Humans detect consontant-vowel pairs using f2 transformations, which are computed by specialized cells in the ear/brain. The same such cells exist in bats and barn owls. Cats actually use f2 transformations to communicate as well, and that's why they're actually pretty good at expressing specific desires to humans based on how they meow.

So this is somewhat less synesthesia, and more of an organic, bioinformatic method of processing that data. So very cool.

Many image processing algorithms are taken from the signal processing world, and probably initially employed to process audio during the construction of the Bell telephone system. There's a lot of mathematical similarity between a visual and audio signal, the differences are primarily in the human receivers.

So this may well be a case of borrowing audio processing techniques from an image processing library and then using it to process audio.

Ignoring the problems with the authors title. I've found this article sums up and describes a very robust solution to this sort of problem using Haar Wavelet Decomposition and Locally Sensitive Hashing.


Doing audio processing on a GPU isn't exactly a new thing: http://www.nvidia.com/content/GTC/documents/1011_GTC09.pdf comes from 2009 and people have been mucking around with this about as long as it's been possible to write custom shaders.

What is a good way to do recommendation of music files? How would you analyse them? and what method do you compare them with? Is this possible with just the raw data itself rather than looking at meta data within the files.

Would love to see how this works vs. just finding the max correlation of the 1d sampled stream vs the base data. If that's no good - try fft on the 1d sample and rescale.

I explored this options as well. In my testing I was able to get a factor of 6 speedup simply by reshaping a 1D array into 2 dimensions.

The FFT on the 1D audio would not work because the FFT of a sub sample would not be immediately recognizable in the FFT of the whole. This is why I took the spectrogram approach, which is many FFTs over time.

Where does that 6x speedup come from? Is it a different algorithm? How is the 1d process different from the 2d? It might be that the 2d lib is just better optimized.

I'm thinking that real time visualizations of music based on this would be amazing when high. I should go implement it.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact