

Computational Synesthesia: Audio Analysis with Image Processing Algorithms - doctoboggan
http://jack.minardi.org/software/computational-synesthesia/

======
fluidcruft
This seems to more accurately be "Audio Analysis with Signal Processing
Algorithms". It shouldn't be surprising that Fourier analysis applies to
audio... up next: discovering that wavelets apply to audio.

~~~
doctoboggan
I used the spectrogram just to get the audio in a 2d representation that was
insensitive to timeshifts. The image processing part was the `match_template`
function from skimage.

Although normalized correlation (which is what `match_template` uses) has been
applied outside of image processing before. I also tried other image
processing techniques like harris corner detection and SURF feature detection.
I didn't write them up but you can see the code on github:

[https://github.com/jminardi/audio_fingerprinting](https://github.com/jminardi/audio_fingerprinting)

------
ijl
Using an image algorithm to process audio reminds me of Jeff Hawkins'
arguments in On Intelligence, specifically in arguing that the brain processes
all types of input in the same way of recognizing a pattern of activation
within a sequence. Very interesting book.

[https://en.wikipedia.org/wiki/On_Intelligence](https://en.wikipedia.org/wiki/On_Intelligence)

~~~
doctoboggan
Yeah, I was thinking about this algorithm because I realized how much my brain
uses the visual system to solve non visual problems. Maybe I am more of a
visual thinker than others but I often visualize abstract concepts naturally.

I find it amazing that the brain can accept and process large amounts of
information if it is just presented in a coherent visual way. Imagine looking
at an image vs a printout of the pixel values. Same data but one is designed
to be processed by humans.

~~~
fluidcruft
Recruiting the visual areas of the brain to process information is actually
not a strong support for the argument that information processing in other
regions of the brain is organized similarly--if it were, processing could be
recruited elsewhere and you would not perceive it as "visual".

~~~
doctoboggan
> Recruiting the visual areas of the brain to process information is actually
> not a strong support for the argument that information processing in other
> regions of the brain is organized similarly

Good point. I guess I was trying to say that non-visual information processing
in the brain is more than just organized similarly to visual processing, in
some cases the brain actually uses the visual system to process non-visual
information.

------
sanskritabelt
I was originally going call this 'pretty good for amateur hour stuff' but I
clicked around the site a little bit and dude's got an EE, which raises the
bar of what's acceptable, and the mistakes are more damming.

For example, you gotta window your FFTs in the spectrogram, kid.

~~~
doctoboggan
Hi sanskrit, thanks for the feedback.

If I was implementing this in production I certainly would apply some sort of
window function. However this blog post was mostly just to show some cool
code, and ease of reading was my first constraint. Notice also how I kept
intermediate arrays around in my functions. This uses more memory but is
easier to read.

In the end this approach doesn't seem to scale to the amount of data Shazaam
must be processing so it's all moot anyway.

------
lewisgodowski
As someone who has synesthesia, a background in engineering, and currently
studies music technology, I'm pretty confident saying this has nothing to do
with the former. I'm genuinely confused why OP called this synesthesia.

Synesthesia, by definition, is the _mixing_ of your senses. Letters having
colors, smells having personalities, music tasting like cheeseburgers. NOT,
using a spectrogram to display audio.

I can sort of see where OP is coming from, but I think it's a bit farfetched
to call this synesthesia. As fluidcruft said, a more suitable title would
probably be "Audio Analysis with Signal Processing Algorithms" or something
along those lines.

Regardless, it's still a pretty impressive project.

------
dchichkov
That's pretty cool. Just for the fun of it, I also did something along these
lines, trying to make a model of human voice with stacked denoising
autoencoders (SDAs, which were used in the image recognition originally) in
the compressed audio domain (MDCT, basically MP3, but without Huffman
encoding). It worked, but a simple nearest neighbour model was producing a lot
more realistic voice.

Here's how audio data in that compressed domain looks like [you'll need to
zoom in]:
[https://raw.github.com/dchichkov/DeepLearningTutorials/maste...](https://raw.github.com/dchichkov/DeepLearningTutorials/master/code/plots/hhgttg0101007.png)

------
0x09
This is funny, I was fooling around with something really similar last week.
The difference is rather than operating on a STFT spectrum, instead transform
the whole signal, re-order the coefficients by folding from low to high in the
desired dimension, and invert the transform in that dimension, effectively
changing from time to "spatial" domain or vice versa.

I won't post any image to audio examples since they are generally really
annoying, but here is a handful of audio samples converted to images (mainly
game samples due to size and features):

[http://0x09.net/img/sfx1.png](http://0x09.net/img/sfx1.png)

[http://0x09.net/img/sfx2.png](http://0x09.net/img/sfx2.png)

[http://0x09.net/img/bgm1.png](http://0x09.net/img/bgm1.png)

[http://0x09.net/img/bgm2.png](http://0x09.net/img/bgm2.png)

[http://0x09.net/img/bgm3.png](http://0x09.net/img/bgm3.png) (warning 22mb)

It's actually invertible aside from quantization and rounding errors, so there
is for example this mp3-compressed Lenna:

[http://0x09.net/img/lenna.mp3.png](http://0x09.net/img/lenna.mp3.png) (the
colorful fuzz is mainly from quantization/rounding rather than the compression
itself)

Or downsampled:
[http://0x09.net/img/lenna11khz.png](http://0x09.net/img/lenna11khz.png)

Or upsampled:
[http://0x09.net/img/lenna96khz.png](http://0x09.net/img/lenna96khz.png)

I can't imagine this is actually useful for anything but it is pretty neat.

~~~
simcop2387
I'm amazed at how the 11khz lenna actually looks more like dithering than
anything else. It's really uncanny how it looks like a floyd-steinberg
dithered lenna [1]. I suppose that's probably the intent of the dithering, so
that you can turn a higher frequency (the dither) into a lower one with
aliasing basically.

[1]
[http://www.malcolmmclean.site11.com/www/BinaryImageProcessin...](http://www.malcolmmclean.site11.com/www/BinaryImageProcessing/lenafloyd.gif)

------
tlarkworthy
Template matching is normally doing the pixelwise correlation between source +
offset and dest (in 2D).

Its a really basic algorithm, even more basic in 1D. So it would be pretty
trivial to just compare the shifted spectrograph profiles for a massive gain
in performance.

Infact, you probably don't even the "shift" bit of the algorithm because you
will end up comparing one frequency to a different frequency which does not
make much sense (outside of Doppler shift calcs). So its a really long winded
approach to taking the cross correlation of two spectrograms that introduces a
load of unwanted homomorphism.

see also [http://dsp.stackexchange.com/questions/736/how-do-i-
implemen...](http://dsp.stackexchange.com/questions/736/how-do-i-implement-
cross-correlation-to-prove-two-audio-files-are-similar)

I would like to add that just because an algorithm is in vision processing
library, doesn't make it "clever". Basic template matching implementation is 4
nested for loops. Clever template matching is FFT to save one two of those
loops, but the results are the same. In audio processing you don't need all
those loops because the signal is 1D. So going via a vision processing library
is just introducing pointless loops that do nothing but worsen results and
slow down code.

------
wazoox
As a related note, nowadays it's becoming more common to use image processing
to restore audio from old film reels. It's way easier and more efficient to
first remove scratches and enhance contrast of the optical audio track before
going through standard audio tools.

------
heurist
In school I had a project that used edge detection to find certain patterns in
audio samples. There were probably better ways to do it than looking at the
spectrogram but I didn't pay enough attention in that class to know what they
were. I was always curious about what else could be done with image processing
algorithms and audio, it's cool to see a thread about it.

------
batemanesque
you might be interested in MetaSynth
([http://www.uisoftware.com/MetaSynth/index.php](http://www.uisoftware.com/MetaSynth/index.php))
& Izotope's Iris
([http://www.izotope.com/products/audio/iris/](http://www.izotope.com/products/audio/iris/))
which basically offer the converse of what you're doing here. fascinating from
both a technical & a sound design point of view.

the former's what Aphex Twin used to insert his face into the Windowlicker""
spectrogram:
[http://en.wikipedia.org/wiki/Windowlicker#Background](http://en.wikipedia.org/wiki/Windowlicker#Background)

edit: his face is actually in "[equation]"; "Windowlicker" has a spiral.

------
Hydraulix989
CMU researchers followed this same approach some time ago:
[http://www.cs.cmu.edu/~yke/musicretrieval/](http://www.cs.cmu.edu/~yke/musicretrieval/)

~~~
dimatura
Yes, this seems pretty much what the author wants to implement. It's also
worthwhile to have a look at the paper describing Shazam's original algorithm
([http://www.ee.columbia.edu/~dpwe/papers/Wang03-shazam.pdf](http://www.ee.columbia.edu/~dpwe/papers/Wang03-shazam.pdf)).

It's interesting (but not surprising) to see that there are many parallels
among popular algorithms in computer vision and computer audio.

------
csmatt
Just yesterday I was looking at this
[http://m.fbi.gov/#http://www.fbi.gov/news/stories/2013/april...](http://m.fbi.gov/#http://www.fbi.gov/news/stories/2013/april/can-
you-crack-a-code/can-you-crack-a-code) as a fun, different challenge to work
on and wanted to be able to translate the dot-notation characters into
something more easily distinguishable. The pattern matching in python would
have helped a lot, I think.

As far as the cross-domain use of these algorithms. I realized recently how
much overlap there is between what we perceive as 'different' domains. Both
audio and image processing tend to deal with detecting and quantifying
patterns. It was a Eureka moment for me, but I'm just a simpleton, so I
haven't done much with the new insight :/

------
istayhomeallday
There has been a fair amount of research in the Music Information Retrieval
(MIR) community that borrows from image processing techniques for the likes of
audio fingerprinting, onset detection, tempo estimation, chord recognition,
etc. Most of the techniques used start by transforming the audio into some
time-frequency representation (like an STFT, but also short-time versions of
the wavelet transform, constant-q transform, etc) as is done here.

For a specific implementation of computer-vision related techniques used for
audio fingerprinting, see:

[http://ecee.colorado.edu/~fmeyer/class/ecen5322/waveprint.pd...](http://ecee.colorado.edu/~fmeyer/class/ecen5322/waveprint.pdf)

However, a Google Scholar search should produce a lot more examples.

------
terramars
This is not "Computational Synesthesia" \- it's nice but that is much harder.

This is computational synesthesia -
[http://www.youtube.com/watch?v=yVOD0X4KbYk](http://www.youtube.com/watch?v=yVOD0X4KbYk)

I'm currently looking for someone to work with on building a hemispheric
projection for viewing these videos (in SF). Don't suppose anyone is
interested? Will release source code in conjunction with my first
installation. It's not real time yet but should be possible after GPU-ifying.

~~~
triggercut
I would recommend you look at the work of Professor Paul Bourke at the
University of Western Australia in Perth. He creates some amazing immersive
environments.

[http://paulbourke.net/dome/](http://paulbourke.net/dome/)
[http://paulbourke.net/exhibition/domeinstall/](http://paulbourke.net/exhibition/domeinstall/)

------
calebm
Very cool. I've been looking into doing some speaker recognition, and I've
considered using image processing algorithms as well - they seem like a good
way to recognize patterns in a spectrogram.

------
languagehacker
I'm not surprised at how well this worked. Humans detect consontant-vowel
pairs using f2 transformations, which are computed by specialized cells in the
ear/brain. The same such cells exist in bats and barn owls. Cats actually use
f2 transformations to communicate as well, and that's why they're actually
pretty good at expressing specific desires to humans based on how they meow.

So this is somewhat less synesthesia, and more of an organic, bioinformatic
method of processing that data. So very cool.

------
zenbowman
Many image processing algorithms are taken from the signal processing world,
and probably initially employed to process audio during the construction of
the Bell telephone system. There's a lot of mathematical similarity between a
visual and audio signal, the differences are primarily in the human receivers.

So this may well be a case of borrowing audio processing techniques from an
image processing library and then using it to process audio.

------
triggercut
Ignoring the problems with the authors title. I've found this article sums up
and describes a very robust solution to this sort of problem using Haar
Wavelet Decomposition and Locally Sensitive Hashing.

[http://www.codeproject.com/Articles/206507/Duplicates-
detect...](http://www.codeproject.com/Articles/206507/Duplicates-detector-via-
audio-fingerprinting)

------
anigbrowl
Doing audio processing on a GPU isn't exactly a new thing:
[http://www.nvidia.com/content/GTC/documents/1011_GTC09.pdf](http://www.nvidia.com/content/GTC/documents/1011_GTC09.pdf)
comes from 2009 and people have been mucking around with this about as long as
it's been possible to write custom shaders.

------
tharshan09
What is a good way to do recommendation of music files? How would you analyse
them? and what method do you compare them with? Is this possible with just the
raw data itself rather than looking at meta data within the files.

------
etrain
Would love to see how this works vs. just finding the max correlation of the
1d sampled stream vs the base data. If that's no good - try fft on the 1d
sample and rescale.

~~~
doctoboggan
I explored this options as well. In my testing I was able to get a factor of 6
speedup simply by reshaping a 1D array into 2 dimensions.

The FFT on the 1D audio would not work because the FFT of a sub sample would
not be immediately recognizable in the FFT of the whole. This is why I took
the spectrogram approach, which is many FFTs over time.

~~~
etrain
Where does that 6x speedup come from? Is it a different algorithm? How is the
1d process different from the 2d? It might be that the 2d lib is just better
optimized.

------
junpuy
I'm thinking that real time visualizations of music based on this would be
amazing when high. I should go implement it.

