
How Perceptual Hashes Work - brudgers
http://www.hackerfactor.com/blog/index.php?/archives/432-Looks-Like-It.html
======
wickedchicken
The two main problems with this approach is it is not rotation-invariant and
it does not work well if the image is damaged or added to. A more robust
system (that, admittedly, will take longer) is to use one of the affine-
invariant feature detection algorithms pioneered by SIFT. SURF is a faster,
open-sourced version of SIFT that has many implementations. Essentially it
scans chunks of the image at different scales and identifies features that
peak even as the chunk around it gets bigger. Once these are identified, they
are described in a way forcing them to the same size and orientation for
lookup. Since these features should presumably be scattered throughout the
image, the image can be recognized even if certain features are obscured or
modified. It's certainly not as straight-forward as a DCT metric on a
downsampled image, but the nature of widespread image capture, creation and
manipulation usually requires this robustness.

~~~
wickedchicken
As a followup (and this is rather unrelated to the original post), you can
combine a feature detector with a statistical clustering algorithm to
automatically identify the generic visual properties of objects _in an
unsupervised manner_. One of the first papers attempting this is
[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.72....](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.72.2610&rep=rep1&type=pdf)
but many have followed it. From what I understand these still simply check for
the existence of features in an image instead of their spatial relationship
('a picture of a cat has two ears, a tail, and a blob for a body' versus 'a
picture of a cat has two ears positioned on the end of a blob with a tail on
the other end of the blob'). Nevertheless, they represent the current state-
of-the-art in automatic object classification and recognition algorithms.

~~~
sigil
I went to a talk by Yann LeCun [1] a few months back, "Learning Feature
Hierarchies for Vision." The current state of the art in this field is mind-
blowing to an outsider. The final demo was a program that he trained in a
matter of seconds to recognize and distinguish faces of various random
audience members, in real time, from different viewing angles.

[1] <http://yann.lecun.com/>

~~~
_delirium
Is any of that software available?

~~~
sigil
Some of the libraries are [1]. Probably not the whole kit and caboodle.
Interestingly, there was a focus on special purpose hardware that made
convolutional network learning possible in realtime. One of the other demos
was an autonomous driving robot that learned to recognize obstacles in video.
Again, just mind-blowing.

[1] <http://www.cs.nyu.edu/~yann/software/index.html>

------
stevetjoa
If you're interested in this article, then you may interested in locality
sensitive hashing (LSH), a randomized hash that has been used seemingly
everywhere. I recently used it to speed up music source separation (papers
pending).

The idea is similar to the one mentioned in this article, but more general.
Unlike a cryptographically secure hash where x != y implies that h(x) != h(y)
(collisions aside), LSH says that if x and y are "near", then P(h(x) = h(y))
is "high". This quality is important when doing robust similarity search. For
example, if your image is noisy or rotated or scaled, you hope that you can
still find the clean version in a database.

LSH has been used in many application domains including images, video, music,
text, bioinformatics, and more. LSH is not directly comparable to a feature
extraction algorithm such as SIFT.

[Edited for clarity.]

~~~
cemerick
Thank you for that term. LSH looks like a very useful technique.

------
piotrSikora
This reminds me of OpenSSH's fingerprint visualization support ("VisualHostKey
yes"):

    
    
        $ ssh A.B.C.D
        Host key fingerprint is b0:c9:c9:96:fb:fd:ac:a4:ff:70:8f:1b:35:f4:f9:2e
        +--[ECDSA  256]---+
        |                 |
        |                 |
        |      .       .  |
        |     o *     . ..|
        |      O S     o..|
        |     . .     . ..|
        |      .   o o   .|
        |       . + + +E. |
        |        o.++*....|
        +-----------------+
        
        me@A.B.C.D's password:
    

Original article introducing this feature:
[http://www.undeadly.org/cgi?action=article&sid=200806150...](http://www.undeadly.org/cgi?action=article&sid=20080615022750)
(2008/06/26).

~~~
wickedchicken
This is pretty much the opposite from what the article is talking about -- the
article is trying to get a hash from an image in order to compare that image
to another, while you're talking about synthesizing an image from an arbitrary
hash...

~~~
killerswan
Well, one could use a variation of this technique to compare tons of other
things... Music?

~~~
bct
<http://musicbrainz.org/doc/Audio_Fingerprint>

The Picard MP3 tagger uses this kind of thing (and MusicBrainz' database of
tracks and audio fingerprints) to identify files.

------
NickC_dev
There's a publicly available implementation of a perceptual hashing algorithm
called phash at <http://phash.org>. I use some of their c++ code to detect
reposts on an image sharing site I run (<http://lolstack.com>).

~~~
pavel_lishin
Could you add RSS?

~~~
NickC_dev
It's definitely on my summer to-do list. You're the first person to ask.

------
kenjackson
That was one of the best articles I've read in a while.

------
seanalltogether
I've always wondered how services like Shazam work. I'm amazed that they can
do this kind of perceptual hash against ANY 10 second portion of a song. How
do they search against something like that when they don't know the start or
end time of the segment that is being input?

~~~
stevetjoa
I do research in music information retrieval. See the ISMIR 2003 paper below.
In short, it searches for landmarks in the spectrogram, hashes those
landmarks, then compares those hashes against database hashes for temporal
continuity. <http://www.ee.columbia.edu/~dpwe/papers/Wang03-shazam.pdf>

A seminal paper on audio fingerprinting is the one by Haitsma and Kalker.
<http://ismir2002.ismir.net/proceedings/02-fp04-2.pdf>

~~~
starkdg
There's a gpl implementation of an audio information retrieval approach here:
<http://code.google.comp/audioscout/>

------
michaeldhopkins
Thank you, I learned something.

Since TinEye pre-computes the hashes, do they use something like Redis to
retrieve information? Redis seems perfect for a such quick results using the
hash as the key and a URL or object of some kind as the value.

~~~
barrkel
It seems to me that what you want is some kind of spatial index - you might
not get an exact match on the hash, but instead get one that's one or two bits
away, and you'll want something better than linear search to find it.

~~~
cemerick
Doesn't scaling the images down (to 32x32 for the pHash approach) achieve
essentially the same thing? Images that differ only slightly will likely scale
down to the same thumbnail to begin with, and the resulting hash still bears
some relationship to that thumbnail — so you should be able to look at similar
hashes to find similar inputs.

~~~
jbri
Often you get things like images that have had text added to them (e.g. I
often use tineye to find the original source image that someone has added a
caption to), which means that the thumbnails will differ to some degree.

------
tobylane
Tineye must take this so much further, one image I looked up was about half of
one of the pictures returned, I think it was in a page of a book.

From the reddit link further down - "A Fourier transform takes a signal in the
time domain and breaks it down into its frequency components. Simplified, it
takes a CD and produces sheet music.", "To be clear, OP says this is a
matching algorithm - it's not what tineye uses, because matching the signature
from the searched-for image with the database of previous signatures which is
probably a nightmare."

------
andrewflnr
Wow, I had no idea it was so easy. Could this in fact be used to combat
copyright infringement for the likes of The Oatmeal on large scale, given
someone (Google) with the requisite computing power?

~~~
gsmaverick
From the looks of it this is actually TinEye's main commercial activity, they
merely expose the search interface to demonstrate their technology.

------
noblethrasher
"With pictures, high frequencies give you detail, while low frequencies show
you structure."

I have a vague idea of what this means but can someone please explain it in a
bit more detail?

~~~
joshu
Ok, I'll take a shot at this.

Let's work in one dimension rather than two dimensions. It's easy enough to
extend later.

You know that a any signal is the sum of a (potentially infinite) number of
sine waves. For example, a square wave is the sum of ever higher-frequency
(but smaller-amplitude) sine waves.

The higher frequencies are necessary to get the sharp edges.

If you strip the high frequencies, the sharp edges dissapear, leaving only the
larger motions of the lower-frequency (yet bigger amplitude) waves.

So the low frequencies are the hill, and the high frequencies are the grass.

Does that make sense?

Edit: Here's an image: <http://cnx.org/content/m0041/latest/fourier4.png>

~~~
noblethrasher
Thanks, that is a bit more helpful.

~~~
zackattack
another way i like to think of it (someone please correct me if i'm wrong) is
that high-frequency means high-detail (highly frequently needing information
to specify how it looks) whereas low-frequency means low-detail

(is that completely off or is it an analogous transform?)

~~~
gbog
What is misleading when one talk about Fourier transform for pictures, is that
it has nothing to do with the waves emitted by the colored particles and
received by our eyes. It is more about the spatial distribution of
intensities.

Applied to the sound, this "frequency view" is much more natural: we hear a
sound, and there is a low and a high part of it. It's because our ears really
do real time frequency analysis, a kind of biological Fast Fourier transform.

From what I remember, doing this transformation is just a matter of taking the
original signal s, get its level n of the lowest frequency f, and compute s -
n × f, and recurse on the result with the next frequency. The theorem proves
that if you go to the limit you get two equivalent representations of the
signal, one being the wave itself s = f(t), one being its "spectrum" s1 =
f(freq) (a function of the frequencies).

For many purposes, f(freq) is much more convenient than f(t), including
comparisons, frequency shifting, extraction, compression, etc.

It applies equally well to images, but for me the frequency representation of
a picture is not perceptively useful, maybe because our eyes are not Fourier
transforming what we see.

All that is's old story for me (I studied acoustics in IRCAM), please correct
if my memory is wrong.

~~~
joshu
indeed. it took me forever to wrap my head around fft of an image.

one thing you can do is read how JPEG works, the DCT is a lot like generalized
FFT.

------
there
libpuzzle is good for this type of work:
<http://libpuzzle.pureftpd.org/project/libpuzzle>

------
tsewlliw
That is completely brilliant! I want to go out and write a image diff viewer
that uses this on blobs in the image to detect pieces moving around!

~~~
zackattack
I don't understand? :)

~~~
tsewlliw
so you have a before and after image, say a webpage mockup thats where the
logo moved from top left to top right.

the blob finder finds the interesting pieces of each version, and the
perceptual hash picks which blobs match each other, and the software can say
with reasonable certainty that the top left part of the image was moved to top
right.

don't eat my lunch :)

------
CountHackulus
This is really interesting, but I'm sad that they skip over the step "convert
to grayscale." There's a ton of ways to convert an image to grayscale, each
with their own pros and cons.

How do you weight each channel? Do you convert to HSL and just use L? Do you
instead use Lab? HSV? Do you do a global or local algorithm? So many
questions!

~~~
mdaniel
It was my understanding that (for a lot of the steps) it does not matter what
mechanism you use, so long as you use the same one.

The problem you are addressing would matter if someone were trying to query
TinyEye's database without submitting the image to TinyEye's servers.

------
krisw
What a fascinating topic, it's too bad that Gazopa (mentioned at the end of
the article) just announced that they're shutting down their CBIR service:

[http://gazopablog.blogspot.com/2011/05/shut-down-notice-
from...](http://gazopablog.blogspot.com/2011/05/shut-down-notice-from-
gazopa.html)

------
buddydvd
Isn't tineye's signature algorithm based on Fourier transform?

[http://www.reddit.com/r/programming/comments/bvmln/how_does_...](http://www.reddit.com/r/programming/comments/bvmln/how_does_tineye_work/c0os84n)

~~~
icegreentea
As the post notes, that algorithm is great at matching, but sucks at
searching. Hashes on the other hand are probably better off at matching. My
guess is that you use some variation of hashing to get a set of candidate
images, and then do more detailed examinations.

------
iandanforth
I have a feeling there is a deep connection between perceptual hashes and
compressed sensing. Could someone more familiar with the latter weigh in?

~~~
stevetjoa
Kinda sorta not really. Compressed sensing and sparse coding show that, under
certain sparsity assumptions, you can perfectly reconstruct your original data
with fewer bits than previously thought. It is a coding principle.

Perceptual hashes (or hashes in general) are used for fast indexing and
retrieval. You cannot recreate the original data from a hash, pretty much by
definition.

So hashes and coding algorithms both provide smaller representations of data.
But hashes are used for indexing and do not provide the original data, or even
an approximation thereof, while compressed sensing can.

~~~
mjb
The key thing here is that compressed sensing is an attempt to throw away
redundant data as early as possible in the process. Any data which is
redundant in the actual data stream, or can be inferred from prior knowledge
about the stream does not need to be measured.

Perceptual hashing is instead an attempt to make the matching problem easier
by throwing away data that is seen as irrelevant. In the case of the described
algorithm, low frequencies are chosen as relevant and high frequencies as
irrelevant. As you point out, this involves losing the ability to recover the
original signal and adds the risk of mismatching in cases where data thought
to be irrelevant to the task is actually relevant.

~~~
iandanforth
If we take the principles from compressed sensing and use a random-lens
approach to subsampling the original image, we can create a fingerprint of the
image which also happens to be able to reconstruct the original.

Both techniques are compressions that rely on the sparse properties of images
to devine which bits are meaningful and which are redundant. It appears to me
that using compressed sensing is just a smarter way of doing it. Maybe a hash
that starts with a random subsample is inherently slower for comparing
millions of images, but I shouldn't think so.

~~~
nuitblanche
I agree. Briefly looking at the description of it, it is some sort of
compressed sensing. The differences from traditional CS are minimal in fact,
but the scheme is in line with some of the work undertaken in manifold signal
processing. The differences are: \- the proposed hash is deterministic,
generally in CS, you want to rely on random projections (yet there are some
results for deterministic problems) in order to get some sort of universality
and by the same token some sort of robustness. \- step 3 and 4, are the most
fascinating steps because they are clearly one of the approaches used in
manifold signal processing for images. To summarize, in order for pictures to
be close to each other on a manifold, you really want to defocus them. I'll
put something on my blog on the matter. This is the reason why the has of two
images next to each other are close in the "hash" or manifold space. \- for
one image, the hash seems to provide 16 measurements (16 bits of the hash
result). That would be OK if the initial picture was at the size and color of
the picture after step 1 and 2. So in effect, that information is lost.
However, in CS you also have "lossy" scheme such as the 1-bit compressed
sensing approach (there you retain only the sign of the measurement!, i.e. a
little bit like step 6). The reconstruction of these 1-bit pictures are not
the original but they are close).

(ps: I write a small blog on CS).

------
joshu
See also SIFT

~~~
copper
Isn't SiFT still patented, though?

------
zackattack
I don't understand why compressing an image is going to generate the same 8x8
image each time no matter what aspect ratio it was originally... whether it
has been stretched before.. If you stretch and then recompress a bunch of
times don't you eventually lose the information?

That math is for some reason totally counter-intuitive to me. Could someone do
a proof?

~~~
bryanh
From my understanding, its not a md5 hash or equivalent. Its just encoding the
64 bits to a "hash" thus allowing a fuzzy comparison between hashes. Someone
correct me if I am wrong.

Obviously, this isn't robust enough to find all matches. A simple cropping
would throw it completely off.

------
mdonahoe
"I'll use a picture of my next wife, Alyson Hannigan."

That had me doing a few google searches.

