
Perceptual Image Hashing - specular
http://bertolami.com/index.php?engine=blog&content=posts&detail=perceptual-hashing
======
Jemaclus
This made me think of something, so bear with me for a second. I took Andrew
Ng's Machine Learning class on Coursera, and one of the things he talked about
was how, given a certain data set and a logistic regression approach, the
algorithm might decide to just always give it a value of 1, since 98% of the
time the value is one.

The examples given here were all predicted to have 95%+ similarity, but it
seems to me that given this dataset, I could be reasonably convinced that it
just assumes all images are reasonably similar. In other words, the author
provides no counter-examples that demonstrate that the algorithm can detect
_dissimilar_ images, such as a photo of a car and a clipart of pizza. I would
expect that to have very low similarity, and given the algorithm's
demonstrated ability to discern between similar and dissimilar images, I would
be more convinced of perceptual image hashing as a viable technique.

~~~
andrea_s
While your comment applies on a broader sense (we need to test both for cases
in which we expect the algorithm to find similarity and cases in which we
expect to find dissimilarity - lest we embrace confirmation bias), I believe
your reference to machine learning is a bit off.

In this case, we are not looking at a model training, we are just evaluating
the outcome of a (static) algorithm. On the other hand, in the context of ML
you should try to include as much information as possible in your training
dataset. This is a concept that comes by in many fields, some of them just
tangentially related to machine learning (e.g. excitation signals for system
identification, speaking both in terms of frequency domain and differential
equations).

tl;dr: there is no "training set" to speak in the presented article - hence,
concepts such as overfitting and information content (of the training set) do
not quite apply here.

~~~
Jemaclus
I wasn't making a direct comparison between machine learning and this
particular task. The article just made me think of that machine learning
problem, and I would like to see more examples that show a wide range of
results. For instance, there are no examples of false positives -- how often
would that happen in an image processor? Probably a lot, but I wouldn't know
it based on this particular article.

------
bcoates
I've used this technique before for image searching, here's some tips:

The algorithm presented here can be summarized as "Take the low-frequency DCT
components of the luminance channel, quantized down to one bit". Hashing is a
very misleading term for what you get from it: In particular, it doesn't have
anything like the uniform coverage of the output space you would expect from a
hash function, as almost all encountered images exist in a very small range of
hashes. Also, the bits in the hash are not all created equal; assuming you
encode them in a zig-zag pattern truncation is equivalent to a low-pass
filter, so if you want to make the hash different sizes while matching on the
same image elements you'll want to alter the quantization amount, not the
output DCT size (this is directly equivalent to what JPEG does, which uses a
fixed DCT size but lossily quantizes the information in it to varying degrees
for different compression quality settings). Experiment with different
quantization techniques; you don't have to treat every bit in the output the
same.

This technique is only suitable for a small subset of the image files that you
will encounter in the real world: it's mostly unsuitable for anything but
photographs and photograph-like images (3d renders, paintings, etc.) Many
classes of image have no useful distinguishing information in the low-
frequency luminance space, such as screenshots, pictures of text, line
drawings, many forms of graph and diagram, most "clip art" style drawings,
pictures that include layout elements in the image file (like most of the
meme/sucessories/demotivator format pictures you'll see on imageboards). Of
course, an image with a flat or heavily compressed luminance channel will
yield no information at all.

You don't have to do anything special to match flipped images, as you should
be using (iirc) DCT-II that has the same output for a mirror image. It's
insensitive to scaling and mild translation.

For debugging purposes, note that the transformation performed is lossy but
easily reversible: you can undo all the steps but quantization to generate a
'stereotype' image of a particular hash value and its neighbors. There is no
more information available to the matching algorithm than you can see with
your own eyes with the round-trip conversion, so if the round-tripped image
preserves important distinguishing detail the matching will work and if it
turns the image into a meaningless grey smear you'll get tons of false
positive matches.

~~~
vosper
Is there a good technique / library for comparing screenshot-like images?

------
667
Is it different from existing implementations? Am I missing something?

[http://phash.org/](http://phash.org/)

[http://hzqtc.github.io/2013/04/image-duplication-
detection.h...](http://hzqtc.github.io/2013/04/image-duplication-
detection.html)

[http://www.hackerfactor.com/blog/?/archives/432-Looks-
Like-I...](http://www.hackerfactor.com/blog/?/archives/432-Looks-Like-It.html)

~~~
vitovito
A more tactful question might be, "How is it different from existing
implementations?" And then don't ask "Am I missing something?"

Best practices for engineering is always to see what else is out there, so to
assume the author hasn't conducted a review of the field might be read as
rudeness or snark.

Many developers, even when they do compare and contrast their own work as they
develop it, don't write up the results or list their peer projects, so it's
otherwise not an unfair question to ask.

~~~
667
But I actually want to know what I'm missing-- what's interesting about this
"closed source but available for academic purposes" application which appears
to simply be a re-implementation of existing work?

I'm trying to give benefit of the doubt here actually.

~~~
ahknight
It's using a very similar technique to pHash, actually. DCT-based hash and
all. It's not brand new, but perhaps a variation on a theme.

I've actually been working in this domain for the last week or so and DCT-
based hashes are quite accurate, but slow as hell. Average hashes (aHash) or
Delta Hashes (dHash) are _much_ faster and rather good at weeding out large
numbers of images. A mix of ideas is usually a good idea.

~~~
opless
Yup, it's been shown that the accuracy of a set of poor classifiers together
can be greater than a known 'good' classifier.

The implication is that there's no single 'perfect' classifier for a problem,
engineers have used this notion for many years having multiple systems vote
for fail-proof operations.

------
elzr
Hash functions impose practical mappings between unlike domains. When volatile
(~1-to-1) identifying (crypto), when correlated (~n-to-n) perceiving (AI).

(Tweet summary of the article. I'm becoming increasingly fascinated by hash
functions. I'm finding all important this tension between
abstraction/correlation/perception & groundedness/volatility/identification.)

[https://twitter.com/elzr/status/497893639000190976](https://twitter.com/elzr/status/497893639000190976)

~~~
morsch
That is a remarkable way to put it. But aren't has functions -- per definition
-- n-to-1? That seems to be what the perceiving function is about: map
multiple sensations into a single correlated perception.

~~~
elzr
Thanks for pointing it out, you're right! I rewrote the summary (the original
is still at the linked tweet). So it's ~1-to-1 (usually, that's what I mean
with practical) when it's volatile. And I'm calling it ~n-to-n (again,
usually, practically) when it's correlated: mapping ~n correlations in the
source domain to ~n correlations in the destination.

To my mind, an ~n-to-1 mapping would actually be a bit more like idealistic
Platonic classifying than Wittgensteinian perception (~"we spin perceptual
threads by twisting n attributes like fiber on fiber. And the strength of the
thread does not reside in the fact that some 1 fiber runs through its whole
length, but in the overlapping of the fibers.").

What do you think? :)

------
bravura
Also look at Semantic Hashing, which uses deep learning to compress an input
(e.g. a document) into a low dimensional bit vector with fast lookup:

[http://www.cs.toronto.edu/~rsalakhu/papers/semantic_final.pd...](http://www.cs.toronto.edu/~rsalakhu/papers/semantic_final.pdf)

The idea is that you could hash one billion objects, and do a very fast lookup
of similar objects.

There is later academic work that refines this approach. For example, here is
work applying it to image search:
[http://www.cs.utexas.edu/~grauman/temp/GraumanFergus_Hashing...](http://www.cs.utexas.edu/~grauman/temp/GraumanFergus_Hashing_chapterdraft.pdf)

You can consider this a variant of Locality-Sensitive Hashing, where the hash
function is specifically induced through a machine learning technique.

------
daveloyall
Does the term "perceptual hashing" only apply to the domain of images? What
would you call this concept applied to text files? I am familiar with
Levenshtein distance, but that family of algorithms provide a similarity score
or value whereas I'd like a hash as the output.

~~~
vitovito
What would you be hashing, on a text file? What the text "looks like," e.g.
particular density and repeated patterns of alphanumerics?

Or would you be hashing the content, the language, the words used, the meaning
of the words?

Stylometry is the study of written style. It's used forensically to see if two
texts were written the same, to identify anonymous/pseudonymous authors, and
there's "adversarial stylometry," which is intentionally changing a writing
style to make it look like someone else wrote something.

This deck seems to summarize a lot of these points and issues, but offers no
"stylometric hash" solution:
[http://wiki.uni.lu/mine/docs/RDPresentation.ppt](http://wiki.uni.lu/mine/docs/RDPresentation.ppt)

Plagiarism detection is another forum that might do these sorts of analyses.
How many metaphors do you have to change, and synonyms do you have to replace,
before it's an original work, or would it always still hash nearby?

Textometry tries to make texts more analyzable. Wordprints or stylometric
fingerprints, seem to turn up more results than "hash".

~~~
daveloyall
Hm. Well, I want to hash the pattern of bits in the text file, like a
cryptographic hash does (suppose md5 or sha-1 for simplicity's sake).

I'll provide some examples of input and output. These examples happen to
contain no linefeeds.

Suppose:

    
    
        Hello, World! --> 65a8e27d8879283831b664bd8b7f0ad4
    

Then I want something like:

    
    
        Hello, Worlds! --> 65a8e27d8879283831b664bd8b7f4ad4
    

...Rather than what md5 currently provides:

    
    
        Hello, Worlds! --> d0478649ad1c15f0e623846c3e26ebeb
    

Basically, I love hashes and I use them all the time, but for many purposes I
am only _accidentally_ using the cryptographic features and in fact I would
sometimes find it nice if similar inputs had similar outputs.

Said differently: I'd like a hashing algorithm where the Levenshtein distance
between any given two outputs correlates with the Levenshtein distance between
the corresponding inputs.

I can imagine lots of uses for such a tool. ...But I can't imagine how it
could be possible to make one: files (or strings) vary in length, for one
thing, but a good hash does not! Of course, I couldn't imagine md5 before I
saw it in action, either.

~~~
petercooper
Perhaps I'm missing the point, but I came up with a naive and sloppy part
solution:
[https://gist.github.com/peterc/737d9178f02118f8e315](https://gist.github.com/peterc/737d9178f02118f8e315)
.. there are some weaknesses to this solution but it does perform similarly to
your examples.

~~~
daveloyall
Thanks, Peter. I don't speak Ruby. Uh, line 5 contains a small function
definition, which is applied to each ...slice of the input string, is that
right?

Do you (or anyone else) have time to describe this code in English or
pseudocode?

I'll start.

    
    
        Let there be a function called hash which takes a
            string named str and an integer named
            length_of_hash (which defaults to 20).
        Slice the string (str) into length_of_hash equally-sized
            pieces?
        Take numeric value of each character (of each slice??)
            and do.. something to it, something involving
            modulo 256, unless it's zero.  Save all the
            results.  
        Express each numeric result as hexidecimal and append
            all those together.  Return it, probably.
    

Clearly there are bugs in the translation. :)

~~~
petercooper
In short, convert the input string into an array of its character values,
group those values into sub-arrays of length 'x', add together the aligned
values in each sub-array and modulo each by 256, output the resulting values
in hex pairs joined together in a big string.

Example, with a hash length of 4: "helloworld" => [104, 101, 108, 108, 111,
119, 111, 114, 108, 100] => [[104, 101, 108, 108], [111,119,111,114], [108,
100]] => [67, 64, 219, 222] => "4340dbde"

So if one character is different in the string, only one hex value in the
output will vary too. However, a flaw of the plan is that changes spaced out
at an interval that matches your hash "length" will only affect one hex value
in the output, but you run into the pigeonhole principle if you want a hash of
limited size to have the same or similar Levenshtein distance as the potential
inputs, but I suspect there are far smarter solutions :-)

------
TheLoneWolfling
How would an adversary break this?

My first idea would be to rotate it by something that's halfway between the
steps used. Say, 360/32 degrees. How does that compare?

Also: because it discards the high frequency data one should be able to
construct something like this:
[http://cvcl.mit.edu/hybrid/CatDogHybrid.jpg](http://cvcl.mit.edu/hybrid/CatDogHybrid.jpg)
\- where to us at a close distance it looks like one thing but to this it
looks like something else.

~~~
bcoates
It would be pretty easy to assign any image an arbitrary value. Compare the
image's current (unquantized) DCT values to the desired ones, find the value
that differs from the target but is closest to being quantized as the desired
bit, and apply a filter that just barely flips it. Each time you do this you
reduce the Hamming distance score between the image and the target by 1.

If getting close enough results in unsightly blotches on the image, reduce the
power of the low frequency luminance channel across the board, which will mask
the changes by making the unmodified high frequency components more
noticeable. That looks like what's being done in the catdog image, at least.

You could increase the saturation as well, as this fingerprinting system
ignores color.

------
petercooper
If this sort of thing fascinates you, here's a related approach:
[http://grail.cs.washington.edu/projects/query/](http://grail.cs.washington.edu/projects/query/)
.. for some reason numerous people were working on image search features like
this (i.e. draw what you want to find) 10 years ago but they don't seem to
have caught on other than with Google Images.

------
alceufc
Very interesting approach.

I have worked with image feature extraction in the past. Although using DCT
coefficients has been used as a way to analyze texture features, the idea idea
of generating the hash (step 5) seems to be new.

I am curious however on why you are discarding color information. Usually for
reverse image search this kind of information can be quite useful.

~~~
ahknight
Not new. pHash (phash.org) has been doing it for a while now. :)

------
munificent
This seems to work well for modifications that are basically in-place and high
frequency, but how well does it handle things like one image being a crop of
another, or translation, or flipping?

------
ktzar
Hashes are supposed to change a lot with small changes in the data source. I
wouldn't call this hashing, but fingerprinting, and there're plenty of
fingerprinting algorithms for images, some related to steganography.

~~~
ahknight
A cryptographic hash, yes. Not all hashes are cryptographic hashes.

