
A Comparison of Image Hashing Libraries - spolu
http://totems.co/blog/comparison-image-hashing-libraries/
======
spolu
I realize now that I could have given the basic principles behind each of the
two libraries compared here:

libPuzzle: Splits the image in blocks and compute the hash based on the
relationships between the adjacent blocks brightness.

pHash: Computes the 8x8 DCT
([http://en.wikipedia.org/wiki/Discrete_cosine_transform](http://en.wikipedia.org/wiki/Discrete_cosine_transform))
representation of an image (lowest frequencies of the image). It then sets the
hash by comparing each of these 8×8 values to the mean DCT value (very
resilient to non-structural changes in the image).

I updated the post with these informations

------
jo_
I just finished writing about distance hashing functions with a slightly
different angle. I visualized the distances between a bunch of images using
two different techniques, one of which was pHash (discussed in the parent
article). Mine isn't quite as in-depth performance wise, but it makes for
pretty pictures. Some of my work is here:
[http://www.josephcatrambone.com/?p=619](http://www.josephcatrambone.com/?p=619)

I'm going to upload the SHA distance tonight.

------
0x09
libpHash is actually quite slow for what it does. I spent a fair amount of
time investigating image hashing algorithms a few years ago and at that time I
saw 10-20x improvement over libpHash just by implementing the similar phash
algorithm described in Neil K's blog.* With Puzzle being both slower and
dramatically less accurate on my body of test images. Perceptual hashing can
be surprisingly lightweight -- by the end of the experiment I was really just
benchmarking image loading libraries. If speed is a concern you are probably
better off foregoing these libs and writing the 2-3 dozen lines of code
(really!) it takes to roll your own, or better yet implementing a comparable,
even more lightweight algorithm like dhash.

* [http://www.hackerfactor.com/blog/index.php?/archives/529-Kin...](http://www.hackerfactor.com/blog/index.php?/archives/529-Kind-of-Like-That.html)

* The main difference being that libpHash applies a gaussian blur over the image, which can be made redundant by using a decent resampling algorithm.

------
drewcummins
I wrote this after spending way too much time on this problem earlier this
year--nothing new, but is a fair chronology of my approach.

[http://download.picturelife.com.s3.amazonaws.com/press-
kits/...](http://download.picturelife.com.s3.amazonaws.com/press-
kits/ImageSimilarityWhirlwind.pdf)

~~~
spolu
Nice write-up! Very interesting.

------
phoboslab
We've been using phash for an image board for a while now and are quite happy
with it. We only use it to detect reposts when someone uploads an image. It
gives some false positives quite often, but that's totally okay for our use
case. We specifically set it up to err on the safe side. Users are only
presented with a "Are you sure your upload is not a duplicate?" message.

Currently we're just doing a `WHERE BIT_COUNT(images.phash ^ inputHash) < 12`
in MySQL over 400k rows, which still works reasonably well (~200ms) given that
it can't use an index for the XOR/BIT_COUNT operation. To my knowledge there's
no way to speed up this query in MySQL, so if we continue to grow we probably
have to write a small daemon that is able to search hashes more efficiently.

~~~
jo_
What is your data type for inputHash and images.phash? ByteArray? Character
array? Blob?

~~~
phoboslab
Just a 64 bit integer (BIGINT).

------
albertzeyer
Some relevant interesting StackOverflow questions:

[http://stackoverflow.com/questions/4196453/simple-and-
fast-m...](http://stackoverflow.com/questions/4196453/simple-and-fast-method-
to-compare-images-for-similarity) (my own :))
[http://stackoverflow.com/questions/75891/algorithm-for-
findi...](http://stackoverflow.com/questions/75891/algorithm-for-finding-
similar-images) [http://stackoverflow.com/questions/596262/image-
fingerprint-...](http://stackoverflow.com/questions/596262/image-fingerprint-
to-compare-similarity-of-many-images)

------
quarterwave
Is an image hash a very different beast from the hashing function used in
passwords etc? In the latter we want large sensitivity to small changes, while
in an image hash we want a measure sensitive to similarities.

Naively I'd expect image hashing to be like cross-correlation (non-linear)
while password hashing can be done with shifts and modulo-2 (linear).

------
jimktrains2
Be sure to enable javascript or you won't be able to see the gist with the
results. (Obviously including a plain table in the post would be too much
work:-\\)

~~~
mattbessey
You also won't be able to use most of the internet in 2014, so this point
seems moot.

~~~
valarauca1
I run no script 100% of the time for more then a year.

Only time I have issues consistent issues is "Show HN:" posts and start-up
websites. Most my day to day browsing is hardly affected by this.

------
luminati
The Phash library is GPL licensed. If you are building a closed source
commercial product, you need to purchase a license.

~~~
spolu
node-phash implementation is MIT-licensed [https://github.com/aaronm67/node-
phash/blob/master/LICENSE](https://github.com/aaronm67/node-
phash/blob/master/LICENSE)

~~~
spolu
Sorry node-phash is just a wrapper. pHash license is indeed GPL v3. Which
means as you say a commercial license should be obtained for commercial use

~~~
onli
Aren't they/you using it for a hosted web platform? Then it wouldn't matter,
as long as it is not the AGPL.

~~~
spolu
Correct, I was just commenting for general commercial purpose use!

~~~
mlinksva
No, for closed source distribution. Commercial can be open or closed source.
GPL does not prohibit commercial use, it prohibits not sharing source under
same free terms.

------
GFK_of_xmaspast
Isn't SURF patent-encumbered?

~~~
andersonfreitas
SIFT is patented. SURF is a similar and alternative method to SIFT.

~~~
jboy
SIFT and SURF are both patented in the US.

SIFT:
[http://www.google.com/patents/US6711293](http://www.google.com/patents/US6711293)

SURF:
[http://worldwide.espacenet.com/publicationDetails/biblio?CC=...](http://worldwide.espacenet.com/publicationDetails/biblio?CC=US&NR=2009238460&KC=&FT=E&locale=en_EP)

More info: [http://opencv-users.1802565.n2.nabble.com/SURF-protected-
by-...](http://opencv-users.1802565.n2.nabble.com/SURF-protected-by-patent-
tp3458734p3463927.html)

