Hacker News new | past | comments | ask | show | jobs | submit login
A Comparison of Image Hashing Libraries (totems.co)
82 points by spolu on Oct 27, 2014 | hide | past | web | favorite | 28 comments

I realize now that I could have given the basic principles behind each of the two libraries compared here:

libPuzzle: Splits the image in blocks and compute the hash based on the relationships between the adjacent blocks brightness.

pHash: Computes the 8x8 DCT (http://en.wikipedia.org/wiki/Discrete_cosine_transform) representation of an image (lowest frequencies of the image). It then sets the hash by comparing each of these 8×8 values to the mean DCT value (very resilient to non-structural changes in the image).

I updated the post with these informations

I just finished writing about distance hashing functions with a slightly different angle. I visualized the distances between a bunch of images using two different techniques, one of which was pHash (discussed in the parent article). Mine isn't quite as in-depth performance wise, but it makes for pretty pictures. Some of my work is here: http://www.josephcatrambone.com/?p=619

I'm going to upload the SHA distance tonight.

libpHash is actually quite slow for what it does. I spent a fair amount of time investigating image hashing algorithms a few years ago and at that time I saw 10-20x improvement over libpHash just by implementing the similar phash algorithm described in Neil K's blog.* With Puzzle being both slower and dramatically less accurate on my body of test images. Perceptual hashing can be surprisingly lightweight -- by the end of the experiment I was really just benchmarking image loading libraries. If speed is a concern you are probably better off foregoing these libs and writing the 2-3 dozen lines of code (really!) it takes to roll your own, or better yet implementing a comparable, even more lightweight algorithm like dhash.

* http://www.hackerfactor.com/blog/index.php?/archives/529-Kin...

* The main difference being that libpHash applies a gaussian blur over the image, which can be made redundant by using a decent resampling algorithm.

I wrote this after spending way too much time on this problem earlier this year--nothing new, but is a fair chronology of my approach.


Nice write-up! Very interesting.

We've been using phash for an image board for a while now and are quite happy with it. We only use it to detect reposts when someone uploads an image. It gives some false positives quite often, but that's totally okay for our use case. We specifically set it up to err on the safe side. Users are only presented with a "Are you sure your upload is not a duplicate?" message.

Currently we're just doing a `WHERE BIT_COUNT(images.phash ^ inputHash) < 12` in MySQL over 400k rows, which still works reasonably well (~200ms) given that it can't use an index for the XOR/BIT_COUNT operation. To my knowledge there's no way to speed up this query in MySQL, so if we continue to grow we probably have to write a small daemon that is able to search hashes more efficiently.

Hmmm...400K rows seems a bit small for this, but you might be able to build a (FLANN) based vocabulary tree that turns phash into a more stable signature (that makes use of database indexes). Then your SQL query would be more like: "where (images.phashsig=inputsig)". Phash might need a tweak, to output 64 floats instead of bits. But it would be more robust to the "random bits that flip" problem.

It's similar to what an engine like TinEye does, but instead of using a bunch of SIFT/SURF/etc features to do numerous visual word lookups, you'd just put in (a modified?) phash and get one word out.

What is your data type for inputHash and images.phash? ByteArray? Character array? Blob?

Just a 64 bit integer (BIGINT).

Is an image hash a very different beast from the hashing function used in passwords etc? In the latter we want large sensitivity to small changes, while in an image hash we want a measure sensitive to similarities.

Naively I'd expect image hashing to be like cross-correlation (non-linear) while password hashing can be done with shifts and modulo-2 (linear).


Correct me if I'm wrong, but using DL for such purposes would require much more preparation in sense of effort and computation power. Wouldn't it?

Also that would mean training? Which is rather cumbersome here. Can you elaborate a little more ?

Be sure to enable javascript or you won't be able to see the gist with the results. (Obviously including a plain table in the post would be too much work:-\)

You also won't be able to use most of the internet in 2014, so this point seems moot.

I run no script 100% of the time for more then a year.

Only time I have issues consistent issues is "Show HN:" posts and start-up websites. Most my day to day browsing is hardly affected by this.

This certainly isn't true for the part of the Internet I generally try to use. Most of it works without js (some of it works better with js on, some of it works worse (but "as intended") with js on.

Demanding js for showing simple text+images on the web, is a little like distributing a text email as an .exe-file IMNHO.

I could argue, using a gist makes the results more reusable... + I like the way gists look.

The Phash library is GPL licensed. If you are building a closed source commercial product, you need to purchase a license.

Unless your closed source commercial product is a service, in which case you don't need to do anything, since pHash is GPL not AGPL.

node-phash implementation is MIT-licensed https://github.com/aaronm67/node-phash/blob/master/LICENSE

Sorry node-phash is just a wrapper. pHash license is indeed GPL v3. Which means as you say a commercial license should be obtained for commercial use

Aren't they/you using it for a hosted web platform? Then it wouldn't matter, as long as it is not the AGPL.

Correct, I was just commenting for general commercial purpose use!

No, for closed source distribution. Commercial can be open or closed source. GPL does not prohibit commercial use, it prohibits not sharing source under same free terms.

Isn't SURF patent-encumbered?

SIFT is patented. SURF is a similar and alternative method to SIFT.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact