
Detecting duplicate images with Python - iconfinder
http://blog.iconfinder.com/detecting-duplicate-images-using-python/
======
herrherr
We are currently using perceptual hashes (e.g. phash.org) to do hundreds of
thousands of image comparisons per day.

As mentioned in another comment, you really have to test different hashing
algorithms to find one that suits your needs best. In general though, I think
it is in most cases not necessary to develop an algorithm from scratch :)

For us, the much more challenging part was/is to develop a system that can
find similar images quickly. If you are interested in things like that have a
look at VP/MVP data structures (e.g.
[http://pnylab.com/pny/papers/vptree/vptree/](http://pnylab.com/pny/papers/vptree/vptree/),
[http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.43.7...](http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.43.7492)).

~~~
razius
Yes, you're right. We're not using SQL queries at the moment as that would be
very inefficient, it was just as an example for a small dataset.

I'm currently researching MVP's and reading on VP-trees, BK-trees [1], GNAT
[2] and HEngine [3]. Do you have any advice?

[1] [http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-
Part-1-BK...](http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-
Trees)

[2]
[http://www.vldb.org/conf/1995/P574.PDF](http://www.vldb.org/conf/1995/P574.PDF)

[3]
[https://www.cse.msu.edu/~alexliu/publications/HammingQuery/H...](https://www.cse.msu.edu/~alexliu/publications/HammingQuery/HammingQuery_ICDE2011.pdf)

~~~
herrherr
I think you are on the right track there.

The thing is though, you won't have difficulties finding papers on those
topics. However, you will probably not have any luck finding many concrete and
practical implementations that you could look at.

So it's a far way from reading the papers to having something working.

If you find something, please let me know.

~~~
razius
There's a list of CBIR's on Wikipedia and ammong those there are a few open
source ones. I didn't really had time to check them all but during skimming
through them imgSeek [2] caught my eye.

[1]
[http://en.wikipedia.org/wiki/List_of_CBIR_engines](http://en.wikipedia.org/wiki/List_of_CBIR_engines)

[2] [http://www.imgseek.net/isk-daemon](http://www.imgseek.net/isk-daemon)

------
acslater00
Here is my question for OP. It seems like the image shrinking step coupled
with the transformation to +/\- values loses so much information that the hash
would suffer from big false positive problem. I would have loved to see some
data on this based on their own dataset.

To give a concrete example, I noticed recently that a lot of app store icons
within a category look pretty similar. See for example this:

[https://twitter.com/acslater00/status/450127865682489344/pho...](https://twitter.com/acslater00/status/450127865682489344/photo/1)

All of those checkmarks certainly seem to my naked eye like they the binary
diff transformation would result in a very identical hash. It seems like the
rounded corners in 'Finish' would blur out, and the texture in 'OmniFocus 2'
would blur out, and the gradient in 'Clear' would look identical to a flat
gradient on the right side of the checkmark.

Anyway, clever algorithm but curious how it works in practice on small icons?

~~~
razius
First as a reply to the false positive problem:
[https://news.ycombinator.com/item?id=7527982](https://news.ycombinator.com/item?id=7527982)

Idealy we would like to move to a content-based image retrieval system where
we would be able to search based on certain features that can be derived from
the image itself (color, shape, texture for example) so we could fine tune our
results.

Yes, the presented example is a curious case, if we take the first three icons
there and compare them based on share and color we can see that their shape is
identical but the background is different. Based on this, should we consider
different or identical? You can't have too many variations on a simple shape
like a checkmark or a facebook logo so what variations should you allow and
which ones would you consider as copying previous work?

------
steeve
Or you can use OpenCV with SIFT/SURF/ORB + KNN/RANSAC and have a very robust
solution. Example [1].

OpenCV has awesome Python bindings (cv2), btw.

[1] [http://stackoverflow.com/questions/2146542/opencv-surf-
how-t...](http://stackoverflow.com/questions/2146542/opencv-surf-how-to-
generate-a-image-hash-fingerprint-signature-out-of-the)

~~~
amelim
Using feature detectors and descriptors is only half of the solution. If you
really want robust image recognition you need to use something like the
vocabulary tree developed by Nister[1][2].

[1][http://www.wisdom.weizmann.ac.il/~bagon/CVspring07/files/sca...](http://www.wisdom.weizmann.ac.il/~bagon/CVspring07/files/scalable.pdf)
[2][http://www.cc.gatech.edu/~phlosoft/files/schindler07cvpr2.pd...](http://www.cc.gatech.edu/~phlosoft/files/schindler07cvpr2.pdf)

~~~
steeve
True, however, OpenCV implements a bag of visual words, too [1]

[1]
[http://docs.opencv.org/modules/features2d/doc/object_categor...](http://docs.opencv.org/modules/features2d/doc/object_categorization.html)

------
cyanoacry
This is a bad approach for a couple reasons: 1) The total measurable
space/information over a resized icon-size greyscale image is pretty small, so
you run into a much higher likelyhood of collisions/false positives.

2) It's not too hard to program a Haar wavelet[1] transform[2] (basically
iterative scaled thesholding). This has worked well over at IQDB[3], where
they do reverse image lookups on databases of over several million images via
a modified Haar wavelet.

You can't beat this algorithm for simplicity, though. Have you guys done any
false positive checks with this algorithm? The saving grace might be that
icons are fairly small and limited in detail/color.

[1]
[http://en.wikipedia.org/wiki/Haar_wavelet](http://en.wikipedia.org/wiki/Haar_wavelet)

[2] [http://stackoverflow.com/questions/1034900/near-duplicate-
im...](http://stackoverflow.com/questions/1034900/near-duplicate-image-
detection/1076507#1076507)

[3] [http://iqdb.org/](http://iqdb.org/)

------
rplnt
I feel like this is doing something that was already done better (probably),
but I enjoyed the article anyways. Nicely summarized what, why, and how.

~~~
ivoflipse
Any idea of what might be a better approach?

~~~
razius
There is no universal solution, there are a few algorithms like SURF [1], SIFT
[2], Winsurf [3], dhash (the one presented here), ahash [4], phash [5] each
concentrating on a particular feature of the image and has it's strong points
and weak points.

[1] [http://en.wikipedia.org/wiki/Scale-
invariant_feature_transfo...](http://en.wikipedia.org/wiki/Scale-
invariant_feature_transform)

[2] [http://en.wikipedia.org/wiki/SURF](http://en.wikipedia.org/wiki/SURF)

[3] [http://www-db.deis.unibo.it/research/papers/IWOSS99.ABP.pdf](http://www-
db.deis.unibo.it/research/papers/IWOSS99.ABP.pdf)

[4]
[http://www.hackerfactor.com/blog/index.php?/archives/529-Kin...](http://www.hackerfactor.com/blog/index.php?/archives/529-Kind-
of-Like-That.html)

[5]
[http://www.hackerfactor.com/blog/index.php?/archives/529-Kin...](http://www.hackerfactor.com/blog/index.php?/archives/529-Kind-
of-Like-That.html)

------
SeoxyS
If the has were really looked up by value, like in the example SQL:

    
    
       SELECT pk, hash, file_path FROM image_hashes
           WHERE hash = '4c8e3366c275650f';
    

This is screaming to be stored in a bloom filter.

------
ambirex
Several years ago I tried matching images on histogram comparisons, I realized
it wasn't going to be a perfect match, but actually had a pretty high degree
of success on my data set (game tiles).

------
feelstupid
I can't work out why the width would be need to be 1px bigger in width? I
don't see the explanation in #3. Also, the array in #3 shows a 9x9 grid of 81
values when there are only 72 pixels?

~~~
sp332
If you only have 8 samples per row, then you have 7 adjacent pairs. If you use
a bit to represent the differences, you'll only have 7 bits. If you want 8
bits, you need 8 differences, so 9 samples.

~~~
feelstupid
That makes sense, but why does the example have a 9 by 9 grid, would it be a 9
columns but only 8 rows?

~~~
razius
Yes, you are right, it's a mistake in the text. I'll correct it.

------
szidev
Great write-up! We did something very similar when trying to find duplicate
product images for a consumer review site we were working on. Our
implementation desaturated the image, broke it into a fixed number of tiles,
and generated a histogram of each tile's values. Then we put a threshold on
the histogram values, and had each value in the tile represent a bit. Combine
the bits, and we had a hash to store in the DB. Our hashes were a bit larger,
so images within a certain hamming distance were flagged, rather than just
looking for exact hash matches. It took quite a bit of tuning for us to get
good results, but it seemed to work pretty well. Do you see many false
positives with such a small processed image size (the 9x8 one, I mean)?

~~~
razius
In practice we are using an image size of 17x16 which will result in a hash
size of 256 bits and currently it seems to work pretty well. I ran the
algorithm through the whole dataset (about 330.000+ icons) and I would say
that from all the duplicate matches about 1% where false positives.

Also, we will be integrating this into the reviewing process for an iconset,
where we also do a manual quality check, showing possible matches to something
currently uploaded so skimming over one or two false positives isn't such a
big deal and we where more interested in the speed of the algorithm.

~~~
szidev
That's pretty impressive performance given the hash size and speed. Thanks for
sharing!

------
aleh
What about images which are mirror-flipped? Unless it is a symmetrical icon
that particular algorithm will consider the two highly different.

------
amckenna
An idea: This would require more information storage, but would it be possible
to hash an image and take snapshots of the hashing algorithm as it processes
the image, say after each block of hashing (hashing digests a block - such as
64 bytes - at a time). Then simply compare the list of snapshots between two
images and come up with a statistical threshold for a "similar image"? In the
case of the two cats the images only differ in the nose, so the first half of
the image up to the nose would produce the same list of snapshots.

You could also hash forwards, backwards and starting at several random
midpoints to prevent someone from simply changing the first block to throw off
the hashing algorithm.

------
mtct
This is a very good article on images hashing. Thanks

------
d0ugie
Will the PIL import pick up WebP images as long as libwebp-dev is installed?
I'm a WebP kinda guy and I love fuzzy math achievements like this, especially
when I don't have to switch to Windows[1] to take advantage of it.

[1] [http://www.visipics.info](http://www.visipics.info)

------
amoonki
I wonder if this algorithm would work if someone uploaded a rotated version of
a previously uploaded icon. As it only compares row adjacent values, if the
rotation was 90 or 270 degrees I'm not sure it'd detect the duplicate.

------
avaku
Be careful not to get sued like that guy who posted an article about detecting
songs and got sued by Shazam, even though everybody knows about FFT (I've been
doing something like that for my BSc)

~~~
Torrents
Any idea if that article is still available? Or do you have any good resources
for implementing FTT? I'm working on a project to compare sound clips for
similarity, but I haven't grasped an effective way to accomplish that yet.

~~~
jmmcd
FFTW

~~~
Torrents
Oh cool...thanks jmmcd.

------
danso
For Rubyists, check out the phashion gem, which is a light wrapper around
pHash

[https://github.com/westonplatter/phashion](https://github.com/westonplatter/phashion)

------
izzydata
I always wondered how partial duplicates on images were found. This wasn't as
complicated as I was expecting. I assume some similar techniques are used for
reverse image searching?

------
Gravityloss
JPEG already has the low resolution information stored in an easily
retrievable way. You could use that directly, no need to do the transforms. It
would be a lot faster.

~~~
ominous_prime
Only if it's progressively encoded (though the decoder can still do fast 1/2,
1/4, and 1/8 reductions). Also, low resolution data isn't the same as what you
get from a scaling algorithm like ALTIALIAS (though I don't know what that
does, probably something like Lanczos).

Plus, you still have to handle other formats like png, and get the same output
from the same image.

~~~
liuliu
Not really. JPEG encodes by 8x8 blocks, thus, for the DCT transform at the
beginning, the first component after the transformation is always the mean
value (of the 64 elements). Therefore, if your resizing target is more than 8x
smaller, you can use the mean value directly without doing expensive DCT
transform and the whole decoding process.

~~~
ominous_prime
well, that's what I was hinting at by the fast 1/8 reduction, so yes you can
speed up the process by "decoding" a smaller version of the jpeg image.

This is just an optimization to add later though. You might be able to quickly
scale your jpeg down, but you can't skip the rest of the transforms, because
you still need to get deterministic fingerprint of the image. And in the
article's use case, the file size is so small that you shouldn't be dominated
by decoding time (that test jpeg from the site decodes in 12ms on my laptop
with libjpeg at full size).

------
kasperset
Although slower than this solution, I like this program:
[http://ssdeep.sourceforge.net/](http://ssdeep.sourceforge.net/)

------
thearn4
I've done a fair amount of work in image processing and computer vision, but
haven't read much into hashing of images. Interesting article.

------
rompetoto
Node implementation ->
[https://github.com/rompetoto/dhash](https://github.com/rompetoto/dhash)

------
johndavidback
This might be a stupid question - but what if you have two icons, one with the
cat with a black nose, one with the cat with a gray nose?

~~~
good_guy
I had the same thought.i think they are going to use this to flag similar
icons(and to show recommended images).so, most likely it'll be reviewed by a
human.

Edit:

From the last paragraph. > _we will be using the algorithm in the future on
Iconfinder to warn us if a submitted icon already exists in our collection but
it can also have other practical uses. For example, because images with
similar features have the hamming distance of their hashes close, it can also
be used as a basic recommendation system where the recommended images are
within a certain hashing distance of the current image._

------
puppetmaster3
The title should be: how to detect a duplicate image.

No need to say, in Java, or in Python. It can be done in any lang.

