
Show HN: Imagededup – Finding duplicate images made easy - datitran
https://github.com/idealo/imagededup
======
datitran
We've just open-sourced our library imagededup, a Python package that
simplifies the task of finding exact and near duplicates in an image
collection.

It includes several hashing algorithms (PHash, DHash etc) and convolutional
neural networks. Secondly, an evaluation framework to judge the quality of
deduplication. Finally easy plotting functionality of duplicates and a simple
API.

We're really excited about this library because finding image duplication is a
very important task in computer vision and machine learning. For example,
severe duplicates can create extreme biases in your evaluation of your ML
model (check out the CIFAR-10 problem). Please try out our library, star it on
Github and spread the word! We'd love to get feedback.

~~~
mkl
This looks really interesting. Can you give us an idea of the performance?
E.g. roughly how long would it take to process 1 million 1920×1080 JPEGS,
without GPU and with GPU?

What is the scaling like? E.g. what if it was 10 million?

~~~
jonatron
I use Cython (CPU) to brute force 400,000 pHash's of images. It takes
somewhere between 100 and 200ms to search.

~~~
tanujjain
Only realized now that the 100-200ms time you refer to is for a single search
and not for 400,000 searches. The package already achieves this brute-force
speed. In fact, the package also implements bktree, which, depending upon the
distance threshold passed, could drastically reduce the search time. Moreover,
the search through bktree is also parallelized in the package(each image's
hash gets searched through the tree independently after the tree is
constructed). On one of the example dataset containing 10k images, with a
distance threshold of 10 (for 64-bit hashes), the retrieval time per image
obtained was < 50 ms.

~~~
jonatron
The 100-200ms time I referred to was indeed a single search. The difference
is, it's on a single core. Cython definitely makes the hamming distance
function faster.

------
albertzeyer
This looks nice, and seems to support the most common methods for
fingerprinting/hashing. This comes with some heavy dependencies, though (which
is reasonable):

    
    
        install_requires=[
            'numpy==1.16.3',
            'Pillow==6.0.0',
            'PyWavelets==1.0.3',
            'scipy==1.2.1',
            'tensorflow==2.0.0',
            'tqdm==4.35.0',
            'scikit-learn==0.21.2',
            'matplotlib==3.1.1',
        ],
    
    

A while ago, I asked about sth like this (or more about the underlying
methods) here on SO:

[https://stackoverflow.com/questions/4196453/simple-and-
fast-...](https://stackoverflow.com/questions/4196453/simple-and-fast-method-
to-compare-images-for-similarity)

There are some interesting discussions. (Nowadays, such a question would have
been closed...)

~~~
orf
If this is a library then locking those dependencies down is not great. Does
it really _need_ Pillow 6.0.0, and not Pillow 6.0.1?

As this is being consumed by larger applications that may have dependencies
that conflict with these, they should be much more liberal.

[https://github.com/idealo/imagededup/pull/36](https://github.com/idealo/imagededup/pull/36)

~~~
datitran
Yes you're right thanks for pointing this out!

------
donatj
For a split second I was excited and terrified because I thought this was a
very similarly named Go project I wrote a while ago. It’s nowhere near as
fancy but is very fast and has a very similar name.

I don’t have the background in imaging these people likely have but mine works
by breaking an image into an X by X map of average colors and comparing,
written specifically because I needed to find similar images of different
aspect ratios and at the time I couldn’t find anything.

[https://github.com/donatj/imgdedup](https://github.com/donatj/imgdedup)

------
wruza
If anyone is interested in gui, XnViewMP does that too.
[https://www.journeybytes.com/2018/03/how-to-find-
duplicate-p...](https://www.journeybytes.com/2018/03/how-to-find-duplicate-
photos-for-free-using-xnview.html)

------
shifto
This came a bit late. I recently decided I had to sort all my photos which I
usually just dump in a big photos folder. Using the camera on my phone a lot
and also getting a lot of media through whatsapp the collection was getting a
bit big.

I made a script to calculate the hash of every file and if it found a double
it would move it to another duplicate folder. This worked reasonably well but
I couldn't stop thinking there should be more than 1 solution already made for
this.

~~~
ptaffs
quicker than hashing the file, you might want to compare exif data (extract
with exiftool), i have been comparing image date/time (to the second) as
tagged by the camera, and when i find a duplicate, i keep the one with the
largest image size. I've not worked out how to deal with those without
exiftags. I understand ShotWell hashes the thumbnail to find dupes. The
security camera software motion has some image comparison and to determine if
the camera image has changed since the last image, i think it was visual in
nature, rather than hashing, since webcams are "noisy".

------
sekasi
Here's a broad and perhaps a bit naive question on this;

Reddit, Imgur, and any other site that uploads significant amounts of images
from significant amount of users.. do they attempt to do this? To de-dupe
images and instead create virtual links?

At face value it'd seem like a crazy amount of physical disk space savings,
but maybe the processing overhead is too expensive?

~~~
throwaway_bad
People do deduplicate files to save on space, except it's usually based on
exact byte match using md5 or sha256. Some don't due to privacy issues:
[https://news.ycombinator.com/item?id=2438181](https://news.ycombinator.com/item?id=2438181)
(e.g., MPAA can upload all their torrented movies and see which ones uploaded
instantly to prove that your system has their copyrighted files)

There's no way to make the UX work out for images that are only _similar_.
Would be pretty wild to upload a picture of myself just to see a picture of my
twin used instead.

But I do wonder if it's possible to deduplicate different resolutions of an
image that only differ in upscaling/downscaling algorithm and compression
level used (thereby solving the jpeg erosion problem:
[https://xkcd.com/1683/](https://xkcd.com/1683/))

~~~
tanujjain
The cnn methods in the package are particularly robust against resolution
differences. In fact, if it's just a simple up/downscale that differentiates 2
images, then even hashing algorithms could be expected to do a good job.

------
cosmic_quanta
Nice project!

I wonder how much of it could be adapted to finding duplicate documents, e.g.
homeworks, CVs, etc. Presumably, the hashing would have to be adapted
slightly. But how much?

~~~
herohamp
For documents I wouldnt do anything listed here. I would just compare the
contents on a line by line or word by word basis

------
rcarmo
This is really nice. I have a semi-abandoned project that uses PHash to canvas
my photo library to flag duplicates, and am quite likely to use this instead.

Now if only Apple hadn’t repeatedly broken PyObjC over the years...

~~~
wertenshap
I’ve been trying to sort through multiple backups of my photo library for
several years and such a tool will be very useful.

~~~
ggm
This too is my usecase. Some backported home from Google takeout, some jpeg
recompress, some renamed and date time shifted by jhead. I'd love to be able
to group and select the highest Q or pixel count and then canonically order by
date.

G

------
JonathanFly
Could this be used to order a large set of images by similarity?

~~~
jlg23
The API does not seem to support this, but it should be easy to hack this
(return not only a list of duplicates but the actual distance to the target
image along with it, then sort results by distance).

~~~
tanujjain
The api already supports returning the hamming distances/cosine similarities
along with the duplicate file list which can be used to sort the files. Please
refer the docs for 'find_duplicates' function for more.

------
Dowwie
did you evaluate the 'imagehash' [1] library prior to working on this-- any
limitations/concerns? the additional CNN seems to be the difference between
the two libraries

[1]
[https://github.com/JohannesBuchner/imagehash](https://github.com/JohannesBuchner/imagehash)

~~~
datitran
Yes, before developing the package, we were also using this great library for
hash generation. There are a bunch of differences we have compared to
imagehash: 1\. Added CNN as you mentioned 2\. Took care of housekeeping
functions like efficient retrieval (using bktree, also parallelized) 3\. Added
plotting abilities for visualizing duplicates 4\. Added possibilities to do
evaluation of deduplication algorithm so that the user can judge the
deduplication performance on a custom dataset (with classification and
information retrieval metrics) 5\. Allow possibility to change thresholds to
better capture the idea of 'duplicate' for specific user cases

------
dandigangi
Sweet. Just what I needed to remove duplicate memes from my phone.

