

Ask HN: I need to deduplicate terabytes of redundant photo libraries. - Chevalier

Hi all,<p>As a result of multiple laptops&#x2F;backup external drives for both me and my wife over the years, I have nearly two terabytes of redundant photo libraries... emerging from about 300-500GB of actual, individual photos.<p>The libraries have been painstakingly pulled from iPhoto packages and recovered from crashed hard drives. Unfortunately, many of them have different filenames (from recovery software) or different sizes (from thumbnail duplicates). I&#x27;ve finally built a Windows desktop with the room and power to host&#x2F;sort these pictures, but I don&#x27;t know where to begin.<p>Does anyone have a good place to start with these photos? Can a program like Visipics even begin to make a dent? When I tried with an earlier MacBook Pro and an external drive, the &quot;Duplicate Annihilator&quot; app couldn&#x27;t handle the load... and I wound up stuck with even more duplicates.<p>I&#x27;m just relieved that these problems are, by and large, a legacy of the cloudless past. Once deduplicated, I&#x27;ll just stick my photos on Dropbox or GDrive and never worry again.
======
pwg
For exact duplicates, you can use something like sha256sum
([http://linux.die.net/man/1/sha256sum](http://linux.die.net/man/1/sha256sum))
to acquire a hash of each file, and then use sort
([http://linux.die.net/man/1/sort](http://linux.die.net/man/1/sort)), cut
([http://linux.die.net/man/1/cut](http://linux.die.net/man/1/cut)) and uniq
([http://linux.die.net/man/1/uniq](http://linux.die.net/man/1/uniq)) to get a
list of duplicated hashes.

Once you have a list of duplicated hashes, you can use split
([http://linux.die.net/man/1/split](http://linux.die.net/man/1/split)), paste
([http://linux.die.net/man/1/paste](http://linux.die.net/man/1/paste)), and
egrep ([http://unixhelp.ed.ac.uk/CGI/man-
cgi?egrep](http://unixhelp.ed.ac.uk/CGI/man-cgi?egrep)) to reacquire a list of
file-names containing duplicate content.

Then, if you trust the hash collision resistance of sha256, you can just
delete all but one of those files. If you are slightly parinoid, you can use
cmp ([http://linux.die.net/man/1/cmp](http://linux.die.net/man/1/cmp)) to
compare the files byte-for-byte and remove those that are exact duplicates.

This would eliminate the exact duplicates, which from the sound of things
might just be a good portion of your duplicates. It won't help for same but
different (i.e., cropped version of a larger image, etc.).

~~~
Chevalier
WOW. Thanks! I'm coming from a non-technical background, but I'll give it a
shot.

------
lazylizard
depends on the version of windows you have, it may be built-
in..[https://en.wikipedia.org/wiki/Single-
instance_storage](https://en.wikipedia.org/wiki/Single-instance_storage)

~~~
Chevalier
I had no idea. I'm running Windows 8.1... I'll see if I can find it. Thanks!

