
Rename files to match hash of contents - SimplGy
https://gist.github.com/SimplGy/75bb4fd26a12d4f16da6df1c4e506562
======
sgentle
Fun! This is similar to how git stores files internally. You can do some neat
tricks like this:

    
    
      $ ls
      01.jpg      03.jpg      03_copy.jpg 04.jpg      05.jpg
    
      $ git init
      Initialized empty Git repository in /tmp/test/.git/
    
      $ git hash-object -w *
      82f7d50fc89d2fd47150aff539ea4acf45ec1589
      0080672bc4f248c400d569cce1a2a3d743eb1331
      0080672bc4f248c400d569cce1a2a3d743eb1331
      58db57b10c219b9b71f0223e58a6dc0d51cfe207
      05dcde743807bddaf55ad1231572c1365d4db4af
    
      $ find .git/objects -type f
      .git/objects/00/80672bc4f248c400d569cce1a2a3d743eb1331
      .git/objects/05/dcde743807bddaf55ad1231572c1365d4db4af
      .git/objects/58/db57b10c219b9b71f0223e58a6dc0d51cfe207
      .git/objects/82/f7d50fc89d2fd47150aff539ea4acf45ec1589
    

If you're curious, you can read more about how it works here: [https://git-
scm.com/book/en/v1/Git-Internals-Git-Objects](https://git-
scm.com/book/en/v1/Git-Internals-Git-Objects)

~~~
cperciva
This is also how FreeBSD Update and Portsnap store files. This technique has
been around for a long time.

~~~
patrickmn
[https://en.wikipedia.org/wiki/Content-
addressable_storage](https://en.wikipedia.org/wiki/Content-
addressable_storage)

~~~
cperciva
Right. I'm not a fan of the terminology though; I prefer _hash-addressed
storage_ to avoid the potential confusion with associative memory.

------
stirner
[http://mywiki.wooledge.org/BashPitfalls](http://mywiki.wooledge.org/BashPitfalls)

------
sliken
Be warned that this (by default) only looks at part of the file. Seems like a
poor default.

------
askvictor
Don't modern filesystems allow you to store metadata like this separately to
the filename or file data?

------
m0atz
Nirsofts 'hashmyfiles' has this functionality built in already, known as
duplicate search mode. Works extremely well.
[http://www.nirsoft.net/utils/hash_my_files.html](http://www.nirsoft.net/utils/hash_my_files.html)

------
zokier
Content-addressable storage is always neat. Does anyone know if using
truncated md5 like this is somehow more robust than using some non-crypto hash
like siphash, which already produces 64bit hashes.

------
zbuf
duff (duplicate file finder) is another useful tool for this with flags to
operate once duplicates are found:

[http://duff.dreda.org/](http://duff.dreda.org/)

------
tucaz
It would be nice to turn this into a program that stores the previous name so
they can be renamed back after deduplicating.

Very cool!

~~~
kardos
Why rename them at all? There are already good tools for duplicate detection.
An example is fdupes [1], which is smart enough to rule out dupes by other
tricks like partial hashes etc., so you can avoid hashing some of the files.

[1]
[https://github.com/adrianlopezroche/fdupes](https://github.com/adrianlopezroche/fdupes)

Edit: just noticed that it's using md5, which is broken [2], and that it's
using truncated md5 hashes.....!

[2] [https://natmchugh.blogspot.ca/2015/02/create-your-own-
md5-co...](https://natmchugh.blogspot.ca/2015/02/create-your-own-
md5-collisions.html)

~~~
spangry
Using md5 is only a problem here if someone has actually gained access to your
files and then gone to the trouble of secretly adding new files and
calculating/brute-forcing the correct 'chosen-prefixes' to ensure a clash. It
would be a pretty weird attack to mount, that's for sure.

md5 is fine for deduplicating. It's extremely improbable you'd 'organically'
get a md5 hash clash for two different files.

~~~
kardos
If you had a copy of the two image files from my second link, this 'dupe
detector' would erroneously flag one as a dupe.

Also, what of truncating the hashes?

I don't get why people try to justify using severely weakened things when
using the non-broken (ie, secure) version is a /trivial/ drop in
replacement...

~~~
spangry
I'm not trying to justify anything. I'm just trying to suggest you're
labouring under a misapprehension. And this has nothing to do with security.
I'm guessing you've heard the (good) advice that md5 is not a secure hashing
function for, say, storing passwords, and then promptly joined the 'md5 is bad
for all the things' cargo cult.

So while you're correct about the two images on that blog, the only reason why
you'd get a clash is because the author of that blog post spent ~15 hours on
an AWS GPU instance to generate the correct prefixes which, when appended to
those files, results in a clash.

So, I guess if you are in the habit of grabbing random files from your hdd,
loading them on to an AWS GPU instance for 15 hours (per file) and generating
hash collisions, then yeah, don't use fdupes.

~~~
kardos
fdupes is not a problem assuming wikipedia's description [1] is correct: "It
first compares file sizes, partial MD5 signatures, full MD5 signatures, and
then performs a byte-by-byte comparison for verification."

I was unimpressed by the md5 used in the shell script at the original link,
which is using a truncated md5...

[1]
[https://en.wikipedia.org/wiki/Fdupes](https://en.wikipedia.org/wiki/Fdupes)

~~~
spangry
Ok, fair enough. I would agree with the view that using md5, presumably for
the faster performance, is probably not the best trade-off to be making here.
Unless we're dealing with an NVMe drive (or something more exotic), you're
likely to be IO bound even if using more computationally intensive hashing
functions.

And if you are deduping on really fast storage, you'd get way better
performance (with comparable safety) using something like xxHash64
([https://cyan4973.github.io/xxHash/](https://cyan4973.github.io/xxHash/)).

------
tscs37
I wrote something similar once, but only for gifs and it also fixes file
extensions for a few mimetypes.

------
dschiptsov
wouldn't symbolic links be more appropriate?

