SHA256 is something like 30 percent slower than MD5. I'd suggest using Adler (wh...

adrian_b · on Jan 4, 2024

On a modern CPU (i.e. 64-bit Arm since 2012, Intel Atom since 2016, AMD Zen since 2017, Intel Core since 2019) SHA-256 is twice faster than MD5.

The difference in speed between the hashing speeds is actually greater, a double speed for a long file is what you get when the execution time includes parts that are identical for the two hashes, i.e. launching md5sum/sha256sum and reading the file.

Older versions of the binary coreutils package may mask this speed difference by having executables compiled only for very old CPUs.

Recent coreutils versions normally use for hashing the OpenSSL library, if found, and OpenSSL uses the hardware instructions where available.

Where a bad sha256sum is installed, "openssl dgst -r -sha256" should work instead.

AlienRobot · on Jan 4, 2024

In Python the Adler library returns a 32 bit checksum. It works pretty well when you're comparing one file to another file. It doesn't work pretty well if you want to, for example, create a quick fingerprint that (tries to) uniquely identify tens of thousands of files.

On StackOverflow I saw someone say that they got hash collisions in MD5 (128 bit) after hashing around 20k files.

When I tried making something similar I figured if I added the size of the file in bytes to the hash that would decrease the number of hash collisions since you would need a permutation of bytes in a set of bytes of same size to generate the same MD5 hash to get a collision. Still feels random and unavoidable in the greater scheme of things, though.

magicalhippo · on Jan 4, 2024

When making a dupe detector some decades ago, I kept the filesize outside of the hash.

No need to compute the hash until there's at least two files with the same size.

gojomo · on Jan 5, 2024

Conventional analysis of non-contrived files would suggest only a 50% chance of an MD5 collision among 2^64 (18 quintillion) files.

So any SO account of "hash collisions in MD5 (128 bit) after hashing around 20k files" may be indicative of some other bug, misreporting, or having a set of files that includes pairs intentionally-contrived to include MD5 collisions. (That's now trivial as long as you're not targeting a specific MD5 value, just intending two files to match, and have some ranges in the files where you can stuff the right result-aligning binary patterns.)

AlienRobot · on Jan 7, 2024

I'm not well-versed in this, but I think you're wrong. A MD5 hash has 2^128 permutations, so it's like the birthday problem, you'll get 50% chance of any two files in a set having the same hash with far fewer than 2^64 files.

Besides, we're talking about probabilities. It's possible for 2 files, one after the other, to have the exact same hash just because of chance. In fact, I think there was a bitcoin bug due to this, something about making the wrong assumption that the hash of the entire blockchain and the hash of the entire blockchain + 1 block would never be the same hash, if I remember correctly.

Hashes are simply not a reliable method of uniquely identifying things. They are a convenient method but if you implement any hash-based system for identification you should also implement a escape hatch for when two different things have the same hash but must be treated differently.

gojomo · on Jan 12, 2024

Nope. 2^64 is the number of tries needed for a 50% chance of a collision between two of the tries. You don't have to stay poorly read on this, and groundlessly allege wrongness:

https://auth0.com/blog/amp/birthday-attacks-collisions-and-p...