

Ask HN: md5 + size = file UID? - alexrodygin

Is md5 checksum plus the file's size is absolutely unique identifier for a file across the universe?
======
paulgb
No, for the same reason you can't fit n+1 pigeons in n holes.

<http://en.wikipedia.org/wiki/Pigeonhole_principle>

------
mooism2
No, it isn't.

MD5 gives 16 bytes of output, so consider all possible 17 byte files and their
MD5 checksums. On average, each checksum will be shared by 256 17 byte files.

If you're worried about MD5 collisions between files, adding the file size
isn't going to do much to help. Better to use SHA1 or some other algorithm in
addition to MD5. E.g. 16 bytes of MD5 + 20 bytes of SHA1 = 36 bytes total
output.

------
mfukar
No, because all hashing algorithms suffer from collisions. A perfect hash
function for files would require space of at least a number of bits
proportional to the size of your problem space.

In practice, you can go with SHA-2, for which no collisions have been found
yet.

------
_0ffh
No matter which hash function you use, you will _not_ get 100% unique
identifiers, because you _can't_! Just do the freaking math, and you'll see
for yourself - it's quite obvious, actually!

~~~
alexrodygin
Yeah, I was confused. I'm a product guy, so math isn't my strongest point. It
looks like combining Md5 + size + Sha1 should be enough to get a almost unique
file id. Thank you all for the replies.

------
cperciva
_Is md5 checksum plus the file's size is absolutely unique identifier for a
file across the universe?_

No. Use SHA256.

