It really depends. The early bytes probably contain things like image dimensions...

kazinator · on Oct 29, 2021

To avoid the hash being entirely oblivious to bytes beyond a small block at the start, we could take additional samples of the file at exponentially increasing offsets:

Hash the first kilobyte.

Hash the 16th kilobyte

Hash the 256th kilobyte

Hash the 4096th kilobyte (4M)

Hash the 65536th kilobyte (64M)

Hash the 1048576th kilobyte (1G)

Then 16G, 256G, ...

It might be worth including the last kilobyte. (I think I mentioned that.) If the file format has trailing metadata, that can help, especially if there is any sort of checksum in it.

cb321 · on Oct 29, 2021

Sure. I agree various sampling strategies are a good design, as long as they don't add up to much IO. Making that a "user parameter" to be customizable seems ideal. What you suggest may make a fine default - if adapted to varying file sizes somehow. E.g., that Nim dups program I mentioned elsewhere could just take a sequence of slices instead of only one..maybe just ignoring out of bounds slices. { And for size clusters where the size < N, screw it - just hash the whole thing, probably. ;-) }

DeathArrow · on Oct 30, 2021

Finding sections of the file where data can be different is not only potentially hard but depends on the type of data. So can't be made generic.

Comparing files based on size is trivial and I presume people will always do it first before attempting to use a hash or even CRC.

As you say, single disk read speeds are slow but there is a high chance they will improve over time.

PCIe 6.0 x16 has a bandwidth of 128 GB/s. It would be theoretically possible to reach that aggregate speed in raid 0.

cb321 · on Oct 30, 2021

>Finding sections of the file where data can be different is not only potentially hard but depends on the type of data. So can't be made generic

I agree, and a cousin comment of mine suggests user-specifiable sampling which can always be a full sample in the "hard case". Often, once one has identified true duplicate candidates, a full compare is desirable to not trust any hash, and that full compare uses a lot of IO. That depends on the user, too.

>Comparing files based on size is trivial and I presume people will always do it first before attempting to use a hash or even CRC.

It is unclear the meow guys did this since they seemed to motivate hash performance based on dedup and they unlikely had IO > 10GB/s in 2018, but it's hard to know for sure.