It really depends. The early bytes probably contain things like image dimensions which might discriminate a lot - or not at all. They might contain palette-style color information which might discriminate tons - or not at all. Most things with hash functions depend upon distributions of inputs relative to the function, though.
I think it would be accurate to say "Determining where they are likely to differ is not necessarily hard, but it could be." If it happens to be easy then you can radically change the scale of the problem.
File length alone is a weak hash actively maintained by any file system and so available "for free" (or one stat call). And it can work surprisingly well, as the grandparent alludes to.
EDIT: You actually do not need to even open(2) files that are not part of "same size clusters", never mind hash their data. This alone can really squash the problem size over a more naive approach even if you did feel the need to hash the whole file.
EDIT: And in terms of performance, if this data has to actually be read off a drive then the hash need not go any faster than the drive IO which is probably much, much slower than most candidate hash functions. Even PCIe5.0 NVMes are only like 14 GB/s, SATA buses like 0.75 GB/s, etc.
To avoid the hash being entirely oblivious to bytes beyond a small block at the start, we could take additional samples of the file at exponentially increasing offsets:
Hash the first kilobyte.
Hash the 16th kilobyte
Hash the 256th kilobyte
Hash the 4096th kilobyte (4M)
Hash the 65536th kilobyte (64M)
Hash the 1048576th kilobyte (1G)
Then 16G, 256G, ...
It might be worth including the last kilobyte. (I think I mentioned that.) If the file format has trailing metadata, that can help, especially if there is any sort of checksum in it.
Sure. I agree various sampling strategies are a good design, as long as they don't add up to much IO. Making that a "user parameter" to be customizable seems ideal. What you suggest may make a fine default - if adapted to varying file sizes somehow. E.g., that Nim dups program I mentioned elsewhere could just take a sequence of slices instead of only one..maybe just ignoring out of bounds slices. { And for size clusters where the size < N, screw it - just hash the whole thing, probably. ;-) }
>Finding sections of the file where data can be different is not only potentially hard but depends on the type of data. So can't be made generic
I agree, and a cousin comment of mine suggests user-specifiable sampling which can always be a full sample in the "hard case". Often, once one has identified true duplicate candidates, a full compare is desirable to not trust any hash, and that full compare uses a lot of IO. That depends on the user, too.
>Comparing files based on size is trivial and I presume people will always do it first before attempting to use a hash or even CRC.
It is unclear the meow guys did this since they seemed to motivate hash performance based on dedup and they unlikely had IO > 10GB/s in 2018, but it's hard to know for sure.
I think it would be accurate to say "Determining where they are likely to differ is not necessarily hard, but it could be." If it happens to be easy then you can radically change the scale of the problem.
File length alone is a weak hash actively maintained by any file system and so available "for free" (or one stat call). And it can work surprisingly well, as the grandparent alludes to.
EDIT: You actually do not need to even open(2) files that are not part of "same size clusters", never mind hash their data. This alone can really squash the problem size over a more naive approach even if you did feel the need to hash the whole file.
EDIT: And in terms of performance, if this data has to actually be read off a drive then the hash need not go any faster than the drive IO which is probably much, much slower than most candidate hash functions. Even PCIe5.0 NVMes are only like 14 GB/s, SATA buses like 0.75 GB/s, etc.