
Why it's hard to write a dupefinder - SXX
http://rmlint.readthedocs.org/en/latest/cautions.html
======
mixu
I wrote one of these for fun a while back using the following approach:

\- Files are indexed by inode and device, files with the same inode + device
are considered equal. (My main use case for this was to bundle up JS files.)

\- Files are then indexed by size; only files with the same size are compared.

\- During comparison, the files are read at block sizes increasing in powers
of two, starting with 2k. The blocks are hashed and compared, and if they do
not match the comparison is stopped early (often without having to read the
full file). If all the hashes are equal, then the files are considered to be
equal.

\- Hashes are only computed when needed and cached in memory. Since the hash
block size increases in powers of two, only a few dozen hashes are needed even
for large files (reducing memory usage compared to a fixed hash block size).

link: [https://github.com/mixu/file-dedupe](https://github.com/mixu/file-
dedupe)

~~~
liw
I wrote mine along similar lines, except without using hashing at all. Files
of identical size are compared byte-by-byte instead, until first difference or
end of file. As many files as possible at a time, of course, to avoid having
to read through files multiple times. This avoids any uncertainty about hash
collisions.

To find out how many files of each size you have:

find ~ -type f -printf '%s\n' | sort | uniq -c | sort -n

~~~
hvidgaard
A hash is often use for an online algorithm. If you know the hash you know
there is a potential for dedupe, and you can do a byte for byte comparison. I
suppose you could use size as a prefilter for dedupe. This is if you do dedupe
on a file level. Dedupe on block level doesn't care about the content, only
the blocks, and it's not unusual to see for instance mp3 files with the same
mp3 stream but different metadata. You cannot do the latter without hashing.

------
gwillen
I can think of a slightly more general solution to the problem of not being
sure if A and B are different views on the same file or not, especially if you
plan to hardlink one to the other: Hardlink A to a tempfile, hardlink the
tempfile in place of both A and B, then remove the tempfile. This will give
the right outcome regardless of whether A and B are secretly the same
underlying file (assuming that they are on the same filesystem).

------
pixelbeat
The approach that
[http://www.pixelbeat.org/fslint](http://www.pixelbeat.org/fslint) takes is to
md5 and sha1 the files to avoid false positives.

The core of the duplicate detection is the shell script here:

[https://github.com/pixelb/fslint/blob/ffcd3b85/fslint/findup...](https://github.com/pixelb/fslint/blob/ffcd3b85/fslint/findup#L162)

------
bpchaps
One of the odd things I've seen that marks a "good" admin from a "bad" admin
is their reliance on md5s as a checksum for file differences for anything in
bulk. It's extremely rare to see a collision, but it definitely happens (I've
seen it once after a decently sized config change. Nobody believed me :( ).
That's not to say it's not a useful hash, just that it should be used with
caution.

~~~
kevinnk
You saw a full 128 bit collision on real files? They were right not to believe
you...

~~~
ubernostrum
While it's rare, do remember that hash collisions need to be thought of in
birthday-paradox terms, which does increase the chance of observing a
collision significantly. Hitting MD5's birthday point (50% chance of
collision) requires a couple tens of quintillions of hashes, but that's
probably a lot smaller number than people expect, and a not-unreasonable
number to see in a sufficiently large-scale data archiving system.

(meanwhile there's stuff in the wild using things like Java hashCode() to
determine identities, and the birthday point on that is only around 75k IIRC)

~~~
hvidgaard
"probably a lot smaller number than people expect"

If you crunch your way through files (and lets assume they're random from the
hash functions point of view) for 10 years, you have to calculate the hash of
60.000.000.000 files / second - to have 50% chance of finding a collision. If
the average filesize is 1 KB, that is ~56TB data every single second.

Md5 is not bad as long as you're not expecting malicious input, or gigantic
datasets.

~~~
ubernostrum
Well, a couple years ago it was reported that YouTube alone receives 300
hours' worth of video uploads every minute. That's without considering other
social-media properties with file-upload capability, or PaaS offerings like
AWS.

So I'd say it's certain that the requisite number of MD5 hashes have been
generated across the aggregate of those systems (assuming enough of them use
MD5 in some fashion), and I'd personally bet money there's at least one real-
world deployment out there that's handled enough all on its own to hit the
birthday point.

~~~
hvidgaard
Humans really cannot fanthom the magnitude of the numbers involved here. 2^64
is an astronomically large number - it's mindboggling huge. Really, I will put
money on the the claim that YouTube have not seen a collision yet. Let me
explain:

300 hours every minute, equals 300 _60_ 24*365.25 = 157.788.000 hours / year.

Lets assume it's all HD and 2 gb an hour, that is ~30pb / year. Storing all
that video surely uses large capacity harddrive with a 4k sector size, so that
it the smallest it makes sense to perform dedupe on.

That is ~82726 billion blocks / year.

To have 50% probability of a collision, we need to calculate 2^64 blocks, so
we need 222985 years of video before there is more than 50% probability.

Lets move away from YouTube for a moment, and move towards the IP trafic as a
whole. Cisco has estimated the total IP trafic to 27483 pb / month in 2011, if
we're looking at the growth it is growing less and less for every year, and
from 2010 to 2011 it was only 34% down from 65% from 2005 to 2006. I assume
30% growth from 2011 and forward, and we end up with ~291 exabytes / month, or
3497 exabytes / year in 2020, this is total IP trafic and will include a lot
of duplicated data (e.g. Netflix streams). Again, lets assume a 4kb block, and
the total estimated IP trafic in 2020 generates ~961e10^15 blocks. We need
2^64 blocks to have a 50% probability of collision, that is more than 19 times
of the total estimated IP trafic in 2020, the vast majority of which is
duplicated.

My claim is that unless you're google, md5 is fine. If you're google, md5 is
still fine, but you should be looking to prepare your system to accept a hash
with a larger output space.

------
coroutines
Friend wrote this a long time ago when he was a young C++ coder. It's hosted
on another friends' Github but the author is John Brooks, the author of
Ricochet.

[https://github.com/rburchell/fastdup](https://github.com/rburchell/fastdup)

:-)

------
oever
I did some quick checks to investigate how size helps with deduplication.

Size is a fast initial differentiator because scanning for file size only
needs to read metadata, but how good does it differentiate?

    
    
      # all files
      find ~ -type f | wc -l
      1098045
      # files with a unique size
      find ~ -type f -printf '%s\n'|sort|uniq -c|grep ' 1 '|wc -l
      48437
      # most prevalent sizes (prevalence vs size)
      find ~ -type f -printf '%s\n'|sort |uniq -c |sort -nr|head
      36972 0
      12707 24
       9480 22
       8036 39
       5025 12
       4804 26
       3879 6
       3232 76
       2946 38
       2855 20
    

In my home dir, less than 5% of files have a unique size.

------
rurban
we wrote such dupefinders as fun hackathon contest organized by the dallas
perl user group 2013:

[http://perlmonks.org/?node_id=1067570](http://perlmonks.org/?node_id=1067570)

discussion at [http://mail.pm.org/pipermail/dfw-
pm/2013-December/thread.htm...](http://mail.pm.org/pipermail/dfw-
pm/2013-December/thread.html)

=> [http://dfw.pm.org/winners.html](http://dfw.pm.org/winners.html)

it is really not easy. I used as only one bloom hashes, which explains why
mine was fastest and smallest, but unfortunately not correct.

------
signa11
one simple option might be to just use sha256 sums on files. if done
incrementally, it's not that expensive computationally, unless files are very
large (couple of gb's).

one way to mitigate that might be take a _few_ random, say, 1k chunks of file
content, and hash those. as long as you have a map of (offset, hash) de-duping
should be just fine.

the only thing is, i don't really know, what an upper bound on 'few samples'
would be like. asymptotically, it should approach file-size/chunk-size, but
ideally, it should be way less than that.

any ideas on how to determine a minimal but complete set ?

~~~
Someone
The minimal set is to use full content of the file.

Consider the case of a log file and its backup made one short log message
earlier.

If you want to gamble on using only a partial hash, I would hash the last few
kb of each file, as that (at least based on this 30-second 'analysis' of the
problem domain) likely would work better in practice. Theoretically, it is as
vulnerable, though.

Edit: if you have a VM image on your disk, and that small append happens to a
log file in the image, that small change can happen anywhere. It also can be
even smaller; someone might run 'touch' on an already existing file.
Theoretically, that could be a one-byte change.

~~~
daurnimator
The idea is to use a two stage process:

1\. take a sha1 hash of the last 4KB of each file.

2\. for any equal hashes, compare the _whole_ file.

With this method you should be able to skip reading many large files in their
entirety.

~~~
Someone
That is _an_ idea, and it would, in typical cases, avoid reading _most_ large
files in their entirety, but it is not what I read in signa11's comment.

