Hacker News new | past | comments | ask | show | jobs | submit login

The rsync algorithm.

There is a theorem that it is impossible in general to remotely compare two files with less network traffic than it requires to simply send one file over the wire. But rsync does it. How? By accepting a theoretical but unlikely possibility of coming up with the wrong answer.

What rsync does is compare hashes of ranges of stuff. If the hash comes out the same, the two are assumed to be identical with no further investigation. It is possible that different files will be thought identical. But it is unlikely that any two hashes ever generated by rsync have accidentally been the same when the underlying files were different.

(I do have to say "accidentally" because rsync uses MD5 as a hashing algorithm, and people have deliberately created collisions in MD5.)




rsync was indeed a dramatic "this should not be possible" feat. But the way it uses rolling hashes is not the simplest possible. One side sends hashes of fixed-size chunks, the other scans input for rolling hash matches, which only works for 2 sides and requires non-cachable CPU work on at least one side.

There is an even simpler idea at the core of many modern backup programs (eg. bup, attic, borg)! Possibly pioneered in `gzip --rsyncable`, not sure. https://en.wikipedia.org/wiki/Rolling_hash#Content-based_sli... Each side slices files into chunks at points where rolling hash % N == 0, giving chunks of variable size averaging N. These points usually self-synchronize after insertions and deletions. This allows not just comparing 2 files for common parts, but also deduplicated storage of ANY number of files as lists of chunk hashes, pointing to just one copy of each chunk!


rsync also checks modification dates. you would have to be extremely unlucky for both the checksums and the modification dates to be ok but for the files to be different




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: