Hacker News new | past | comments | ask | show | jobs | submit login

rsync was indeed a dramatic "this should not be possible" feat. But the way it uses rolling hashes is not the simplest possible. One side sends hashes of fixed-size chunks, the other scans input for rolling hash matches, which only works for 2 sides and requires non-cachable CPU work on at least one side.

There is an even simpler idea at the core of many modern backup programs (eg. bup, attic, borg)! Possibly pioneered in `gzip --rsyncable`, not sure. https://en.wikipedia.org/wiki/Rolling_hash#Content-based_sli... Each side slices files into chunks at points where rolling hash % N == 0, giving chunks of variable size averaging N. These points usually self-synchronize after insertions and deletions. This allows not just comparing 2 files for common parts, but also deduplicated storage of ANY number of files as lists of chunk hashes, pointing to just one copy of each chunk!




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: