
Rediscovering the Rsync Algorithm - nicolast
http://blog.incubaid.com/2012/02/14/rediscovering-the-rsync-algorithm/
======
mitchty
As cool as the rsync algorithm is, i'd much rather we had the dsync utility
outlined in this usenix 08 paper.
[http://www.usenix.org/event/usenix08/tech/full_papers/pucha/...](http://www.usenix.org/event/usenix08/tech/full_papers/pucha/pucha.pdf)

An adaptive protocol that matches to the systems load dynamically whether its
cpu/disk/network. Anyone know of what happened to this?

~~~
arethuza
Apologies - downvoted by mistake!

------
rlpb
There is a Better Way. Instead of using fixed sized blocks, use variable sized
blocks. Decide the block boundaries using the data in the blocks themselves.
This will reduce your search from O(n^2) to O(n).

Tarsnap does this. My project (ddar) does the same.

~~~
willvarfar
I've heard that called 'anchored hashing'. Its recently been used by FreeArc:
[http://encode.ru/threads/43-FreeArc?p=28237&viewfull=1#p...](http://encode.ru/threads/43-FreeArc?p=28237&viewfull=1#post28237)

How does ddar do it?

~~~
rlpb
ddar calculates a 32-bit Rabin fingerprint over a sliding 48-byte window (a
Rabin fingerprint is a type of checksum which allows quick calculation of a
"slide" with just the previous fingerprint, the new byte coming in, and the
old byte going away). The fingerprint is compared against a mask which has n
bits set, where 2^n is the desired target block size. If there's a match, then
the location of the window is taken to be a block boundary. With random data
this should lead to a binomial distribution of block sizes centred around the
target block size. But minimum and maximum block sizes are also enforced for
pathological cases which skews this slightly.

Blocks whose boundaries are determined by this algorithm are hashed (SHA256
currently). The hash is used to key each block in a reference-counted
key/value store.

Then each archive member is just a list of block hashes.

~~~
huhtenberg
If you were to elaborate on and brush up the description, it would make an
excellent HN submission. Good stuff anyways.

------
sciurus
The rsync algorithm and program are both great, and I use the program a lot to
update directory trees across the network. It's also my default tool for
synchronizing two directories on the same system. The rsync program correctly
optimizes for this case by skipping the rsync algorithm and completely copying
changed files. However, it still uses multiple processes and seemingly still
calculates some hashes, making it slower than it needs to be.

Joey found [0] that running rsync once in dry-run mode to find what files have
been changed, copying them each with cp, then running rsync a second time to
handle things like deletions and file permissions resulted in a major speedup.

[0] <http://kitenet.net/~joey/blog/entry/local_rsync_accelerator/>

------
omh
_Don’t walk the folder and ‘rsync’ each file you encounter_

If I just tell rsync to syncronise between two directories, what does it do
internally? I might have assumed that it does the more naive option, but in
practice it seems to do a lot of upfront calculation that suggests it's doing
something more sophisticated.

~~~
pronoiac
Rsync will compare timestamps, to only transfer changed files.

~~~
paxswill
The default behavior also checks file size in addition to modification time.

~~~
burgerbrain
Is there a purpose to that?

~~~
ominous_prime
Modification time can be altered, or incorrect. File size is just another fast
check that can be done with the same data return by stat'ing the file.

Rsync can also compare hashes if you don't trust the size and time on the
files.

------
thibaut_barrere
Sidenote but in case it's helpful to someone; if you need to have rsync.exe on
Windows, here's one path:

<https://github.com/thbar/rsync-windows>

------
Ygor
Do you know of any other implementations of the rsync algorithm other than the
actual rsync program? And where are they used?

Do you know how and where dropbox uses rsync?

There have been some tries to port the rsync program to other
languages/platforms [1], but they are usually not in sync with the current
rsync program. I am talking about ports of the program, not new
implementations of the algorithm.

[1] <https://github.com/MatthewSteeples/rsync.net>

~~~
sciurus
There's librsync [0], although it's not compatible with the rsync program.
Duplicity [1] uses it.

[0] <http://librsync.sourceforge.net> [1] <http://duplicity.nongnu.org>

------
ajays
rsync is great. I use the "-H" and "--link-dest" options to make incremental
backups which look like snapshots. Been doing this for the better part of a
decade; would be interested to know if there's A Better Solution(tm) out
there...

~~~
andrewcooke
rdiff-backup <http://rdiff-backup.nongnu.org/> may be what you are looking
for?

------
jeet-singh
cool

------
david_a_r_kemp
If someone committed any of that code to a repository I was working on, then
I'd hang them up. It's 2011 and people are still using one and two letter
variable names.

An interesting article, but I don't have time nor the inclination to
understand the code, which is the core of it.

~~~
taeric
Of course, this is a wonderful example of how a literate style works. Despite
the small variables names, the code is easy to understand because of the
accompanying text. My (admittedly biased) leaning is that it is easier to
understand than a similar snippet with "good variable names" without the
accompanying explanation.

That is, if you aren't willing to read this article, then I have my doubts
that you would have worked through any free standing code.

~~~
phillmv
Yeah, but you should still have meaningful variable names.

Let's be honest here, very few people practice a good literate style.

~~~
taeric
Hmm... I definitely do not want to just toss out meaningful variable names.
I'm torn, though, as I feel that if we had better practices in place to
promote literate styles in general, we'd be in a better place. (That is, I
think the return from literate styling is higher than the return from
meaningful variable names.)

It really just comes down to adding a narrative to what you're doing. I think
the README convention that github has helped push out there is doing more to
get people using each other's codes than any amount of variable naming.

~~~
phillmv
>I think the README convention that github has helped push out there is doing
more to get people using each other's codes than any amount of variable
naming.

Well… the README file is an ancient convention that probably stretches all the
way back to the 70s.

IMHO, no, not at all. Github is helping people cos it lowered the cost of
publishing code by 1) making it easy and 2) putting code at the forefront of
the project.

In my ideal world, we would all use something like Docco -
<http://jashkenas.github.com/docco/> \- but in my day to day I count myself
lucky if I find the barest amount of class documentation, let alone a literate
style.

So: people will do what is easy. Everyone should start by having meaningful
variable names.

~~~
taeric
Yes, apologies, I meant to word that in a better way. I knew github did not
invent the readme, it just seems they are sparking a revival of it. In that I
mean you don't just get a bloody jar file (or whatever) and have to go looking
for ways to use it. Take a look at any good shared project out there and there
is just a text file outlining how to get going with it.

I think this is possibly my being jaded with Java's documentation conventions.
It can be great for index level documentation, but is absolutely terrible at
showing use cases and why things were done the way they are.

