
How Rsync Works - thedookmaster
http://rsync.samba.org/how-rsync-works.html
======
temp45234
An interesting alternative to rsync is zsync
[http://zsync.moria.org.uk/](http://zsync.moria.org.uk/) . A very brief
summary of differences:

* Instead of performing the sender portion of work of generating checksums on-demand, it is performed once when the file is "published" and saved in a zsync metadata file

* This zsync metadata file is fetched (simple copy) and the receiver uses it to decide which portions of the file it needs to request. It then requests only those portions.

* Because of the simplification, the protocol can be reduced to work over simple stateless http. Any HTTPD that supports range requests can be a zsync server. Remote zsync files are represented by http urls.

* Note, this all but removes the CPU requirement of the sender/server.

I've used zsync in some very large systems to efficiently distribute write-few
read-often files with only partial changes to many endpoints. Much more
scalable than rsync due to the lack of CPU cost for the server/sender.

I also maintain a fork of zsync which runs using libcurl rather than the
original author's custom http client code. This fork is primarily to support
SSL: [https://github.com/eam/zsync](https://github.com/eam/zsync)

It's a cool project, check it out!

~~~
beagle3
All is true, but do note that zsync is, at least for now, a single file
system. If you are rsyncing thousands of files over a slow connection (because
only little has changed), rsync can often do this with just a handful of bytes
more than the actual changes, and zsync needs hundreds of bytes per file just
to see nothing has changed.

Use zsync to distribute a small number of large files that have small changes.
If you need to rsync hierarchies with lots of files, rsync is still king.

~~~
temp45234
Absolutely true, the zsync client operates on a single file and doesn't
manipulate file metadata. But this is a solvable problem and I have written
wrappers which will deal with file hierarchies approximately as efficiently as
rsync. Here is one I developed to drive a CM system comprised of many small
files most of which are unchanging:
[https://github.com/yahoo/cm3/tree/master/azsync](https://github.com/yahoo/cm3/tree/master/azsync)

The additional process is to generate and send a list of filenames and
metadata attributes (which rsync must do as well) and to invoke zsync per-file
only if an update is necessary. For large trees of files which are largely
unchanged this is very efficient - much moreso than fetching a zsync manifest
per-file.

The file path is generally the largest amount of data sent per-file, prior to
sending the zsync manifest. This is similar to rsync.

------
Ygor
The main problem with the standard rsync utility is the protocol. Check out
the Rsync Protocol section of this document:

"A well-designed communications protocol has a number of characteristics."

<list of characteristics>

"Rsync's protocol has none of these good characteristics."

...

"It unfortunately makes the protocol extremely difficult to document, debug or
extend. Each version of the protocol will have subtle differences on the wire
that can only be anticipated by knowing the exact protocol version."

This is why it is very hard to implement a client program that can communicate
with the standard rsync deamon on a server. You can always use the rsync
program itself to communicate with the server, but this is not always an
option. If it is - it can get ugly. On windows, you need cygwin or similar to
run rsync.exe, which can complicate the deployment of your desktop app or
shell extension.

An easy rsync client API would be useful if you were building an app that can
store files on an rsync server, because the rsync utility and the rsync
algorithm are great ways to efficiently syncronize files.

~~~
m0th87
What about librsync (as mentioned in the comments here)?

~~~
cshesse
Does librsync speak the rsync protocol or just create deltas?

~~~
adamtj
librsync is a library for building rsync workalikes. It is not compatible with
rsync itself.

librsync and the rdiff binary that wraps it can create a signature from a
destination file, create a patch from a signature and a source file, and can
apply a patch to a destination file. And that's about it. librsync doesn't
concern itself with the networking. That's up to you.

rdiff is a thin wrapper around librsync. librsync can easily do anything rdiff
can do, without having to fork a new process. You might wish the rsync
executable were built this way, but it is not.

------
beagle3
I'm almost sure that this description is out of date and describes rsync 2.

rsync 3 does not need to create or transfer the entire file list - in fact, it
will start immediately, and will have no idea how many files are left -- it's
not uncommon for it to always say "just 1000 more files left" all the time
while working through a million files. You can force it to prescan all files
with -m ("\--prune-empty-dirs" or something like that) if you insist.

Also, I might be mistaken, but I think rsync3 doesn't even transfer the entire
file list to the other side - it will treat the directory like a file (which
contains file names, attributes, and checksums), and transfer _that_ using
rsync. If nothing changed, this will take a few bytes. If something did, the
entire directory listing is rsynced to the other side, and it will be
determined recursively which files and directories actually need to be
transferred -- with every directory that doesn't any changes skipped like a
file that doesn't need any changes.

------
Theodores
The 'rolling checksum' part of the implementation is brilliant.

I have often wondered why it is that rsync is so life-saving-ly quick and how
it is that a few small changes to a massive file (e.g. from mysqldump) can be
copied up to a server from the slow end of an ADSL line so quickly. Now I know
about the 'rolling checksum' I can see what is going on.

Note that I work with people who use 'FTP' to copy files, or even worse,
people who find FTP too complicated and have to send me files on a 'Dropbox'
thing so I can download them and upload them for them, notionally with 'FTP'.
(I will use rsync instead, not least for the bandwidth control options).

I have even had micro-managers get me to get FTP to work on the server for
them, despite my protestations about it being insecure (which it really is if
you use a Windows PC and something like Filezilla).

Obviously I only use rsync and scp. Without aforementioned micro-managed
requests I would not even know if FTP was installed on the server side.

My point is that it may be easy for a few folks here to criticise rsync,
however, there are a lot of people, from clients to managers and even talented
programmers that just don't have a clue about rsync and are stuck in some
stone age of using things like FTP.

~~~
pilif
_> (which it really is if you use a Windows PC and something like Filezilla)._

What does Windows as a client OS have to do with it? FTP is insecure because
it transmits credentials in the clear and because it opens additional ports
for the actual transfer of data. Neither of which are a concern of the client.

~~~
Theodores
Thanks for your point - I have just remembered that I have recovered someone's
FTP password for them from a TCPIP stream and it was fun but not difficult!

However, of notable attacks I have witnessed recently, the plain text file of
Filezilla was the attack vector. Get that and away you go!

------
almost
Something that wasn't clear to me right away is that the generator is running
on the remote system (assuming a remote transfer) so in the generator ->
sender -> receiver bit each -> is data going over the network.

------
meltzerj
What are peoples' thoughts on using rsync for production deployment?

~~~
peterwwillis
Rsync is just a file transfer tool with extra options. Deployment involves a
lot more pieces. The file transfer component of your deployment could
certainly use Rsync, assuming you aren't limited to a particular transport
protocol (though rsync does support HTTP proxies!)

Here are some of the neat features of Rsync you can take advantage of for
deployments:

* Fault tolerance: when an error happens at any layer (network, local i/o, remote i/o, etc), Rsync will report it to you. Trapping these errors will give you better insight into the status of your deployments.

* Authentication: the Rsync daemon supports its own authentication schemes.

* Logging: report various logs about the transfer process to syslog, and collect from these logs to learn about the deployment status.

* Fine-grained file access: use a 'filter', 'exclude' or 'include' to specify what files a user can read or write, so complex sets of access can be granted for multiple accounts to use the same set of files (you can also specify specific operations that will always be blocked by the daemon)

* Proper permissions: force the permissions of files being transferred, so your clients don't fuck up and transfer them with mode 0000 perms ("My deploy succeeded, but the files won't load on the server! Wtf?")

* Pre/post hooks: you can specify a command to run before the transfer, and after, making deployment set-up and clean-up a breeze.

* Checksums on file transfers for integrity

* Preserves all kinds of file types, ownership and modes, with tons of options to deal with different kinds of local/remote/relative paths, even if you aren't the super-user (including acls/xattrs)

* Tons of options for when to delete files and when to apply the files on the remote side (before, during or after transfer, depending on your needs)

* Custom user and group mapping

~~~
justinmk
> Checksums on file transfers for integrity

Are you sure rsync doesn't already do this?

~~~
peterwwillis
I'm saying, this is a feature of Rsync.

~~~
justinmk
Yeah, I completely mis-read your comment.

------
beedogs
Too bad there's not a complimentary document, "How Rsync Breaks", because that
one would be quite useful as well. I've had it fail in the most annoying and
arbitrary ways and it's dissuaded me from using it in any real production
situations.

~~~
huhtenberg
Yep. The main caveat is that files update is not transactional. If rsync is
stopped (crashed, disconnected) in the middle of updating a file, then what
you get is a corrupted file.

~~~
malone
When rsync needs to update a file it saves the contents to a temporary file
first and then copies it over at the end, which should be an atomic operation
on most filesystems. So you shouldn't end up with half updated files (unless
you use the --inplace switch), but you can end up in situations where half the
files in a directory are updated and half are not, which can be just as bad.

~~~
huhtenberg
Interesting, didn't know about the temp file. It doesn't really make updates
atomic, but it certainly reduces the chances of ending up with a partially
updated file.

~~~
beagle3
No, it DOES make a single file update atomic. What it doesn't do is make
multiple updates atomic.

The way rsync works, it CANNOT end up with a partially updated file! (unless
you use --inplace or --append which implies it - and it's your fault if you
do)

~~~
huhtenberg
Of course it CAN and it DOES NOT. If I flip two bits in a large file - one at
the head and at the tail - then no matter how clever the _algorithm_ is, the
update cannot be atomic without proper support from the OS, because it would
involve two separates Writes into the file.

On Windows there's Transactional NTFS whereby you can bind an open file to a
transaction and then have either all or no changes applied at once. But that's
only Vista+ and I am pretty sure rsync doesn't use it anyhow.

~~~
beagle3
Flip those two bits. What rsync will do on the target system is create a copy
of the file you want (with a name like .tempxasdiohkshlksdf-filename.ext)
which takes most of the data from the local copy, and a few kilobytes of
patches transferred. Then, when this file has been created, closed, its
attributes properly set, and it is an identical copy of the file on the source
system - it will rename ("move") the temporary file into the name that it
should have. This move operation is what makes everything atomic;

It does cost another copy of the file on disk, but it does NOT leave the file
in an inconsistent state. It is either the original file, or the new file - no
in between.

You CAN avoid this behavior, by using the "\--inplace" switch or the
"\--append" switch, which tell rsync to just modify the file in-place.
However, this is NOT atomic, and NOT the default (for that exact reason).

~~~
huhtenberg
OK, you win. I didn't realize rsync was solving for network-bound scenarios,
but in retrospect it makes sense.

------
cliveowen
I always found myself looking for a simple way to backup a hierarchy of
folders on an external device and then keep keep both copies synced, then I
heard about rsync and discovered that it does just that. Being using it
exclusively for all of my backups, really useful.

EDIT: Also since we're talking about rsync, do you think the following options
are sufficient for syncing a folder hierarchy from the local disk to an
external flash drive?

rsync -aW --delete /source /destination

My main concern is the W option, which skips the usual compression (that
delays a lot the already long process of syncing) and might end up writing a
lot of bytes and decaying the memory cells of flash storage.

~~~
perlgeek
And if you ever need a two-way sync (ie files on both hosts might change, and
propagation to the other host is desired), look at unison.

~~~
beagle3
I've given unison a lot of chances over the years, but I keep going back to
rsync with "inbox" and "outbox" directories (if that can be done), or [le]git
push/pull/sync if not.

unison is very slow compared to rsync. version at both ends must match (which
means you'll likely need to compile your own unless all your machines run the
same distro and version).

------
nicolast
You might be interested in [http://blog.incubaid.com/2012/02/14/rediscovering-
the-rsync-...](http://blog.incubaid.com/2012/02/14/rediscovering-the-rsync-
algorithm/)

------
joeblau
Thanks! I'm building an app where I might need to implement this type of
syncing paradigm.

~~~
grn
You can take a look at librsync. It might be useful for you.

~~~
joeblau
Excellent, I'm looking into it right now. Thanks for the tip.

------
ape4
Rsync always sends a list of files (and their attributes). But typically most
files haven't changed. They could just send the files that have changed since
the last sync.

~~~
ars
It doesn't know which files exist on the server.

And the file can change on the server just as easily as the client. So how can
it tell this without sending the complete list?

~~~
ape4
Yes in the general case. But in the case of a backup that's done daily the
sender can say... there are all the files that changed since we last did this.

~~~
ars
Rsync is designed for the general case. It's useful for backups, but not
designed for them.

------
UrsaFoot
A typo: s/transfe/transfer/

