
Out-Tridging Tridge by improving rsync - zdw
http://www.anchor.com.au/blog/2013/08/out-tridging-tridge/
======
chetanahuja
ok this article had me intrigued (and the fix presented to speedup rsync for
large files is good) but then, this gem:

"The thing is, modern networks aren’t like that at all, they’re high bandwidth
and low latency"

No they're not. Unless you mean two machines sitting side by side in a data
center or perhaps within the same metro area connected via _wired_
connections.

Two machines sitting on either coast of US with the best wired connectivity at
either end are at least 70-80 ms away from each other. That's not "low"
latency.

Connected via anything less than a perfect connection on either end, now
you're looking at 100+ ms latency.

One of the ends is on non-wired internet connection (LTE, Wimax etc.) and now
you're looking at 150+ ms and with high standard deviation (in ping
latencies).

Slow pace of latency improvements is as much a fact of life as the slow pace
of battery life improvements. Perhaps because both are constrained by hard
physical limits of nature.

~~~
mcpherrinm
In the article, the author mentions sending "1.2 terabytes in a few hours"
with 10ms latency. This sounds like a gigabit network with a few router hops
in the middle. Maybe 100mbit if "a few hours" is interpreted longer than I
would. So we're talking about connecting two machines, possibly in different
buildings, but likely within a thousand kilometers.

It's honestly the use case I have the most often, and certainly one useful to
have tools supporting the use case. This person seems to be worried about off-
site backups, so I'm thinking "enterprise", not "cross-country home user".

------
kbenson
does copying/moving a file over another not trigger copy-on-write in btrfs? If
not, it seems a much simpler (but much less cool and useful for all) solution
would be to patch rsync with an option to allow writing the temp file over the
original when done. While still non-atomic, you'll get the copy-on-write
semantics you need. Unfortunately it will use much more IO. There are ways to
mitigate the extra IO, such as creating a special diff formatted temp file,
but that getting out of the territory of "simple".

Also, in case the FS is saving copies of the temp file and you don't like
that, the --temp-dir option might help with that.

Then again, depending on how btrfs treats overwriting a file with a move, if
the temp file had a timestamp in the name (patch probably needed) in some form
before replacing the original, that might be good enough.

~~~
keeperofdakeys
Most Unix file systems don't have the semantics of a "move". You unlink an
inode from a filename, and link another inode (usually the inode for the tmp
file). Then you unlink the original tmp filename. As far as btrfs is
concerned, there is no relation between these inodes, and without copying the
file (like you suggest), you can't improve this.

------
gmac
Sounds like a worthwhile improvement to rsync, but I wonder why this setup is
preferred to duplicity [1] or rdiff-backup [2], which both also use rsync
(librsync) to perform incremental backups. I've had good experiences with
duplicity in particular.

[1] [http://duplicity.nongnu.org/](http://duplicity.nongnu.org/)

[2] [http://www.nongnu.org/rdiff-backup/](http://www.nongnu.org/rdiff-backup/)

~~~
fhars
rdiff-bachup will give you copies of the changed files (so 1.2TiB for the
database file mentioned in the article) every time you run a backup for each
old version you keep. It will not transfer that much data over the network
(since it uses the rsync algorithm), but it stores that much on disk.

On the other hand, if you use a filesystem with copy-on-write snapshots and
in-place modification of the changed files, you will only use as much disk
space as there are changed blocks in the file for each version you keep. (Of
course you have no additional redundancy if you keep n older version, as each
bit of data is only stored once physically. But you only ever store one
version of an unchanged file in the rdiff-backup scenario either, so you
should alternate between different backup disks anyway.)

~~~
kbuck
This isn't entirely correct; rdiff-backup will give you a full copy of the
latest version of the file as well as a set of binary diffs that can be
applied in sequence to roll it back to an earlier version. rdiff-backup will
actually end up being a little more space efficient for each incremental
change since its diffs don't need to store entire filesystem blocks.

------
dylangs1030
I love rsync. It's also really useful for backing up local files
incrementally. It's an easy CLI utility, but there's a GUI as well, called
grsync.

You can use cron to automate them, which is how I backup my linux system
nightly. Totally recommend it for server and local backups.

~~~
koenigdavidmj
One of the useful modes for backups is to take a parameter (--link-dest)
specifying a directory containing a previous backup. It will build hard links
to the previous backup directory for files that did not change.

~~~
extra88
I use rsnapshot to automate this hard link creation and backup rotations
hourly, daily, weekly, monthly. You still use cron or something else (I've
used Launchd on OS X systems) to schedule the runs but rsnapshot takes care of
the rest.

------
huwr
Forgive me for not understanding, if btfs is cheap for copying files why do
they want to use --in-place?

