

How Tarsnap uses Amazon Web Services - cperciva
http://www.daemonology.net/blog/2008-12-14-how-tarsnap-uses-aws.html

======
natch
I'm just curious, do all the people using these kinds of paid services just
have really tiny hard drives? To back up 1TB of data via this system would
cost around $250 a month in storage and transfer charges, and that's assuming
very good compression is happening. What am I missing? Or maybe I'm just not
rich enough for $3,000 / year?

Or maybe you're just doing a subset of your data, in which case I don't see
how snapshotting is such a big win. Nice to have, but not important in the
least if you are just doing a small subset of your data. True, it saves a few
pennies on transfer charges, but you could save that much by doing the upload
to S3 yourself.

My other concern with this is that if the tarsnap server ever goes away,
customers risk losing their data, since the server maintains the mapping of S3
objects to blobs. That's worrying. Assurances are not mechanisms.

~~~
jrockway
> My other concern with this is that if the tarsnap server ever goes away,
> customers risk losing their data, since the server maintains the mapping of
> S3 objects to blobs. That's worrying. Assurances are not mechanisms.

Agreed. I use duplicity instead, which is a Free program that is similar to
tarsnap. It backs up to S3, but uses many fewer PUTs (since archives are
several megabytes).

Anyway, a full backup of my homedir (minus music) costs me about $1.40 a month
to store (with incremental backups every night). A small price to pay knowing
that if my laptop blows up, I can be right back where I started in just a few
hours. (Or if I delete a file accidentally, it is back in seconds.)

<http://duplicity.nongnu.org/>

~~~
rtw
I also won't store my backups in anything but my own S3 buckets (they are
encrypted, privacy is not the issue). Is duplicity stable in your opinion? I
am usually one to use alpha/beta software etc. but this is a long term need. I
am a big rdiff-backup fan, this looks like a good alternative to my current
strategy.

My current strategy is a little lame but works quite well: rsync daily to my
home server and about once or twice a week an EC2 instance is fired up with
elastic block store attachment and the home server does rdiff-backup to it.

------
mdasen
Tarsnap is really wonderful. I've been using it for about a month now and it's
really simple. From your perspective, you're simply creating tar archives. No
fuss, no muss.

On the tarsnap side, it makes sure not to duplicate storage or bandwidth for
duplicate parts. Anytime you want to get a specific backup back, you just
reference it by name. You can list the available archives. It's all encrypted.
Pricing is based on what you actually use (rather than being rounded) which
makes it ideal for small things as you can pay fractions of a cent.

It's definitely something to look into.

------
bentoner
Great post! I don't understand the paragraph about the cost of the PUTs and
GETs though. The saving you get by batching writes seems marginal. Can you
give some numbers?

~~~
cperciva
The average block size the tarsnap server sees is about 30 kB (the tarsnap
client tries to produce blocks of 64 kB on average, but then it compresses
them individually before sending them to the server). This means that for
every GB uploaded, there are about 33 thousand blocks.

S3 PUTs cost $0.01 per thousand PUTs, so writing each of the blocks as an
individual S3 object would cost $0.33 / GB for PUTs (plus the normal $0.10 /
GB for bandwidth).

~~~
jrockway
Just out of curiosity, why such small blocks?

~~~
cperciva
There are several advantages and disadvantages to different block sizes; but
most significantly, larger blocks would make tarsnap less efficient at
identifying duplicate data in the (very common) case where part of a file is
modified. In the end it came down to weighing all the factors and picking a
value which worked well.

------
sfk
Very interesting. Do you have an estimate of how tarsnap compares to rsync in
terms of bandwidth in the typical case that only a few files have been
modified?

~~~
cperciva
Tarsnap is more efficient than rsync in that case, because rsync has a
significant index overhead (sending a list of files, and sending a list of
blocks for each file) while the tarsnap client works locally to identify new
data and only uploads the new bits.

------
Harkins
You mention that it's more expensive than JungleDisk, but you don't say why
it's better.

~~~
cperciva
_You mention that it's more expensive than JungleDisk_

No I don't. Tarsnap isn't more expensive than JungleDisk overall -- yes, the
bandwidth and storage costs more, but tarsnap doesn't have per-request costs,
a fixed monthly service charge, or an up-front cost for the software. For some
people, tarsnap will be more expensive, certainly; but for many others tarsnap
will be cheaper.

 _you don't say why it's better_

Not in this blog post, no -- this post was about how tarsnap uses Amazon Web
Services. :-)

Details about why I think tarsnap is an amazingly superior backup system are
at [http://www.daemonology.net/blog/2008-11-10-tarsnap-public-
be...](http://www.daemonology.net/blog/2008-11-10-tarsnap-public-beta.html)

