

Why Tarsnap doesn't use Glacier - cperciva
http://www.daemonology.net/blog/2012-09-04-why-tarsnap-doesnt-use-glacier.html

======
ChuckMcM
While I think this is a completely reasonable thing (not storing tarsnap data
in Glacier), I'm not sure if the reasoning holds beyond, "For the way we
designed tarsnap it doesn't make sense."

When you look at how GFS is implemented or Bigtable, or Blekko's NoSQL data
store, there are layers, with the meta data as a pretty separable layer from
the data. In all of the cases for these large stores that separation was put
in to facilitate putting the meta data into lower latency storage than the
'bulk' data which facilitates fast access.

As cperciva relates, tarsnap spends a lot of time de-duplicating data (or
writing only one copy of a block with identical contents) which is great for
reducing your overall storage footprint which lowers your costs. But if the
cost of storage is much much smaller than the retrieval cost, then your design
methodology would be different.

So I would not be surprised if there was a way to build a function equivalent
product to tarsnap that had lower storage costs if the bulk data was in
glacier and the index data was in S3, or if the index data was designed such
that a document recovery was exactly two retrievals (a catalog, and then the
data). And of course such a system would not de-duplicate as that would result
in potentially more retrievals.

It seems that if the price delta ratio between Glacier and S3 was higher than
the de-dupe ratio, then Glacier would 'win' with no de-dupe. Else de-dupe
would still win. Thoughts?

~~~
Locke1689
I think the most important part is separating the de-dup metadata from the
data. I assume the de-dup is done via either digest or HMAC -- if you keep the
digests (and the blocks of digests, et al.) and pointers to the data location
in Glacier in S3, couldn't you do the de-dup using S3 but actual data storage
using Glacier?

~~~
ghshephard
This is exactly what I've been wondering - obviously cperciva is 10x (100x?)
smarter in this area than I ever will be, but I kept thinking, "Just keep the
digest in online storage, and the actual data in (offline?) Glacier. You don't
need to access Glacier during backups to calculate digests, because you
already have the digests online."

I read through his posting carefully, to see if he would capture that
possibility - but didn't really see it there.

~~~
ChuckMcM
Think that through a bit :-). So what counts as a 'retrieval' ? Well in the
tarsnap case it appears to be a block which as far as glacier is concerned it
probably thinks it is a file.

[ NB: huge huge guesses here about how tarsnap works ]

Now consider de-dupe in blocks (vs de-dupe in files) with an object store,
combined with a backup. Now lets say you have a file that is 5 'blocks' long.
You can get the original 5 blocks (5x fetches) then each block where deltas
reside, to recreate the file you want at the time you want it.

Compare that to the 'stupid' way of doing it which is to store a full image of
the file system for each time period. Where you have one fetch to get back the
file from the copy of the file system that you were interested in.

But this is where the assumptions that make that stupid need to be evaluated.
It is a poor choice because the expensive thing is the storage, and you trade
compute cycles for storage. That is normally a winning idea because you only
'spend' the compute cycles when you reconstruct the file you want, but you
continually pay for storage month after month.

Except that the pricing model of Glacier makes bulk storage cheaper and
algorithmic reconstruction expensive.

Presumably the folks at Amazon are de-duping, after all they get to charge per
the GB and if they can sell that exact same GB to two people, well that is a
win!

So back to the question at hand.

Lets say your middleware layer is just like it is today, figures out just the
deltas in your file system from the last backup and then pushes those. You've
got a file system image plus a delta image and by applying the delta to the
full image you can get back to the current image. However, since storage is
now less expensive than compute, _at the Amazon instance_ you take the delta
apply it to the latest full backup, and create a new full backup which you
then store in Glacier.

So can this possibly make sense? That is the question, if you keep full copies
of the latest copy of every file, and pointers to the reconstructed previous
versions in your S3 meta data. Can you get a lower net cost for the service
implementation?

------
danielweber
I like articles like this.

1\. Explains what the new whiz-bang technology is.

2\. Explains what Tarsnap is.

3\. Provides technical explanation that lists both the upsides of the new
technology along with the downsides.

~~~
cperciva
Thanks! I like blog posts like this too -- they appeal to my didactic
instincts.

------
cperciva
Another reason, which I decided deserved to be written up as a separate blog
post, is the surprising nature of the "data retrieval" component of Glacier's
pricing model: <http://news.ycombinator.com/item?id=4475319>

------
cynicalkane
I'd bet that this use case is pretty common: I have only a small amount of
data (a few hundred MB in my case) that I deem day-to-day critical, but
hundreds of GB that I don't want to lose. My backup solution is effective but
primitive: parchives on two offline hard drives, one of which is at my
parents' house seven hours away. Backing up takes little effort but it's easy
to forget, especially at the "remote site".

It's really annoying that there's lots of easy backup solutions for online
data, but nothing for cheap backup of large amounts of data that can afford to
be offline for a while. Glacier is the perfect solution but I dread having to
figure out whatever I have to do--divide things into a 180-part archive and
download it over a month?--to get data back. The first glaciated backup
solution on the market that I think I can trust will get my dollars almost
immediately.

~~~
j_s
Some people might not be aware that CrashPlan supports backing up between your
own PCs (eg. to your parents) for free:

<http://www.crashplan.com/consumer/crashplan.html>

------
tedunangst
I was thinking about one time, single shot backups, but encrypted wtih
tarsnap. No dedup. After all, storage is cheap and this is supposed to be
zuper worst case scenario stuff, so maybe I don't want to go to the trouble of
reconstructing the backup. I want to know it's there. But my interest is only
theoretical.

~~~
aristidb
If you have a command line tool to access Glacier, I think what you can do is
make an archive (with tar), encrypt it with gpg, and just use the command line
tool to upload the encrypted archive. So you don't need tarsnap for that, but
if you want fancy deduplicating incremental encrypted backup, that's where
tarsnap shines.

~~~
tedunangst
Yes, but I'm already using tarsnap. Running it with "--backup-to-glacier" is
easier than rolling my own. I don't want two passwords, two super duper secret
keys, two online services I have to pay, ....

------
nchuhoai
Does anyone know of products that already incoorperate Glacier. Basically, I
am looking for a solution to back up my photo library. It's one big file with
a "Write Once, Read hopefully never" Access Policy. I guess I could go and
push it to Glacier by myself, but I'd rather have a library/product do it for
me

~~~
ghshephard
Glacier was rolled out on August 21st - I don't think anyone has rolled out
code to support it in a consumer friendly fashion - but drop a line to
<http://www.haystacksoftware.com/arq/> and encourage them to give you the
ability to target Glacier for your Archives, and I bet you'll see something
fairly soon.

------
michaelt
I'm confused about the reason a block can't exist in both S3 and Glacier at
the same time, if the deduplication code decides that block is needed in a new
archive.

Why couldn't you simply have a rule that each file is either S3 or Glacier,
and S3 lists-of-blocks can only reference other S3 blocks, while Glacier lists
of blocks can only reference other Glacier blocks?

In the worst case, where every block was in both archives, this would only
increase costs by 10% if Glacier costs a tenth what S3 costs.

~~~
cperciva
What happens when you decide that you don't want that new archive to be in S3
any more and tell Tarsnap to migrate it over to Glacier?

~~~
michaelt
To S3 it's like you just deleted the file; to Glacier it's like you just
created the file.

~~~
cperciva
Except that then we're trying to store two different blocks in Glacier with
the same hash ID.

~~~
alexfoo
From that I assume that if a block's hash matches something that's already in
the archives then you retrieve the archive block(s) with the same hash ID in
order to verify that it is exactly the same (byte-for-byte)? And this wouldn't
be possible with Glacier as you can't just retrieve the block from storage to
check there and then.

Do you have any stats on the number of collisions you've seen?

------
veyron
I've got Terabytes of data that need to be archived and possibly recalled
later (trading logs and market data preserved in case of an audit). I'm not at
the scale of Glacier. If TarSnap had a readonly vault (files in said vault
would never change) then it would be able to distinguish the files and split
between the Glacier and S3 offerings.

~~~
ghshephard
Drop a note to <http://www.haystacksoftware.com/arq/> \- I'll wager they will
have an option to create "Glaciered Archives" of folders within a few months.
The developer there has already expressed some interest, and suggested that
Arq would lend itself to pointing at a folder and storing it off to a Glacier
Archive (with all the caveats around 4 Hour delay, 24 Hour of Availability,
Costs associated with restores, etc...)

~~~
sreitshamer
[I'm the developer] I'm hoping storing the metadata in S3 and the actual file
data in Glacier would work well. I'm still looking into it.

[edited for clarity]

~~~
ghshephard
That's awesome news. Just being able to take a folder on my Drive, and say,
"Archive this 5 Gigabytes of photos for the next 50 years" - and know it will
cost me around $30 (and, as time goes on, likely less as the storage drops)
and I won't have to worry (much) about it, will be a big win - even if you
don't come up with an easy way of integrating with the normal S3 backups - I
bet a lot of people will love that feature.

------
unreal37
I think the best use-case for Glacier is large files that you would absolutely
hate to lose. The thing that people cry over when their laptops get stolen.

For home use, all the family photos and videos. An archive of your emails
(outlook.pst) because a lot of important data is stored in there. All your
taxes and accounting data from years past. Bulky stuff that takes up space on
your home system, but isn't used daily.

In business, many companies use "Iron Mountain" to archive their paperwork.
Old invoices, reports, things that were important in the past and may someday
be needed. That's what Glacier is for.

Glacier is archiving, not backup. You might want to take advantage of cheap
storage by keeping your backups there, but that's a different issue.

~~~
fusiongyro
I don't get the impression Glacier is right for any home user. The pricing
scheme is just too weird, and there are weird delays and time limits involved.
I believe you have 24 hours to retrieve your data before incurring more fees
when you do a retrieval from Glacier--but if I store a backup of 150 GB of
photos and music and fetch the whole thing, I'm not going to be able to get
that from Amazon to my home computer in one day. Avoiding that problem will
necessitate other expenditures.

In order to avoid the retrieval costs you'll have to limit retrievals to a
small fraction of what you have stored. In my example, if I lose all my photos
and music I'm going to want to restore the whole thing. Ignoring the transfer
problem above, you're looking at paying for transferring 95% of the archive,
142.5 GB. I find Glacier's pricing model so difficult to comprehend I couldn't
even guess at what that would cost, but Colin's math shows that what looks
like it should cost $0.02 (retrieving 2 GB once at $0.01 per GB) winds up
costing $3.60 (peak rate, percentage of archive fetched, etc.), so I wouldn't
hold out a lot of hope for our use case of fetching the entire 150 GB archive.

As you say, this is certainly something businesses that need to archive lots
of stuff may be able to use (if they can navigate the pricing structure) I
just don't see a home user getting anything out of it but frustration and a
confusing bill.

~~~
michaelt
Assume I store 150 gigabytes of family photos. I pay $1.50 a month ($18 a
year) for storage. Assume I've uploaded the files in 150 * 1 gigabyte
archives.

I decide to retrieve it all. The 5% (6 gigabyte) retrieval allowance is
negligible. The data transfer out fee at $0.120 per gigabyte will cost $18.

If I retrieve 150 * 1 gigabyte chunks every hour retrieval will take 1 hour;
the peak hourly retrieval will be 150 gigabytes; the data rate will be 341
Mbps; and the retrieval fee will be 150 _720_ $0.01 = $1,080

If I retrieve 7 * 1 gigabyte chunks every hour retrieval will take 150/7~=22
hours; the data rate will be 15 Mbps; the peak hourly retrieval will be 7
gigabytes; and the retrieval fee will be 7 * 720 * $0.01 = $50.40

If I retrieve 1 * 1 gigabyte chunks every hour retrieval will take ~7 days;
the data rate will be 2.2 Mbps; the peak hourly retrieval will be 1 gigabytes;
and the retrieval fee will be 1 * 720 * $0.01 = $7.20

If I share an account with 20 other people with the same amount of data
stored, the 5% allowance would be enough for my entire download; I could
retrieve as quickly as I liked without incurring a retrieval fee. I would
still pay the $18 data transfer out fee.

TLDR: Retrieval isn't as cheap as storage, but if you lost all your family
photos, you'd probably be willing to pay it.

~~~
fusiongyro
Thanks for doing the math on this!

------
ocharles
Great write up, I've been curious about it, but I'm not particularly bothered.
I think what I'm going to end up doing is to basically tar ~/media, pipe it
through GPG, and pipe it through some upload-to-glacier script. Maybe automate
that once a month or something and then have tarsnap for much smaller/rapidly
changing stuff. As it stands now I have 100gb in ~/media that's costing me
about $1/day, so it'd be nice to reduce that.

------
RexRollman
Cperciva,

Could a similar system to Tarsnap exist for Windows? I'm not asking you to
implement it, I'm just curious.

~~~
magic_haze
Tarsnap runs without any problems under cygwin/windows right now. It is very
easy to set up: the Tarsnap downloads page lists all the dependencies - you
just need to make sure to include them in the Cygwin setup and then run his
makefile. Frankly, his setup instructions are far more easier to follow than
certain other gui-based installs that look shiny but induce far more stress
trying to guess how to opt-out of the complementary crapware (e.g., Skype,
Adobe Reader, utorrent...)

Also, as he notes in his install page, it _definitely_ is worth checking out
the source code: one of the cleanest C source I've ever seen, and very
educational.

~~~
RexRollman
Thank you for the info!!

