
Tarsnap outage post-mortem - cperciva
http://www.daemonology.net/blog/2012-07-04-tarsnap-outage.html
======
gleb
Acunote is a customer of tarsnap and so am I personally. The professionalism
and transparency that you can see in this RFO are a big reason why I use it
and recommend it. The quality of the product/solution is the bigger reason.

What's not mentioned is that throughout the outage the customers were getting
timely emails from Colin letting us know what's going on. We didn't have to go
to a blog, twitter, facebook or some other ungodly place to find out what's
happening. This is how it should be done.

~~~
16s
The source code is very nice too. For people who want to see an example of
simple, high quality C source code, I highly recommend reading Tarsnap. It's
small too, so it's very easy to get all the way through it.

~~~
cperciva
... and if you find anything wrong in the source code, you can win bug
bounties too!

------
16s
My power just came back on today. Amazing levels of damage from the storm here
in Virginia. Some homes are expected to be without power for another week or
more. To add to that, every day the temps are in the mid 90s with 50%
humidity. It's been miserable but a good learning experience. I need to build
an outdoor camp shower and a latrine. Believe it or not, having those would
have been a great luxury.

~~~
ghshephard
The following kits might be of interest to you. Scale to as many people/days
as you are interested in. (Be careful, though, campingsurvival.com can be
_very_ addictive)

<http://campingsurvival.com/porcamtoil.html>,
<http://campingsurvival.com/redodopl.html>,
<http://campingsurvival.com/hottaptrsh6.html>

~~~
16s
Thanks for those links. I have the luggable loo thingy that fits a 5-gallon
bucket and the doodie bags. I just need a place where they can be used outside
privately and during bad weather. I rigged another 5 gallon bucket with a
spigot and a water hose for a quick shower. That worked OK for the most part,
still needs more privacy though.

~~~
sage_joch
I feel like I'm missing something. Why is a portable shower preferable to a
cold shower in the bathroom?

 _edit_ (since I can't reply below) I didn't realize portable showers were
commonly heated, and didn't think about the possibility of gas-powered
heating. That makes sense.

~~~
ghshephard
Admittedly, in 100+ degree weather, a cold shower is pretty darn nice. But in
the morning, when you just want to get clean (at least for me - maybe this is
a personal preference) nothing beats a hot shower + soap + shampoo.

When camping, I usually bring a Solar Shower
(<http://campingsurvival.com/casoshbag.html>) - left out in direct sun for a
couple hours, that things gets _hot_.

The advantage of the gas powered shower, of course, is that you can use a lot
more water than you can fit in a solar shower bag. The _downside_ of those gas
showers (I have a coleman one) is the zillion warnings telling you not to use
them indoors. Maybe if you had really good ventilation (or could run the water
hose from the outside?) then you could use your internal shower stall + Hot
water from outside?

This is where all the people who have Solar Powered Hot water heaters must be
chuckling - though, in a broad power outage, I don't know if they have
pressurized water in all locations right away. And definitely not always water
that you can trust without boiling.

That's the other thing that almost _no_ people are prepared for - Day 3+
without water for hygiene (A lot of people have their emergency drinking
supplies, but never even think about what they would do if they can't
flush/shower - things get grim pretty quickly)

------
knowtheory
I love reading about Tarsnap. This is the sort of post that exemplifies
someone who has a deep knowledge of his craft, a responsibility to his
customers, and probably most noteworthy, a measured perspective and
appreciation for life outside of just the business and web service he runs.

------
raghus
_For many of us, a datacenter losing power is the only effect we will see from
this storm. For most of the people who were directly affected by the storm,
it's the least of their worries._

Very well put. Good to put things in perspective.

------
underwater
One thing that sticks out for me about the post-mortem is that Tarsnap is
heavily relient on Colin being able to diagnose and fix problems with the
product. He is a single point of failure for the entire system. If anything
were to happen to him it seems like any data stored in Tarsnap could easily be
lost forever.

~~~
nitrogen
That's probably true of a large number of services people rely upon, but those
services hide the fact by using the royal "we" and oblique references to the
"team" in their announcements and press releases.

~~~
Splines
Maybe, but a well run team should have a "this is what you need to do if I get
hit by a bus" document that is actually the "this is what you need to do when
I get a better job" document.

------
davidbanham
Tarsnap is a _great_ product, but I worry that I'm not paying you enough money
for it. It's super cheap for what we get from it. The very last thing I want
is for you not to be making enough from the business to warrant putting in the
kind of effort and energy that you do, and to then have the service go away.
If the current pricing structure achieves that, then brilliant, but if it
doesn't, please raise your prices and keep kicking goals.

~~~
cperciva
Don't worry, Tarsnap isn't going anywhere.

------
jbellis
What design improvements are you considering to make this go faster, should it
be necessary again?

~~~
cperciva
The first step is simply to merge my temporary-hack fixes (e.g., removing the
I/O rate limiting during recovery operations) into the main Tarsnap server
codebase. I lost about 3 hours to those.

The second step is to change the order of S3 GETs and adjust the
parallelization of those; I think I can easily cut that from the ~20 hours it
took down to ~5 hours.

The third step is to parallelize the third (replaying of log entries on a
machine-specific basis) stage; that should cut from ~10 hours down to ~1 hour
given a sufficiently hefty EC2 instance.

After that it's a question of profiling and experimenting. I'm sure the second
stage can be sped up considerably, but I don't know exactly where the
bottleneck is right now. I know the first stage can be sped up by pre-
emptively "rebundling" the metadata from each S3 object so that I need fewer
S3 GETs, but I'm not sure if that's necessarily the best option.

In the longer term, I'm reworking the entire back-end metadata store, so some
of the above won't be relevant any more.

~~~
wheels
Will those changes also finally speed up client restores? The extremely slow
restores are one of those things that remain terrifying as a customer.

We've had to consider moving to a different backup system for disaster
recovery. Tarsnap covers the cases of fat-fingering and versioned backups
nicely, but not disaster recovery of large-ish data sets.

~~~
kondro
The simplest mitigation to this is to store your primary systems in a
different location to your backup. This way the likelihood of both your
primaries and your backups becoming unavailable at the same time are
significantly reduced.

~~~
wheels
I don't see how that mitigates the issue at all. The case I'm talking about is
when your primaries totally bite the dust. It's obvious that a backup system
shouldn't be in the same physical location as the primary system.

The issue I'm referring to isn't about lack of availability of tarsnap's
backups, it's that it's very slow to restore backups if you need to do a
complete restore (rather than just grab a few accidentally trashed files).

------
moe
Thank you.

That's a nice post-portem and reinforces my trust in tarsnap.

~~~
HerraBRE
This post makes me want to do business with them. :-)

------
robryan
Here file system corruption is mentioned and I read the amazon posts on the
issue mentioning that they let customers check EBS disks for potential
corruption in a read only state (I think it was).

Is there some kind of guide to this process? I feel like I am underprepared if
my EC2 instances get hit by a similar failure.

------
rbancroft
Great description and I have to say I really liked the broader context of the
outage. The severity of the storm was something I had been curious about but
hadn't looked up myself yet, and so I appreciated the extra education!

------
jacques_chester
I have a cron job that emails me the results of nightly tarsnap runs. I
noticed two failures, the first and only time tarsnap has ever failed for me.

And I was OK with it -- I knew Colin would be working to bring it up as soon
as possible. I further knew that the envelope for minimal downtime was
determined by Amazon, who were also working to repair matters as soon as
possible.

Natural disasters happen. Somewhere in the world there is always an
unprecedented disaster going on. Living with it is part of life in a strictly
uncontrollable, unpredictable universe.

------
xentronium
Somewhat offtopic:

> off-site backups (which Amazon S3 counts as, since it's replicated to
> multiple datacenters)

I'm not sure if I read that right, but isn't that like saying RAID counts as
backup since it's replicated multiple times? What happens if data (on S3) is
corrupted as a result of a logical error?

~~~
jimktrains2
The idea is not consitency but availability when it comes to an off-site
backup. (Yes, it _should_ be consitent in-and-of being a back-up, but the
point of off-site is so that you have access to it in the event your main
server area is gone.)

------
baconhigh
I really admire how well written this outage report is, and how transparent
the whole process is. +1

------
da_n
"...after which Amazon wrote in a post-mortem that "We have also completed an
audit of all our back-up power distribution circuits""

Not saying this was unpreventable, but there is a tone of disingenuous
doublethink in the Amazon status's recently. Fail-proof is failing.

~~~
jacques_chester
My old man has been working in radio and electronics for decades. We discussed
the Amazon outage and when I told him two generators had failed, he smiled
grimly and muttered "Murphy's Law".

No matter how much you test, you simply cannot _know_ how a system will behave
in a critical state until that state is reached.

The other thing too here is availability bias. We see the outages, but we
don't hear about the near-outages. We're not seeing a true baseline for the
occasions where the system behaved resiliently according to its design.

~~~
Robin_Message
"How complex systems fail" by Dr. Richard Cook makes exactly this point – all
complex systems are, by definition, running in a degraded mode, with
catastrophe just around the corner. They are keep up through a series of
gambles – and you never hear about the good ones.

Interesting it is written by a doctor about medical environments, but works
for any complex system.

The essay is reproduced in _Web Operations_ by John Allspaw and Jesse Robbins
with a web ops spin on it, and is also available online at
[http://www.ctlab.org/documents/How%20Complex%20Systems%20Fai...](http://www.ctlab.org/documents/How%20Complex%20Systems%20Fail.pdf)

~~~
scott_s
Point 7 is a new idea to me: "Post-accident attribution accident to a ‘root
cause’ is fundamentally wrong."

~~~
jacques_chester
I disagree with that nostrum -- I wrote about it in nauseating detail here:
<http://chester.id.au/2012/04/09/review-drift-into-failure/>

------
illumen
Why put all your eggs in one basket? Seems mighty strange to have everything
hosted with one company, without a backup. Especially for a backup company.

~~~
sgt
That's one way to view it, but realistically the typical Tarsnap client has
_not_ put all his/her eggs in one basket.

------
serverascode
Great work going on at tarsnap. I've known about it for a while; should start
using it.

