Hacker News new | comments | ask | show | jobs | submit login
Tarsnap outage post-mortem (daemonology.net)
190 points by cperciva on July 4, 2012 | hide | past | web | favorite | 52 comments

Acunote is a customer of tarsnap and so am I personally. The professionalism and transparency that you can see in this RFO are a big reason why I use it and recommend it. The quality of the product/solution is the bigger reason.

What's not mentioned is that throughout the outage the customers were getting timely emails from Colin letting us know what's going on. We didn't have to go to a blog, twitter, facebook or some other ungodly place to find out what's happening. This is how it should be done.

The source code is very nice too. For people who want to see an example of simple, high quality C source code, I highly recommend reading Tarsnap. It's small too, so it's very easy to get all the way through it.

... and if you find anything wrong in the source code, you can win bug bounties too!

I'd love to use it...but I'm Canadian, so it's a hackish mess of duplicity scripts for me.

Stay tuned.

Good Dr.,

Be certain (if you've got time) to write a worthy blog post on the results of that Canadian tax situation. I'm curious as I (was, is, could have been) in a similar situation.

I didn't get such an email, most-likely because I wasn't subscribed to the mailing list; if there was a prompt during sign-up, I must have missed it.

yes, the tarsnap-announce mailing list https://twitter.com/cperciva/status/220665293515653121

My power just came back on today. Amazing levels of damage from the storm here in Virginia. Some homes are expected to be without power for another week or more. To add to that, every day the temps are in the mid 90s with 50% humidity. It's been miserable but a good learning experience. I need to build an outdoor camp shower and a latrine. Believe it or not, having those would have been a great luxury.

The following kits might be of interest to you. Scale to as many people/days as you are interested in. (Be careful, though, campingsurvival.com can be _very_ addictive)

http://campingsurvival.com/porcamtoil.html, http://campingsurvival.com/redodopl.html, http://campingsurvival.com/hottaptrsh6.html

Thanks for those links. I have the luggable loo thingy that fits a 5-gallon bucket and the doodie bags. I just need a place where they can be used outside privately and during bad weather. I rigged another 5 gallon bucket with a spigot and a water hose for a quick shower. That worked OK for the most part, still needs more privacy though.

I feel like I'm missing something. Why is a portable shower preferable to a cold shower in the bathroom?

edit (since I can't reply below) I didn't realize portable showers were commonly heated, and didn't think about the possibility of gas-powered heating. That makes sense.

Admittedly, in 100+ degree weather, a cold shower is pretty darn nice. But in the morning, when you just want to get clean (at least for me - maybe this is a personal preference) nothing beats a hot shower + soap + shampoo.

When camping, I usually bring a Solar Shower (http://campingsurvival.com/casoshbag.html) - left out in direct sun for a couple hours, that things gets _hot_.

The advantage of the gas powered shower, of course, is that you can use a lot more water than you can fit in a solar shower bag. The _downside_ of those gas showers (I have a coleman one) is the zillion warnings telling you not to use them indoors. Maybe if you had really good ventilation (or could run the water hose from the outside?) then you could use your internal shower stall + Hot water from outside?

This is where all the people who have Solar Powered Hot water heaters must be chuckling - though, in a broad power outage, I don't know if they have pressurized water in all locations right away. And definitely not always water that you can trust without boiling.

That's the other thing that almost _no_ people are prepared for - Day 3+ without water for hygiene (A lot of people have their emergency drinking supplies, but never even think about what they would do if they can't flush/shower - things get grim pretty quickly)

My neighborhood is still out (upper NW DC), which has been fun.

This is definitely the longest outage I've seen since living in the area (I actually don't think I've seen a power outage last more than a night in 12 years)

I lived in Charlotte, NC when Hurricane Hugo came through. IIRC we didn't have power for over a week, perhaps closer to 10 or 14 days. Our front yard and general neighbourhood looked like a battle ground. Trees and limbs were everywhere! It took several days for enough trees to be cleared to be able to get a car out to the main road.

The point of all this is that I feel your pain. At least it was cold during that time, so the fireplace got a lot of use and kept us warm. Hope your power is restored soon!

I love reading about Tarsnap. This is the sort of post that exemplifies someone who has a deep knowledge of his craft, a responsibility to his customers, and probably most noteworthy, a measured perspective and appreciation for life outside of just the business and web service he runs.

For many of us, a datacenter losing power is the only effect we will see from this storm. For most of the people who were directly affected by the storm, it's the least of their worries.

Very well put. Good to put things in perspective.

One thing that sticks out for me about the post-mortem is that Tarsnap is heavily relient on Colin being able to diagnose and fix problems with the product. He is a single point of failure for the entire system. If anything were to happen to him it seems like any data stored in Tarsnap could easily be lost forever.

That's probably true of a large number of services people rely upon, but those services hide the fact by using the royal "we" and oblique references to the "team" in their announcements and press releases.

Maybe, but a well run team should have a "this is what you need to do if I get hit by a bus" document that is actually the "this is what you need to do when I get a better job" document.

True, but if that bus does one day get him then we'd probably hear about it pretty quick on the grapevine (he didn't squeal, just let out a little whine) and start making alternative arrangements en masse.

Colin needs an apprentice :)

Tarsnap is a _great_ product, but I worry that I'm not paying you enough money for it. It's super cheap for what we get from it. The very last thing I want is for you not to be making enough from the business to warrant putting in the kind of effort and energy that you do, and to then have the service go away. If the current pricing structure achieves that, then brilliant, but if it doesn't, please raise your prices and keep kicking goals.

Don't worry, Tarsnap isn't going anywhere.

What design improvements are you considering to make this go faster, should it be necessary again?

The first step is simply to merge my temporary-hack fixes (e.g., removing the I/O rate limiting during recovery operations) into the main Tarsnap server codebase. I lost about 3 hours to those.

The second step is to change the order of S3 GETs and adjust the parallelization of those; I think I can easily cut that from the ~20 hours it took down to ~5 hours.

The third step is to parallelize the third (replaying of log entries on a machine-specific basis) stage; that should cut from ~10 hours down to ~1 hour given a sufficiently hefty EC2 instance.

After that it's a question of profiling and experimenting. I'm sure the second stage can be sped up considerably, but I don't know exactly where the bottleneck is right now. I know the first stage can be sped up by pre-emptively "rebundling" the metadata from each S3 object so that I need fewer S3 GETs, but I'm not sure if that's necessarily the best option.

In the longer term, I'm reworking the entire back-end metadata store, so some of the above won't be relevant any more.

Will those changes also finally speed up client restores? The extremely slow restores are one of those things that remain terrifying as a customer.

We've had to consider moving to a different backup system for disaster recovery. Tarsnap covers the cases of fat-fingering and versioned backups nicely, but not disaster recovery of large-ish data sets.

That's a different issue, but related to the long-term back-end reworking. (It's not one back-end, it's several pieces of back-end, some of which are necessary for speeding up extracts and some of which aren't.)

The simplest mitigation to this is to store your primary systems in a different location to your backup. This way the likelihood of both your primaries and your backups becoming unavailable at the same time are significantly reduced.

I don't see how that mitigates the issue at all. The case I'm talking about is when your primaries totally bite the dust. It's obvious that a backup system shouldn't be in the same physical location as the primary system.

The issue I'm referring to isn't about lack of availability of tarsnap's backups, it's that it's very slow to restore backups if you need to do a complete restore (rather than just grab a few accidentally trashed files).

Would it be feasible to store snapshots of the metadata periodically as well, so you only have to replay mutations performed after the last snapshot in an emergency?

Yes, that's also something I'm looking into. But I want to have a decent "worst case scenario" recovery mechanism too.

Seems to me that keeping an EC2 instance spun up and idle might become a sensible approach for people who run downtime-sensitive apps with a single AWS region architecture. A lot of recent outages have been exacerbated for many people by the flood of control plane traffic and contention for new instances - at some stage paying for enough idle instances (which might be just one) spread over availability zones to allow you to at least have some chance of surviving an entire datacenter going dark has to be at least worth running the numbers on.

I also predict the control plane contention problem is going to get worse - I'm sure I'm not the only one working on a system to spin up and grab a few extra instances as soon as my monitoring detects the beginnings of a problem. It seems to be "the right thing" to do is start spinning up and configuring replacement instances immediately you suspect a problem - even if it turns out not to be a problem most of the time, when it _is_ a problem at least you're well ahead in the queue of the people who chose to wait 5 or 10 mins on the assumption that perhaps it's only a temporary network glitch.

(And there's a weird negative incentive for Amazon here. If they fail to address the poor heavy load performance of the control plane/management console/provisioning api, they may end up with more money in their pockets from people keeping spare idle instances or provisioning additional instances speculatively at the first sign of trouble.)

"I'm sure I'm not the only one working on a system to spin up and grab a few extra instances as soon as my monitoring detects the beginnings of a problem"

Evil idea of the day. Instead of actually spending money and buying possibly unneeded services from AWS, a somewhat less ethical person might choose to ddos the provisioning API at the first sign of trouble - then if it turns out that you _do_ need extra provisioning, you'll be first to know when the API becomes available (since you call the ddos off yourself when it suits you)…

It's a classic tragedy of the commons.

The normal response in this situation is to do away with the commons; a floating price for control plane requests, for example.

I don't think that would be very popular, though.

Like it or not, people designing for AWS must now assume that the control plane will simply be unavailable during partial outages.

Thank you.

That's a nice post-portem and reinforces my trust in tarsnap.

This post makes me want to do business with them. :-)

Here file system corruption is mentioned and I read the amazon posts on the issue mentioning that they let customers check EBS disks for potential corruption in a read only state (I think it was).

Is there some kind of guide to this process? I feel like I am underprepared if my EC2 instances get hit by a similar failure.

Great description and I have to say I really liked the broader context of the outage. The severity of the storm was something I had been curious about but hadn't looked up myself yet, and so I appreciated the extra education!

I have a cron job that emails me the results of nightly tarsnap runs. I noticed two failures, the first and only time tarsnap has ever failed for me.

And I was OK with it -- I knew Colin would be working to bring it up as soon as possible. I further knew that the envelope for minimal downtime was determined by Amazon, who were also working to repair matters as soon as possible.

Natural disasters happen. Somewhere in the world there is always an unprecedented disaster going on. Living with it is part of life in a strictly uncontrollable, unpredictable universe.

Somewhat offtopic:

> off-site backups (which Amazon S3 counts as, since it's replicated to multiple datacenters)

I'm not sure if I read that right, but isn't that like saying RAID counts as backup since it's replicated multiple times? What happens if data (on S3) is corrupted as a result of a logical error?

The idea is not consitency but availability when it comes to an off-site backup. (Yes, it _should_ be consitent in-and-of being a back-up, but the point of off-site is so that you have access to it in the event your main server area is gone.)

I really admire how well written this outage report is, and how transparent the whole process is. +1

"...after which Amazon wrote in a post-mortem that "We have also completed an audit of all our back-up power distribution circuits""

Not saying this was unpreventable, but there is a tone of disingenuous doublethink in the Amazon status's recently. Fail-proof is failing.

My old man has been working in radio and electronics for decades. We discussed the Amazon outage and when I told him two generators had failed, he smiled grimly and muttered "Murphy's Law".

No matter how much you test, you simply cannot know how a system will behave in a critical state until that state is reached.

The other thing too here is availability bias. We see the outages, but we don't hear about the near-outages. We're not seeing a true baseline for the occasions where the system behaved resiliently according to its design.

"How complex systems fail" by Dr. Richard Cook makes exactly this point – all complex systems are, by definition, running in a degraded mode, with catastrophe just around the corner. They are keep up through a series of gambles – and you never hear about the good ones.

Interesting it is written by a doctor about medical environments, but works for any complex system.

The essay is reproduced in Web Operations by John Allspaw and Jesse Robbins with a web ops spin on it, and is also available online at http://www.ctlab.org/documents/How%20Complex%20Systems%20Fai...

Yes, I started reading into the literature on failures recently because of that exact essay. It's been a great supplement to my reading on systems thinking.

Point 7 is a new idea to me: "Post-accident attribution accident to a ‘root cause’ is fundamentally wrong."

I disagree with that nostrum -- I wrote about it in nauseating detail here: http://chester.id.au/2012/04/09/review-drift-into-failure/

Why put all your eggs in one basket? Seems mighty strange to have everything hosted with one company, without a backup. Especially for a backup company.

That's one way to view it, but realistically the typical Tarsnap client has not put all his/her eggs in one basket.

Great work going on at tarsnap. I've known about it for a while; should start using it.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact