What's not mentioned is that throughout the outage the customers were getting timely emails from Colin letting us know what's going on. We didn't have to go to a blog, twitter, facebook or some other ungodly place to find out what's happening. This is how it should be done.
Be certain (if you've got time) to write a worthy blog post on the results of that Canadian tax situation. I'm curious as I (was, is, could have been) in a similar situation.
edit (since I can't reply below) I didn't realize portable showers were commonly heated, and didn't think about the possibility of gas-powered heating. That makes sense.
When camping, I usually bring a Solar Shower (http://campingsurvival.com/casoshbag.html) - left out in direct sun for a couple hours, that things gets _hot_.
The advantage of the gas powered shower, of course, is that you can use a lot more water than you can fit in a solar shower bag. The _downside_ of those gas showers (I have a coleman one) is the zillion warnings telling you not to use them indoors. Maybe if you had really good ventilation (or could run the water hose from the outside?) then you could use your internal shower stall + Hot water from outside?
This is where all the people who have Solar Powered Hot water heaters must be chuckling - though, in a broad power outage, I don't know if they have pressurized water in all locations right away. And definitely not always water that you can trust without boiling.
That's the other thing that almost _no_ people are prepared for - Day 3+ without water for hygiene (A lot of people have their emergency drinking supplies, but never even think about what they would do if they can't flush/shower - things get grim pretty quickly)
This is definitely the longest outage I've seen since living in the area (I actually don't think I've seen a power outage last more than a night in 12 years)
The point of all this is that I feel your pain. At least it was cold during that time, so the fireplace got a lot of use and kept us warm. Hope your power is restored soon!
Very well put. Good to put things in perspective.
The second step is to change the order of S3 GETs and adjust the parallelization of those; I think I can easily cut that from the ~20 hours it took down to ~5 hours.
The third step is to parallelize the third (replaying of log entries on a machine-specific basis) stage; that should cut from ~10 hours down to ~1 hour given a sufficiently hefty EC2 instance.
After that it's a question of profiling and experimenting. I'm sure the second stage can be sped up considerably, but I don't know exactly where the bottleneck is right now. I know the first stage can be sped up by pre-emptively "rebundling" the metadata from each S3 object so that I need fewer S3 GETs, but I'm not sure if that's necessarily the best option.
In the longer term, I'm reworking the entire back-end metadata store, so some of the above won't be relevant any more.
We've had to consider moving to a different backup system for disaster recovery. Tarsnap covers the cases of fat-fingering and versioned backups nicely, but not disaster recovery of large-ish data sets.
The issue I'm referring to isn't about lack of availability of tarsnap's backups, it's that it's very slow to restore backups if you need to do a complete restore (rather than just grab a few accidentally trashed files).
I also predict the control plane contention problem is going to get worse - I'm sure I'm not the only one working on a system to spin up and grab a few extra instances as soon as my monitoring detects the beginnings of a problem. It seems to be "the right thing" to do is start spinning up and configuring replacement instances immediately you suspect a problem - even if it turns out not to be a problem most of the time, when it _is_ a problem at least you're well ahead in the queue of the people who chose to wait 5 or 10 mins on the assumption that perhaps it's only a temporary network glitch.
(And there's a weird negative incentive for Amazon here. If they fail to address the poor heavy load performance of the control plane/management console/provisioning api, they may end up with more money in their pockets from people keeping spare idle instances or provisioning additional instances speculatively at the first sign of trouble.)
Evil idea of the day. Instead of actually spending money and buying possibly unneeded services from AWS, a somewhat less ethical person might choose to ddos the provisioning API at the first sign of trouble - then if it turns out that you _do_ need extra provisioning, you'll be first to know when the API becomes available (since you call the ddos off yourself when it suits you)…
The normal response in this situation is to do away with the commons; a floating price for control plane requests, for example.
I don't think that would be very popular, though.
Like it or not, people designing for AWS must now assume that the control plane will simply be unavailable during partial outages.
That's a nice post-portem and reinforces my trust in tarsnap.
Is there some kind of guide to this process? I feel like I am underprepared if my EC2 instances get hit by a similar failure.
And I was OK with it -- I knew Colin would be working to bring it up as soon as possible. I further knew that the envelope for minimal downtime was determined by Amazon, who were also working to repair matters as soon as possible.
Natural disasters happen. Somewhere in the world there is always an unprecedented disaster going on. Living with it is part of life in a strictly uncontrollable, unpredictable universe.
> off-site backups (which Amazon S3 counts as, since it's replicated to multiple datacenters)
I'm not sure if I read that right, but isn't that like saying RAID counts as backup since it's replicated multiple times? What happens if data (on S3) is corrupted as a result of a logical error?
Not saying this was unpreventable, but there is a tone of disingenuous doublethink in the Amazon status's recently. Fail-proof is failing.
No matter how much you test, you simply cannot know how a system will behave in a critical state until that state is reached.
The other thing too here is availability bias. We see the outages, but we don't hear about the near-outages. We're not seeing a true baseline for the occasions where the system behaved resiliently according to its design.
Interesting it is written by a doctor about medical environments, but works for any complex system.
The essay is reproduced in Web Operations by John Allspaw and Jesse Robbins with a web ops spin on it, and is also available online at http://www.ctlab.org/documents/How%20Complex%20Systems%20Fai...