

My Friday Night With AWS - BenjaminCoe
http://www.benjamincoe.com/post/26236654381/my-friday-night-with-aws
Brain dump of my thoughts about how we handled/recovered from the AWS outage last night.
======
inopinatus
In most contexts, Disaster Recovery is not the same as High Availability is
not the same as Fault Tolerance.

So, in this context, if your devops crew is on the ball, then the first
warning in this article:

 _The only way to ensure close to 100% up time is replicating your entire
infrastructure. Infrastructure costs will more than double ..._

is mercifully untrue in the majority of cases.

Why? Because unless the major component of your infrastructure cost is
storage, or your Recovery Point Objective (RPO) is zero, then database log
shipping and bulk data sync to another region isn't all that expensive.

The author may be assuming that you'd need to have the VMs ready to go at the
standby region. This isn't true, not when you can boot a large application
cluster and promote/upgrade a database replica in minutes. For the majority of
businesses, a realistic Recovery Time Objective (RTO) is on the order of
minutes to hours, so this is fine.

I built this recently. A booking system for an airline. Works as intended.
Failover time is under five minutes. Enabling this is repeatability of
deployment, which is an outcome of careful tooling. The application itself was
developed by an agile & TDD-centric team which made for an easily transplanted
app.

~~~
mechanical_fish
_This isn't true, not when you can boot a large application cluster and
promote/upgrade a database replica in minutes._

This is correct. Unfortunately, in the AWS context those of us who confidently
planned to react to trouble in zone US-EAST-X by launching a new cluster in
US-EAST-Y have often been frustrated by failures in Amazon's control plane.
When one zone goes down hard, instances in other zones generally stay running
- though the anecdotal evidence is not perfectly clear - but the ability to
_spin up_ new instances or drives in other zones often breaks.

What I don't remember seeing is a case where a failure spanned regions. Which
does not mean that can't happen; perhaps a catastrophic latent bug in Amazon's
control software could kick it off. But it is far less likely. So planning to
spin up in a distant region is a workable plan, and then your point is back to
being correct. But spanning regions takes a bit more work and planning.

And, needless to say, everyone's RTO is different. When people talk casually
about "100% uptime" I tend to think "30 seconds", in which case redundant
running instances is the solution, but obviously you do have to specify it.
Because if you can live with 5 minutes, or even 30 minutes, as most apps
probably can, your life will be much easier.

~~~
inopinatus
Well, er, good, fortunately my point was always in favour of spanning regions,
not zones, so I refer you back to my opening note about not conflating
disaster recovery (i.e. "we lost the DC, now what?") with high availability
("something within the DC is down, shouldn't affect the application").

Because in that analysis, the AZs never have represented an isolated unit of
availability. They are clearly locally interdependent and/or sharing
infrastructure. Heck, not sharing a continental plate is pretty much my #1
criteria for a DR replica.

NB: choosing to deploy in us-east just says to me "I want to save bucks, I
don't care that it's got by far the worst availability track record".

------
alanh
While I appreciate anyone taking the time to share their thoughts, I also find
it very distracting that nearly every sentence contains some sort of
grammatical, orthographic, or structural error.

Does this make me the grammar police, or do I have a valid complaint?

 _Update._ Putting my time where my mouth is: Next time someone has a real
time crunch (as Coe notes at the end) but wants to publish a helpful post in a
timely manner, contact me with a draft or CMS credentials and I’ll take at
least a quick look. Expect no miracles, but I will catch obvious errors.

I also keep wishing I could send pull requests to bloggers with suggested
edits.

------
nothacker
Redundancy wasn't the problem I saw last night. What I saw, at least with
Heroku, is that when I checked, _the main Heroku site was down and displaying
things like nginx errors_. That to me is unacceptable for an operation such as
theirs. Even if all hell is breaking loose, you don't _only_ keep your status
page up for all to see, you have a pretty damn good message up that the main
page resolves to. I'm not saying they screwed the pooch entirely as I'm sure
they were busy, but, damn it, even Amazon is going to go down sometimes. Screw
redundancy if you can't even serve a webpage to inspire confidence that you
are working on it. I'm sorry I'm picking on Heroku specifically, because I'd
be really f'n surprised if a lot of you weren't in the same boat. You _need_
to have the main page served when that happens, even just a static page that
inspires confidence or direct to the blog and provide updates there.

------
BenjaminCoe
The first indicator that it was going to be a long Friday night was our EC2
hosted Minecraft server tipping over. Nagios alerts followed. This is a brain
dump of some of my thoughts about AWS, and a discussion of how we got back on-
line quickly.

~~~
benatkin
Great post. I'm considering backing up the latest to another blobstore like
Rackspace Cloud Files as well, so I can tell people that I don't depend on a
single vendor.

------
talonx
A sober post with good advice compared to most of the rants about the outage
that are now on HN.

It's also telling about our need for sensationalism that those rants have more
comments than this article!

