

How to work around Amazon EC2 outages - webmonkeyuk
http://webmonkeyuk.wordpress.com/2011/04/21/how-to-work-around-amazon-ec2-outages/

======
mechanical_fish
There might be good advice in here for an average day, but as a piece of
_timely_ advice targeted to this particular outage it's very, very annoying --
fingernails-on-a-chalkboard annoying, to those of us in the trenches --
because it is full of crap. The premise is faulty, and half of these
strategies are useless today, because:

A) We've seen downtime in multiple zones, not just one. The _big_ problem
seems like it might be localized to one zone, but there are spillover effects.
See below.

B) We have extensive EBS snapshots, thank you very much, and, yes, they span
availability zones. Unfortunately, they can't be reliably reinstantiated as
volumes in _any zone_ in US-East.

If one is going to chase ambulances one could at least provide some advice
that is actually topical. Like, e.g., an article on the tradeoffs and design
choices needed to run a service that spans _regions_ or even _companies_ , not
just zones.

Sorry to be so testy, but it's not a good day for platitudes.

~~~
webmonkeyuk
No need for apologies.

I must admit that it's very much an idiot's guide aimed at smaller setups or
people less familar with web operations.

I can assure you there's no copy/paste and I'm surprised at the interest it's
had.

~~~
mechanical_fish
Don't worry; hopefully by tomorrow this disaster will be over and we'll be
back to the state where your advice is really good advice that everyone needs
to hear and I'll be sipping delicious drinks somewhere.

------
cscotta
A few of these options are good in principle, but are not necessarily informed
by the reality of operational experience with the more-common failure modes of
AWS at a medium to larger scale (~50 instances +).

The author recommends using EBS volumes to provide for backups and snapshots.
However, Amazon's EBS system is one of the more failure-prone components of
the AWS infrastructure, and lies at the heart of this morning's outage [1].
Any steps you can take to reduce your dependence upon a service that is both
critical to operation and failure-prone will limit the surface of your
vulnerability to such outages. While the snapshotting ability of EBS is nice,
waking up to a buzzing pager to find that half of the EBS volumes in your
cluster have dropped out, hosing each of the striped RAID arrays you've set up
to achieve reasonable IO throughput, is not. Instead, consider using the
ephemeral drives of your EC2 instances, switching to a non-snapshot-based
backup strategy, and replicating data to other instances and AZ's to improve
resilience.

The author also recommends Elastic Load Balancers to distribute load across
services in multiple availability zones. Load balancing across availability
zones is excellent advice in principle, but still succumbs to the problem
above in the instance of EBS unavailability: ELB instances are also backed by
Amazon's EBS infrastructure. ELB's can be excellent day-to-day and provide
some great monitoring and introspection. However, having a quick chef script
to spin up an Nginx or HAProxy balancer and flipping DNS could save your bacon
in the event of an outage that also affected ELBs, like today.

With each service provider incident, you learn more about your availability,
dependencies, and assumptions, along with what must improve. Proportional
investment following each incident should reduce the impact of subsequent
provider issues. Naming and shaming providers in angry Twitter posts will not
solve your problem, and it most certainly won't solve your users' problem.
Owning your availability by taking concrete steps following each outage to
analyze what went down and why, mitigating your exposure to these factors, and
measuring your progress during the next incident will. It is exciting to see
these investments pay off.

Some of these:

– _Painfully_ thorough monitoring of every subsystem of every component of
your infrastructure. When you get paged, it's good to know _exactly_ what's
having issues rather than checking each manually in blind suspicion.

– Threshold-based alerting.

– Keeping failover for all systems as automated, quick, and transparent as is
reasonably possible.

– Spreading your systems across multiple availability zones and regions, with
the ideal goal of being able to lose an entire AZ/region without a complete
production outage.

– Team operational reviews and incident analysis that expose the root cause of
an issue, but also spider out across your system's dependencies to
preemptively identify other components which are vulnerable to the same sort
of problem.

\---

[1] See the response from AWS in the first reply here:
[https://forums.aws.amazon.com/thread.jspa?messageID=239106&#...</a>

------
reedlaw
What about for those who rely on Heroku? So far there is no way to run multi-
AZ deployments.

------
gubatron
Don't put all your infrastructure on a single cloud provider... spread your
eggs in different baskets.

