
Ask HN: How Would You Architect Around Potential AWS Failures? - byoung2
With the AWS outages over the last few days, I've been wondering how you would set up a system using only AWS services that would be resistant to multi-availability zone outages across multiple services within a geographical region.  Assume that the system that we are setting up is an API that will have heavy read/write traffic with a significant number of users, and that the startup running it is on a typical shoestring budget that makes AWS attractive.<p>What got a lot of startups in trouble was architecting systems that were fault-tolerant across availability zones within the US-East region, but when one availability zone went down, everybody's apps started flooding the other availability zones, causing more problems.  A typical setup might have been an Elastic Load Balancer with EC2 instances in a few availability zones (with the ability to create new instances in other availability zones in response to outages), multi-AZ RDS database servers, and S3 backups to multiple AZ's.<p>What I'm looking for is ideas for taking this setup and expanding it to multiple geographical regions, using only AWS services.  Would you have multiple stacks and use Route 53 DNS to route users to different regions?  How would you keep databases in sync across regions?  Would you use one region as a primary and periodically back up to
======
jrockway
If you're on a shoestring budget, can't you just afford the day of downtime
once every few years? Yeah, it's annoying to be down, but each 9 you add past
99% costs more than the last.

~~~
agazso
For reference, here is how many days/hours the system is unavailable with
different percentages:

    
    
      90%      36.5 days ("one nine")
      95%      18.25 days
      98%      7.30 days
      99%      3.65 days ("two nines")
      99.5%	   1.83 days
      99.8%	   17.52 hours
      99.9%    8.76 hours ("three nines")
      99.95%   4.38 hours
      99.99%   52.56 minutes ("four nines")
      99.999%  5.26 minutes ("five nines")
      99.9999% 31.5 seconds ("six nines")
    

[http://en.wikipedia.org/wiki/High_availability#Percentage_ca...](http://en.wikipedia.org/wiki/High_availability#Percentage_calculation)

~~~
Terretta
It's even easier if you measure per week instead of per year since a week has
~10,000 minutes.

Then the rule of thumb is 10 minutes is 0.1%, 1 minute is 0.01%, and 6 seconds
is 0.001%; or 10 mins = 99.9%, 1 min = 99.99%, and 6 seconds = 99.999%
respectively.

(Note that 6 seconds a week * 52 weeks = 5.2 minutes, same as the reference
table above.)

Programmers don't think in years, but they can think in "any given week". This
rule of thumb puts things in an easy-to-remember perspective.

------
acangiano
DB2University.com wasn't impacted by this AWS disaster, thanks to a DB2
feature known as High Availability and Disaster Recovery (HADR) which, being
asynchronous, works exceptionally well for long distances. Essentially, the
main server runs on US-East while a failover server runs in a different
region. The exact second DB2 detects an issue with US-East, it switches over
to the standby server running in a different region. All automated, and
without downtimes.

~~~
amock
How does it handle the possible inconsistency due to the asynchronous
replication?

~~~
fleitz
Write Ahead Logging.

Async replication doesn't produce inconsistency it produces uncommitted
transactions. Every database produces them when it goes down whether it
replicates or not.

When the master comes up all it has to do is reverse the transactions that the
slave didn't receive. Voila, consistent database.

~~~
amock
What about inconsistencies with data outside the database? Things like credit
card transactions or other external API calls that were recorded on the master
but not the slave will be inconsistent with your slaves view of the world. Is
there a standard way of dealing with those kind of things or is that usually
handled manually?

~~~
fleitz
You use the logs from the slave to commit the 2nd portion of a two-phase
commit, and immediately stop processing new transactions when you only have
the slave up.

Your hypothetical API does support two-phase commit correct? Because if it
doesn't you have lots of solutions for losing data/creating inconsistent data
anyway.

------
jrussbowman
It depends on the architecture.

OK, first I'd want to make sure I have the requirements right.

\- Only use Amazon services \- Must keep databases in sync across regions.
Periodic is acceptable. \- Not a lot of budget.

OK. My first instinct is to say you're still putting all your eggs in one
basket, I'd request we evaluate the idea of using a second cloud provider as a
backup.

I would also suggest that Route 53 is a pretty new DNS provider, can we look
at other providers with a proven track record, at least a secondary server.

Now if you're going to insist on sticking with Amazon only and being on a shoe
string budget, then we'd set up with region servers. I'd hope we can control
the refresh time on Route 53, as I'd want to keep the DNS TTL low. We'd be
paying for more requests, but we'd have the flexibility to roll over to
another region easily.

As for keeping data in sync, or periodic backups, that's really dependent on
the data requirements and what type of storage is involved and I wouldn't want
to get into it.

The main thing I would shoot for is, especially on a low budget, keep it as
simple as possible. Even if it means rolling over to an instance with 24 hour
old data. That's better than nothing in most application cases. You don't want
your interim recovery to be complicated. It should be flip a switch (or change
a DNS entry) and there you go. Recovery is going to be a pain, count on it and
expect it. Especially if you're on a shoe string budget because you're likely
depending on overworked sysadmins who you're keeping too busy to manage
accurate recovery testing practices. That's the kind of stuff sysadmins love
to do and never get the time to get to because project work somehow ends up
being more important.

------
tshtf
I'd like to see some suggestions here, too. Amazon has shattered the idea that
availability zones are truly independent. Both ELB and RDS only work within a
single zone, and this latest incident occurred in "multiple availability
zones", according to Amazon.

I think relying on Route 53 by itself for DNS is equally dangerous, you should
have a co-hosted or at least a backup DNS provider available to you.

Perhaps the best solution involves adding a service external to AWS? Or if you
have to stick with AWS, perhaps a master-slave database sync with us-west-1?

~~~
byoung2
_Both ELB and RDS only work within a single zone, and this latest incident
occurred in "multiple availability zones", according to Amazon_

ELB and RDS work across multiple zones. So a single ELB can span EC2 instances
in multiple zones within a region. So I could have 4 EC2 instances in us-
east-1a, us-east-1b, us-east-1c, and us-east-1d, with an RDS instance in us-
east-1a and a read replica in us-east-1b. What happens when us-east-1a goes
down and every big user in that AZ has failover mechanisms that start moving
instances, EBS volumes, RDS databases, S3 buckets and who knows what else from
us-east-1a to us-east-1b,c, and d? Those AZ's get overloaded, API endpoints
get slammed, and the whole region goes down in flames. It didn't matter that
these availability zones are hosted on separate infrastructures (from what I
gather, the zones are separate datacenters in the same city with low latency
connections between them, but separate power sources and backbone
connections). Meanwhile, all's quiet on the US-West front.

That's what got me thinking about failover across regions.

------
sespindola
If you're using exclusively AWS, you should have instances in 2 different
regions, at least.

Instead of relying on the elastic load balancer, you should be doing load
balancing with DNS using SRV records.

If you are using SRV records and you are at the beginning of what seems to be
a serious downtime, you can set all the weight of your SRV records to the
instances in the healthy AZ.

In the backend, if you're using SQL, you should use a DB with a WAL-based
async replication, like Postgresql.

In PostgreSQL 9, you have streaming replication integrated to the DB. If using
PostgreSQL 8, you could use third party apps, like Slony-1 or PG-Pool II.

In the NoSQL databases, there seems to be a "last write wins" effect. Even in
distributed beasts like Cassandra. So if you are running a NoSQL cluster, you
need to determine which nodes received the most data during the outage and
repair from there.

~~~
spitfire
I'm still not sure why people ever used mysql for anything more than a simple
datastore with sql interface.

Since time eternal (the 90's) mysql has been the fastest way on the net to
lose your data. (I still remember the many data corruption bugs). Use postgres
or if you have the need/money a DB like DB2.

------
mikebabineau
I wrote a blog post on exactly this topic:
[http://dev.bizo.com/2010/05/improving-global-
application.htm...](http://dev.bizo.com/2010/05/improving-global-
application.html)

Basically, you're on the right track. Note Route 53 doesn't support GSLB
(Global Server Load Balancing... e.g., different DNS results for users from
different geographic locations). Akamai and Dynect do, however. More details
in the blog post.

~~~
vdondeti
dnsmadeeasy.com also offers Global Server load balancing
([http://www.dnsmadeeasy.com/enterprisedns/trafficdirector.htm...](http://www.dnsmadeeasy.com/enterprisedns/trafficdirector.html)).
They offer amazing products and service. I have used their DNS service for
years. They just introduced the global server load balancing and are testing
it out. They are just as good if not much better than any DNS provider, but
much cheaper.

------
atambo
Check out these new features amazon is adding to route 53 (it's dns hosting
service):

<https://forums.aws.amazon.com/thread.jspa?threadID=63893>

It's basically adding the ability to specify round robin and different weights
to elastic load balancing groups. That should make it easier to have multiple
elastic load balancing groups in different regions.

------
originalgeek
I think it's ironic that in the face of this failure, people are scrambling to
figure out how to give Amazon more money.

------
aquark
I'd like to see Amazon offer some new services to improve working across
multiple regions.

Just being able to copy an EBS volume and AMI to a different region with a
single API call would help a lot of people quickly establish much better
redundancy than they had a week ago.

~~~
Devilboy
That will increase the chances of cross-region failures though.

~~~
robryan
Almost need an availability zone built for a failure event that otherwise sees
little use, or just spot use. So then you can assume that the people on it
need the same resources their live implementation is taking up before failure
so there is no overload when the failure occurs.

~~~
justincormack
That would mean a 25% price hike on all services. Rather a lot for a situation
when the biggest issues seem to be lack of preparedness and planning for
failure by a lot of people.

------
cmsj
If you casually refer to uptime as "nine fives" most people won't notice and
you'll have a much easier time delivering ;)

~~~
Wicher
That reminds me of

<http://ars.userfriendly.org/cartoons/?id=20080310>

and

<http://ars.userfriendly.org/cartoons/?id=20080311>

------
fleitz
"What got a lot of startups in trouble was architecting systems that were
fault-tolerant across availability zones within the US-East region, but when
one availability zone went down, everybody's apps started flooding the other
availability zones"

If most startups implement this solution wouldn't the problem just replicate
itself to regions instead of availability zones?

The problem is mathematical in nature and the dependent variables are
uncontrolled by consumers of the AWS service. To be certain that you can
failover and handle the load you need to run at 50% capacity. EC2 does not do
this therefore you cannot solve the problem with certainty using only AWS
resources.

~~~
dotBen
Given this is a common problem that all tenants of the system face, I'd like
to see Amazon offer a more holistic approach to faul-related instance
migration.

Essentially, if Amazon offers that functionality rather than individual
scripts, it can have a better chance of managing the resource as a whole,
rather than everyone fight for resource with no overall management.

------
xal
Since most of the issues here come down to MySQL and other databases:

Why doesn't mysql support something like Mongo's Replica Sets? It seems like a
wonderful solution for these kind of issues.

------
avstraliitski
I think the rapid adoption of these first generation cloud type services is
basically something that introduces massive inefficiencies and unknowns in to
any system.

Simply put, no SLA - even with cash penalties attached - really means
anything: you actually have to know the capabilities and specs of your system,
its power, cooling and other physical environmental inputs, plus the
engineered capacities for live failover on all levels before you can calculate
or claim reliability. Anything else is just kidding yourself. A lot of people
kid themselves.

~~~
avstraliitski
Furthermore I would add that there is no 'magic deploy to cloud' button that
is going to work most of the time for most people. As much as the RoR fans
would love to think so :)

./generate code && tweak-slightly && deploy-reliably && lunch-on-profits # not
gonna happen anytime soon for complex systems

