Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: How Would You Architect Around Potential AWS Failures?
88 points by byoung2 on April 23, 2011 | hide | past | favorite | 40 comments
With the AWS outages over the last few days, I've been wondering how you would set up a system using only AWS services that would be resistant to multi-availability zone outages across multiple services within a geographical region. Assume that the system that we are setting up is an API that will have heavy read/write traffic with a significant number of users, and that the startup running it is on a typical shoestring budget that makes AWS attractive.

What got a lot of startups in trouble was architecting systems that were fault-tolerant across availability zones within the US-East region, but when one availability zone went down, everybody's apps started flooding the other availability zones, causing more problems. A typical setup might have been an Elastic Load Balancer with EC2 instances in a few availability zones (with the ability to create new instances in other availability zones in response to outages), multi-AZ RDS database servers, and S3 backups to multiple AZ's.

What I'm looking for is ideas for taking this setup and expanding it to multiple geographical regions, using only AWS services. Would you have multiple stacks and use Route 53 DNS to route users to different regions? How would you keep databases in sync across regions? Would you use one region as a primary and periodically back up to




If you're on a shoestring budget, can't you just afford the day of downtime once every few years? Yeah, it's annoying to be down, but each 9 you add past 99% costs more than the last.


For reference, here is how many days/hours the system is unavailable with different percentages:

  90%      36.5 days ("one nine")
  95%      18.25 days
  98%      7.30 days
  99%      3.65 days ("two nines")
  99.5%	   1.83 days
  99.8%	   17.52 hours
  99.9%    8.76 hours ("three nines")
  99.95%   4.38 hours
  99.99%   52.56 minutes ("four nines")
  99.999%  5.26 minutes ("five nines")
  99.9999% 31.5 seconds ("six nines")
http://en.wikipedia.org/wiki/High_availability#Percentage_ca...


It's even easier if you measure per week instead of per year since a week has ~10,000 minutes.

Then the rule of thumb is 10 minutes is 0.1%, 1 minute is 0.01%, and 6 seconds is 0.001%; or 10 mins = 99.9%, 1 min = 99.99%, and 6 seconds = 99.999% respectively.

(Note that 6 seconds a week * 52 weeks = 5.2 minutes, same as the reference table above.)

Programmers don't think in years, but they can think in "any given week". This rule of thumb puts things in an easy-to-remember perspective.


I spent a year as a High Performance Computing Sys Admin. After careful consideration, we just decided reliability wasn't worth the cost. We bought more hardware instead of a UPS system for our data center. Once or twice a year, we experience a power blip and lose all our compute nodes (infrastructure is on UPS). We would send out an apology email to our users and tell them to resubmit their jobs. Worst case scenario was someone had to restart a 14 day job. Had we bought the UPS system, that 14 day job would take almost twice as long to complete, spending a week or so longer in the queue. We started embracing the idea of acceptable failures and saw it had a great impact on our service.


Do your users run 14day jobs with no checkpointing whatsoever? I'd be afraid of a bug in my code crashing the computation 90% of the way through. The MapReduce setup seemed much more resilient to things like this for example.


each 9 you add past 99% costs more than the last.

You're right about that. In fact, each 9 past 99% costs twice as much as the last. But on the other hand when you have paying customers, one day of downtime they might forgive you. Two days and they'll be annoyed. But on the third day they'll come for you with pitchforks in hand. If you can avoid pitchforks with clever architecture and slightly higher costs, I think it's worth it.


Is it only twice? In my experience, 99.9999% uptime is significantly more than 16x more expensive than 99% uptime.

For comparison, 99.9999% uptime means about 30 seconds of downtime in a year, while 99% uptime is about 3 days of downtime. You can get 99% uptime with a singly-homed, not terribly reliable commodity system. For 99.9999%, you need multiple redundancies in every architectural component with automatic error detection and failover, and have to watch every change to make sure it doesn't introduce the possibility of system instability or cascading failures. Those are qualitatively different approaches to software engineering.


DB2University.com wasn't impacted by this AWS disaster, thanks to a DB2 feature known as High Availability and Disaster Recovery (HADR) which, being asynchronous, works exceptionally well for long distances. Essentially, the main server runs on US-East while a failover server runs in a different region. The exact second DB2 detects an issue with US-East, it switches over to the standby server running in a different region. All automated, and without downtimes.


> The exact second DB2 detects an issue with US-East, it switches over to the standby server

How does it do this? This is actually a very difficult problem because the monitoring system has to determine whether the primary site is down or whether its own network is experiencing trouble. And once you perform the failover, the old master might not learn that it has lost its master role, and may continue to serve requests to clients. Systems with automated failover usually use a lock service like Google's Chubby.

For most folks, it's better to have a manual failover script that the oncall engineer can run after diagnosing the issue. Automated failover requires a lot of extra complexity in your systems. There's the real risk of total service failure when the lock service goes down. And there are lots of interesting failure modes in the failover process. For a startup on a tight budget, it's probably not worth it just to change 30 minutes of downtime into 1 minute.


Exactly right. Network partition is a hard problem in automatic failover of replicated system. You don't want the standby to become master unless it can be sure the primary master is absolutely down. It's difficult to unwind the mess if two masters are active and taking changes.

In high availability system design, the secondary node literally has to shut down the primary's power (called the Shoot-At-The-Head technique) to ensure it's really down when it's not responding via network.

Of course over long distance cross-datacenter replication, shutting down power remotely is not reliable. In the last HA clusters I built, the failover between datacenters is done via manual decision. It means there could be a 15 minutes to 30 minutes window to do the manual failover, but it's an acceptable risk since datacenter failure is rare, like AWS failure once in a blue moon.


I remember when I first used HA-Linux in a project being highly amused when I came across the acronym STONITH and discovering that it meant "Shoot The Other Node In The Head" :-) http://www.linux-ha.org/wiki/STONITH

All jokes aside, it is indeed a very important concept when dealing with high availability.


How does it handle the possible inconsistency due to the asynchronous replication?


Write Ahead Logging.

Async replication doesn't produce inconsistency it produces uncommitted transactions. Every database produces them when it goes down whether it replicates or not.

When the master comes up all it has to do is reverse the transactions that the slave didn't receive. Voila, consistent database.


What about inconsistencies with data outside the database? Things like credit card transactions or other external API calls that were recorded on the master but not the slave will be inconsistent with your slaves view of the world. Is there a standard way of dealing with those kind of things or is that usually handled manually?


You use the logs from the slave to commit the 2nd portion of a two-phase commit, and immediately stop processing new transactions when you only have the slave up.

Your hypothetical API does support two-phase commit correct? Because if it doesn't you have lots of solutions for losing data/creating inconsistent data anyway.


How does it handle the possible inconsistency due to the asynchronous replication?

That would be a big question for me. I can sync databases easily and instantaneously within an availability zone, and nearly instantaneously across availability zones within a region. But once I have to replicate across regions, I add latency to the mix. If replication from US-East to US-West is asynchronous, how would I reconcile the two? I suppose with a catastrophic failure of the primary database in US-East, I could just write off any data that wasn't replicated before the failure, but that doesn't seem like a solid solution. Would it be better to write a transaction layer into the app, so that data isn't considered committed until it has been written and replicated across multiple regions?


There are all kinds of solutions available to you in that scenario. Two-phase commit, on DBs that support it, would probably go a long way towards enabling a transaction layer like you describe.

Usually, though, discussions of DR should start with determining what kinds of RPO and RTO you're willing to pay for and then evaluating which of the available solutions will get you there.


It depends on the architecture.

OK, first I'd want to make sure I have the requirements right.

- Only use Amazon services - Must keep databases in sync across regions. Periodic is acceptable. - Not a lot of budget.

OK. My first instinct is to say you're still putting all your eggs in one basket, I'd request we evaluate the idea of using a second cloud provider as a backup.

I would also suggest that Route 53 is a pretty new DNS provider, can we look at other providers with a proven track record, at least a secondary server.

Now if you're going to insist on sticking with Amazon only and being on a shoe string budget, then we'd set up with region servers. I'd hope we can control the refresh time on Route 53, as I'd want to keep the DNS TTL low. We'd be paying for more requests, but we'd have the flexibility to roll over to another region easily.

As for keeping data in sync, or periodic backups, that's really dependent on the data requirements and what type of storage is involved and I wouldn't want to get into it.

The main thing I would shoot for is, especially on a low budget, keep it as simple as possible. Even if it means rolling over to an instance with 24 hour old data. That's better than nothing in most application cases. You don't want your interim recovery to be complicated. It should be flip a switch (or change a DNS entry) and there you go. Recovery is going to be a pain, count on it and expect it. Especially if you're on a shoe string budget because you're likely depending on overworked sysadmins who you're keeping too busy to manage accurate recovery testing practices. That's the kind of stuff sysadmins love to do and never get the time to get to because project work somehow ends up being more important.


I'd like to see some suggestions here, too. Amazon has shattered the idea that availability zones are truly independent. Both ELB and RDS only work within a single zone, and this latest incident occurred in "multiple availability zones", according to Amazon.

I think relying on Route 53 by itself for DNS is equally dangerous, you should have a co-hosted or at least a backup DNS provider available to you.

Perhaps the best solution involves adding a service external to AWS? Or if you have to stick with AWS, perhaps a master-slave database sync with us-west-1?


Both ELB and RDS only work within a single zone, and this latest incident occurred in "multiple availability zones", according to Amazon

ELB and RDS work across multiple zones. So a single ELB can span EC2 instances in multiple zones within a region. So I could have 4 EC2 instances in us-east-1a, us-east-1b, us-east-1c, and us-east-1d, with an RDS instance in us-east-1a and a read replica in us-east-1b. What happens when us-east-1a goes down and every big user in that AZ has failover mechanisms that start moving instances, EBS volumes, RDS databases, S3 buckets and who knows what else from us-east-1a to us-east-1b,c, and d? Those AZ's get overloaded, API endpoints get slammed, and the whole region goes down in flames. It didn't matter that these availability zones are hosted on separate infrastructures (from what I gather, the zones are separate datacenters in the same city with low latency connections between them, but separate power sources and backbone connections). Meanwhile, all's quiet on the US-West front.

That's what got me thinking about failover across regions.


If you're using exclusively AWS, you should have instances in 2 different regions, at least.

Instead of relying on the elastic load balancer, you should be doing load balancing with DNS using SRV records.

If you are using SRV records and you are at the beginning of what seems to be a serious downtime, you can set all the weight of your SRV records to the instances in the healthy AZ.

In the backend, if you're using SQL, you should use a DB with a WAL-based async replication, like Postgresql.

In PostgreSQL 9, you have streaming replication integrated to the DB. If using PostgreSQL 8, you could use third party apps, like Slony-1 or PG-Pool II.

In the NoSQL databases, there seems to be a "last write wins" effect. Even in distributed beasts like Cassandra. So if you are running a NoSQL cluster, you need to determine which nodes received the most data during the outage and repair from there.


I'm still not sure why people ever used mysql for anything more than a simple datastore with sql interface.

Since time eternal (the 90's) mysql has been the fastest way on the net to lose your data. (I still remember the many data corruption bugs). Use postgres or if you have the need/money a DB like DB2.


I wrote a blog post on exactly this topic: http://dev.bizo.com/2010/05/improving-global-application.htm...

Basically, you're on the right track. Note Route 53 doesn't support GSLB (Global Server Load Balancing... e.g., different DNS results for users from different geographic locations). Akamai and Dynect do, however. More details in the blog post.


dnsmadeeasy.com also offers Global Server load balancing (http://www.dnsmadeeasy.com/enterprisedns/trafficdirector.htm...). They offer amazing products and service. I have used their DNS service for years. They just introduced the global server load balancing and are testing it out. They are just as good if not much better than any DNS provider, but much cheaper.


Thanks for the link...looks like exactly what I'm looking for!


Check out these new features amazon is adding to route 53 (it's dns hosting service):

https://forums.aws.amazon.com/thread.jspa?threadID=63893

It's basically adding the ability to specify round robin and different weights to elastic load balancing groups. That should make it easier to have multiple elastic load balancing groups in different regions.


I think it's ironic that in the face of this failure, people are scrambling to figure out how to give Amazon more money.


I'd like to see Amazon offer some new services to improve working across multiple regions.

Just being able to copy an EBS volume and AMI to a different region with a single API call would help a lot of people quickly establish much better redundancy than they had a week ago.


Yes, better support for running in multiple regions is clearly needed. There needs to be native support for transferring snapshots between regions (at the user's expense, of course).


That will increase the chances of cross-region failures though.


Almost need an availability zone built for a failure event that otherwise sees little use, or just spot use. So then you can assume that the people on it need the same resources their live implementation is taking up before failure so there is no overload when the failure occurs.


That would mean a 25% price hike on all services. Rather a lot for a situation when the biggest issues seem to be lack of preparedness and planning for failure by a lot of people.


If you casually refer to uptime as "nine fives" most people won't notice and you'll have a much easier time delivering ;)



"What got a lot of startups in trouble was architecting systems that were fault-tolerant across availability zones within the US-East region, but when one availability zone went down, everybody's apps started flooding the other availability zones"

If most startups implement this solution wouldn't the problem just replicate itself to regions instead of availability zones?

The problem is mathematical in nature and the dependent variables are uncontrolled by consumers of the AWS service. To be certain that you can failover and handle the load you need to run at 50% capacity. EC2 does not do this therefore you cannot solve the problem with certainty using only AWS resources.


Given this is a common problem that all tenants of the system face, I'd like to see Amazon offer a more holistic approach to faul-related instance migration.

Essentially, if Amazon offers that functionality rather than individual scripts, it can have a better chance of managing the resource as a whole, rather than everyone fight for resource with no overall management.


If most startups implement this solution wouldn't the problem just replicate itself to regions instead of availability zones?

I think the problem would be less severe across regions than across availability zones within a region and here's why. A major benefit of multiple availability zones is the low-latency connection between them. That encourages you to copy instances, EBS volumes, S3 buckets, and RDS data between them, most of the time en masse in the event of a failure. Across regions, you're sending data over a slower connection, so you have to take care of replication on an ongoing basis. So your instances, EBS volumes, S3 backups and RDS instances and replicas would already be in the secondary region when failure occurs in the primary. I'd compare it to having a vacation home and spare car in another state when an earthquake hits, instead of looking for a shelter near the disaster area.


Since most of the issues here come down to MySQL and other databases:

Why doesn't mysql support something like Mongo's Replica Sets? It seems like a wonderful solution for these kind of issues.


I think the rapid adoption of these first generation cloud type services is basically something that introduces massive inefficiencies and unknowns in to any system.

Simply put, no SLA - even with cash penalties attached - really means anything: you actually have to know the capabilities and specs of your system, its power, cooling and other physical environmental inputs, plus the engineered capacities for live failover on all levels before you can calculate or claim reliability. Anything else is just kidding yourself. A lot of people kid themselves.


Furthermore I would add that there is no 'magic deploy to cloud' button that is going to work most of the time for most people. As much as the RoR fans would love to think so :)

./generate code && tweak-slightly && deploy-reliably && lunch-on-profits # not gonna happen anytime soon for complex systems




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: