Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Best setups to avoid availability outages on AWS
128 points by richardv on June 30, 2012 | hide | past | favorite | 33 comments
Using only AWS services, what do you put in place to help prevent disruptions when a single availability zone goes down?

The most simple would be to set up your instances in a multiple AZ's and then configure the ELB to round robin requests until the health of one of the instances is poor.

Any other thoughts?




I hate to sound like a simpleton, but for a small operation you're best off putting all your eggs in one basket.

I'm in one of the us-east zones and I haven't had a failure in at least a year. They retired one machine I was using and dealing with that was as simple as starting and stopping -- at a time I chose.

With five zones in U.S. East, the probability of a zone failure affecting a single zone systems is 1 in 5.

If you're a busybody who spreads your system across five zones, the probability of a failure affecting you becomes 1.

You're spending more money, and dealing with a lot more complexity, all to increase the probability that hardware failures will affect you.

Now, you're hoping that a zone-distributed system will be able to recover from failures, but that's tricky to do and it's quite unlikely that this will work if you haven't tested it. Add the fact that all the other "cool kids" will be trying to recover their systems at this time and make AMZN's control plane go down.

In the meantime, with probability 4/5 I'm sleeping through the disaster and the first time I hear about it is on hacker news.


This is correct; probability works against you if you spread your instances over multiple zones without also compensating with some very solid failover. If you are not willing to put the effort into truly going fully active/active/active across THREE availability zones, you might as well just stick with one and weather the outages.


I think that's a bit of an exaggeration. My old company (Acquia) kept a lot of customers up through the various outages with mere two-zone redundancy.

You do need to actually engineer for failover and recovery, though. That point is correct: It doesn't necessarily help you to have half of your servers survive an outage if that still kills your service or requires painful amounts of manual intervention. And it really is a lot easier to just shrug, keep your backups up to date, and pray for your zone to come back ASAP. Do your cost-benefit analysis.


what if Amazon DNS goes down? Doesn't matter how many zones you've got... DNS remains a potential single (yes it is redundant with HA), but it's still a single point of failure.


Route 53 has had 100% uptime for serving DNS requests since launch. The API hasn't, but that only affects changes to your config. It's never stopped serving resolve requests.


My thought: Have a very nice screen for your mobile app/website that says "We are down for maintenance, please stand by."

Sorry to be fatalist, but it's a hard problem. This last outage was more than just an AZ failure. Region-wide API usage was affected, so operations like static IP reassignment and ELB changes were not taking effect. This means you are hanging out in the wind should there be something unusual that requires manual intervention (as was the case with us).

Route 53 is a good service but I don't know how its control plane works, and it could be that problems in a single region would disable the ability to update DNS records (I would guess that DNS reads are a lot more available than writes). And in any case DNS is not a very good failover mechanism due to upstream caching.

Unless your business model requires higher reliability than Instagram, Netflix, and Pinterest, I'd suggest going multi-AZ, crossing your fingers, and doing everything else right.


I think Route 53's TTL is 60 seconds.


TTL is not always honored -- some caching nameservers peg it to a lower limit, and Android has a bug (just recently fixed) that pegs it to 10 minutes: http://code.google.com/p/android/issues/detail?id=7904


Mindless rambling ahead; I love this topic

Richard,

AWS is an amazing tool and you have a few options here, but the downside is the more options you use to be highly-available (HA), the more expensive AWS gets (as you would imagine).

Your first option is to be HA across a SINGLE region; to do this you make use of the elastic load balancers (ELB) + auto-scaling. You setup auto-scale rules to launch more instances in different availability zones (AZ) either in response to demand or in response to failures (e.g. "always keep at least 3 instances running").

You compliment that with an ELB to load-balance incoming requests automatically across those instances in the different AZs. This is all fairly straight forward through the web console (except auto-scaling is still done via CLI for some reason)

If you want to be HA ACROSS regions you can't just use ELBs anymore, you have some added complexity and an additional AWS feature you will likely want to use: Route 53.

Route 53 is Amazon's DNS service which offers a lot of slick DNS-features like removing dead points from DNS rotation, latency-based routings, etc. There are also something like 29 deployments of Route53 (and CloudFront) around the globe so you'll hopefully never have Route53 become a point of failure for you even disaster strikes.

In this scenario you would setup the HA configuration for a single-region as mentioned above, but you would do it in multiple regions. Put another way, 2+ servers in multiple AZs in each AWS region. Then a Route53 DNS configuration setup to point to each ELB in each region representing those individual pockets of servers.

Ontop of that you would use Route53 to manage all routing of client requests into your entire domain; you can leverage the new "latency-based routing" (effectively why everyone was asking for GeoDNS for years, but even better) and monitor capability to ensure you aren't routing anyone to a dead region.

SIMPLIFICATION

--------------

Here is what I would recommend given the size of your budget and need to stay up in the AWS cloud, in-order of expense:

1. Launch a single instance in a region with acceptable latency that has never had an outage before (e.g. Oregon has never completely gone down but Virginia has -- yes yes I know VA is older, but you understand my point). This solution will be cheaper than multiple instances in any region.

2. Launch multiple instances using the web console, in multiple AZs in US-EAST (cheapest option for multi-instances) and front them with an ELB. You skip any auto-scaling complexity here but you need to keep an eye on your servers. I think ELB fixed the issue where it would effectively route traffic into the void if all the instances in an AZ went down.

OPTIONAL: If you didn't mind spending a few $ more, you could do this strategy in the region that has never gone down for added piece of mind.

3. Launch single instances in multiple REGIONS and front them with Route53. This isn't really a recommended setup as entire regions will disappear if you lose a single instance, BUT I said I would list possibilities in order of price, so there you go. You could mitigate this by setting up auto-scaling policies to replace any dead instances quickly in the off chance you wanted to do exactly this but not babysit the web console all day.

4. Launch multiple instances in each region, across multiple AZs fronted by ELBs and then the entire collection fronted by Route53.

NOTE: The real cost comes from the additional instances and not from Route53 or the ELB; so if you can use smaller instances to help keep costs down (or reserved instances also) that might allow you to provide a larger HA setup.

What about my data?

-------------------------

Yes, yes... this is an issue that someone already touched on (data locality below).

You will have to decide on a single region to hold your data; in this case I would recommend using DB services that aren't based on EC2 and have never experienced outages (or rarely) -- this includes S3, SimpleDB and/or DynamoDB. AWS's MySQL offering (RDS) are just custom EC2 instances with MySQL running on them, so any time EC2 goes down, RDS goes down.

The other DB offerings are all custom and except for SimpleDB a long time ago, have never experienced outages that I am aware of.

Making this choice is all about latency and which DB store you are comfortable with (obviously don't choose SimpleDB if everything you do requires MySQL -- then use RDS); you'll want your data as close to your web tier as possible, so if you are spread across all regions you'll just want to pick a region with the smallest latency to MOST of your customers (typically West coast if you have a lot of Asia/Aus customers and East coast if you have a lot of European customers).

Want to Go to 11?

-----------------

If you have the money and desperately want to go to 11 with this regional-scale (which I love to do, so I am sharing this) you can combine services like DynamoDB and SQS to effectively create a globally distributed NoSQL datastore with behavior along the lines of:

1. Write operation comes into a region, immediately write it to the local DynamoDB instance, asynchronously queue the write command in SQS and return to the caller.

2. In 1+ additional EC2 instances running daemons, pull messages from SQS in chunk sizes that make sense and re-play them out to the other regions DynamoDB stores; erase the messages when processed or if the processing fails the next dameon to spin up will replay it.

3. On reads, just hit the local DynamoDB in any region and reply; we trust our reconciliation threads to do the work to keep us all in sync eventually.

NOTE: If you prefer to do read-repairs here you can, but it will increase complexity and inter-region communication which all costs money.

The challenges with this approach is that you pull up a lot of DB concerns into your code like conflict resolution, resync'ing entire regions after failure, bringing new regions online and ensuring they are synchronized, diffs, etc.

There is a reason AWS doesn't offer a globally-distributed data store: it is a really hard problem to get right once you make it past the 80% use case.

Your data will determine if this is an option or not; some data allows for certain amounts of inconsistency in which case this strategy is awesome and works great; while other data (e.g. banking data) cannot allow a single wiggle of inconsistency in which case pulling all this DB logic up into the application is a bad idea. Your failure scenarios become catastrophic (e.g. your conflict-resolution logic is wrong and wipes out the balance from an account; or keeps re-filling the balance on an empty account... something bad basically)

It is all a trade-off though; if you managed your own Cassandra cluster though, Cassandra does all this and much more for you automatically; but then you just put your time into Cassandra administration instead of developing the logic around DynamoDB (or SimpleDB, or MySQL, or whatever); just pick which devil you feel more comfortable with.

I am not aware of a services company yet that offers cross-region AWS datastore deployments yet; Datastax and Iris Couch will setup things for that like you via a consulting/custom arrangement, but there isn't a dashboard for launching something like that automatically.

Hope that helped (and didn't bring you to tears of boredom)


The above is why I love to read HN...

However, it seems that at some point here you are better off going back to getting dedicated hosts at a couple different distributed data centers and dealing with the complexity yourself. Surely by the time you "get to 11", you are spending more to be on AWS than doing it yourself would cost?


If anything, I suspect that the opposite is the case. If you're planning to set up servers in data centers on opposite sides of the globe, AWS is great. This is where their abstraction layer pays off. All their regions work the same way, accept the same API calls, are backed by effectively the same hardware. Get your setup working in one region, then copy a few S3 buckets over, change the --region argument to some calls in your script, and suddenly you're up in Singapore.

Plus, Amazon is building tools, like the above-mentioned Route 53, to support just this use case.

Now, those of us who have worked a lot within one AWS region will complain and moan about the horror of trying to work cross-region, but that's not because AWS makes cross-region unusually hard. It's because they make your work within one region unusually easy. We're all spoiled by, e.g., the magic of cloning EBS volumes from zone to zone via snapshots. (Well, okay, it's magical when it's working, anyway. ;)


Was a really helpful read actually.

Currently our dependencies on AWS is..

[-] Route 53

[-] Small EC2 Instance

[-] Standard RDS

[-] S3/CF etc...

We are still in total stealth mode and I'm about to set up the actual architecture for our platform next week.

I was thinking about keeping everything in a single region. I was going to have:

[x] Elastic Load Balancer

[x] Declare one zone as our main zone (this can be 1A for example). Have one medium instance in zone-1A

[x] Have my RDS in zone-1A

[x] Set up another instance in zone-2A, and then let the ELB distribute the requests...

The only problem is that clearly, I'm running both mysql and an instance in zone-1A, which is my single point of failure. I will be able to survive an outage on 2A though.

How should I handle the data locality?


You're doing the best you can with one MySQL server.

One thing you must do is keep a MySQL data dump, as recent as you can afford (at least one per day, more often if you can; note that dumps can have nontrivial performance impacts), in an accessible location. (The easy thing to use is S3; though you're still vulnerable to Amazon outages, my empirical observation over my last three years as an AWS devops guy is that S3 rarely goes down, even when EBS or EC2 is freaking out. But the paranoid person has backups outside Amazon as well.)

Then, in an emergency, boot another MySQL server in a different zone (or even region), recover the DB, and go.

There are lots of problems with this plan. One is that it is wicked slow at the best of times: Your downtime will be measured in minutes or hours. The second is that, when AWS is having an outage, it very often becomes difficult to impossible to do anything with its API, especially launching new machines. (My hypothesis is that this is often due to the surge of thousands of people just like you, all trying to launch new machines to handle the outage.) So, again, downtime could be hours. But hours is better than days or decades.

For actual HA you must run two MySQL servers at all times, one in each of two availability zones, one of which has a slightly older copy of the data. To make "slightly older" as short as possible, most folks master-master replicate the data between the servers. But you must not write to both DB servers at the same time, so one machine will still be "in charge" of writes at any given moment, and you'll have to have a scheme for swapping that "in charge of writes" status over to the other DB when the first one fails. (I'd suggest the "human logs in with SSH and runs a script" method to start, on the assumption that you don't need HA on the time scale of thirty seconds rather than thirty minutes.)

There are several other bits of plumbing involved with replication: Setting up something like Nagios to alert you when replication breaks, learning how to rebuild replication from scratch when one server dies, et cetera. You'll want to check out Percona Toolkit. Or, though I haven't used it yet, you should read up on Percona XtraDB Cluster, which I think does all of the above and comes with the option to buy support from Percona, which has smart folks:

http://www.percona.com/software/percona-xtradb-cluster/

The next stage, if you've got money, really love setting up new tools, and laugh defiantly in the face of latency, is to try master-master replication across AWS regions using a product like Continuent Tungsten: http://continuent.com/downloads/software . But I'm not sure what it costs. Probably more than you want to spend at this point in your product lifecycle.


Thanks a lot for sharing, I can add one more option, write to one master mysql db, use slaves in other locations. Read would be near to web server, if master colo fails, you can't write. But there are workarounds like standby master in another location.


Decide which part of the CAP Theorem http://en.wikipedia.org/wiki/CAP_theorem you want to give up on. Presumably you decided that Availability was not it, so you need to program around lack of consistency and/or partition tolerance. Essentially that means there is no "master database", and you will need to reconcile differing views. This can get quite application specific, and you need to understand your data well.


FWIW I initially went into ELB assuming it would solve a lot of my redundancy problems. And while it has helped a lot (I spread my frontend across 3 zones), I've suffered through a number of ELB failures or disruptions, including this latest one, which is one of the worst. Even with fully functioning servers that I can connect to individually, ELB was intermittently rejecting connections and failed to reregister instances. There's no silver bullet! Just prepare for failure and attempt to handle it gracefully, learning from each one. I suppose you should also think hard before you launch into a greater AWS budget to increase availability. Most of us are tempted to do that after each major incident--which is why Amazon can walk away from these events in a better position than before (until they have a genuine competitor that is).


We run a few decent sized social games and we have survived all the major AWS region outages in the past year. He's what we do and what I would suggest.

1. Use Rightscale. You can get away with the free edition, but for $500/month the basic paid edition will allow you access to arrays and all the excellent scripts available on the marketplace.

2. The front end. I would strongly suggest moving away from ELB. We are using it and are about to get rid of it. The main problem is what exactly happened last night. If a whole AZ goes down, the ELB for that zone can get screwed and the DNS was not updating the CNAME to remove the bad zone. Instead of ELB, we have our own LB solution we are going to roll out that will use Rightscale server arrays and will handle the updating of the DNS names itself. We also aren't going to use Route53, because we learned last night that the API for that can go down and you can get stuck with bad DNS records.

3. Application servers. Use at least 3 AZ and have them evenly spaced. This is easy to do in Rightscale with sever arrays. Make sure your voting ration for scaling isn't 50% because you might not scale correctly if you loose 2 AZ. Keep the vote to 30% and you will be happy (if one zone votes to grow, let it grow).

4. Database. This is the fun one. We have been using MongoDB with pretty good success. Our multi-shard DB has 3 servers per replica set and has them distributed equally between AZs. We use 4 EBS drive RAID-0 drives for storage which have had problems in the past due to the outages that EBS sometimes has. Our best bet has been a watcher process that will kill the mongod process if there's any problems writing to the drive array. By doing this, the replica set will automatically failover to the next server and we won't get stuck with a primary node that can't write back to disk. For backups, we just freeze the writes on the secondaries and do EBS snapshots even 15 minutes. Rightscale has some great EBS tools for managing this for you. If you loose a server, we can deploy a new server in a matter of minutes and it will rebuild the RAID array from the last backup so we have a warm spare.

5. Monitor, monitor, monitor. Rightscale has some great tools for monitoring everything. Use them, and use more monitoring on other infrastructure (ie Pingdom)

Doing something like this will cost a lot more that just sticking to a single AZ, but you should be able to survive one, if not two complete datacenter outages.


> If you loose a server, we can deploy a new server in a matter of minutes and it will rebuild the RAID array from the last backup so we have a warm spare.

Unless you can't actually get a new server provisioned because the now-fragile API (like last night) is under such load from people trying to mitigate their downtime... We use rightscale, and rightscale won't solve this issue. If AWS is being clobbered by people trying to get new boxes up, using a third-party api abstraction service doesn't help.

Also, rightscale themselves were affected last night, throwing invalid alerts about servers being inaccessible when they were actually still operating normally.


That's also the beauty of Rightscale. If you do everything correctly you should be able to provision a new server in another region if need be or even another cloud (ie Rackspace).

Most outages of AWS don't last more than a few hours. The real goal is to make sure your infrastructure can hobble on one leg for those few hours until help arrives and you can cleanup the mess once the outage is over.

When the API goes down it sure isn't fun. Just try to project yourself the best you can. I had no problems with Rightscale last night other than anything trying to reach EC2.


And also,

If shit really hits the fan, its not a bad idea to make sure you have a OpenVPN tunnel ready on each server.

This will allow to get connectivity between old and new instances if you can't update the security groups due to the API being down.


Sending traffic to different zones isn't the challenge, the challenge is deciding where your master data will live. In fact, this has always been one of the biggest challenges of building a fault-tolerant systems. If your master data store lives in one zone then you've got latency issues, but if it lives in multiple zones then you need to find a logical way to shard. You could also replicate across zones and then turn off writes if the zone with the master fails. You could even change masters in that case, but there's risk of data loss there.

Anyhow, sorry I don't have a simple answer - I'm not sure a simple answer exists.


There would be some value in failing somewhat gracefully even when your database, AZ, etc. is down. Redirect to a site-specific outage message with updates, workarounds, etc. Serving read-only also works, for some applications.

The nice thing is Cloudflare does a lot of this for you for free.


multi-az in RDS does the replication and automatic failover for you, no sharding necessary.



I am on PHPFog for my front-end and with an AWS RDS back-end. I managed to survive this incident without an outage (I am on U.S East as well), although I did get some horrendous response times from RDS for about an hour there.

PHPFog are on AWS and I pay them to make sure they have the redundancy worked out. If they don't, I would yell at them until I got some money back.

I am considering configuring RDS for Multi A-Z, but need to research it a little more first. From what I can tell you just click a button to turn it on, but there were a lot of people complaining yesterday that the fail-over didn't work at all when it was supposed to.

I also have a bunch of EC2 VMs that do back-end processing and have a load of CRON jobs on them that need to run once every 24 hours. If these go down for a couple of hours then there is no noticeable impact to my customers, they can still log into my service and access their historical data.

I have considered spreading across multiple regions etc but at the end of the day it's just too expensive for the small increase in reliability.


Assuming we want to minimize latency and maximize reliability, we want to create a stack that:

1) Has AutoScaling Groups & Elastic Load Balancers in two regions (and only two availability zones; let's keep front-end instances in the same AZ as your local/region-specific DB)

2) Has Databases in two regions and uses Master Master replication

3) Instances talk to their local DB. If they detect their local DB is down, they failover to the remote DB (ie, the far region). If they failover, they notify you.

4) DNS does geographic load balancing (pre-ELB). You'll need to use a provider like DynDNS or UltraDNS to give you Geo Load Balancing & Failover. Or, you could pair a monitoring service like CatchPoint with Route53

5) Application caching (Memcache, Redis, etc). Let's not put more load on the DB's than necessary.

That's a good start, at least.


I haven't tried this (I use single EC2 deployments, some Heroku, also have a Hetzner server) but it is something that I have been thinking of: have the web services that back up your web app on a single server, and yes that will fail on hopefully rare occasions. Host the Javascript+HTML5+CSS front end on S3 with Cloudfront CDN. The home page of your app will almost never go offline and you control what to report to your users if your backend services are offline. Sure you lose core functionality, but you still have static content and a friendly message about temporary lack of services.

Going beyond that at a cost of slow response times when trying to access a downed backend, you could deploy back end web services to two different hosting providers, perhaps running something like CouchDB replicated on each provider. The Javascript on your UI could switch to an alternative back end after a timeout. For "one page" style apps, you could maintain the state information that a backend host is down in the browser.


Start here: http://aws.amazon.com/architecture/

I don't think they show how to do ELB across Regions, or diversity against single-ELB problems (although I haven't seen ELB fail yet). You'd probably have to build this yourself.


Today's status update (http://status.aws.amazon.com/) indicates partial ELB failure, still unresolved:

Jun 30, 12:15 AM PDT [..] Elastic Load Balancers were also impacted by this event. ELBs are still experiencing delays in provisioning load balancers and in making updates to DNS records.

Jun 30, 12:37 AM PDT ELB is currently experiencing delayed provisioning and propagation of changes made in API requests. As a result, when you make a call to the ELB API to register instances, the registration request may take some time to process. As a result, when you use the DescribeInstanceHealth call for your ELB, the state may be inaccurately reflected at that time. To ensure your load balancer is routing traffic properly, it is best to get the IP addresses of the ELB's DNS name (via dig, etc.) then try your request on each IP address. We are working as fast as possible to get provisioning and the API latencies back to normal range.


From what I've heard, you're on the right track. However, I'd want it to not round-robin but go to the nearest working node. I don't use AWS, so I don't know how to configure the ELB, but I would assume that this is possible.


Super low TTL and Refresh combined with replication to a DR provider. High Availability placed exclusively on a single provider - even Amazon (albeit different AZ's) is of zero value if all of Amazon itself is offline...


I'm shocked no one has mentioned http://www.cedexis.com/ yet.


Try a multi-infrastructure PaaS like http://appfog.com/




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: