The most simple would be to set up your instances in a multiple AZ's and then configure the ELB to round robin requests until the health of one of the instances is poor.
Any other thoughts?
I'm in one of the us-east zones and I haven't had a failure in at least a year. They retired one machine I was using and dealing with that was as simple as starting and stopping -- at a time I chose.
With five zones in U.S. East, the probability of a zone failure affecting a single zone systems is 1 in 5.
If you're a busybody who spreads your system across five zones, the probability of a failure affecting you becomes 1.
You're spending more money, and dealing with a lot more complexity, all to increase the probability that hardware failures will affect you.
Now, you're hoping that a zone-distributed system will be able to recover from failures, but that's tricky to do and it's quite unlikely that this will work if you haven't tested it. Add the fact that all the other "cool kids" will be trying to recover their systems at this time and make AMZN's control plane go down.
In the meantime, with probability 4/5 I'm sleeping through the disaster and the first time I hear about it is on hacker news.
You do need to actually engineer for failover and recovery, though. That point is correct: It doesn't necessarily help you to have half of your servers survive an outage if that still kills your service or requires painful amounts of manual intervention. And it really is a lot easier to just shrug, keep your backups up to date, and pray for your zone to come back ASAP. Do your cost-benefit analysis.
Sorry to be fatalist, but it's a hard problem. This last outage was more than just an AZ failure. Region-wide API usage was affected, so operations like static IP reassignment and ELB changes were not taking effect. This means you are hanging out in the wind should there be something unusual that requires manual intervention (as was the case with us).
Route 53 is a good service but I don't know how its control plane works, and it could be that problems in a single region would disable the ability to update DNS records (I would guess that DNS reads are a lot more available than writes). And in any case DNS is not a very good failover mechanism due to upstream caching.
Unless your business model requires higher reliability than Instagram, Netflix, and Pinterest, I'd suggest going multi-AZ, crossing your fingers, and doing everything else right.
AWS is an amazing tool and you have a few options here, but the downside is the more options you use to be highly-available (HA), the more expensive AWS gets (as you would imagine).
Your first option is to be HA across a SINGLE region; to do this you make use of the elastic load balancers (ELB) + auto-scaling. You setup auto-scale rules to launch more instances in different availability zones (AZ) either in response to demand or in response to failures (e.g. "always keep at least 3 instances running").
You compliment that with an ELB to load-balance incoming requests automatically across those instances in the different AZs. This is all fairly straight forward through the web console (except auto-scaling is still done via CLI for some reason)
If you want to be HA ACROSS regions you can't just use ELBs anymore, you have some added complexity and an additional AWS feature you will likely want to use: Route 53.
Route 53 is Amazon's DNS service which offers a lot of slick DNS-features like removing dead points from DNS rotation, latency-based routings, etc. There are also something like 29 deployments of Route53 (and CloudFront) around the globe so you'll hopefully never have Route53 become a point of failure for you even disaster strikes.
In this scenario you would setup the HA configuration for a single-region as mentioned above, but you would do it in multiple regions. Put another way, 2+ servers in multiple AZs in each AWS region. Then a Route53 DNS configuration setup to point to each ELB in each region representing those individual pockets of servers.
Ontop of that you would use Route53 to manage all routing of client requests into your entire domain; you can leverage the new "latency-based routing" (effectively why everyone was asking for GeoDNS for years, but even better) and monitor capability to ensure you aren't routing anyone to a dead region.
Here is what I would recommend given the size of your budget and need to stay up in the AWS cloud, in-order of expense:
1. Launch a single instance in a region with acceptable latency that has never had an outage before (e.g. Oregon has never completely gone down but Virginia has -- yes yes I know VA is older, but you understand my point). This solution will be cheaper than multiple instances in any region.
2. Launch multiple instances using the web console, in multiple AZs in US-EAST (cheapest option for multi-instances) and front them with an ELB. You skip any auto-scaling complexity here but you need to keep an eye on your servers. I think ELB fixed the issue where it would effectively route traffic into the void if all the instances in an AZ went down.
OPTIONAL: If you didn't mind spending a few $ more, you could do this strategy in the region that has never gone down for added piece of mind.
3. Launch single instances in multiple REGIONS and front them with Route53. This isn't really a recommended setup as entire regions will disappear if you lose a single instance, BUT I said I would list possibilities in order of price, so there you go. You could mitigate this by setting up auto-scaling policies to replace any dead instances quickly in the off chance you wanted to do exactly this but not babysit the web console all day.
4. Launch multiple instances in each region, across multiple AZs fronted by ELBs and then the entire collection fronted by Route53.
NOTE: The real cost comes from the additional instances and not from Route53 or the ELB; so if you can use smaller instances to help keep costs down (or reserved instances also) that might allow you to provide a larger HA setup.
What about my data?
Yes, yes... this is an issue that someone already touched on (data locality below).
You will have to decide on a single region to hold your data; in this case I would recommend using DB services that aren't based on EC2 and have never experienced outages (or rarely) -- this includes S3, SimpleDB and/or DynamoDB. AWS's MySQL offering (RDS) are just custom EC2 instances with MySQL running on them, so any time EC2 goes down, RDS goes down.
The other DB offerings are all custom and except for SimpleDB a long time ago, have never experienced outages that I am aware of.
Making this choice is all about latency and which DB store you are comfortable with (obviously don't choose SimpleDB if everything you do requires MySQL -- then use RDS); you'll want your data as close to your web tier as possible, so if you are spread across all regions you'll just want to pick a region with the smallest latency to MOST of your customers (typically West coast if you have a lot of Asia/Aus customers and East coast if you have a lot of European customers).
Want to Go to 11?
If you have the money and desperately want to go to 11 with this regional-scale (which I love to do, so I am sharing this) you can combine services like DynamoDB and SQS to effectively create a globally distributed NoSQL datastore with behavior along the lines of:
1. Write operation comes into a region, immediately write it to the local DynamoDB instance, asynchronously queue the write command in SQS and return to the caller.
2. In 1+ additional EC2 instances running daemons, pull messages from SQS in chunk sizes that make sense and re-play them out to the other regions DynamoDB stores; erase the messages when processed or if the processing fails the next dameon to spin up will replay it.
3. On reads, just hit the local DynamoDB in any region and reply; we trust our reconciliation threads to do the work to keep us all in sync eventually.
NOTE: If you prefer to do read-repairs here you can, but it will increase complexity and inter-region communication which all costs money.
The challenges with this approach is that you pull up a lot of DB concerns into your code like conflict resolution, resync'ing entire regions after failure, bringing new regions online and ensuring they are synchronized, diffs, etc.
There is a reason AWS doesn't offer a globally-distributed data store: it is a really hard problem to get right once you make it past the 80% use case.
Your data will determine if this is an option or not; some data allows for certain amounts of inconsistency in which case this strategy is awesome and works great; while other data (e.g. banking data) cannot allow a single wiggle of inconsistency in which case pulling all this DB logic up into the application is a bad idea. Your failure scenarios become catastrophic (e.g. your conflict-resolution logic is wrong and wipes out the balance from an account; or keeps re-filling the balance on an empty account... something bad basically)
It is all a trade-off though; if you managed your own Cassandra cluster though, Cassandra does all this and much more for you automatically; but then you just put your time into Cassandra administration instead of developing the logic around DynamoDB (or SimpleDB, or MySQL, or whatever); just pick which devil you feel more comfortable with.
I am not aware of a services company yet that offers cross-region AWS datastore deployments yet; Datastax and Iris Couch will setup things for that like you via a consulting/custom arrangement, but there isn't a dashboard for launching something like that automatically.
Hope that helped (and didn't bring you to tears of boredom)
However, it seems that at some point here you are better off going back to getting dedicated hosts at a couple different distributed data centers and dealing with the complexity yourself. Surely by the time you "get to 11", you are spending more to be on AWS than doing it yourself would cost?
Plus, Amazon is building tools, like the above-mentioned Route 53, to support just this use case.
Now, those of us who have worked a lot within one AWS region will complain and moan about the horror of trying to work cross-region, but that's not because AWS makes cross-region unusually hard. It's because they make your work within one region unusually easy. We're all spoiled by, e.g., the magic of cloning EBS volumes from zone to zone via snapshots. (Well, okay, it's magical when it's working, anyway. ;)
Currently our dependencies on AWS is..
[-] Route 53
[-] Small EC2 Instance
[-] Standard RDS
[-] S3/CF etc...
We are still in total stealth mode and I'm about to set up the actual architecture for our platform next week.
I was thinking about keeping everything in a single region.
I was going to have:
[x] Elastic Load Balancer
[x] Declare one zone as our main zone (this can be 1A for example). Have one medium instance in zone-1A
[x] Have my RDS in zone-1A
[x] Set up another instance in zone-2A, and then let the ELB distribute the requests...
The only problem is that clearly, I'm running both mysql and an instance in zone-1A, which is my single point of failure. I will be able to survive an outage on 2A though.
How should I handle the data locality?
One thing you must do is keep a MySQL data dump, as recent as you can afford (at least one per day, more often if you can; note that dumps can have nontrivial performance impacts), in an accessible location. (The easy thing to use is S3; though you're still vulnerable to Amazon outages, my empirical observation over my last three years as an AWS devops guy is that S3 rarely goes down, even when EBS or EC2 is freaking out. But the paranoid person has backups outside Amazon as well.)
Then, in an emergency, boot another MySQL server in a different zone (or even region), recover the DB, and go.
There are lots of problems with this plan. One is that it is wicked slow at the best of times: Your downtime will be measured in minutes or hours. The second is that, when AWS is having an outage, it very often becomes difficult to impossible to do anything with its API, especially launching new machines. (My hypothesis is that this is often due to the surge of thousands of people just like you, all trying to launch new machines to handle the outage.) So, again, downtime could be hours. But hours is better than days or decades.
For actual HA you must run two MySQL servers at all times, one in each of two availability zones, one of which has a slightly older copy of the data. To make "slightly older" as short as possible, most folks master-master replicate the data between the servers. But you must not write to both DB servers at the same time, so one machine will still be "in charge" of writes at any given moment, and you'll have to have a scheme for swapping that "in charge of writes" status over to the other DB when the first one fails. (I'd suggest the "human logs in with SSH and runs a script" method to start, on the assumption that you don't need HA on the time scale of thirty seconds rather than thirty minutes.)
There are several other bits of plumbing involved with replication: Setting up something like Nagios to alert you when replication breaks, learning how to rebuild replication from scratch when one server dies, et cetera. You'll want to check out Percona Toolkit. Or, though I haven't used it yet, you should read up on Percona XtraDB Cluster, which I think does all of the above and comes with the option to buy support from Percona, which has smart folks:
The next stage, if you've got money, really love setting up new tools, and laugh defiantly in the face of latency, is to try master-master replication across AWS regions using a product like Continuent Tungsten: http://continuent.com/downloads/software . But I'm not sure what it costs. Probably more than you want to spend at this point in your product lifecycle.
1. Use Rightscale. You can get away with the free edition, but for $500/month the basic paid edition will allow you access to arrays and all the excellent scripts available on the marketplace.
2. The front end. I would strongly suggest moving away from ELB. We are using it and are about to get rid of it. The main problem is what exactly happened last night. If a whole AZ goes down, the ELB for that zone can get screwed and the DNS was not updating the CNAME to remove the bad zone. Instead of ELB, we have our own LB solution we are going to roll out that will use Rightscale server arrays and will handle the updating of the DNS names itself. We also aren't going to use Route53, because we learned last night that the API for that can go down and you can get stuck with bad DNS records.
3. Application servers. Use at least 3 AZ and have them evenly spaced. This is easy to do in Rightscale with sever arrays. Make sure your voting ration for scaling isn't 50% because you might not scale correctly if you loose 2 AZ. Keep the vote to 30% and you will be happy (if one zone votes to grow, let it grow).
4. Database. This is the fun one. We have been using MongoDB with pretty good success. Our multi-shard DB has 3 servers per replica set and has them distributed equally between AZs. We use 4 EBS drive RAID-0 drives for storage which have had problems in the past due to the outages that EBS sometimes has. Our best bet has been a watcher process that will kill the mongod process if there's any problems writing to the drive array. By doing this, the replica set will automatically failover to the next server and we won't get stuck with a primary node that can't write back to disk. For backups, we just freeze the writes on the secondaries and do EBS snapshots even 15 minutes. Rightscale has some great EBS tools for managing this for you. If you loose a server, we can deploy a new server in a matter of minutes and it will rebuild the RAID array from the last backup so we have a warm spare.
5. Monitor, monitor, monitor. Rightscale has some great tools for monitoring everything. Use them, and use more monitoring on other infrastructure (ie Pingdom)
Doing something like this will cost a lot more that just sticking to a single AZ, but you should be able to survive one, if not two complete datacenter outages.
Unless you can't actually get a new server provisioned because the now-fragile API (like last night) is under such load from people trying to mitigate their downtime... We use rightscale, and rightscale won't solve this issue. If AWS is being clobbered by people trying to get new boxes up, using a third-party api abstraction service doesn't help.
Also, rightscale themselves were affected last night, throwing invalid alerts about servers being inaccessible when they were actually still operating normally.
Most outages of AWS don't last more than a few hours. The real goal is to make sure your infrastructure can hobble on one leg for those few hours until help arrives and you can cleanup the mess once the outage is over.
When the API goes down it sure isn't fun. Just try to project yourself the best you can. I had no problems with Rightscale last night other than anything trying to reach EC2.
If shit really hits the fan, its not a bad idea to make sure you have a OpenVPN tunnel ready on each server.
This will allow to get connectivity between old and new instances if you can't update the security groups due to the API being down.
Anyhow, sorry I don't have a simple answer - I'm not sure a simple answer exists.
The nice thing is Cloudflare does a lot of this for you for free.
PHPFog are on AWS and I pay them to make sure they have the redundancy worked out. If they don't, I would yell at them until I got some money back.
I am considering configuring RDS for Multi A-Z, but need to research it a little more first. From what I can tell you just click a button to turn it on, but there were a lot of people complaining yesterday that the fail-over didn't work at all when it was supposed to.
I also have a bunch of EC2 VMs that do back-end processing and have a load of CRON jobs on them that need to run once every 24 hours. If these go down for a couple of hours then there is no noticeable impact to my customers, they can still log into my service and access their historical data.
I have considered spreading across multiple regions etc but at the end of the day it's just too expensive for the small increase in reliability.
1) Has AutoScaling Groups & Elastic Load Balancers in two regions (and only two availability zones; let's keep front-end instances in the same AZ as your local/region-specific DB)
2) Has Databases in two regions and uses Master Master replication
3) Instances talk to their local DB. If they detect their local DB is down, they failover to the remote DB (ie, the far region). If they failover, they notify you.
4) DNS does geographic load balancing (pre-ELB). You'll need to use a provider like DynDNS or UltraDNS to give you Geo Load Balancing & Failover. Or, you could pair a monitoring service like CatchPoint with Route53
5) Application caching (Memcache, Redis, etc). Let's not put more load on the DB's than necessary.
That's a good start, at least.
I don't think they show how to do ELB across Regions, or diversity against single-ELB problems (although I haven't seen ELB fail yet). You'd probably have to build this yourself.
Jun 30, 12:15 AM PDT [..] Elastic Load Balancers were also impacted by this event. ELBs are still experiencing delays in provisioning load balancers and in making updates to DNS records.
Jun 30, 12:37 AM PDT ELB is currently experiencing delayed provisioning and propagation of changes made in API requests. As a result, when you make a call to the ELB API to register instances, the registration request may take some time to process. As a result, when you use the DescribeInstanceHealth call for your ELB, the state may be inaccurately reflected at that time. To ensure your load balancer is routing traffic properly, it is best to get the IP addresses of the ELB's DNS name (via dig, etc.) then try your request on each IP address. We are working as fast as possible to get provisioning and the API latencies back to normal range.