AWS is an amazing tool and you have a few options here, but the downside is the more options you use to be highly-available (HA), the more expensive AWS gets (as you would imagine).
Your first option is to be HA across a SINGLE region; to do this you make use of the elastic load balancers (ELB) + auto-scaling. You setup auto-scale rules to launch more instances in different availability zones (AZ) either in response to demand or in response to failures (e.g. "always keep at least 3 instances running").
You compliment that with an ELB to load-balance incoming requests automatically across those instances in the different AZs. This is all fairly straight forward through the web console (except auto-scaling is still done via CLI for some reason)
If you want to be HA ACROSS regions you can't just use ELBs anymore, you have some added complexity and an additional AWS feature you will likely want to use: Route 53.
Route 53 is Amazon's DNS service which offers a lot of slick DNS-features like removing dead points from DNS rotation, latency-based routings, etc. There are also something like 29 deployments of Route53 (and CloudFront) around the globe so you'll hopefully never have Route53 become a point of failure for you even disaster strikes.
In this scenario you would setup the HA configuration for a single-region as mentioned above, but you would do it in multiple regions. Put another way, 2+ servers in multiple AZs in each AWS region. Then a Route53 DNS configuration setup to point to each ELB in each region representing those individual pockets of servers.
Ontop of that you would use Route53 to manage all routing of client requests into your entire domain; you can leverage the new "latency-based routing" (effectively why everyone was asking for GeoDNS for years, but even better) and monitor capability to ensure you aren't routing anyone to a dead region.
SIMPLIFICATION
--------------
Here is what I would recommend given the size of your budget and need to stay up in the AWS cloud, in-order of expense:
1. Launch a single instance in a region with acceptable latency that has never had an outage before (e.g. Oregon has never completely gone down but Virginia has -- yes yes I know VA is older, but you understand my point). This solution will be cheaper than multiple instances in any region.
2. Launch multiple instances using the web console, in multiple AZs in US-EAST (cheapest option for multi-instances) and front them with an ELB. You skip any auto-scaling complexity here but you need to keep an eye on your servers. I think ELB fixed the issue where it would effectively route traffic into the void if all the instances in an AZ went down.
OPTIONAL: If you didn't mind spending a few $ more, you could do this strategy in the region that has never gone down for added piece of mind.
3. Launch single instances in multiple REGIONS and front them with Route53. This isn't really a recommended setup as entire regions will disappear if you lose a single instance, BUT I said I would list possibilities in order of price, so there you go. You could mitigate this by setting up auto-scaling policies to replace any dead instances quickly in the off chance you wanted to do exactly this but not babysit the web console all day.
4. Launch multiple instances in each region, across multiple AZs fronted by ELBs and then the entire collection fronted by Route53.
NOTE: The real cost comes from the additional instances and not from Route53 or the ELB; so if you can use smaller instances to help keep costs down (or reserved instances also) that might allow you to provide a larger HA setup.
What about my data?
-------------------------
Yes, yes... this is an issue that someone already touched on (data locality below).
You will have to decide on a single region to hold your data; in this case I would recommend using DB services that aren't based on EC2 and have never experienced outages (or rarely) -- this includes S3, SimpleDB and/or DynamoDB. AWS's MySQL offering (RDS) are just custom EC2 instances with MySQL running on them, so any time EC2 goes down, RDS goes down.
The other DB offerings are all custom and except for SimpleDB a long time ago, have never experienced outages that I am aware of.
Making this choice is all about latency and which DB store you are comfortable with (obviously don't choose SimpleDB if everything you do requires MySQL -- then use RDS); you'll want your data as close to your web tier as possible, so if you are spread across all regions you'll just want to pick a region with the smallest latency to MOST of your customers (typically West coast if you have a lot of Asia/Aus customers and East coast if you have a lot of European customers).
Want to Go to 11?
-----------------
If you have the money and desperately want to go to 11 with this regional-scale (which I love to do, so I am sharing this) you can combine services like DynamoDB and SQS to effectively create a globally distributed NoSQL datastore with behavior along the lines of:
1. Write operation comes into a region, immediately write it to the local DynamoDB instance, asynchronously queue the write command in SQS and return to the caller.
2. In 1+ additional EC2 instances running daemons, pull messages from SQS in chunk sizes that make sense and re-play them out to the other regions DynamoDB stores; erase the messages when processed or if the processing fails the next dameon to spin up will replay it.
3. On reads, just hit the local DynamoDB in any region and reply; we trust our reconciliation threads to do the work to keep us all in sync eventually.
NOTE: If you prefer to do read-repairs here you can, but it will increase complexity and inter-region communication which all costs money.
The challenges with this approach is that you pull up a lot of DB concerns into your code like conflict resolution, resync'ing entire regions after failure, bringing new regions online and ensuring they are synchronized, diffs, etc.
There is a reason AWS doesn't offer a globally-distributed data store: it is a really hard problem to get right once you make it past the 80% use case.
Your data will determine if this is an option or not; some data allows for certain amounts of inconsistency in which case this strategy is awesome and works great; while other data (e.g. banking data) cannot allow a single wiggle of inconsistency in which case pulling all this DB logic up into the application is a bad idea. Your failure scenarios become catastrophic (e.g. your conflict-resolution logic is wrong and wipes out the balance from an account; or keeps re-filling the balance on an empty account... something bad basically)
It is all a trade-off though; if you managed your own Cassandra cluster though, Cassandra does all this and much more for you automatically; but then you just put your time into Cassandra administration instead of developing the logic around DynamoDB (or SimpleDB, or MySQL, or whatever); just pick which devil you feel more comfortable with.
I am not aware of a services company yet that offers cross-region AWS datastore deployments yet; Datastax and Iris Couch will setup things for that like you via a consulting/custom arrangement, but there isn't a dashboard for launching something like that automatically.
Hope that helped (and didn't bring you to tears of boredom)
However, it seems that at some point here you are better off going back to getting dedicated hosts at a couple different distributed data centers and dealing with the complexity yourself. Surely by the time you "get to 11", you are spending more to be on AWS than doing it yourself would cost?
If anything, I suspect that the opposite is the case. If you're planning to set up servers in data centers on opposite sides of the globe, AWS is great. This is where their abstraction layer pays off. All their regions work the same way, accept the same API calls, are backed by effectively the same hardware. Get your setup working in one region, then copy a few S3 buckets over, change the --region argument to some calls in your script, and suddenly you're up in Singapore.
Plus, Amazon is building tools, like the above-mentioned Route 53, to support just this use case.
Now, those of us who have worked a lot within one AWS region will complain and moan about the horror of trying to work cross-region, but that's not because AWS makes cross-region unusually hard. It's because they make your work within one region unusually easy. We're all spoiled by, e.g., the magic of cloning EBS volumes from zone to zone via snapshots. (Well, okay, it's magical when it's working, anyway. ;)
We are still in total stealth mode and I'm about to set up the actual architecture for our platform next week.
I was thinking about keeping everything in a single region.
I was going to have:
[x] Elastic Load Balancer
[x] Declare one zone as our main zone (this can be 1A for example). Have one medium instance in zone-1A
[x] Have my RDS in zone-1A
[x] Set up another instance in zone-2A, and then let the ELB distribute the requests...
The only problem is that clearly, I'm running both mysql and an instance in zone-1A, which is my single point of failure. I will be able to survive an outage on 2A though.
You're doing the best you can with one MySQL server.
One thing you must do is keep a MySQL data dump, as recent as you can afford (at least one per day, more often if you can; note that dumps can have nontrivial performance impacts), in an accessible location. (The easy thing to use is S3; though you're still vulnerable to Amazon outages, my empirical observation over my last three years as an AWS devops guy is that S3 rarely goes down, even when EBS or EC2 is freaking out. But the paranoid person has backups outside Amazon as well.)
Then, in an emergency, boot another MySQL server in a different zone (or even region), recover the DB, and go.
There are lots of problems with this plan. One is that it is wicked slow at the best of times: Your downtime will be measured in minutes or hours. The second is that, when AWS is having an outage, it very often becomes difficult to impossible to do anything with its API, especially launching new machines. (My hypothesis is that this is often due to the surge of thousands of people just like you, all trying to launch new machines to handle the outage.) So, again, downtime could be hours. But hours is better than days or decades.
For actual HA you must run two MySQL servers at all times, one in each of two availability zones, one of which has a slightly older copy of the data. To make "slightly older" as short as possible, most folks master-master replicate the data between the servers. But you must not write to both DB servers at the same time, so one machine will still be "in charge" of writes at any given moment, and you'll have to have a scheme for swapping that "in charge of writes" status over to the other DB when the first one fails. (I'd suggest the "human logs in with SSH and runs a script" method to start, on the assumption that you don't need HA on the time scale of thirty seconds rather than thirty minutes.)
There are several other bits of plumbing involved with replication: Setting up something like Nagios to alert you when replication breaks, learning how to rebuild replication from scratch when one server dies, et cetera. You'll want to check out Percona Toolkit. Or, though I haven't used it yet, you should read up on Percona XtraDB Cluster, which I think does all of the above and comes with the option to buy support from Percona, which has smart folks:
The next stage, if you've got money, really love setting up new tools, and laugh defiantly in the face of latency, is to try master-master replication across AWS regions using a product like Continuent Tungsten: http://continuent.com/downloads/software . But I'm not sure what it costs. Probably more than you want to spend at this point in your product lifecycle.
Thanks a lot for sharing, I can add one more option, write to one master mysql db, use slaves in other locations. Read would be near to web server, if master colo fails, you can't write. But there are workarounds like standby master in another location.
Richard,
AWS is an amazing tool and you have a few options here, but the downside is the more options you use to be highly-available (HA), the more expensive AWS gets (as you would imagine).
Your first option is to be HA across a SINGLE region; to do this you make use of the elastic load balancers (ELB) + auto-scaling. You setup auto-scale rules to launch more instances in different availability zones (AZ) either in response to demand or in response to failures (e.g. "always keep at least 3 instances running").
You compliment that with an ELB to load-balance incoming requests automatically across those instances in the different AZs. This is all fairly straight forward through the web console (except auto-scaling is still done via CLI for some reason)
If you want to be HA ACROSS regions you can't just use ELBs anymore, you have some added complexity and an additional AWS feature you will likely want to use: Route 53.
Route 53 is Amazon's DNS service which offers a lot of slick DNS-features like removing dead points from DNS rotation, latency-based routings, etc. There are also something like 29 deployments of Route53 (and CloudFront) around the globe so you'll hopefully never have Route53 become a point of failure for you even disaster strikes.
In this scenario you would setup the HA configuration for a single-region as mentioned above, but you would do it in multiple regions. Put another way, 2+ servers in multiple AZs in each AWS region. Then a Route53 DNS configuration setup to point to each ELB in each region representing those individual pockets of servers.
Ontop of that you would use Route53 to manage all routing of client requests into your entire domain; you can leverage the new "latency-based routing" (effectively why everyone was asking for GeoDNS for years, but even better) and monitor capability to ensure you aren't routing anyone to a dead region.
SIMPLIFICATION
--------------
Here is what I would recommend given the size of your budget and need to stay up in the AWS cloud, in-order of expense:
1. Launch a single instance in a region with acceptable latency that has never had an outage before (e.g. Oregon has never completely gone down but Virginia has -- yes yes I know VA is older, but you understand my point). This solution will be cheaper than multiple instances in any region.
2. Launch multiple instances using the web console, in multiple AZs in US-EAST (cheapest option for multi-instances) and front them with an ELB. You skip any auto-scaling complexity here but you need to keep an eye on your servers. I think ELB fixed the issue where it would effectively route traffic into the void if all the instances in an AZ went down.
OPTIONAL: If you didn't mind spending a few $ more, you could do this strategy in the region that has never gone down for added piece of mind.
3. Launch single instances in multiple REGIONS and front them with Route53. This isn't really a recommended setup as entire regions will disappear if you lose a single instance, BUT I said I would list possibilities in order of price, so there you go. You could mitigate this by setting up auto-scaling policies to replace any dead instances quickly in the off chance you wanted to do exactly this but not babysit the web console all day.
4. Launch multiple instances in each region, across multiple AZs fronted by ELBs and then the entire collection fronted by Route53.
NOTE: The real cost comes from the additional instances and not from Route53 or the ELB; so if you can use smaller instances to help keep costs down (or reserved instances also) that might allow you to provide a larger HA setup.
What about my data?
-------------------------
Yes, yes... this is an issue that someone already touched on (data locality below).
You will have to decide on a single region to hold your data; in this case I would recommend using DB services that aren't based on EC2 and have never experienced outages (or rarely) -- this includes S3, SimpleDB and/or DynamoDB. AWS's MySQL offering (RDS) are just custom EC2 instances with MySQL running on them, so any time EC2 goes down, RDS goes down.
The other DB offerings are all custom and except for SimpleDB a long time ago, have never experienced outages that I am aware of.
Making this choice is all about latency and which DB store you are comfortable with (obviously don't choose SimpleDB if everything you do requires MySQL -- then use RDS); you'll want your data as close to your web tier as possible, so if you are spread across all regions you'll just want to pick a region with the smallest latency to MOST of your customers (typically West coast if you have a lot of Asia/Aus customers and East coast if you have a lot of European customers).
Want to Go to 11?
-----------------
If you have the money and desperately want to go to 11 with this regional-scale (which I love to do, so I am sharing this) you can combine services like DynamoDB and SQS to effectively create a globally distributed NoSQL datastore with behavior along the lines of:
1. Write operation comes into a region, immediately write it to the local DynamoDB instance, asynchronously queue the write command in SQS and return to the caller.
2. In 1+ additional EC2 instances running daemons, pull messages from SQS in chunk sizes that make sense and re-play them out to the other regions DynamoDB stores; erase the messages when processed or if the processing fails the next dameon to spin up will replay it.
3. On reads, just hit the local DynamoDB in any region and reply; we trust our reconciliation threads to do the work to keep us all in sync eventually.
NOTE: If you prefer to do read-repairs here you can, but it will increase complexity and inter-region communication which all costs money.
The challenges with this approach is that you pull up a lot of DB concerns into your code like conflict resolution, resync'ing entire regions after failure, bringing new regions online and ensuring they are synchronized, diffs, etc.
There is a reason AWS doesn't offer a globally-distributed data store: it is a really hard problem to get right once you make it past the 80% use case.
Your data will determine if this is an option or not; some data allows for certain amounts of inconsistency in which case this strategy is awesome and works great; while other data (e.g. banking data) cannot allow a single wiggle of inconsistency in which case pulling all this DB logic up into the application is a bad idea. Your failure scenarios become catastrophic (e.g. your conflict-resolution logic is wrong and wipes out the balance from an account; or keeps re-filling the balance on an empty account... something bad basically)
It is all a trade-off though; if you managed your own Cassandra cluster though, Cassandra does all this and much more for you automatically; but then you just put your time into Cassandra administration instead of developing the logic around DynamoDB (or SimpleDB, or MySQL, or whatever); just pick which devil you feel more comfortable with.
I am not aware of a services company yet that offers cross-region AWS datastore deployments yet; Datastax and Iris Couch will setup things for that like you via a consulting/custom arrangement, but there isn't a dashboard for launching something like that automatically.
Hope that helped (and didn't bring you to tears of boredom)