*Mindless rambling ahead; I love this topic* Richard, AWS is an amazing tool and...

EwanG · on June 30, 2012

The above is why I love to read HN...

However, it seems that at some point here you are better off going back to getting dedicated hosts at a couple different distributed data centers and dealing with the complexity yourself. Surely by the time you "get to 11", you are spending more to be on AWS than doing it yourself would cost?

mechanical_fish · on June 30, 2012

If anything, I suspect that the opposite is the case. If you're planning to set up servers in data centers on opposite sides of the globe, AWS is great. This is where their abstraction layer pays off. All their regions work the same way, accept the same API calls, are backed by effectively the same hardware. Get your setup working in one region, then copy a few S3 buckets over, change the --region argument to some calls in your script, and suddenly you're up in Singapore.

Plus, Amazon is building tools, like the above-mentioned Route 53, to support just this use case.

Now, those of us who have worked a lot within one AWS region will complain and moan about the horror of trying to work cross-region, but that's not because AWS makes cross-region unusually hard. It's because they make your work within one region unusually easy. We're all spoiled by, e.g., the magic of cloning EBS volumes from zone to zone via snapshots. (Well, okay, it's magical when it's working, anyway. ;)

richardv · on June 30, 2012

Was a really helpful read actually.

Currently our dependencies on AWS is..

[-] Route 53

[-] Small EC2 Instance

[-] Standard RDS

[-] S3/CF etc...

We are still in total stealth mode and I'm about to set up the actual architecture for our platform next week.

I was thinking about keeping everything in a single region. I was going to have:

[x] Elastic Load Balancer

[x] Declare one zone as our main zone (this can be 1A for example). Have one medium instance in zone-1A

[x] Have my RDS in zone-1A

[x] Set up another instance in zone-2A, and then let the ELB distribute the requests...

The only problem is that clearly, I'm running both mysql and an instance in zone-1A, which is my single point of failure. I will be able to survive an outage on 2A though.

How should I handle the data locality?

mechanical_fish · on June 30, 2012

You're doing the best you can with one MySQL server.

One thing you must do is keep a MySQL data dump, as recent as you can afford (at least one per day, more often if you can; note that dumps can have nontrivial performance impacts), in an accessible location. (The easy thing to use is S3; though you're still vulnerable to Amazon outages, my empirical observation over my last three years as an AWS devops guy is that S3 rarely goes down, even when EBS or EC2 is freaking out. But the paranoid person has backups outside Amazon as well.)

Then, in an emergency, boot another MySQL server in a different zone (or even region), recover the DB, and go.

There are lots of problems with this plan. One is that it is wicked slow at the best of times: Your downtime will be measured in minutes or hours. The second is that, when AWS is having an outage, it very often becomes difficult to impossible to do anything with its API, especially launching new machines. (My hypothesis is that this is often due to the surge of thousands of people just like you, all trying to launch new machines to handle the outage.) So, again, downtime could be hours. But hours is better than days or decades.

For actual HA you must run two MySQL servers at all times, one in each of two availability zones, one of which has a slightly older copy of the data. To make "slightly older" as short as possible, most folks master-master replicate the data between the servers. But you must not write to both DB servers at the same time, so one machine will still be "in charge" of writes at any given moment, and you'll have to have a scheme for swapping that "in charge of writes" status over to the other DB when the first one fails. (I'd suggest the "human logs in with SSH and runs a script" method to start, on the assumption that you don't need HA on the time scale of thirty seconds rather than thirty minutes.)

There are several other bits of plumbing involved with replication: Setting up something like Nagios to alert you when replication breaks, learning how to rebuild replication from scratch when one server dies, et cetera. You'll want to check out Percona Toolkit. Or, though I haven't used it yet, you should read up on Percona XtraDB Cluster, which I think does all of the above and comes with the option to buy support from Percona, which has smart folks:

http://www.percona.com/software/percona-xtradb-cluster/

The next stage, if you've got money, really love setting up new tools, and laugh defiantly in the face of latency, is to try master-master replication across AWS regions using a product like Continuent Tungsten: http://continuent.com/downloads/software . But I'm not sure what it costs. Probably more than you want to spend at this point in your product lifecycle.

helpothers · on June 30, 2012

Thanks a lot for sharing, I can add one more option, write to one master mysql db, use slaves in other locations. Read would be near to web server, if master colo fails, you can't write. But there are workarounds like standby master in another location.