
Ask HN: Best setups to avoid availability outages on AWS - richardv
Using only AWS services, what do you put in place to help prevent disruptions when a single availability zone goes down?<p>The most simple would be to set up your instances in a multiple AZ's and then configure the ELB to round robin requests until the health of one of the instances is poor.<p>Any other thoughts?
======
PaulHoule
I hate to sound like a simpleton, but for a small operation you're best off
putting all your eggs in one basket.

I'm in one of the us-east zones and I haven't had a failure in at least a
year. They retired one machine I was using and dealing with that was as simple
as starting and stopping -- at a time I chose.

With five zones in U.S. East, the probability of a zone failure affecting a
single zone systems is 1 in 5.

If you're a busybody who spreads your system across five zones, the
probability of a failure affecting you becomes 1.

You're spending more money, and dealing with a lot more complexity, all to
increase the probability that hardware failures will affect you.

Now, you're hoping that a zone-distributed system will be able to recover from
failures, but that's tricky to do and it's quite unlikely that this will work
if you haven't tested it. Add the fact that all the other "cool kids" will be
trying to recover their systems at this time and make AMZN's control plane go
down.

In the meantime, with probability 4/5 I'm sleeping through the disaster and
the first time I hear about it is on hacker news.

~~~
rbranson
This is correct; probability works against you if you spread your instances
over multiple zones without also compensating with some very solid failover.
If you are not willing to put the effort into truly going fully
active/active/active across THREE availability zones, you might as well just
stick with one and weather the outages.

~~~
mechanical_fish
I think that's a bit of an exaggeration. My old company (Acquia) kept a lot of
customers up through the various outages with mere two-zone redundancy.

You do need to actually engineer for failover and recovery, though. That point
is correct: It doesn't necessarily help you to have half of your servers
survive an outage if that still kills your service or requires painful amounts
of manual intervention. And it really is a lot easier to just shrug, keep your
backups up to date, and pray for your zone to come back ASAP. Do your cost-
benefit analysis.

------
sehugg
My thought: Have a very nice screen for your mobile app/website that says "We
are down for maintenance, please stand by."

Sorry to be fatalist, but it's a hard problem. This last outage was more than
just an AZ failure. Region-wide API usage was affected, so operations like
static IP reassignment and ELB changes were not taking effect. This means you
are hanging out in the wind should there be something unusual that requires
manual intervention (as was the case with us).

Route 53 is a good service but I don't know how its control plane works, and
it could be that problems in a single region would disable the ability to
update DNS records (I would guess that DNS reads are a lot more available than
writes). And in any case DNS is not a very good failover mechanism due to
upstream caching.

Unless your business model requires higher reliability than Instagram,
Netflix, and Pinterest, I'd suggest going multi-AZ, crossing your fingers, and
doing everything else right.

~~~
xxpor
I think Route 53's TTL is 60 seconds.

~~~
sehugg
TTL is not always honored -- some caching nameservers peg it to a lower limit,
and Android has a bug (just recently fixed) that pegs it to 10 minutes:
<http://code.google.com/p/android/issues/detail?id=7904>

------
rkalla
_Mindless rambling ahead; I love this topic_

Richard,

AWS is an amazing tool and you have a few options here, but the downside is
the more options you use to be highly-available (HA), the more expensive AWS
gets (as you would imagine).

Your first option is to be HA across a SINGLE region; to do this you make use
of the elastic load balancers (ELB) + auto-scaling. You setup auto-scale rules
to launch more instances in different availability zones (AZ) either in
response to demand or in response to failures (e.g. "always keep at least 3
instances running").

You compliment that with an ELB to load-balance incoming requests
automatically across those instances in the different AZs. This is all fairly
straight forward through the web console (except auto-scaling is still done
via CLI for some reason)

If you want to be HA ACROSS regions you can't just use ELBs anymore, you have
some added complexity and an additional AWS feature you will likely want to
use: Route 53.

Route 53 is Amazon's DNS service which offers a lot of slick DNS-features like
removing dead points from DNS rotation, latency-based routings, etc. There are
also something like 29 deployments of Route53 (and CloudFront) around the
globe so you'll hopefully never have Route53 become a point of failure for you
even disaster strikes.

In this scenario you would setup the HA configuration for a single-region as
mentioned above, but you would do it in multiple regions. Put another way, 2+
servers in multiple AZs in _each_ AWS region. Then a Route53 DNS configuration
setup to point to each ELB in each region representing those individual
pockets of servers.

Ontop of that you would use Route53 to manage all routing of client requests
into your entire domain; you can leverage the new "latency-based routing"
(effectively why everyone was asking for GeoDNS for years, but even better)
and monitor capability to ensure you aren't routing anyone to a dead region.

SIMPLIFICATION

\--------------

Here is what I would recommend given the size of your budget and need to stay
up in the AWS cloud, in-order of expense:

1\. Launch a single instance in a region with acceptable latency that has
never had an outage before (e.g. Oregon has never completely gone down but
Virginia has -- yes yes I know VA is older, but you understand my point). This
solution will be cheaper than multiple instances in _any_ region.

2\. Launch multiple instances using the web console, in multiple AZs in US-
EAST (cheapest option for multi-instances) and front them with an ELB. You
skip any auto-scaling complexity here but you need to keep an eye on your
servers. I think ELB fixed the issue where it would effectively route traffic
into the void if all the instances in an AZ went down.

OPTIONAL: If you didn't mind spending a few $ more, you could do this strategy
in the region that has never gone down for added piece of mind.

3\. Launch single instances in multiple REGIONS and front them with Route53.
This isn't really a recommended setup as entire regions will disappear if you
lose a single instance, BUT I said I would list possibilities in order of
price, so there you go. You could mitigate this by setting up auto-scaling
policies to replace any dead instances quickly in the off chance you wanted to
do exactly this but not babysit the web console all day.

4\. Launch multiple instances in each region, across multiple AZs fronted by
ELBs and then the entire collection fronted by Route53.

NOTE: The real cost comes from the additional instances and not from Route53
or the ELB; so if you can use smaller instances to help keep costs down (or
reserved instances also) that might allow you to provide a larger HA setup.

What about my data?

\-------------------------

Yes, yes... this is an issue that someone already touched on (data locality
below).

You will have to decide on a single region to hold your data; in this case I
would recommend using DB services that aren't based on EC2 and have never
experienced outages (or rarely) -- this includes S3, SimpleDB and/or DynamoDB.
AWS's MySQL offering (RDS) are just custom EC2 instances with MySQL running on
them, so any time EC2 goes down, RDS goes down.

The other DB offerings are all custom and except for SimpleDB a long time ago,
have never experienced outages that I am aware of.

Making this choice is all about latency and which DB store you are comfortable
with (obviously don't choose SimpleDB if everything you do requires MySQL --
then use RDS); you'll want your data as close to your web tier as possible, so
if you are spread across all regions you'll just want to pick a region with
the smallest latency to MOST of your customers (typically West coast if you
have a lot of Asia/Aus customers and East coast if you have a lot of European
customers).

Want to Go to 11?

\-----------------

If you have the money and desperately want to go to 11 with this regional-
scale (which I love to do, so I am sharing this) you can combine services like
DynamoDB and SQS to effectively create a globally distributed NoSQL datastore
with behavior along the lines of:

1\. Write operation comes into a region, immediately write it to the local
DynamoDB instance, asynchronously queue the write command in SQS and return to
the caller.

2\. In 1+ additional EC2 instances running daemons, pull messages from SQS in
chunk sizes that make sense and re-play them out to the other regions DynamoDB
stores; erase the messages when processed or if the processing fails the next
dameon to spin up will replay it.

3\. On reads, just hit the local DynamoDB in any region and reply; we trust
our reconciliation threads to do the work to keep us all in sync _eventually_.

NOTE: If you prefer to do read-repairs here you can, but it will increase
complexity and inter-region communication which all costs money.

The challenges with this approach is that you pull up a lot of DB concerns
into your code like conflict resolution, resync'ing entire regions after
failure, bringing new regions online and ensuring they are synchronized,
diffs, etc.

There is a reason AWS doesn't offer a globally-distributed data store: it is a
really hard problem to get right once you make it past the 80% use case.

Your data will determine if this is an option or not; some data allows for
certain amounts of inconsistency in which case this strategy is awesome and
works great; while other data (e.g. banking data) cannot allow a single wiggle
of inconsistency in which case pulling all this DB logic up into the
application is a bad idea. Your failure scenarios become catastrophic (e.g.
your conflict-resolution logic is wrong and wipes out the balance from an
account; or keeps re-filling the balance on an empty account... something bad
basically)

It is all a trade-off though; if you managed your own Cassandra cluster
though, Cassandra does all this and much more for you automatically; but then
you just put your time into Cassandra administration instead of developing the
logic around DynamoDB (or SimpleDB, or MySQL, or whatever); just pick which
devil you feel more comfortable with.

I am not aware of a services company yet that offers cross-region AWS
datastore deployments yet; Datastax and Iris Couch will setup things for that
like you via a consulting/custom arrangement, but there isn't a dashboard for
launching something like that automatically.

Hope that helped (and didn't bring you to tears of boredom)

~~~
EwanG
The above is why I love to read HN...

However, it seems that at some point here you are better off going back to
getting dedicated hosts at a couple different distributed data centers and
dealing with the complexity yourself. Surely by the time you "get to 11", you
are spending more to be on AWS than doing it yourself would cost?

~~~
mechanical_fish
If anything, I suspect that the opposite is the case. If you're planning to
set up servers in data centers on opposite sides of the globe, AWS is great.
This is where their abstraction layer pays off. All their regions work the
same way, accept the same API calls, are backed by effectively the same
hardware. Get your setup working in one region, then copy a few S3 buckets
over, change the --region argument to some calls in your script, and suddenly
you're up in Singapore.

Plus, Amazon is building tools, like the above-mentioned Route 53, to support
just this use case.

Now, those of us who have worked a lot within one AWS region will complain and
moan about the horror of trying to work cross-region, but that's not because
AWS makes cross-region unusually hard. It's because they make your work within
one region unusually easy. We're all spoiled by, e.g., the magic of cloning
EBS volumes from zone to zone via snapshots. (Well, okay, it's magical when
it's working, anyway. ;)

------
justincormack
Decide which part of the CAP Theorem
<http://en.wikipedia.org/wiki/CAP_theorem> you want to give up on. Presumably
you decided that Availability was not it, so you need to program around lack
of consistency and/or partition tolerance. Essentially that means there is no
"master database", and you will need to reconcile differing views. This can
get quite application specific, and you need to understand your data well.

------
rvagg
FWIW I initially went into ELB assuming it would solve a lot of my redundancy
problems. And while it has helped a lot (I spread my frontend across 3 zones),
I've suffered through a number of ELB failures or disruptions, including this
latest one, which is one of the worst. Even with fully functioning servers
that I can connect to individually, ELB was intermittently rejecting
connections and failed to reregister instances. There's no silver bullet! Just
prepare for failure and attempt to handle it gracefully, learning from each
one. I suppose you should also think hard before you launch into a greater AWS
budget to increase availability. Most of us are tempted to do that after each
major incident--which is why Amazon can walk away from these events in a
better position than before (until they have a genuine competitor that is).

------
explodingbarrel
We run a few decent sized social games and we have survived all the major AWS
region outages in the past year. He's what we do and what I would suggest.

1\. Use Rightscale. You can get away with the free edition, but for $500/month
the basic paid edition will allow you access to arrays and all the excellent
scripts available on the marketplace.

2\. The front end. I would strongly suggest moving away from ELB. We are using
it and are about to get rid of it. The main problem is what exactly happened
last night. If a whole AZ goes down, the ELB for that zone can get screwed and
the DNS was not updating the CNAME to remove the bad zone. Instead of ELB, we
have our own LB solution we are going to roll out that will use Rightscale
server arrays and will handle the updating of the DNS names itself. We also
aren't going to use Route53, because we learned last night that the API for
that can go down and you can get stuck with bad DNS records.

3\. Application servers. Use at least 3 AZ and have them evenly spaced. This
is easy to do in Rightscale with sever arrays. Make sure your voting ration
for scaling isn't 50% because you might not scale correctly if you loose 2 AZ.
Keep the vote to 30% and you will be happy (if one zone votes to grow, let it
grow).

4\. Database. This is the fun one. We have been using MongoDB with pretty good
success. Our multi-shard DB has 3 servers per replica set and has them
distributed equally between AZs. We use 4 EBS drive RAID-0 drives for storage
which have had problems in the past due to the outages that EBS sometimes has.
Our best bet has been a watcher process that will kill the mongod process if
there's any problems writing to the drive array. By doing this, the replica
set will automatically failover to the next server and we won't get stuck with
a primary node that can't write back to disk. For backups, we just freeze the
writes on the secondaries and do EBS snapshots even 15 minutes. Rightscale has
some great EBS tools for managing this for you. If you loose a server, we can
deploy a new server in a matter of minutes and it will rebuild the RAID array
from the last backup so we have a warm spare.

5\. Monitor, monitor, monitor. Rightscale has some great tools for monitoring
everything. Use them, and use more monitoring on other infrastructure (ie
Pingdom)

Doing something like this will cost a lot more that just sticking to a single
AZ, but you should be able to survive one, if not two complete datacenter
outages.

~~~
MrMike
> If you loose a server, we can deploy a new server in a matter of minutes and
> it will rebuild the RAID array from the last backup so we have a warm spare.

Unless you can't actually get a new server provisioned because the now-fragile
API (like last night) is under such load from people trying to mitigate their
downtime... We use rightscale, and rightscale won't solve this issue. If AWS
is being clobbered by people trying to get new boxes up, using a third-party
api abstraction service doesn't help.

Also, rightscale themselves were affected last night, throwing invalid alerts
about servers being inaccessible when they were actually still operating
normally.

~~~
explodingbarrel
That's also the beauty of Rightscale. If you do everything correctly you
should be able to provision a new server in another region if need be or even
another cloud (ie Rackspace).

Most outages of AWS don't last more than a few hours. The real goal is to make
sure your infrastructure can hobble on one leg for those few hours until help
arrives and you can cleanup the mess once the outage is over.

When the API goes down it sure isn't fun. Just try to project yourself the
best you can. I had no problems with Rightscale last night other than anything
trying to reach EC2.

~~~
explodingbarrel
And also,

If shit really hits the fan, its not a bad idea to make sure you have a
OpenVPN tunnel ready on each server.

This will allow to get connectivity between old and new instances if you can't
update the security groups due to the API being down.

------
aeden
Sending traffic to different zones isn't the challenge, the challenge is
deciding where your master data will live. In fact, this has always been one
of the biggest challenges of building a fault-tolerant systems. If your master
data store lives in one zone then you've got latency issues, but if it lives
in multiple zones then you need to find a logical way to shard. You could also
replicate across zones and then turn off writes if the zone with the master
fails. You could even change masters in that case, but there's risk of data
loss there.

Anyhow, sorry I don't have a simple answer - I'm not sure a simple answer
exists.

~~~
gazarsgo
multi-az in RDS does the replication and automatic failover for you, no
sharding necessary.

~~~
mh-
except when it doesn't --
[https://forums.aws.amazon.com/thread.jspa?threadID=98376&...](https://forums.aws.amazon.com/thread.jspa?threadID=98376&tstart=0)

------
alanbyrne
I am on PHPFog for my front-end and with an AWS RDS back-end. I managed to
survive this incident without an outage (I am on U.S East as well), although I
did get some horrendous response times from RDS for about an hour there.

PHPFog are on AWS and I pay them to make sure they have the redundancy worked
out. If they don't, I would yell at them until I got some money back.

I am considering configuring RDS for Multi A-Z, but need to research it a
little more first. From what I can tell you just click a button to turn it on,
but there were a lot of people complaining yesterday that the fail-over didn't
work at all when it was supposed to.

I also have a bunch of EC2 VMs that do back-end processing and have a load of
CRON jobs on them that need to run once every 24 hours. If these go down for a
couple of hours then there is no noticeable impact to my customers, they can
still log into my service and access their historical data.

I have considered spreading across multiple regions etc but at the end of the
day it's just too expensive for the small increase in reliability.

------
elijahchancey
Assuming we want to minimize latency and maximize reliability, we want to
create a stack that:

1) Has AutoScaling Groups & Elastic Load Balancers in two regions (and only
two availability zones; let's keep front-end instances in the same AZ as your
local/region-specific DB)

2) Has Databases in two regions and uses Master Master replication

3) Instances talk to their local DB. If they detect their local DB is down,
they failover to the remote DB (ie, the far region). If they failover, they
notify you.

4) DNS does geographic load balancing (pre-ELB). You'll need to use a provider
like DynDNS or UltraDNS to give you Geo Load Balancing & Failover. Or, you
could pair a monitoring service like CatchPoint with Route53

5) Application caching (Memcache, Redis, etc). Let's not put more load on the
DB's than necessary.

That's a good start, at least.

------
mark_l_watson
I haven't tried this (I use single EC2 deployments, some Heroku, also have a
Hetzner server) but it is something that I have been thinking of: have the web
services that back up your web app on a single server, and yes that will fail
on hopefully rare occasions. Host the Javascript+HTML5+CSS front end on S3
with Cloudfront CDN. The home page of your app will almost never go offline
and you control what to report to your users if your backend services are
offline. Sure you lose core functionality, but you still have static content
and a friendly message about temporary lack of services.

Going beyond that at a cost of slow response times when trying to access a
downed backend, you could deploy back end web services to two different
hosting providers, perhaps running something like CouchDB replicated on each
provider. The Javascript on your UI could switch to an alternative back end
after a timeout. For "one page" style apps, you could maintain the state
information that a backend host is down in the browser.

------
rdl
Start here: <http://aws.amazon.com/architecture/>

I don't think they show how to do ELB across Regions, or diversity against
single-ELB problems (although I haven't seen ELB fail yet). You'd probably
have to build this yourself.

~~~
crb
Today's status update (<http://status.aws.amazon.com/>) indicates partial ELB
failure, still unresolved:

 _Jun 30, 12:15 AM PDT [..] Elastic Load Balancers were also impacted by this
event. ELBs are still experiencing delays in provisioning load balancers and
in making updates to DNS records._

 _Jun 30, 12:37 AM PDT ELB is currently experiencing delayed provisioning and
propagation of changes made in API requests. As a result, when you make a call
to the ELB API to register instances, the registration request may take some
time to process. As a result, when you use the DescribeInstanceHealth call for
your ELB, the state may be inaccurately reflected at that time. To ensure your
load balancer is routing traffic properly, it is best to get the IP addresses
of the ELB's DNS name (via dig, etc.) then try your request on each IP
address. We are working as fast as possible to get provisioning and the API
latencies back to normal range._

------
trebor
From what I've heard, you're on the right track. However, I'd want it to not
round-robin but go to the nearest working node. I don't use AWS, so I don't
know how to configure the ELB, but I would assume that this is possible.

------
bfisher9
Super low TTL and Refresh combined with replication to a DR provider. High
Availability placed exclusively on a single provider - even Amazon (albeit
different AZ's) is of zero value if all of Amazon itself is offline...

------
neilwillgettoit
I'm shocked no one has mentioned <http://www.cedexis.com/> yet.

------
cardmagic
Try a multi-infrastructure PaaS like <http://appfog.com/>

