As __lucas has mentioned, this can be achieved with Route 53 Failover, which we'd recommend.
Route 53 Failover is (hopefully) pretty easy to configure. Just mark your ELB as the primary and enable target healthchecks, and add the S3 website bucket as the secondary. We'd also suggest that you use an S3 website bucket hosted in a different region than your ELB. This should take no more than a minute or two of setup in our console.
The difference is makes is that Route 53 failover doesn't depend on any "make this change" control plane ; the status of every Route 53 healthcheck is constantly being polled from at at least 16 locations and then sent (via 3 redundant networking paths) to every Route 53 DNS server. The system is always operating at full-load, with a tremendous degree of fault-tolerance and redundancy. We hope that makes it very robust against very "hard to fix" internet and power problems, and also API outages.
So for an awkward "worst case" example; if there were a large networking outage in an intermediate transit provider your customers might not be able to reach your ELB, and likewise you may not be able to reach the API to make the changes necessary. Route 53 failover should work anyway by detecting the reachability problem and flipping to the pre-configured secondary - an action which is triggered at our edge sites.
If you'd rather not use Route 53 as your primary DNS provider that's ok; all of the above can still be achieved by using a dedicated zone on Route 53 just for managing the failover, which you may then CNAME to, just as with ELB. Each zone costs $0.50/month. Of course we'd also like to make this kind of functionality easier to use and built-in, and that's something we're constantly working on.
Route 53 supports DNS TTLs as low as 0 seconds. ELB and S3 endpoints both have 60 second TTLs. My experience with flipping names like www.amazon.com doesn't reflect the 15% figure. I've seen about 97% of web traffic honouring the TTL and flipping quickly. Within 5 minutes almost all of the rest too. We also take CloudFront sites in and out of service for maintenance, and in 5 years I've never anything like a 15% straggler effect.
That said, we do see a very small number of stragglers. While resolvers over-riding TTLs hasn't shown up as a significant problem, buggy clients can be; we come across clients now and then who either don't re-resolve ever (Various JVMs and their infinite caches are a common cause), or only re-resolve on failures (which is fine for failover, but not great for traffic management).
I don't know that I have ever seen a straggling DNS update take a week, but I have seen many take 3-4 days. Particularly Mediacom customers. I think this situation has improved some, but I still would not consider DNS failover very reliable.
We use weighted A records for this purpose. We basically have two weighted A records, one pointing to the load balancer and the other one to the static maintenance site on S3. When we want to switch sites, we just flip their weights (they are always 1 and 0), hence enforcing 100% redirection. It's not a perfect system but it works and DNS propagation is usually done within 15 minutes (haven't tried modifying the TTLs). Haven't been able to come up with anything simpler than this.
Do you guys have any plans to actually show the results of the DNS changes based on health checks? Connecting the health checks to cloudwatch is a great start, but it would be really useful to actually see which of my rules are in effect at any given time.
Almost all of the time what you see in CloudWatch should be representative of what's being served. The internet stays consistent to at least four nines. Our CloudWatch metrics are actually made up of 16 independent metrics reporters (2 per 8 regions we run in) - each reporting a 1 or a 0.
So you can already use the CloudWatch "average" statistic to get some limited visibility. Here's the metric for my own personal micro instance (hosted in us-west-2) ; http://failfast.info/micrometric.png .
When the average drops to 0.875, what's actually happening is that one region couldn't reach us-west-2. (It turned out to be a transit networking issue somewhere between Singapore and Oregon). Just to give a sense of how rare partitions are; that's the first and only time I've seen one affect my instance, and I've been using the feature before we launched publicly.
But yes, we would like to add visibility into things like where it seemed to fail from and whether it actually impacted any DNS decisions (in this case it didn't, as even our nodes in Singapore were able to tell that my endpoint was healthy, via redundant data from other regions).
Edit to add: I should add that the most common reason why a health check fails is that the endpoint is just unhealthy, or totally unreachable. There is absolutely no ambiguity in those situations and the CloudWatch metrics tell you exactly when the failure and recovery happen. Only crazy internet partitions are hard.
Not directly related but I have one major gripe with ELB--am hopeful you have some inside knowledge and a possible workaround.
I use CloudFlare for DDoS protection which Amazon doesn't offer. For apex domains, ELB requires route53 exclusively but that conflicts with CloudFlare which also requires DNS be hosted there. Since ELB uses CNAME records I haven't been able to find a way for the CloudFlare and ELB to work together (short of polling ELB IP changes every minute which doesn't really work out that well).
Or better yet -- make your entire site static to begin with! This is how our site works (firebase.com). Our entire site is static content that's generated at deploy time and hosted on a CDN. Dynamic data is loaded asynchronously as needed. If a server were to go down, at least all of the static pieces (which is most of the site) would be unaffected.
We use Firebase to power the dynamic portions (obviously), but you can use plain old AJAX requests as well.
The age of the dynamically-generated HTML page is coming to an end.
The NoScript man is used to this and his fitted tinfoil hat. If he's more lenient he uses Ghostery to block all 3rd party addons and various cookie blockers to keep out the cruft, but keeps the actual site unperturbed.
In theory, I'm super attracted to this model. My only concern is that it puts a lot of eggs in the Firebase basket. And what happens if I need to scale faster than Firebase can handle? Is there a way for me to run an in-house api compatible version of Firebase on my infrastructure and re-point to that? Can I keep a hot-swappable backup in sync? Or can I pay Firebase to do that for me?
Here's a small, shameless, open source, plug: I work on the "Cactus" static website generator, check out and fork our work here: github.com/koenbok/Cactus (current version), https://github.com/krallin/Cactus (next version)
Shameless plug for Google Web Toolkit. Since core design principle is dynamic site generation client side from permutations (i.e. locales and languages) created at compile time, you can deploy the entire thing as static files to S3. That's what I do :)
The only trick is to leave a foo XHR response in your R53 failover so that a running instance of your app can realize the server is tits up and degrade gracefully. For even better caching throw in webapp manifest files too!
You still need dynamic "pages". Things like geoip redirection or localization etc. You could in theory load this stuff via ajax, but this doesn't work for search engine crawlers. But i do agree most stuff can be static, you can push json on fragments onto s3 and have the web page fetch them via ajax.
We've done this for a lot of data vis work. Clients have access to a cms which lets them stage and publish their data. Doing so puts JSON files on s3 where we also serve the site. There are some trade offs, sometimes you miss having that rest api, but you also gain a lot too.
A drop-in solution to achieve the same is using CloudFlare as CDN for your website. CloudFlare has a configurable "Always Online" mode that is automatically triggered whenever your site is down, that shows the user an offline version of the website, together with a warning message.
Obviously, if you're using CloudFlare in the first place, chances are that you won't have too many problems with high peaks of traffic anyway. But obviously, it can still happen, depending on how high the peak is :)
Note that besides all their indications, for CloudFlare "down" means returning some specific HTTP error codes, not that the server itself is down (as in "completely unresponsive" or not running), I was burnt by this and they actually changed the wording in their documents when I pointed it out.
This is a totally awesome idea. You are still in trouble if the load balancer has problems or S3 has problems (unlikely, but not impossible!). It's always smart to have a couple of ways of failing over to something if your main site has problems - for instance, I'm always surprised that more people don't spend time customizing the default rails error/maintenance pages for heroku.
One thing you'll probably want to do is set 5xx HTTP status codes on your static site, and to try to get things to not cache it as much as possible.. A redirect to a subdomain hosted on s3 makes it more likely something like Google will pick up on it.
App Engine is also a good way to do this. Let's say you have the bulk of your site running on AWS (assume for whatever reason you don't want to use GAE as your primary environment)
* Have a heartbeat task on GAE that polls your server, and if its running, write to "serverisrunning" on memcache with a short time to live.
* (normal operation) : redirect initial visitors to your AWS site.
* (server not responding to a heartbeat task on GAE, or memcache miss), serve static content from GAE, or a limited functionality version of your app hosting on GAE (for example, a static site with signup form).
This type of setup has the added bonus of automatically detecting an outage and responding to it. While App Engine has its own downtime issues, outages are transient. Since they migrated to the high replication data store, I haven't seen anything that lasted more than a few minutes.
S3 works for static websites but in general the latency without Cloudfront in front is not that good.
I'm actually working on something that'll make it incredibly fast to get a static site up and running with a powerful CDN and get form submissions working. Will be up at http://www.bitballoon.com soon.
What is the point of involving S3? Why not just run the emergency site directly on the Apache installation doing the rewrite? Unless your traffic is absolutely massive there shouldn't really be any need for the S3 step.
S3's not completely indestructible, but it is the most reliable platform I've encountered. During the ~1.5 years I was at Pinterest, we used S3 to store and serve all of the cached images on our site, with CDNs in front of course. We only ran into 2 incidents with S3, and only one of those would have affected "normal" customers. (The other was our own fault -- we suddenly started writing hundreds of files per second to a bucket that wasn't ready for this level of traffic.)
Nothing you can do is going to take S3 as a whole down, and it doesn't appear to have EBS dependencies so it generally dodges the bullet when AWS has troubles. I'm not aware of any significant S3 outage in the last five years.
As someone said in the comments, just keep in mind that S3 is just for storage, not serving. You'll need something like CloudFront for that, although I don't know at which degree of activity it's going to save you money to use it. Maybe from the get-go?
I think the point of their post, is that this is their "oh shit we are down" setup, which will consist of a small static site that S3 will happily serve up to anyone anywhere in the world very quickly. This isn't meant to be a full wack highly performant copy of their existing website fronted by CDN's.
You still need to update DNS ASAP, but unless your dealing with a physical flood/fire you can often get something to respond on the old IP. So, depending on the type of failure your dealing with, often setting up a static redirect is viable for insane levels of traffic even if your hosting it off of a single underpowered CPU and limited bandwidth.
Aka the site has been running off your FIOS connection and a spare CPU the suddenly the taffic spikes 5,000% what do you do? Host a single short test only, sorry we where not ready for prime time please come again, or redirect to a nice scalable static site you can manually update with nice pictures of your total failures / requests for donations or whatever.
Because designing V0.1 of your application from the ground up based on edge cases is a Great way to never release anything. Spending a weekend setting up a static failover on the other hand has no long term downside and let's you put off worrining about a host of those edege caeses without any real down side. It's like buying a UPS for your dev box it's probably never going to matter, but it's cheap so feel free.
Actually it's a great way to simultaneously design a website AND an api for others to use. It's also a great way to separate concerns. It's also a great way to reduce load on your server. In fact, it's an easy way to have some people code a standard back end with a standard authentication so that some other people can make front ends for the web, iphone, and more. You can use, for example, oauth to authenticate with the back end, from any front end app.
I dont't think we are quite on the same page, having an public API etc is wonderful but let's use a slightly different topic. Using some 3rd party ORM to talk to your database is generally a no brainer, but v0.1 might not even have a database yet because persistence is not generally needed for a demo. Why put off such a core feature, because just changing your objects is less friction so and the goal is to see if anyone is ineested. aka idea validation and nothing else.
Because it's just as easy to code your front end independently and then hook up your back end to it. If the front end is static, it can be completely hosted on a CDN. If your back end is unreachable the front end can just take another code path. There's your 0.1
Hopefully your load balancers are still responding and they'll failover to an alternate working site. If not, having multiple load balancers in your DNS round robin config could allow the client to retry another record to see if it works, which is hit or miss in my experience, but others have said it works reliably. Like someone else said, low TTLs help. Or Anycast, or advertising your ASN from a new router/DC. None of that should be necessary as Amazon will take care of that for you, but only if you pay for and configure all the right services.