As __lucas has mentioned, this can be achieved with Route 53 Failover, which we'd recommend.
Route 53 Failover is (hopefully) pretty easy to configure. Just mark your ELB as the primary and enable target healthchecks, and add the S3 website bucket as the secondary. We'd also suggest that you use an S3 website bucket hosted in a different region than your ELB. This should take no more than a minute or two of setup in our console.
The difference is makes is that Route 53 failover doesn't depend on any "make this change" control plane ; the status of every Route 53 healthcheck is constantly being polled from at at least 16 locations and then sent (via 3 redundant networking paths) to every Route 53 DNS server. The system is always operating at full-load, with a tremendous degree of fault-tolerance and redundancy. We hope that makes it very robust against very "hard to fix" internet and power problems, and also API outages.
So for an awkward "worst case" example; if there were a large networking outage in an intermediate transit provider your customers might not be able to reach your ELB, and likewise you may not be able to reach the API to make the changes necessary. Route 53 failover should work anyway by detecting the reachability problem and flipping to the pre-configured secondary - an action which is triggered at our edge sites.
If you'd rather not use Route 53 as your primary DNS provider that's ok; all of the above can still be achieved by using a dedicated zone on Route 53 just for managing the failover, which you may then CNAME to, just as with ELB. Each zone costs $0.50/month. Of course we'd also like to make this kind of functionality easier to use and built-in, and that's something we're constantly working on.
DNS is an awful way to do failover.
Route 53 supports DNS TTLs as low as 0 seconds. ELB and S3 endpoints both have 60 second TTLs. My experience with flipping names like www.amazon.com doesn't reflect the 15% figure. I've seen about 97% of web traffic honouring the TTL and flipping quickly. Within 5 minutes almost all of the rest too. We also take CloudFront sites in and out of service for maintenance, and in 5 years I've never anything like a 15% straggler effect.
That said, we do see a very small number of stragglers. While resolvers over-riding TTLs hasn't shown up as a significant problem, buggy clients can be; we come across clients now and then who either don't re-resolve ever (Various JVMs and their infinite caches are a common cause), or only re-resolve on failures (which is fine for failover, but not great for traffic management).
If you have a distribution time plot for the 15% figure it'd be interesting to see; https://www.dns-oarc.net/ would be a good venue, https://lists.dns-oarc.net/mailman/listinfo/dns-operations is the open list. Ignoring TTLs for a week is very concerning; it would very likely break many DNSSEC configurations. Is it possible you were dealing with robots?
But I ran across this around the year 2000. And the biggest offender was named AOL. I could well believe that it is very, very different today.
Typically rural ISPs or Alaskan ISPs (they don't want to pay to forward all the requests so they override the TTLs).
This is when we flip Netflix IPs.
So you can already use the CloudWatch "average" statistic to get some limited visibility. Here's the metric for my own personal micro instance (hosted in us-west-2) ; http://failfast.info/micrometric.png .
When the average drops to 0.875, what's actually happening is that one region couldn't reach us-west-2. (It turned out to be a transit networking issue somewhere between Singapore and Oregon). Just to give a sense of how rare partitions are; that's the first and only time I've seen one affect my instance, and I've been using the feature before we launched publicly.
But yes, we would like to add visibility into things like where it seemed to fail from and whether it actually impacted any DNS decisions (in this case it didn't, as even our nodes in Singapore were able to tell that my endpoint was healthy, via redundant data from other regions).
Edit to add: I should add that the most common reason why a health check fails is that the endpoint is just unhealthy, or totally unreachable. There is absolutely no ambiguity in those situations and the CloudWatch metrics tell you exactly when the failure and recovery happen. Only crazy internet partitions are hard.
I use CloudFlare for DDoS protection which Amazon doesn't offer. For apex domains, ELB requires route53 exclusively but that conflicts with CloudFlare which also requires DNS be hosted there. Since ELB uses CNAME records I haven't been able to find a way for the CloudFlare and ELB to work together (short of polling ELB IP changes every minute which doesn't really work out that well).
This doesnt solve the issue for the apex however, so we have a S3 redirect on the apex.
The key thing this allows us to do is continue to use Route53 with Cloudflare.
We use Firebase to power the dynamic portions (obviously), but you can use plain old AJAX requests as well.
The age of the dynamically-generated HTML page is coming to an end.
The other man says, "Your site doesn't work on my browser with NoScript."
These two men will never like each other.
It's got more tests, and it succeeds at building the Cactus examples like the old one. The only thing that's not stable in it is the "serve" mode.
I'm working with Koen (the original author) to get this into the main repo.
--- If you use it, and run into any issue, get in touch with me: thomas [at] orozco.fr
The only trick is to leave a foo XHR response in your R53 failover so that a running instance of your app can realize the server is tits up and degrade gracefully. For even better caching throw in webapp manifest files too!
Obviously, if you're using CloudFlare in the first place, chances are that you won't have too many problems with high peaks of traffic anyway. But obviously, it can still happen, depending on how high the peak is :)
Also not always 100% compatible with all of your website's features. You may want to only use CloudFlare with your static site and turn off the proxy features for other subdomains. Ymmv.
One thing you'll probably want to do is set 5xx HTTP status codes on your static site, and to try to get things to not cache it as much as possible.. A redirect to a subdomain hosted on s3 makes it more likely something like Google will pick up on it.
* Have a heartbeat task on GAE that polls your server, and if its running, write to "serverisrunning" on memcache with a short time to live.
* (normal operation) : redirect initial visitors to your AWS site.
* (server not responding to a heartbeat task on GAE, or memcache miss), serve static content from GAE, or a limited functionality version of your app hosting on GAE (for example, a static site with signup form).
This type of setup has the added bonus of automatically detecting an outage and responding to it. While App Engine has its own downtime issues, outages are transient. Since they migrated to the high replication data store, I haven't seen anything that lasted more than a few minutes.
I'm actually working on something that'll make it incredibly fast to get a static site up and running with a powerful CDN and get form submissions working. Will be up at http://www.bitballoon.com soon.
Aka the site has been running off your FIOS connection and a spare CPU the suddenly the taffic spikes 5,000% what do you do? Host a single short test only, sorry we where not ready for prime time please come again, or redirect to a nice scalable static site you can manually update with nice pictures of your total failures / requests for donations or whatever.
http://www.discourse.org/ is one example of such an approach
I would imagine that Apache (running on something better than a C64) can handle an absolutely ridiculous amount of traffic that it's just redirecting.
EDIT: It also takes like no effort to turn on Akamai CDN option for Cloud Files as well.
Ideally, under normal conditions, your 'active' landing pages should be as fast as your static maintenance page.
This is relatively easy, and could possibly even just use CSS with the understanding that yes, someone could have a bad experience.
What kind of control do you feel you lack when using GH Pages?
Route53 is also an excellent DNS service.
S3 may indeed serve it using Static Website Hosting (and you need to enable it anyway to use CloudFront with it).
Still, S3 is quite slow, and not globally distributed. It's indeed good advice and practice to use CloudFront in front of it.
CloudFront for a static website not very expensive either.
I run my company's website runs off of S3 + CloudFront, and S3 alone is slow.
Load time is usually 2x to 3x when accessed via the S3 origin
There have been reports about S3 returning 50x errors, when (as documented) the client should retry. A browser is not an S3 client, so it won't retry ...
That said, looking through their linked startups is kind of depressing; the firstest of the first-world problems.
EDIT: Okay, okay, not all of them, as my salesbro points out.
Is this generally the experience of everyone here?