Hacker News new | past | comments | ask | show | jobs | submit login
Keep a static "emergency mode" site on S3 (coderwall.com)
212 points by jonthepirate on July 12, 2013 | hide | past | favorite | 82 comments

(I work on Route 53).

As __lucas has mentioned, this can be achieved with Route 53 Failover, which we'd recommend.

Route 53 Failover is (hopefully) pretty easy to configure. Just mark your ELB as the primary and enable target healthchecks, and add the S3 website bucket as the secondary. We'd also suggest that you use an S3 website bucket hosted in a different region than your ELB. This should take no more than a minute or two of setup in our console.

The difference is makes is that Route 53 failover doesn't depend on any "make this change" control plane ; the status of every Route 53 healthcheck is constantly being polled from at at least 16 locations and then sent (via 3 redundant networking paths) to every Route 53 DNS server. The system is always operating at full-load, with a tremendous degree of fault-tolerance and redundancy. We hope that makes it very robust against very "hard to fix" internet and power problems, and also API outages.

So for an awkward "worst case" example; if there were a large networking outage in an intermediate transit provider your customers might not be able to reach your ELB, and likewise you may not be able to reach the API to make the changes necessary. Route 53 failover should work anyway by detecting the reachability problem and flipping to the pre-configured secondary - an action which is triggered at our edge sites.

If you'd rather not use Route 53 as your primary DNS provider that's ok; all of the above can still be achieved by using a dedicated zone on Route 53 just for managing the failover, which you may then CNAME to, just as with ELB. Each zone costs $0.50/month. Of course we'd also like to make this kind of functionality easier to use and built-in, and that's something we're constantly working on.

Except that his way all the traffic instantly switches, and your way you have to wait for DNS propagation, which about 15% of the users on the internet will not pick up for over a week.

DNS is an awful way to do failover.

To your point ; it's ok to do both.

Route 53 supports DNS TTLs as low as 0 seconds. ELB and S3 endpoints both have 60 second TTLs. My experience with flipping names like www.amazon.com doesn't reflect the 15% figure. I've seen about 97% of web traffic honouring the TTL and flipping quickly. Within 5 minutes almost all of the rest too. We also take CloudFront sites in and out of service for maintenance, and in 5 years I've never anything like a 15% straggler effect.

That said, we do see a very small number of stragglers. While resolvers over-riding TTLs hasn't shown up as a significant problem, buggy clients can be; we come across clients now and then who either don't re-resolve ever (Various JVMs and their infinite caches are a common cause), or only re-resolve on failures (which is fine for failover, but not great for traffic management).

If you have a distribution time plot for the 15% figure it'd be interesting to see; https://www.dns-oarc.net/ would be a good venue, https://lists.dns-oarc.net/mailman/listinfo/dns-operations is the open list. Ignoring TTLs for a week is very concerning; it would very likely break many DNSSEC configurations. Is it possible you were dealing with robots?

I ran across the "DNS broken for a week", and it was broken for ISPs that run their own DNS for home customers. They didn't want to keep asking for upstream DNS, so just ignored the ttl.

But I ran across this around the year 2000. And the biggest offender was named AOL. I could well believe that it is very, very different today.

When we flip Netflix domains we see about a 15% straggler effect (although to be fair only about 3% take a week, but many take around 24 hours).

How much of that 15% is driven by ISPs versus misbehaving clients? (eg. set to boxes)

I don't know that I have ever seen a straggling DNS update take a week, but I have seen many take 3-4 days. Particularly Mediacom customers. I think this situation has improved some, but I still would not consider DNS failover very reliable.

In your experience, are there still major ISPs that don't respect DNS TTL?

Yes, about 15%. :) To be fair most flip within 24 hours, but about 3% or so take longer than a week.

Typically rural ISPs or Alaskan ISPs (they don't want to pay to forward all the requests so they override the TTLs).

This is when we flip Netflix IPs.

We use weighted A records for this purpose. We basically have two weighted A records, one pointing to the load balancer and the other one to the static maintenance site on S3. When we want to switch sites, we just flip their weights (they are always 1 and 0), hence enforcing 100% redirection. It's not a perfect system but it works and DNS propagation is usually done within 15 minutes (haven't tried modifying the TTLs). Haven't been able to come up with anything simpler than this.

Do you guys have any plans to actually show the results of the DNS changes based on health checks? Connecting the health checks to cloudwatch is a great start, but it would be really useful to actually see which of my rules are in effect at any given time.

Almost all of the time what you see in CloudWatch should be representative of what's being served. The internet stays consistent to at least four nines. Our CloudWatch metrics are actually made up of 16 independent metrics reporters (2 per 8 regions we run in) - each reporting a 1 or a 0.

So you can already use the CloudWatch "average" statistic to get some limited visibility. Here's the metric for my own personal micro instance (hosted in us-west-2) ; http://failfast.info/micrometric.png .

When the average drops to 0.875, what's actually happening is that one region couldn't reach us-west-2. (It turned out to be a transit networking issue somewhere between Singapore and Oregon). Just to give a sense of how rare partitions are; that's the first and only time I've seen one affect my instance, and I've been using the feature before we launched publicly.

But yes, we would like to add visibility into things like where it seemed to fail from and whether it actually impacted any DNS decisions (in this case it didn't, as even our nodes in Singapore were able to tell that my endpoint was healthy, via redundant data from other regions).

Edit to add: I should add that the most common reason why a health check fails is that the endpoint is just unhealthy, or totally unreachable. There is absolutely no ambiguity in those situations and the CloudWatch metrics tell you exactly when the failure and recovery happen. Only crazy internet partitions are hard.

Not directly related but I have one major gripe with ELB--am hopeful you have some inside knowledge and a possible workaround.

I use CloudFlare for DDoS protection which Amazon doesn't offer. For apex domains, ELB requires route53 exclusively but that conflicts with CloudFlare which also requires DNS be hosted there. Since ELB uses CNAME records I haven't been able to find a way for the CloudFlare and ELB to work together (short of polling ELB IP changes every minute which doesn't really work out that well).

You can still use Route53 with Cloudflare by using their cname setup.

This doesnt solve the issue for the apex however, so we have a S3 redirect on the apex.

The key thing this allows us to do is continue to use Route53 with Cloudflare.

The current R53 health check does not support SSL endpoints, which is rather critical for our use-cases. Please make it happen, thanks.

Depending on your use case, you might be able to get away with a TCP check on 443 or a separate, HTTP accessible deep healthcheck.

Thanks this is a really great idea. I'm going to try this out.

Or better yet -- make your entire site static to begin with! This is how our site works (firebase.com). Our entire site is static content that's generated at deploy time and hosted on a CDN. Dynamic data is loaded asynchronously as needed. If a server were to go down, at least all of the static pieces (which is most of the site) would be unaffected.

We use Firebase to power the dynamic portions (obviously), but you can use plain old AJAX requests as well.

The age of the dynamically-generated HTML page is coming to an end.

One man says, "Make your whole site static! Load content with AJAX instead of generating HTML on the server!"

The other man says, "Your site doesn't work on my browser with NoScript."

These two men will never like each other.

The NoScript man is used to this and his fitted tinfoil hat. If he's more lenient he uses Ghostery to block all 3rd party addons and various cookie blockers to keep out the cruft, but keeps the actual site unperturbed.

True tinfoils don't use propietary software.

It'll be open source soon. Plus, you can always look at the injected JS yourself - it's not obfuscated.

In theory, I'm super attracted to this model. My only concern is that it puts a lot of eggs in the Firebase basket. And what happens if I need to scale faster than Firebase can handle? Is there a way for me to run an in-house api compatible version of Firebase on my infrastructure and re-point to that? Can I keep a hot-swappable backup in sync? Or can I pay Firebase to do that for me?

Here's a small, shameless, open source, plug: I work on the "Cactus" static website generator, check out and fork our work here: github.com/koenbok/Cactus (current version), https://github.com/krallin/Cactus (next version)

Is the new version stable or would you recommend the old one to a newcomer? I'm using Hyde right now, but I'm not loving it.

I would personally recommend the new one (but take this with a grain of salt: I work on it). I use it daily to build www.scalr.com.

It's got more tests, and it succeeds at building the Cactus examples like the old one. The only thing that's not stable in it is the "serve" mode.

I'm working with Koen (the original author) to get this into the main repo.

--- If you use it, and run into any issue, get in touch with me: thomas [at] orozco.fr

Shameless plug for Google Web Toolkit. Since core design principle is dynamic site generation client side from permutations (i.e. locales and languages) created at compile time, you can deploy the entire thing as static files to S3. That's what I do :)

The only trick is to leave a foo XHR response in your R53 failover so that a running instance of your app can realize the server is tits up and degrade gracefully. For even better caching throw in webapp manifest files too!

You still need dynamic "pages". Things like geoip redirection or localization etc. You could in theory load this stuff via ajax, but this doesn't work for search engine crawlers. But i do agree most stuff can be static, you can push json on fragments onto s3 and have the web page fetch them via ajax.

We've done this for a lot of data vis work. Clients have access to a cms which lets them stage and publish their data. Doing so puts JSON files on s3 where we also serve the site. There are some trade offs, sometimes you miss having that rest api, but you also gain a lot too.

Route53 can do a lot of that localization redirection for you, though not exactly at the geoip level. Other localization redirects can easy be done from JS.

Agree. Another benefit of having the site mainly static, with dynamic parts loaded over ajax is that it becomes much easier to apply efficient http-level caching.

You can do this now automatically with Route53 DNS failover http://docs.aws.amazon.com/Route53/latest/DeveloperGuide/dns...

Yup - and Route 53 can now even serve the Zone Apex to an S3 bucket!

A drop-in solution to achieve the same is using CloudFlare as CDN for your website. CloudFlare has a configurable "Always Online" mode that is automatically triggered whenever your site is down, that shows the user an offline version of the website, together with a warning message.

Obviously, if you're using CloudFlare in the first place, chances are that you won't have too many problems with high peaks of traffic anyway. But obviously, it can still happen, depending on how high the peak is :)


Note that besides all their indications, for CloudFlare "down" means returning some specific HTTP error codes, not that the server itself is down (as in "completely unresponsive" or not running), I was burnt by this and they actually changed the wording in their documents when I pointed it out.

Not exactly drop-in. With CloudFlare you need to be careful to configure it really well in the control panel settings.

Also not always 100% compatible with all of your website's features. You may want to only use CloudFlare with your static site and turn off the proxy features for other subdomains. Ymmv.

replying to 23david, not the parent. 23david you appear to be shadow banned. It seems to be recent, might want to get it checked out.

thanks for letting me know. emailing the mods.

glad it's cleared up. Seemed innocent to me :)

This is a totally awesome idea. You are still in trouble if the load balancer has problems or S3 has problems (unlikely, but not impossible!). It's always smart to have a couple of ways of failing over to something if your main site has problems - for instance, I'm always surprised that more people don't spend time customizing the default rails error/maintenance pages for heroku.

I previously wrote about making a downtime page:


One thing you'll probably want to do is set 5xx HTTP status codes on your static site, and to try to get things to not cache it as much as possible.. A redirect to a subdomain hosted on s3 makes it more likely something like Google will pick up on it.

App Engine is also a good way to do this. Let's say you have the bulk of your site running on AWS (assume for whatever reason you don't want to use GAE as your primary environment)

* Have a heartbeat task on GAE that polls your server, and if its running, write to "serverisrunning" on memcache with a short time to live.

* (normal operation) : redirect initial visitors to your AWS site.

* (server not responding to a heartbeat task on GAE, or memcache miss), serve static content from GAE, or a limited functionality version of your app hosting on GAE (for example, a static site with signup form).

This type of setup has the added bonus of automatically detecting an outage and responding to it. While App Engine has its own downtime issues, outages are transient. Since they migrated to the high replication data store, I haven't seen anything that lasted more than a few minutes.

S3 works for static websites but in general the latency without Cloudfront in front is not that good.

I'm actually working on something that'll make it incredibly fast to get a static site up and running with a powerful CDN and get form submissions working. Will be up at http://www.bitballoon.com soon.

Yeah but the big question is, how can you switch the DNS over in time when NONE of your servers can respond fast enough?

You still need to update DNS ASAP, but unless your dealing with a physical flood/fire you can often get something to respond on the old IP. So, depending on the type of failure your dealing with, often setting up a static redirect is viable for insane levels of traffic even if your hosting it off of a single underpowered CPU and limited bandwidth.

Aka the site has been running off your FIOS connection and a spare CPU the suddenly the taffic spikes 5,000% what do you do? Host a single short test only, sorry we where not ready for prime time please come again, or redirect to a nice scalable static site you can manually update with nice pictures of your total failures / requests for donations or whatever.

Why don't you just always have your site static and connect to your back end as necessary? Treat it as a webservice with uptime, etc.

Because designing V0.1 of your application from the ground up based on edge cases is a Great way to never release anything. Spending a weekend setting up a static failover on the other hand has no long term downside and let's you put off worrining about a host of those edege caeses without any real down side. It's like buying a UPS for your dev box it's probably never going to matter, but it's cheap so feel free.

Actually it's a great way to simultaneously design a website AND an api for others to use. It's also a great way to separate concerns. It's also a great way to reduce load on your server. In fact, it's an easy way to have some people code a standard back end with a standard authentication so that some other people can make front ends for the web, iphone, and more. You can use, for example, oauth to authenticate with the back end, from any front end app.

http://www.discourse.org/ is one example of such an approach

I dont't think we are quite on the same page, having an public API etc is wonderful but let's use a slightly different topic. Using some 3rd party ORM to talk to your database is generally a no brainer, but v0.1 might not even have a database yet because persistence is not generally needed for a demo. Why put off such a core feature, because just changing your objects is less friction so and the goal is to see if anyone is ineested. aka idea validation and nothing else.

Because it's just as easy to code your front end independently and then hook up your back end to it. If the front end is static, it can be completely hosted on a CDN. If your back end is unreachable the front end can just take another code path. There's your 0.1

Hopefully your load balancers are still responding and they'll failover to an alternate working site. If not, having multiple load balancers in your DNS round robin config could allow the client to retry another record to see if it works, which is hit or miss in my experience, but others have said it works reliably. Like someone else said, low TTLs help. Or Anycast, or advertising your ASN from a new router/DC. None of that should be necessary as Amazon will take care of that for you, but only if you pay for and configure all the right services.

Very low DNS TTLs

Didn't the original article mention setting Apache to redirect the requests to the static site?

I would imagine that Apache (running on something better than a C64) can handle an absolutely ridiculous amount of traffic that it's just redirecting.

Great idea. Putting the static site on Rackspace Cloud Files would also be advisable as an alternative to S3 in the event of an AWS outage.

EDIT: It also takes like no effort to turn on Akamai CDN option for Cloud Files as well.

Disaster Recovery. It's called a Disaster Recovery Site.


Your origin is extremely slow. Perhaps this is an artifact of the HN rush, but it's slow enough that I would be looking for ways to improve home page response time.

Ideally, under normal conditions, your 'active' landing pages should be as fast as your static maintenance page.

Why not just do an origin-pull via CloudFront? No need to build a static site on S3.

You would need to make sure you still build the 'static site generator' on your current site (so login, search, and any other functionality dependent on your app is not exposed).

This is relatively easy, and could possibly even just use CSS with the understanding that yes, someone could have a bad experience.

That doesn't help you if the problem is that your origin is down. Or am I misunderstanding you?

What is the point of involving S3? Why not just run the emergency site directly on the Apache installation doing the rewrite? Unless your traffic is absolutely massive there shouldn't really be any need for the S3 step.

I would pay for someone to take care of this for me. I presently run a GH Pages static blog but would like complete control of the build. I want to upload a .zip somewhere and have things just work.

Mentioned it further down in the thread, but working on a really easy way (http://www.bitballoon.com) to get a static site online and backed by a CDN (and we do support uploading a .zip).

What kind of control do you feel you lack when using GH Pages?

I want to use clojurescript on the page, (which has a compilation step), and I don't want to track the build output in source control.

I remember last year when Kony 2012 blew up for a couple days, they switched their site to static s3 pages to handle the traffic and collect donations. I thought it was pretty clever.

+1. The question to ask yourself is: "Is Amazon's uptime better than mine?" If the answer is "yes", use them.

Route53 is also an excellent DNS service.

Your primary site has mobile formatting issue (IOS). The maintenance site does not ;)

As someone said in the comments, just keep in mind that S3 is just for storage, not serving. You'll need something like CloudFront for that, although I don't know at which degree of activity it's going to save you money to use it. Maybe from the get-go?

Actually, this comment is correct.

S3 may indeed serve it using Static Website Hosting (and you need to enable it anyway to use CloudFront with it).

Still, S3 is quite slow, and not globally distributed. It's indeed good advice and practice to use CloudFront in front of it.

CloudFront for a static website not very expensive either.


I run my company's website runs off of S3 + CloudFront, and S3 alone is slow.

Load time is usually 2x to 3x when accessed via the S3 origin

As has already been mentioned, S3 can indeed serve static websites and has had this capability for at least a year now.

That's completely false. Hosting a static website on S3 is very well documented.

And how reliable is it?

There have been reports about S3 returning 50x errors, when (as documented) the client should retry. A browser is not an S3 client, so it won't retry ... https://news.ycombinator.com/item?id=4976893 https://news.ycombinator.com/item?id=4977360 https://news.ycombinator.com/item?id=4981482 https://news.ycombinator.com/item?id=2897959

yes but the recommended approach is to put CloudFront in front of it, if you care about performance at all

I think the point of their post, is that this is their "oh shit we are down" setup, which will consist of a small static site that S3 will happily serve up to anyone anywhere in the world very quickly. This isn't meant to be a full wack highly performant copy of their existing website fronted by CDN's.

But that's wrong. You don't need anything else, S3 is for serving, and they deliberately added support and documentation for exactly this use case: hosting static websites.

Very good point--a static site up now can be waaaay better than a dynamic one that is slow and user-raging.

That said, looking through their linked startups is kind of depressing; the firstest of the first-world problems.

EDIT: Okay, okay, not all of them, as my salesbro points out.

> As far as I'm concerned, S3 static file website serving is completely indestructible. I only need one bland Apache server to bump requests over to it.

Is this generally the experience of everyone here?

S3's not completely indestructible, but it is the most reliable platform I've encountered. During the ~1.5 years I was at Pinterest, we used S3 to store and serve all of the cached images on our site, with CDNs in front of course. We only ran into 2 incidents with S3, and only one of those would have affected "normal" customers. (The other was our own fault -- we suddenly started writing hundreds of files per second to a bucket that wasn't ready for this level of traffic.)

Nothing you can do is going to take S3 as a whole down, and it doesn't appear to have EBS dependencies so it generally dodges the bullet when AWS has troubles. I'm not aware of any significant S3 outage in the last five years.

One would hope not... With the API they offer it should be the type of service that is reasonable easy to make pretty much indestructible.

I'm surprised this forum isn't full of the usual "omg, use nginx instead of apache for rerouting".

If the only thing you're doing is rerouting I'm sure Apache can handle pretty sick traffic as well.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact