Hacker News new | past | comments | ask | show | jobs | submit login
A rough guide to keeping your website up through catastrophic events (rainforestapp.com)
120 points by fredsters_s on June 30, 2012 | hide | past | favorite | 52 comments

I think one factor people don't consider enough is the tradeoffs you need to make in order to make your app have incredible reliability. This article and other ones talk about a bunch of work you can do to help ensure you application stays up in rare events. However, maybe for you particular product, having 99.5% uptime and offering a ton of features is going to help you become more successful than 99.9% uptime.

When you are a Google or a Twitter or an Amazon, you lose lots of money per minute of downtime, so economically speaking it makes sense for them to invest in this. However for an average startup, I don't think having a couple hours of downtime per month is actually going to be that big of a deal. Of course you need to ensure your data is always safe, and your machine configuration is also easy to deploy (via AMIs or something like puppet) so you have the ability to get back up and running in catastrophic cases, but at the end of the day having a good "We're currently having technical issues" screen could very well be a better investment than shooting for a technical setup that can still be up during catastrophic events.

Google incurred that cost early, but that tipping point of being 'worth it' to engineer for uptime came pretty late in the game for twitter.

Interestingly AWS appears to be in the camp where it's not heavily focused on uptime either.

A pretty universal truth is that you can afford to be down more than you expect. Even in the cases of Amazon's store being down when you can measure the immense cost by the second, a total cost analysis may still well show that it was a 'cheaper' event than engineering away.

I don't know about you guys but most of the time twitter doesn't work for me at all. I would say 50-60% of the time. I mean the page loads (v-e-r-y slowly) but the tweets don't. I almost of always have to click on "reload tweets" button to see tweets. Several times.

I don't think I have ever seen such clusterfuck of performance in any major site for such a long time.

It has come to the point that I don't even bother clicking twitter links.

Yep, there will always be a tipping point when your downtime becomes worth engineering out - this obviously varies per app. Understanding what's available to you is also important - I've met quite a few people who have no idea about some of the features available on AWS. I'm not sure if this is just a problem with AWS's docs / console, or just knowing what to look for in the first place.

Uptime is something users love regardless of the size of the company whose product they're trying to use. Investing in uptime is always worth it.

Investing in uptime is always worth it.

That simply can't be true. There is always going to be a point where an extra decimal place of reliability is too costly.

There's always a trade off between the cost of failing and the cost of engineering it out. The problem comes with the lack of understanding about where and how apps and intrastructure fail and how to avoid it. If you misunderstand the problem, you'll probably misjudge it.

People should absolutely at least be doing some back of the envelope math on this before choosing a strategy.

If you're at N DAU, then a 12h downtime will affect a bit more than N/2 users, and some percentage of those users will become ex-users - you can run a small split test to figure out how many if you don't already have data on that. You'll also lose a direct half day of revenue. This type of thing will happen somewhere between once a year an once every couple of months, as low and high estimates.

Crunch those numbers, and you'll have an order of magnitude estimate of what downtime actually costs you, and what you can actually afford to spend to minimize it. Keep in mind that engineering and ops time costs quite a bit of money, and that you'll be slowing down other feature development by wasting time on HA.

For instance, let's say you're running a game with 1M DAU, and 5M total active users, making $10k per day (not sure if that's reasonable, but let's pretend), and you've figured out that 12h of downtime makes you lose approximately 10% of the users that log in during that period. In that case, 12h of downtime costs you a one-time "fee" of $5k, and also pushes away ~1% of your total users, which will cost you $100 per day as an ongoing "cost".

If we assume this happens exactly once, and that a mitigation strategy would work with 100% effectiveness, then you should be willing to spend up to $100 extra per day to implement that strategy; the $5k up-front loss is not nothing, but we can probably assume it'll get eaten up by engineering time to implement that strategy. If such a strategy would cost significantly more than $100 per day over your current costs, then by pursuing it you're assuming that "oh shit it's all gone to hell!" AWS events are likely to affect you multiple times over the period in question.

I'm not saying these numbers are realistic in any way, or that the method I've shown is 100% sound (I'm on an iPhone, so I haven't edited or reread any of it); I'm just saying that whether you pursue a mitigation strategy or not, it's not terribly difficult to ground your decision in numbers. They do tend to be right on the edge of reasonable for a lot of people, so it's worth thinking about them (good) or (better) measuring them.

I agree with the first sentence, uptime is a very nice thing that users will notice and appreciate over time.

However, I strongly disagree with the second sentence. Investing is uptime is not always worth it. Taken to its logical extreme, imagine 2 potential websites. One of them is incredibly useful but only up 80% of the time. The other one is a blank HTML page, but it the most reliable website in history with 0 seconds of downtime in the past 10 years. If I surveyed users of both websites, I think it would be almost unanimous that people preferred the useful website that was up sometimes.

Startups have limited time and resources, and in practice getting 99% uptime is relatively easy, whereas 99.9% uptime is relatively hard. That is a difference of ~7 hours of uptime per month. Yes, it sucks when your website is down, but it also sucks when there are features you can't develop because you don't have the time or your technical infrastructure doesn't allow in order to chase ultra high reliability. Obviously this depends on your industry, IE if you are a payment processor you better have super high uptime or you aren't going to have any customers, but realistically most companies will likely not lose that many customers if they are up >99% of the time.

There's also risks inherent in a more complicated system.

You can engineer a more complicated system with the goal of avoiding downtime, but this added complexity may end up with unexpected corner-cases and cause a net decrease in uptime, at least in the short term.

It's often better to concentrate on improving mean time to repair (MTTR).

You can't just put nodes in different regions, even with a database like MongoDB. It will work in theory, in practice you'll have all kinds of latency problems.

WAN replication is a hard problem and glossing over it by waving your hands is a disservice to readers.

"Real" solutions are to run a database that is tolerant of partitioning, and have application level code to resolve the inevitable conflicts. Riak, Cassandra and other Dynamo inspired projects offer this. On the other hand you can use a more consistent store and hide the latency with write-through caching (this is how Facebook does it with memcached + MySQL), but now you have application code that deals with managing this cache.

Either way you have to have very specific application code to handle these scenarios, and you may even run a combination of solutions for different types of data you need to store. There is no silver bullet, there is no framework or product that does it for you.

Most of the current MongoDB drivers support routing read queries to the lowest latency replica set member - this solves part of the problem.

Choice of database when planning a project is a more fundamental problem, knowing what to use and why - is the trade off for Riak / Cassandra worth it over MongoDB or even MySQL? This people decide on a per-project basis and of course when starting don't always make the right longer term choice.

Running a multi-region Cassandra cluster is ill advised. Cassandra (and Dynamo databases in general) are quite chatty. I think Netflix has implemented some multi-region clusters. Other companies too I'm sure. But it will certainly give you heaps of new challenges (and heaping bandwidth bills.)

Your information is obsolete. Cassandra will only send a single copy of your updates cross-region, which will then be rereplicated within each region if necessary.

That's certainly good news. Appreciate the correction.

I agree wholeheartedly. HA is hard, a single blog post won't even halfway cover it.

HA isn't actually hard, but it does require some forethought and chosing some technologies which aren't neccesarily cool. However, I would submit that building HA capabilities into a cat photo sharing or microblogging site is sortof overkill, most people don't need it. Just take the hit and move on, people are getting more and more used to sites being down/failing and just retry later. As much as I hate to say it, I do think its fairly accurate.

You pose an interesting question that I want to see answered: do users of B2C sites actually care when the site is down? Does it decrease MAU? Your intuition is that it does not, but I'd love to see some data.

Agreed, this is just a rough guide to it :-)

A really rough guide. Rough enough that I thought of each of these points last night at 1 am while extremely groggy and trying to bring back the 7 or so instances we lost in the outage.

I guess it is a good play to get traffic to your site.

Reading all of these post-mortems and guides to keeping your servers up and running, it strikes me how much AWS jargon is in there.

The fact that so many developers have invested so much time into learning Amazon-specific technologies means that developers are left to deal with the problem within that worldview. Going multiple-datacenter means learning two of every technology layer.

You could solve all of these problems using standard non-amazon unix tools, technologies, and products, however Amazon has enabled a whole class of development that makes it easier to just work within their system. It's easier to just wait for Amazon to figure it out for the general case and trust them than to figure it out and implement yourself.

There are other risks with being the lone-wolf but for a lot of people, being in the herd has a certain kind of safety, despite the limitations.

Not making a judgement call on it but it is something that I have noticed with these outages.

Agreed. Hopefully OpenStack or similar gets some serious traction and the IaaS players move closer to simply being providers of hardware. The likelihood of that happening seems quite slim while AWS holds such a strong position in the market though. And a significant chunk of AWS's success is due to their innovation in the software layer.

One could draw a lot of parallels between OpenStack/AWS and Android/iOS development. It is in the best interest of an OpenStack provider to differentiate their offering rather than compete on price through the common platform. Just like how there are power users who want to configure, build and have full control of their mobile device, there's a huge class of people that just want a working stack from one provider. I consider Rackspace to be relatively far along in implementing the OpenStack vision and they're still a far way behind in the offerings AWS has. Google is in a similar position with their new cloud initiative.

Like with HTML5 I'd like to see all of the main datacenters implement full stacks and then have OpenStack be a 'codify what works' project. We need 4 or 5 providers who have the depth and breadth of what AWS has.

It seems like you can make a good tradeoff between Chef and AMIs by nightly rebuilding the AMIs off a fully configured system, and then when the machine comes up you run Chef to make up the incremental difference.

Yep! You could also build the AMI's using a build / CI server each time you push if you wanted. Once you've nailed the process and are used to how long it takes you'll love it.

There's this really great thing called a CDN that can be used to keep your "Web Site" up at all times, even if your source servers are down. It doesn't help your web app, but its better than looking like you've dissapeared from the planet.

The problem with just using a simple CDN (i.e. one without proper ESI support like CloudFront) is that for dynamic content is cached at a fixed TTL per URI, which for 'apps' is likely to be very short. This means your content will expire and then be retried from the origin...which would be down.

However for static or near static sites, it's perfect - just make sure your TTL is correct :-)

I guess I should've included the advice of "make your main page static and use a good CDN", because they both help you a lot in the long run.

>This means your content will expire and then be retried from the origin...which would be down.

I don't think ESI is necessarily a requirement to configure your cdn to serve stale content in case of the backend being unreachable.

No, but it's a likely requirement of a nontrivial app. You can emulate it on AWS if you're using something like backbone, but only for public content.

Thats correct, often times most people don't even set page cache times in their dynamic content.

I think it's because most web developers don't ever have to understand how this stuff works. It's sometimes easier to just throw hardware at a problem than engineer it better. However, understanding what you can do is key imho.

Isn't the point that dynamic content is changing and you don't want to cache it? What would be a good cache time for HN for instance?

You can use Puppet to pre-bake your AMIs for you so you can scale very rapidly and still use configuration management to maintain your instances


This blog post is hilarious.

A region in AWS speak is already multiple supposedly independent data centers (in AWS terms: AZ (availability zone)).

So if an entire region fails, that's four or so data centers which all go down at the same time.

So how many companies on bare metal have four data centers and experience this kind of catastrophic downtimes? Add to that, how many of these companies operate completely in the dark about which data center is actually which?

These blog posts are annoying because it seems like these people have never done anything they suggest themselves.

Yes, the cloud lets you setup a fully configured instance within minutes. But at which expense? Mostly intransparency about what the entire stack and what is going on?

Food for thoughts.

That's just it - they're supposedly separate - but each zone is on the same physical site albeit seperated. Unlikely to break at the same time, but more likely than two separate locations. Just google for aws multi zone outages.

Anyway - the article isn't suggesting to use four datacentres, it's trying to make people aware of the simple steps that can be taken to avoid failure.

Ok, to be more obvious: by comparison, a multi-region setup is maybe over the top. Because essentially, you are distributing over something like eight data centers then.

Because you cannot stay in a single AZ either – who knows what will go down and when.

It's insane how this post is so high up on this website. But it goes to show that neither you or the author of this blog post have ever attempted any replication over WAN. Otherwise it would not be called simple.

Do you realize that most of the serious Amazon outages of the past 2 years have had multi-AZ effects? Either due to the root cause, or control plane, or load.


I assumed this was a response to the recent hella-long outage:

"There are three major lessons about IaaS we've learned from this experience:

1) Spreading across multiple availability zones in single region does not provide as much partitioning as we thought. Therefore, we'll be taking a hard look at spreading to multiple regions. We've explored this option many times in the past - not for availability reasons, but for customers wishing to have their infrastructure more physically nearby for latency or legal reasons. We've always chosen to prioritize it below other ways we could spend our time. It's a big project, and it will inescapably require pushing more configuration options out to users (for example, pointing your DNS at a router chosen by geographic homing) and to add-on providers (latency-sensitive services will need to run in all the regions we support, and find some way to propagate region information between the app and the services). These are non-trivial concerns, but now that we have such dramatic evidence of multi-region's impact on availability, we'll be considering it a much higher priority.

2) Block storage is not a cloud-friendly technology. EC2, S3, and other AWS services have grown much more stable, reliable, and performant over the four years we've been using them. EBS, unfortunately, has not improved much, and in fact has possibly gotten worse. Amazon employs some of the best infrastructure engineers in the world: if they can't make it work, then probably no one can. Block storage has physical locality that can't easily be transferred. That makes it not a cloud-friendly technology. With this information in hand, we'll be taking a hard look on how to reduce our dependence on EBS.

3) Continuous database backups for all. One reason why we were able to fix the dedicated databases quicker has to do with the way that we do backups on them. In the new Heroku PostgreSQL service, we have a continuous backup mechanism that allows for automated recovery of databases. Once we were able to provision new instances, we were able to take advantage of this to quickly recover the dedicated databases that were down with EBS problems."

Then I checked the date. It's actually Heroku's response to their super-long April 2011 outage. Yet, it appears the "we should go across Regions" lesson wasn't learned.

Love AWS. We're using Heroku and it's been pretty painful over the last month or so. However, it's super easy. At some point it they should have an SLA, as there underlying hosting (aka AWS) provide one with credits for outages. The number one feature for Heroku that would help is being able to specify multiple zones or regions when making new apps.

The failure of a single AZ in a single region does not trigger AWS' SLA.

This article takes a too simplistic view of real world deployments on AWS, and attempts to sum it up with 5 bullet points. Yes, I know the title is a "rough" guide, but why not go into more depth and acknowledge that there's more diversity in terms of deployment models out there? The other option would have been to keep it very high level and not talk about specific tools.

E.g. Use Route 53? Isn't that hosted on Amazon itself? Why create another point of failure?

MongoDB - How many big sites on the cloud use it as their primary database?

The only takeaway for me was the last paragraph - "In conclusion, you probably used a single zone because it’s easy (hey - so do we for now!). There will come a point where the pain of getting shouted at by your boss, client or customers outweighs learning how to get your app setup properly yourself."

Regarding doing un-encrypted cross-datacenter replication with MongoDB, I recommend the author of this blog post read this: http://en.wikipedia.org/wiki/Fallacies_of_Distributed_Comput...

SSL support is available - http://docs.mongodb.org/manual/administration/ssl/ and latency can be an issue. Most drivers support routing read queries to the lowest latency slave. Again, obviously this is not for everyone. It's always a trade-off. Hence the title: rough guide. The point of this post was to increase awareness.

Am I the only one who feels like the author missed the mark?

The blog post is supposed to be about "keeping your website up through catastrophic events" and the dominant theme seems to be "invest more heavily on Amazon." IMHO, the exact opposite needs to happen.

Sure, I understand that being in multiple regions means you supposedly have very autonomous deployments (to include Amazon's API endpoints), but nobody can prove to us that each zone or even each region are totally separate.

I'm not saying that Amazon is being dishonest about their engineering - I simply believe that by being 100% reliant on a single vendor you fail to mitigate any systemic risk that is present. That risk can be technical risk or business risk, as engineering at this level isn't strictly a technical profession.

The article mentions using custom origins with CloudFront, and I don't understand why setting up origin.mydomain.com was required. At work, we use mydomain.com directly as the custom origin, and setup was super simple (just tell CloudFront the domain). Is there anything wrong with doing it this way?

No, there's nothing with that. What the author is doing is serving the entire domain (i.e. www.mydomain.com, not just JS+CSS+images) off of CloudFront, so you obviously need a separate host name for the actual server. Hence origin.mydomain.com.

(Note that you need to use a www subdomain for this, or CNAMEs don't work.)

What about enhancing the autoscaling and Automated Configuration & Deployment and or AMIs part of the article with virtualization (ESXi with or without vMotion)? No need of configuration and deployment, only duplicate your VM to have it on the other site or let vMotion move it for you?

Okay, so before all that, I guess the first crucial step is to have the site/system monitored by some third party such as Monitive or Pingdom. Then you can take action based on information (facts).


So before anyone else points this out - yes our app is currently hosted in a single zone, and no we do not plan on keeping it this way! (we're currently in early Alpha)

sounds hard

It's not as simple as using Heroku, but it's not that hard. If you don't deploy very often - using baked AMI's is the easiest solution. Automating your build / deployment process to do this worth the effort in the longer run - especially if you want to scale up and down fast.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact