When you are a Google or a Twitter or an Amazon, you lose lots of money per minute of downtime, so economically speaking it makes sense for them to invest in this. However for an average startup, I don't think having a couple hours of downtime per month is actually going to be that big of a deal. Of course you need to ensure your data is always safe, and your machine configuration is also easy to deploy (via AMIs or something like puppet) so you have the ability to get back up and running in catastrophic cases, but at the end of the day having a good "We're currently having technical issues" screen could very well be a better investment than shooting for a technical setup that can still be up during catastrophic events.
Interestingly AWS appears to be in the camp where it's not heavily focused on uptime either.
A pretty universal truth is that you can afford to be down more than you expect. Even in the cases of Amazon's store being down when you can measure the immense cost by the second, a total cost analysis may still well show that it was a 'cheaper' event than engineering away.
I don't think I have ever seen such clusterfuck of performance in any major site for such a long time.
It has come to the point that I don't even bother clicking twitter links.
That simply can't be true. There is always going to be a point where an extra decimal place of reliability is too costly.
If you're at N DAU, then a 12h downtime will affect a bit more than N/2 users, and some percentage of those users will become ex-users - you can run a small split test to figure out how many if you don't already have data on that. You'll also lose a direct half day of revenue. This type of thing will happen somewhere between once a year an once every couple of months, as low and high estimates.
Crunch those numbers, and you'll have an order of magnitude estimate of what downtime actually costs you, and what you can actually afford to spend to minimize it. Keep in mind that engineering and ops time costs quite a bit of money, and that you'll be slowing down other feature development by wasting time on HA.
For instance, let's say you're running a game with 1M DAU, and 5M total active users, making $10k per day (not sure if that's reasonable, but let's pretend), and you've figured out that 12h of downtime makes you lose approximately 10% of the users that log in during that period. In that case, 12h of downtime costs you a one-time "fee" of $5k, and also pushes away ~1% of your total users, which will cost you $100 per day as an ongoing "cost".
If we assume this happens exactly once, and that a mitigation strategy would work with 100% effectiveness, then you should be willing to spend up to $100 extra per day to implement that strategy; the $5k up-front loss is not nothing, but we can probably assume it'll get eaten up by engineering time to implement that strategy. If such a strategy would cost significantly more than $100 per day over your current costs, then by pursuing it you're assuming that "oh shit it's all gone to hell!" AWS events are likely to affect you multiple times over the period in question.
I'm not saying these numbers are realistic in any way, or that the method I've shown is 100% sound (I'm on an iPhone, so I haven't edited or reread any of it); I'm just saying that whether you pursue a mitigation strategy or not, it's not terribly difficult to ground your decision in numbers. They do tend to be right on the edge of reasonable for a lot of people, so it's worth thinking about them (good) or (better) measuring them.
However, I strongly disagree with the second sentence. Investing is uptime is not always worth it. Taken to its logical extreme, imagine 2 potential websites. One of them is incredibly useful but only up 80% of the time. The other one is a blank HTML page, but it the most reliable website in history with 0 seconds of downtime in the past 10 years. If I surveyed users of both websites, I think it would be almost unanimous that people preferred the useful website that was up sometimes.
Startups have limited time and resources, and in practice getting 99% uptime is relatively easy, whereas 99.9% uptime is relatively hard. That is a difference of ~7 hours of uptime per month. Yes, it sucks when your website is down, but it also sucks when there are features you can't develop because you don't have the time or your technical infrastructure doesn't allow in order to chase ultra high reliability. Obviously this depends on your industry, IE if you are a payment processor you better have super high uptime or you aren't going to have any customers, but realistically most companies will likely not lose that many customers if they are up >99% of the time.
You can engineer a more complicated system with the goal of avoiding downtime, but this added complexity may end up with unexpected corner-cases and cause a net decrease in uptime, at least in the short term.
It's often better to concentrate on improving mean time to repair (MTTR).
WAN replication is a hard problem and glossing over it by waving your hands is a disservice to readers.
"Real" solutions are to run a database that is tolerant of partitioning, and have application level code to resolve the inevitable conflicts. Riak, Cassandra and other Dynamo inspired projects offer this. On the other hand you can use a more consistent store and hide the latency with write-through caching (this is how Facebook does it with memcached + MySQL), but now you have application code that deals with managing this cache.
Either way you have to have very specific application code to handle these scenarios, and you may even run a combination of solutions for different types of data you need to store. There is no silver bullet, there is no framework or product that does it for you.
Choice of database when planning a project is a more fundamental problem, knowing what to use and why - is the trade off for Riak / Cassandra worth it over MongoDB or even MySQL? This people decide on a per-project basis and of course when starting don't always make the right longer term choice.
I guess it is a good play to get traffic to your site.
The fact that so many developers have invested so much time into learning Amazon-specific technologies means that developers are left to deal with the problem within that worldview. Going multiple-datacenter means learning two of every technology layer.
You could solve all of these problems using standard non-amazon unix tools, technologies, and products, however Amazon has enabled a whole class of development that makes it easier to just work within their system. It's easier to just wait for Amazon to figure it out for the general case and trust them than to figure it out and implement yourself.
There are other risks with being the lone-wolf but for a lot of people, being in the herd has a certain kind of safety, despite the limitations.
Not making a judgement call on it but it is something that I have noticed with these outages.
Like with HTML5 I'd like to see all of the main datacenters implement full stacks and then have OpenStack be a 'codify what works' project. We need 4 or 5 providers who have the depth and breadth of what AWS has.
However for static or near static sites, it's perfect - just make sure your TTL is correct :-)
I don't think ESI is necessarily a requirement to configure your cdn to serve stale content in case of the backend being unreachable.
A region in AWS speak is already multiple supposedly independent data centers (in AWS terms: AZ (availability zone)).
So if an entire region fails, that's four or so data centers which all go down at the same time.
So how many companies on bare metal have four data centers and experience this kind of catastrophic downtimes? Add to that, how many of these companies operate completely in the dark about which data center is actually which?
These blog posts are annoying because it seems like these people have never done anything they suggest themselves.
Yes, the cloud lets you setup a fully configured instance within minutes. But at which expense? Mostly intransparency about what the entire stack and what is going on?
Food for thoughts.
Anyway - the article isn't suggesting to use four datacentres, it's trying to make people aware of the simple steps that can be taken to avoid failure.
Because you cannot stay in a single AZ either – who knows what will go down and when.
It's insane how this post is so high up on this website. But it goes to show that neither you or the author of this blog post have ever attempted any replication over WAN. Otherwise it would not be called simple.
I assumed this was a response to the recent hella-long outage:
"There are three major lessons about IaaS we've learned from this experience:
1) Spreading across multiple availability zones in single region does not provide as much partitioning as we thought. Therefore, we'll be taking a hard look at spreading to multiple regions. We've explored this option many times in the past - not for availability reasons, but for customers wishing to have their infrastructure more physically nearby for latency or legal reasons. We've always chosen to prioritize it below other ways we could spend our time. It's a big project, and it will inescapably require pushing more configuration options out to users (for example, pointing your DNS at a router chosen by geographic homing) and to add-on providers (latency-sensitive services will need to run in all the regions we support, and find some way to propagate region information between the app and the services). These are non-trivial concerns, but now that we have such dramatic evidence of multi-region's impact on availability, we'll be considering it a much higher priority.
2) Block storage is not a cloud-friendly technology. EC2, S3, and other AWS services have grown much more stable, reliable, and performant over the four years we've been using them. EBS, unfortunately, has not improved much, and in fact has possibly gotten worse. Amazon employs some of the best infrastructure engineers in the world: if they can't make it work, then probably no one can. Block storage has physical locality that can't easily be transferred. That makes it not a cloud-friendly technology. With this information in hand, we'll be taking a hard look on how to reduce our dependence on EBS.
3) Continuous database backups for all. One reason why we were able to fix the dedicated databases quicker has to do with the way that we do backups on them. In the new Heroku PostgreSQL service, we have a continuous backup mechanism that allows for automated recovery of databases. Once we were able to provision new instances, we were able to take advantage of this to quickly recover the dedicated databases that were down with EBS problems."
Then I checked the date. It's actually Heroku's response to their super-long April 2011 outage. Yet, it appears the "we should go across Regions" lesson wasn't learned.
E.g. Use Route 53? Isn't that hosted on Amazon itself? Why create another point of failure?
MongoDB - How many big sites on the cloud use it as their primary database?
The only takeaway for me was the last paragraph - "In conclusion, you probably used a single zone because it’s easy (hey - so do we for now!). There will come a point where the pain of getting shouted at by your boss, client or customers outweighs learning how to get your app setup properly yourself."
The blog post is supposed to be about "keeping your website up through catastrophic events" and the dominant theme seems to be "invest more heavily on Amazon." IMHO, the exact opposite needs to happen.
Sure, I understand that being in multiple regions means you supposedly have very autonomous deployments (to include Amazon's API endpoints), but nobody can prove to us that each zone or even each region are totally separate.
I'm not saying that Amazon is being dishonest about their engineering - I simply believe that by being 100% reliant on a single vendor you fail to mitigate any systemic risk that is present. That risk can be technical risk or business risk, as engineering at this level isn't strictly a technical profession.
(Note that you need to use a www subdomain for this, or CNAMEs don't work.)