"Netflix showed some increased latency, internal alarms went off but hasn't had a service outage." 
"Netflix is deployed in three zones, sized to lose one and keep going. Cheaper than cost of being down." 
For us, we are just now staffing up to the level where we can make the changes necessary to do the same thing.
Netflix are very very sensitive to this problem because it's much harder for them to sell against their biggest competitor (local cable) since they rely on the cable to deliver their service. If the service goes down, then the cable company can jump in and say, "You'll never lose the signal on our network" -- blatantly untrue, but it doesn't matter.
When you're disrupting a market, remember that what seem trivial is in fact hugely important when you're fighting huge well-established competition :)
I have not had a single service issue with them, ever. They do a better job at reliably providing me with TV shows than the cable company does. That seems to be where they're looking to position themselves, and the reputation for always being there is hard to regain if you lose it.
This is even more true if/when you load balance between zones and aren't just using them as hot backups. As another commenter pointed out, Netflix says they have three zones and only need two to operate.
And if they do mean three regions - can that cost of spanning various regions be quantified for different companies. The money spent vs money earned for Netflix may be very different compared to Quora and Reddit. At the same time, the data synchronization needs in between regions may also vastly differ for different type of companies and infrastructures thus leading to varying amount of cost to maintain a site on multiple regions.
1. See slides 32-35 of http://www.slideshare.net/adrianco/netflix-in-the-cloud-2011
2. "Deploy in three AZ with no extra instances - target autoscale 30-60% util. You have 50% headroom for load spikes. Lose an AZ -> 90% util."
Last 60 minutes comparison data: http://dl.dropbox.com/u/1898990/EC2-multiple-zones-60m.png
time in GMT.
A study we (Cedexis) did in January comparing multiple ec2 zones and other cloud providers: (pdf) http://dl.dropbox.com/u/1898990/76-marty-kagan.pdf
Amazon's EBS SLA is less clear, but they state that they expect an annual failure rate of 0.1-0.5%, compared to commodity hard-drive failure rates of 4%. Hence, if you wanted a higher level of data availability you'd use more than one EBS volume in different regions.
These outages are affecting North America, and not Europe and Asia Pacific. That's it. Why is this even news? Were you expecting 100% availability?
Amazon's EC2 SLA is extremely clear -
a given region has an availability of 99.95%.
If you're running a website and you haven't
deployed across across more than one region then,
by definition, your website will have 99.95%
availailbility. If you want a higher level of
availability use more than one region.
let P(region fails) = 0.05% and let's assume (and hope) that the probability of failure of one region is independent of the state of the other regions.
P(two regions fail) = P(one region fails and another region fails) = P(region fails) * P(region fails) = 0.05% * 0.05% = 0.0025%
Making your availability = 100% - 0.0025% = 99.9975%
Ultimately it's more of a business decision if you want to pay for the extra 0.0475% of availability. I would think (or hope) that most engineers would want it anyway.
The numbers at this size appear insignificant. How would one (say an engineer) convince "the management" that the extra 0.0475% of availability is worth the investment/expense?
In practice, that's not true, and it's not true enough to ruin the entire rest of your calculations. For Amazon regions to function independently, they'd have to be actually, factually independent; there is no interaction between them. The reaction to one node going down is never to increase the load on other nodes as people migrate services, etc. There's fundamentally nothing you can do about the fact that if enough of your capacity goes out then you will experience demand in excess of supply.
If you want true redundancy you will at the very least need to go to another entirely separate service that is not Amazon... and if enough people do that, they'll break the effective independence of that arrangement, too.
(This is a special case of a more general rule, which is that computers are generally so reliable that the ways in which their probabilities deviate from Gaussian or independence tends to dominate your worst-case calculations.)
After today's event, it would certainly be interesting to see how resource consumption changed in other availability zones and at other providers during this outage.
I wonder if that could be measured passively? What I mean is, by monitoring response times of various services that are known to be in specific regions and seeing how that metric changes (as opposed to waiting on a party that has little-to-no economic benefit to release that information.)
Of course, redundancy doesn't set itself up, so there are added costs on top of Amazon.
If expanding to another region costs more than just taking the outage, then it's categorically not a good option. If management still says no in the face of numbers that suggest yes, then that tells you that you're missing a hidden objection, and how you proceed will depend on a lot of factors specific to your situation.
Not complex even factoring in reputational damage?
In businesses where physical goods are sold to customers, "the management" is generally very motivated to avoid stock-out situations in which sales are lost due to lack of inventory (even if it's only a very small percentage.) The reason for this is because they are concerned about the potential loss of customer goodwill. It seems that the same applies in this situation.
Of course, that's assuming that 100% of Reddit's problems were due to EBS only and not a combination of EBS, EC2 and their own code.
Which always made sense to me. I pay you for 99.5% uptime this month. If you don't achieve it, then I get a discount, as simple as that. If your availability is below that, I don't pay full price for the billable period and then reconcile at the end of the year.
Any links or general advice on this topic anyone has I'd be pretty interested on finding out if there's a general consensus of it being done differently?
Perhaps the Reddit admins decided that "up to 0.05%" downtime permitted by the SLA would be acceptable, compared to the extra expense of using more of Amazon's services (and any coding/testing time they may have needed to take advantage of the redundancy depending on how automatic the load balancing and/or failover are within Amazon's system) to improve their redundancy. By my understanding the promise isn't 99.95% if you use more than one of our locations, it is 99.95% at any one location, so the fact that Reddit don't make use of more than one location is irrelevant when talking about the one location they do use not meeting the expectations listed in the SLA.
I'm not saying Reddit's implementation decision is right (I don't have the metrics available to make such a judgement) but it would have been made based partly on that 99.95% figure and how much they trusted Amazon's method of coming to that figure as a reliability they could guarantee. If I had paid money for a service with a 99.95% SLA, unless the SLA had no teeth, I would be expecting some redress at this point (though there is probably no use nagging Amazon about that right now: let them concentrate on fixing the problem and worry about explanations/blame/compo later once things are running again).
A number of tier 1 network providers offer certain customers SLA's that are clearly in place to prove that they invest in redundancy and disaster planning. ex: less than 99.99% --> 10% credit. less than 99.90% --> no charges for the circuit in the billing period.
This reflects an understanding that downtime can hurt your business/infrastructure far in excess of the measurable percentage.
> they would have chosen to deploy across more than one region.
It's far too costly to do that. We are deployed across multiple AZs, but this failure hit multiple AZs.
We'll get there one day, but we aren't there yet.
Gah! You can't always account for all the failure modes that Amazon might have.
Yes, multi-region availability on AWS is hideously expensive. However, some organisations value an availability of greater than 99.95% enough to warrant such a multi-region deployment. Clearly reddit, and many, many other AWS users, do not. This isn't a value call on my part; I definitely couldn't afford the inter-region data transfer costs, all I know is that AWS offers you the tools to deploy high availability web services.
Amazon are the ones who should have made backups in multiple regions, and transfer the load on failure.
And Amazon does have all their stuff available in multiple regions. It's up to you to use it though.
If that were the case, you wouldn't be presented with region and availability zone options.
Would it make sense for Amazon to maintain automatic backups (and potentially charge more for them)? I don't know. It might make business sense, it might not. But their service is apparently popular enough even without it.
That's a cache of it. I really wish that the admins at reddit would implement something like this themselves, then link to it when downtime like this happens.
I think the reason this is news is because it is a massive Amazon failure.
The main reason this is news is because this is an Amazon issue but also because tens of thousands of people who frequent the site regularly are now aimlessly browsing the internet in the hopes of finding alternative lulz and in my case some of us are even getting work done. shudder
Frankly, for the size of the site, they do really, really well for the limited resources they have.
Anything coming up for amazon? If not for anything else, for pure entertainment value!
EC2, EBS and RDS are all down on US-east-1.
Edit: Heroku, Foursquare, Quora and Reddit are all experiencing subsequent issues.
If you actually wish to make a useful point about the practicality or otherwise of massively virtualised systems for webapp deployment, please do. It's going to take more than two words though.
Anyway, my bad, I was just trying to make a joke to lighten up the mood. Sorry.
If you actually wish to bitch about a post that we all got the point of, you're going to need more than two paragraphs.
So I guess that is the long way of saying that hopefully it won't happen again.
Doing things the right way with EC2 means using EBS. It's the brake caliper to the rotor. Sure you could have drum brakes but they're not nearly anywhere effective as they quickly get heat soaked. I'm referring to S3.
One should trust ephemeral storage. Your instance can go down at anytime. Write speeds to S3 are not nearly as fast as ephemeral or EBS arrays (raid).
Hate to say it, but If one cannot trust EBS then what the heck are 'we' doing on EC2... EBS quality should be priority one, otherwise we're all building Skyscrapers on foam foundations of candy cane rebar.
1. Sharding data
2. Pulling tables out to other servers from the main DB
3. Pruning excessive data
4. Compressing data
We have had a lot of success stabilizing EBS by creating mdadm arrays out of lots of smaller EBS volumes. There is minimal additional costs and you can get better performance, stability, and protection (RAID 5, 6).
Gluster makes an OSS distributed filesystem that runs across availability zones, our AMI (not OSS) builds multiple RAID arrays on each instance then spreads the filesystem across instances in multiple AZs. Send me an email if you want to chat.
It's hard to tell for sure since there isn't any load.
I am just a little freaked out right now.
Amazon have stated many times that amazon.com itself runs mostly on AWS platform, but it works fine now ...
Probably just a coincidence.
Edit: I tweeted their European editor about it and he's posted a story up now.
But yeah, right now we're shutting everything down to try and avoid possible data corruption. Once they restore service, hopefully we'll be able to come back quickly.
I wonder if Rackspace really want this particular traffic burden. It seems that if Reddit choose not to pay for the load they need then you get lot's of bad press for it ... perhaps I'm seeing it wrong.
Rubbish analogy: Kinda like if I was doing a haulage business and you called out for a wheelbarrow to carry some elephants, then when the barrows broke we got bad press despite. If you'd paid for a heavy animal transport package ... OK it's all going wrong, you get the idea.
However, in this case, the outage is not because of any issues with our setup, but with Amazon.
So is it a financial constraint with Amazon? Would you be suffering the same sorts of outages regardless of the technology on the backend or does AWS basically suck?
Statements like the one you're quoting are in that context. Let's say you have an unlimited operating budget - you can come up with all kinds of wonderful plans for massive redundancy and zero downtime. But you can't make that happen if you're not allowed to hire any engineers or sysadmins! As far as I'm aware reddit are paying Amazon mucho dinero but still having irredeemable problems with the storage product, EBS. They are stuck on an unreliable service without the manpower to move off.
That's the story, as far as I can piece together from comments here and on reddit.
It's not making money and those looking after reddit don't want to ruin it with a huge money grab - instead taking a soft approach, first just begging for money, then adding in a subscription model (freemium anyone?) and more subtle advertising by way of sponsored reddits (/r/yourCompany'sProduct type stuff).
I understand they've been hit with more staff problems just recently despite having a new [systems?] engineer start with them.
So in your view EBS is the problem regardless of finance? That was the nut I was attempting to crack. TBH I didn't expect someone at reddit to stick their neck out and say "yeah Amazon sucks" but they might have confirmed that the converse was true and they were simply lacking the necessary finance to support the massive userbase they have.
Is this really so or are Racksoace and co. Just "boutique" offerings?
but officially supported (and paid for) by you?
Anyway, thanks for your time.
Good plan though. :p
Right now we may as well be on reddit.
reddit isn't either, but we lost that battle a while ago.
EDIT: I also simply greeted jedberg, and a bunch of people thought that was a good reason to downvote. Do people think there's an imminent influx of redditors, and that they have to dissuade them from becoming HNers? I don't think that's the case.
EDIT: Fuckin' called it.
It looks like EBS will randomly decide to switch to a few bps of performance from time to time. I would use Amazon for my startup, but these issues really make it hard to justify.
I don't work for reddit anymore (as of about a week ago, although I didn't get as much fanfare as raldi did), but I can tell you that they're giving Amazon too much credit here. Amazon's EBSs are a barrel of laughs in terms of performance and reliability and are a constant (and the single largest) source of failure across reddit. reddit's been in talks with Amazon all the way up to CIOs about ways to fix them for nearly a year and they've constantly been making promises that they haven't been keeping, passing us to new people (that "will finally be able to fix it"), and variously otherwise been desperately trying to keep reddit while not actually earning it.
cache: - scroll down for comment - http://webcache.googleusercontent.com/search?q=cache:cfbs-sp...
If you get a large network load to your instance - say, a DDoS attack - you can find you no longer have enough network capacity to talk to your EBS disks.
This is what happened to Bitbucket in 2009: http://blog.bitbucket.org/2009/10/04/on-our-extended-downtim...
If multiple AZs are down, AWS are going to have some serious explaining to do...
Interesting. This is news to me.
I found some more info here (Google cache since alestic.com is presently unreachable): http://webcache.googleusercontent.com/search?q=cache:0jxzyFj...
We're experiencing problems with two of our ELBs, one indicating instance health as out of service, reporting "a transient error occurred". Another, new LB (what we hoped would replace the first problematic LB), reports: "instance registration is still in progress".
A support issue with Amazon indicated that it was related to the ongoing issues and to monitor the Service Health Dashboard. But, as I mentioned before, ELB isn't mentioned at all.
The interesting thing about the ELB in a situation like this is that I believe it may, in many instances, be better to hobble along and deal with an elevated error rate if at least some of your ELB hosts are working than to re-create the entire ELB somewhere else, especially if you're a high-traffic site where you may hit scaling issues going from 0 to 60 in milliseconds (OMMV, but we've been spooked enough in the past not to try anything hasty until things get back to normal).
Am I expecting too much from them?
I think we will see more of a focus from big users of AWS about focusing on how to create a redundant service using AWS. Or at least I hope we will!
This outage is a lot like having your entire datacenter lose power.
If that is not the case, then having a multi-region setup would be a necessity for any major sites on AWS.
Perhaps there will be a time where to truly be redundant, one would need to use multiple cloud providers. Which would be a _huge_ pain to do now I imagine, with all the provider lock-ins we have.
They are. Which means this is probably a software issue or some other systemic issue.
I think I'll go home and wait it out there, but it appears that they are having some progress in recovering it. But our site is still affected.
Could you use this outage to justify switching to multi-region?
A: When we started with Amazon, our code was written with the assumption that there would be one data center. We have been working towards fixing this since we moved two years ago. Unfortunately, progress has been slow in this area. Luckily, we are currently in a hiring round which will increase the technical staff by 200% :) These new programmers will help us address this issue."
Not sure if the costs of data transfer between regions (charged at full internet price) would justify the added reliability/lower latency though.
If we were all having our own rental servers then.... well, many sites that we know wouldn't be around :)
The relationship between the pricing tiers changes fairly drastically, depending on how much you are already spending on Amazon Web Services. Gold, for instance, starts out at 4x the price of Silver support, but by the time you're spending 80K/month on services, it's only a $900 premium (and stays there no matter how much bigger your bill is). At the $150K/month level, it's a 2x jump from Gold to Platinum, which may or may not be a huge jump, considering the extra level of service you get.
I'm guessing jedberg is mostly banging his head in walls and seriously looking at alternative hosting solutions right now.
Periodically, latest a couple of days ago, there's a post / discussion about whether outsourcing core functionality is the right thing to do. There are valid points on both sides of the issue.
For my part, if I'm going to be up in the middle of the night I'd rather be up working on fixing something rather than up fretting and checking status. But either way things get fixed. The real difference comes in the following days and weeks. When core stuff is in the cloud then you can try to get assurances and such, fwiw. When core stuff is in-house then you spend time, energy and money making sure you can sleep at night.
I thought you could cluster your instances across many regions and replicate blah blah blah and change your elastic ip addresses in instances like this?
Is this a case that it's not being utilised or does that system not work?
I appreciate you are busy right now so I'm not expecting a reply any time soon.
I'd still say Amazon is a great place for your startup, just don't use EBS.
I have not used it all in detail yet so I don't know the practicality of this method.
I think I will stick to my co-location costings I am doing for the time being. There is only one person to rely on when it all goes wrong then!
Good luck getting it sorted, i know I wouldn't appreciate being up at 3am sorting it though
In theory, yes. In practice, those snapshots hurt the volume so much that it is impossible to take one in production.
Do you guys blog this anywhere?