Hacker News new | comments | show | ask | jobs | submit login
Amazon Web Services are down (amazon.com)
560 points by yuvadam 2255 days ago | hide | past | web | 332 comments | favorite

Some quotes regarding how Netflix handled this without interruptions:

"Netflix showed some increased latency, internal alarms went off but hasn't had a service outage." [1]

"Netflix is deployed in three zones, sized to lose one and keep going. Cheaper than cost of being down." [2]

[1] https://twitter.com/adrianco/status/61075904847282177

[2] https://twitter.com/adrianco/status/61076362680745984

"Cheaper than cost of being down." This is very insightful. Many of us look at the cost of multi zone deployments and cringe, but its a mathematics exercise. (.05 * hours in a year)*(cost of being down per hour) = (expected cost of single zone availability). Now just compare to 2-3x your single zone deployment cost. Don't forget the cost of being down per hour should include lost customers as well.

At their level of income, this is true.

For us, we are just now staffing up to the level where we can make the changes necessary to do the same thing.

I think it's incredible that you guys can run a site at all with the few people you've got. Hope it all gets better again soon.

I would also be shocked if Amazon isn't giving Netflix preferred pricing because it's such a high-profile customer.

Netflix pays standard rates for instances but uses reserved instances to pay less on bulk EC2 deployments

Are you looking to diversify across ebs or set up dedicated hosting?

It's a strange algebra though; doesn't it mean the WORSE Amazon's uptime is, the more money you should give them?

More accurately, the more unstable your infrastructure is, the more you will need to spend to ensure stability.

Spending more on AWS to increase reliability isn't necessarily a benefit to Amazon. The increased costs can them less competative.

I'm actually surprised if incurring 50% extra hardware costs really is cheaper than the cost of being down. If Netflix is down for a few hours, then it costs them some goodwill, and maybe a few new signups, but is the immediate revenue impact really that great? Most of Netflix's revenue comes from monthly subscriptions, and it's not like their customers have an SLA.

Actually, they do. and Netflix proactively refund customers for downtime. Usually it's pennies on the dollar, but i've had more than refund for sub 30 minute outages which have prohibited me from using the service.

Netflix are very very sensitive to this problem because it's much harder for them to sell against their biggest competitor (local cable) since they rely on the cable to deliver their service. If the service goes down, then the cable company can jump in and say, "You'll never lose the signal on our network" -- blatantly untrue, but it doesn't matter.

When you're disrupting a market, remember that what seem trivial is in fact hugely important when you're fighting huge well-established competition :)

I'd imagine that part of this cost is reputation. The only problem I have ever had with Netflix streaming is when an agreement runs out and the pull something I or my wife regularly watch. (looking at you, "Paint Your Wagon")

I have not had a single service issue with them, ever. They do a better job at reliably providing me with TV shows than the cable company does. That seems to be where they're looking to position themselves, and the reputation for always being there is hard to regain if you lose it.

There isn't a 50% extra hardware cost. You spread systems over three zones and run at the normal utilization levels of 30-60%. If you lose a zone while you are at 60% you will spike to 90% for a while, until you can deploy replacement systems in the remaining zones. Traffic spikes mean that you don't want to run more than 60% busy anyway.

Obviously you haven't been around my wife when she loses the last 5 minutes of a show. SLA or no, services will get cancelled.

I don't think the cost of expanding to other regions/AZs is necessarily linear such that adding a zone would incur 50% more costs. Going from one zone to two would probably look that way (or even one server to two), but when you start going from two to three or even 10 to 11 then the %change-in-cost starts to decrease.

This is even more true if/when you load balance between zones and aren't just using them as hot backups. As another commenter pointed out, Netflix says they have three zones and only need two to operate.

Also, when there are service interruptions, they send out credits to customers.

Every decision in a business is like this - measure the cost of action A versus the cost of not-A. It's just rare that in this case, those costs are easily quantifiable.

Are they only in three zones, or three regions? Three zones would not have helped them in this particular scenario and they would have still been at risk.

And if they do mean three regions - can that cost of spanning various regions be quantified for different companies. The money spent vs money earned for Netflix may be very different compared to Quora and Reddit. At the same time, the data synchronization needs in between regions may also vastly differ for different type of companies and infrastructures thus leading to varying amount of cost to maintain a site on multiple regions.

More comments coming from Adrian Cockcroft:

1. See slides 32-35 of http://www.slideshare.net/adrianco/netflix-in-the-cloud-2011

2. "Deploy in three AZ with no extra instances - target autoscale 30-60% util. You have 50% headroom for load spikes. Lose an AZ -> 90% util."


Here's the 24h latency data on EC2 east, west, eu, apac: http://dl.dropbox.com/u/1898990/EC2-multiple-zones-24h.png

Last 60 minutes comparison data: http://dl.dropbox.com/u/1898990/EC2-multiple-zones-60m.png

time in GMT.

A study we (Cedexis) did in January comparing multiple ec2 zones and other cloud providers: (pdf) http://dl.dropbox.com/u/1898990/76-marty-kagan.pdf

Pure opinion: That convergence might show that Amazon tried to do a failover on a DC level. Once they figured that wouldn't work or east was down for the count they just let it cycle to the ground under latency.

Yes - It is all business decisions. As someone said already an instance on AWS can cost up to 7X a machine you own on co-location. here is how outbrain manages it's multi datacenter architecture while saving on Disaster recovery headroom. http://techblog.outbrain.com/2011/04/lego-bricks-our-data-ce...

Amazon's EC2 SLA is extremely clear - a given region has an availability of 99.95%. If you're running a website and you haven't deployed across across more than one region then, by definition, your website will have 99.95% availailbility. If you want a higher level of availability use more than one region.

Amazon's EBS SLA is less clear, but they state that they expect an annual failure rate of 0.1-0.5%, compared to commodity hard-drive failure rates of 4%. Hence, if you wanted a higher level of data availability you'd use more than one EBS volume in different regions.

These outages are affecting North America, and not Europe and Asia Pacific. That's it. Why is this even news? Were you expecting 100% availability?

    Amazon's EC2 SLA is extremely clear -
    a given region has an availability of 99.95%.
    If you're running a website and you haven't
    deployed across across more than one region then,
    by definition, your website will have 99.95%
    availailbility. If you want a higher level of
    availability use more than one region.
Good point.

let P(region fails) = 0.05% and let's assume (and hope) that the probability of failure of one region is independent of the state of the other regions.

P(two regions fail) = P(one region fails and another region fails) = P(region fails) * P(region fails) = 0.05% * 0.05% = 0.0025%

Making your availability = 100% - 0.0025% = 99.9975%

Ultimately it's more of a business decision if you want to pay for the extra 0.0475% of availability. I would think (or hope) that most engineers would want it anyway.

The numbers at this size appear insignificant. How would one (say an engineer) convince "the management" that the extra 0.0475% of availability is worth the investment/expense?

"let's assume (and hope) the probability of failure of one region is independent of the state of the other regions."

In practice, that's not true, and it's not true enough to ruin the entire rest of your calculations. For Amazon regions to function independently, they'd have to be actually, factually independent; there is no interaction between them. The reaction to one node going down is never to increase the load on other nodes as people migrate services, etc. There's fundamentally nothing you can do about the fact that if enough of your capacity goes out then you will experience demand in excess of supply.

If you want true redundancy you will at the very least need to go to another entirely separate service that is not Amazon... and if enough people do that, they'll break the effective independence of that arrangement, too.

(This is a special case of a more general rule, which is that computers are generally so reliable that the ways in which their probabilities deviate from Gaussian or independence tends to dominate your worst-case calculations.)

I agree with you 100% that they're not independent, but I don't know enough about the data to model the probabilities of failure and availability in a HN comment :-)

After today's event, it would certainly be interesting to see how resource consumption changed in other availability zones and at other providers during this outage.

I wonder if that could be measured passively? What I mean is, by monitoring response times of various services that are known to be in specific regions and seeing how that metric changes (as opposed to waiting on a party that has little-to-no economic benefit to release that information.)

No, your decimal point is off. 0.05% * 0.05% = 0.0005 * 0.0005 = 0.00000025, or 0.000025%. It works out to an expected downtime of 8 seconds per year, instead of over 4 hours for one location.

Of course, redundancy doesn't set itself up, so there are added costs on top of Amazon.

Thank you for this, I don't know I tried doing the math without converting the percents to decimals, I should have known better.

Why wouldn't a simple expected value calculation work? You've shown that you can calculate the extra availability that subscribing to another region provides. Simply multiply the cost of an outage by the extra availability provided by an additional region that would have prevented that outage.

If expanding to another region costs more than just taking the outage, then it's categorically not a good option. If management still says no in the face of numbers that suggest yes, then that tells you that you're missing a hidden objection, and how you proceed will depend on a lot of factors specific to your situation.

I think you're right, that would be the best way of presenting this argument to management. To do so, however, the company would need to calculate its Total Cost of Downtime (which probably isn't very complex for many companies) which is its own subject entirely IMO.

> calculate its Total Cost of Downtime (which probably isn't very complex for many companies)

Not complex even factoring in reputational damage?

> The numbers at this size appear insignificant. How would one (say an engineer) convince "the management" that the extra 0.0475% of availability is worth the investment/expense?

In businesses where physical goods are sold to customers, "the management" is generally very motivated to avoid stock-out situations in which sales are lost due to lack of inventory (even if it's only a very small percentage.) The reason for this is because they are concerned about the potential loss of customer goodwill. It seems that the same applies in this situation.

Minor nitpick, but the availability should be even better, since 1% * 1% = 0.01% the availability becomes 99.999975% - six nines, anyone?

Reddit's been down for several hours today, I'm sure they are already way lower 99.95%.

0.05% of one year is 4 hours, 22 minutes and 48 seconds.

Reddit experienced some issues with Amazon a month ago that resulted in the site being down for almost a day. I'm pretty sure they're way below that percentage.

That's conflating SLAs again, Reddit's long-running problems have been with EBS reliability.

Even so, .5% downtime per year is about 44 hours, and Reddit's definitely had more downtime in the last few months than that.

Of course, that's assuming that 100% of Reddit's problems were due to EBS only and not a combination of EBS, EC2 and their own code.

Is this really standard practice for measuring the SLA? The contracts I've seen for a couple small businesses are generally per billable period.

Which always made sense to me. I pay you for 99.5% uptime this month. If you don't achieve it, then I get a discount, as simple as that. If your availability is below that, I don't pay full price for the billable period and then reconcile at the end of the year.

Any links or general advice on this topic anyone has I'd be pretty interested on finding out if there's a general consensus of it being done differently?

Don't forget to take in to account ALL of Reddit's downtime; there is quite a bit of it.

Which means their entire quota for this year is all gone.

and then some...

If the Reddit web server admins took availability seriously they would have chosen to deploy across more than one region. Do you disagree? Why do you disagree? I'm being honest, no snark involved in my questions.

He wasn't suggesting that all Reddit's problems are due to Amazon services, he was using Reddit's down time today as a data point illustrating that the uptime guarantee claimed for the service has not been kept this year (in fact a whole year's "permitted downtime" as implied by the 99.95% SLA may be eaten on one day). Presumably Amazon will be handing out some refunds and other compensation (assuming the SLA isn't of the toothless "it'll be up at least 99.95% of the time, unless it isn't" variety).

Perhaps the Reddit admins decided that "up to 0.05%" downtime permitted by the SLA would be acceptable, compared to the extra expense of using more of Amazon's services (and any coding/testing time they may have needed to take advantage of the redundancy depending on how automatic the load balancing and/or failover are within Amazon's system) to improve their redundancy. By my understanding the promise isn't 99.95% if you use more than one of our locations, it is 99.95% at any one location, so the fact that Reddit don't make use of more than one location is irrelevant when talking about the one location they do use not meeting the expectations listed in the SLA.

I'm not saying Reddit's implementation decision is right (I don't have the metrics available to make such a judgement) but it would have been made based partly on that 99.95% figure and how much they trusted Amazon's method of coming to that figure as a reliability they could guarantee. If I had paid money for a service with a 99.95% SLA, unless the SLA had no teeth, I would be expecting some redress at this point (though there is probably no use nagging Amazon about that right now: let them concentrate on fixing the problem and worry about explanations/blame/compo later once things are running again).

Very few cloud SLA's seem to have teeth to me. Amazon's SLA gives service credit equal to 10% of your total bill for the billing period if they blow past the 0.05%. This is a lot better than some cloud providers that will simply prorate the downtime, but pretty crappy in terms of actual business compensation. It's equivalent to a sales discount almost any organization with a sales staff could write without thinking about it - meaning Amazon is still making money on every customer even when they've blown past their SLA - assuming every single customer fills out the forms to apply for the discount. Hint: Many won't, see mail in rebates.

A number of tier 1 network providers offer certain customers SLA's that are clearly in place to prove that they invest in redundancy and disaster planning. ex: less than 99.99% --> 10% credit. less than 99.90% --> no charges for the circuit in the billing period.

This reflects an understanding that downtime can hurt your business/infrastructure far in excess of the measurable percentage.

> If the Reddit web server admins took availability seriously

We do.

> they would have chosen to deploy across more than one region.

It's far too costly to do that. We are deployed across multiple AZs, but this failure hit multiple AZs.

Why is it more expensive to deploy in zones X,Y in regions A,B than zones M,N in region C? I assume you don't just mean "US West is ~10% more expensive than US East."

It's the combination of the extra cost of having machines in US West plus the cost of keeping the data synchronized between them (which is a lot) plus the added development overhead of making sure that things work cross region.

We'll get there one day, but we aren't there yet.

> but this failure hit multiple AZs.

Gah! You can't always account for all the failure modes that Amazon might have.

This seems to be a prevalent misunderstanding. Amazon's EC2 SLA of 99.95% applies at the scope of a region. A region may contain more than one availability zone. Hence, deploying on multiple availability zones still only affords the 99.95% availability level.

Yes, multi-region availability on AWS is hideously expensive. However, some organisations value an availability of greater than 99.95% enough to warrant such a multi-region deployment. Clearly reddit, and many, many other AWS users, do not. This isn't a value call on my part; I definitely couldn't afford the inter-region data transfer costs, all I know is that AWS offers you the tools to deploy high availability web services.

Why does Reddit really need 99% availability? Is a customer unduly harmed or is the world even worse off if Reddit is down for a couple cumulative days per year? Is it worth the cost? Would you put up with more ads and/or pay for Reddit in order to make sure that it's available 24/7/365?

Probably not as much for the customer as for the company. When sites are unreliable, people end up going to the more reliable competitors as they arise.

I wouldn't think there a large number of customers deciding "this is too unreliable, I'm leaving" on the basis of a few hours of down time. On the other hand, there might be a large number of people who, upon finding your site down, decide to visit alternatives that are up at the time, and some of those people might decide they like the alternatives better.

If any site can take down time and not lose users, reddit can. And it has.

Not qualified to speak about what Reddit should or should not do about the arrangement with Amazon. I have read several posts, including one by an (ex) Reddit employee saying Amazon is not delivering what they said they would, that much is clear. I really doubt all their downtime is part of the SLA.

The whole point of AWS is to forget about maintaining hardware infrastructure.

Amazon are the ones who should have made backups in multiple regions, and transfer the load on failure.

Actually, the whole point of AWS is to have options for using hardware that you don't own. They don't offer any magic "all your stuff in one package, guaranteed to work all the time" service. So yes, you do still need to think about your hardware infrastructure. You just don't have to own it.

And Amazon does have all their stuff available in multiple regions. It's up to you to use it though.

> The whole point of AWS is to forget about maintaining hardware infrastructure.

If that were the case, you wouldn't be presented with region and availability zone options.

Then it would be much more expensive at the bottom tiers, meaning I wouldn't be able to play with it on a whim without thinking about the money. That would suck.

"[S]hould" is the wrong word here. Clearly, they don't maintain such backups. This is clear to anyone using their service. They pay for the service anyway, so apparently it's still worth it to them, even without auto-backups.

Would it make sense for Amazon to maintain automatic backups (and potentially charge more for them)? I don't know. It might make business sense, it might not. But their service is apparently popular enough even without it.

Not sure if I could say Amazon should be doing that - but I'd love it if other value added providers (such as Heroku) could implement this.


That's a cache of it. I really wish that the admins at reddit would implement something like this themselves, then link to it when downtime like this happens.

They do have a read-only mode, don't they? I'm not sure why they don't enable read-only mode when things like this happen. It may be that Amazon's service being down forbids this. I dunno.

They have a read-only mode for "free" which is their akamai cache that the unlogged in users see.

do you have any stats of this app in terms of total data stored, bandwidth required per day, requests per second?

I don't, unfortunately, because I didn't write it :(

Reddit being down is not news.

That hurts. But you're right, we've had a lot of issues.

I think the reason this is news is because it is a massive Amazon failure.

Reddit gets a lot of grief for stability issues but the fact is it is an immensely popular site that a huge number of people have a close affiliation to. A massive percentage of these people spend a significant portion of their day browsing Reddit, interacting with other redditors, etc. and for the site to be down for as long as it has is news, regardless of similar issues occuring in the past.

The main reason this is news is because this is an Amazon issue but also because tens of thousands of people who frequent the site regularly are now aimlessly browsing the internet in the hopes of finding alternative lulz and in my case some of us are even getting work done. shudder

I can't imagine how frustrating the jobs of the Reddit admins must be.

Is admin supposed to be plural? I mean, do they really have multiple system admins now? I ask, only because I know people have been coming and going recently.

Frankly, for the size of the site, they do really, really well for the limited resources they have.

I meant admins in the more general purpose sense of administrators, people who are paid to maintain the system. But yeah I agree, quality:resources ratio is really really high.

It's usually very rewarding. The awesome community is what keeps me doing it.

Awesome, I just figured out that you kept the votes tallied during the 'downtime'. What's interesting is how clearly good and bad submissions were dichotomized when nobody had anything else to vote on.

You made Google App Engine managers cry once: http://www.theregister.co.uk/2009/07/06/dziuba_google_app_en...

Anything coming up for amazon? If not for anything else, for pure entertainment value!

No. Not really. But we kind of miss it anyway.

Note also that 0.1-0.5% refers to irrecoverable data loss, not temporary unavailability.

Current status: bad things are happening in the North Virginia datacenter.

EC2, EBS and RDS are all down on US-east-1.

Edit: Heroku, Foursquare, Quora and Reddit are all experiencing subsequent issues.

Indeed they are. Right before the issues began, I pushed a bad update to one of my Heroku apps, causing it to crash. A minute later I fixed the bug, re-pushed the git repo to Heroku... and nothing. I've been stuck with an error message on my website for hours. Unfortunate timing!

Staging servers are an easy thing on Heroku :)

You have my sincere sympathy.

http://status.heroku.com/ for those on Heroku

Is it just me, or is their status page down?

I know you probably have your answer by now, but this site always helps me out when I have that question:


It's been on-and-off for the past few hours.. It's up right now (for me, anyway).

This morning from 5a - 6a Pacific time I was able to access my Heroku app just fine.

I can access all my heroku apps that have their own DNS. Anything with a .heroku.com subdomain is down for me. Frustrating, knowing that the apps are still running but aren't routable.

Yay, cloud.

Please do not make content-free posts such as this. It adds no value to the conversation and is only noise.

If you actually wish to make a useful point about the practicality or otherwise of massively virtualised systems for webapp deployment, please do. It's going to take more than two words though.

I guess you haven't seen the Microsoft ads about the cloud? http://www.youtube.com/watch?v=Lel3swo4RMc

Anyway, my bad, I was just trying to make a joke to lighten up the mood. Sorry.

You're right, I hadn't. Fair enough.

I laughed, it's relevant and puts things in perspective if you had seen the ads. So it's not really content free even if it's just two words.

Please do not make content-free posts such as this. It adds no value to the conversation and is only noise.

If you actually wish to bitch about a post that we all got the point of, you're going to need more than two paragraphs.

Foursquare is up now. Quora not showing the 503 anymore, Reddit still down.

We rely heavily on EBS still, so this is hurting us more than most others. Hopefully they'll have us back up soon.

You guys might have answered this in one of your AMAs/blog posts (or was it raldi who commented?), but what options can reddit resort to should this stuff happen again to this degree of severity?

We're moving away from the EBS product altogether. The hard part is dealing with the master databases. Normally I'd have a master database with a built in raid-10, but I can't do that on EC2, so I have to come up with another option.

So I guess that is the long way of saying that hopefully it won't happen again.

I do not believe you could be effective by moving away from EBS, you know without giving up quite a bit.

Doing things the right way with EC2 means using EBS. It's the brake caliper to the rotor. Sure you could have drum brakes but they're not nearly anywhere effective as they quickly get heat soaked. I'm referring to S3.

One should trust ephemeral storage. Your instance can go down at anytime. Write speeds to S3 are not nearly as fast as ephemeral or EBS arrays (raid).

Hate to say it, but If one cannot trust EBS then what the heck are 'we' doing on EC2... EBS quality should be priority one, otherwise we're all building Skyscrapers on foam foundations of candy cane rebar.

I can't say whether much has changed within the last year, but when I worked at FathomDB we had serious issues with EBS. You couldn't trust it. Odd things would happen like disks getting stuck in a reattaching state for days and disks having poor performance.

How do you move away from EBS and still deal with large data?

Not sure what you had in mind by "large", but instance storage goes up to 1.7TB: http://aws.amazon.com/ec2/instance-types/

The reason Reddit uses RAID10 is for performance, not disk size. A single instance storage device is just too slow for the Reddit database.

Many instance types have 2 or 4 virtual disks (presumably on different physical disks).

I imagine they'd do consider some combination of the following (sorted by most likely)

1. Sharding data 2. Pulling tables out to other servers from the main DB 3. Pruning excessive data 4. Compressing data

It still has to be stored somewhere though right? If it's EBS you've just made yourself a complicated solution that will eventually fail all over again. No?

If the data is sharded, then the data/server is small enough enough to fit within the individual server's disk and you no longer need EBS to store it.

NOTE: I work for Gluster.

We have had a lot of success stabilizing EBS by creating mdadm arrays out of lots of smaller EBS volumes. There is minimal additional costs and you can get better performance, stability, and protection (RAID 5, 6).

Gluster makes an OSS distributed filesystem that runs across availability zones, our AMI (not OSS) builds multiple RAID arrays on each instance then spreads the filesystem across instances in multiple AZs. Send me an email if you want to chat.

Please tell us how you plan on moving 700 EBS volumes to something completely different. It sounds amazing.

Foursquare is up, but Quora is still showing the 503.

I stand corrected. 4sq down again.

Not all EC2 & EBS instances are down. I have several in US-east-1a and 1 is down, while all of the others are working.

Same with us. About 10% of our 700+ volumes are having problems right now.

It's hard to tell for sure since there isn't any load.

We're fine on EC2 -- but everything on RDS seems to be giving us big problems. We started a few backups before we new it was systemic, and all of them are stuck at 0%. We also tried spinning up new instances -- and they're all stuck in booting.

I think which physical data center "us-east-1a" etc. corresponds to differs from user to user, to load-balance given that people will probably be more likely to use 1a than the other zones.

We had about 45 min of downtime around 4am EST. Our RDS instances, EBS backed and normal instances all returned without problems. We are in Virginia us-east-1a and us-east-1b.

Also Cuorizini (http://cuorizini.heroku.com) is down!

4/21/2011 is "Judgement Day" when Skynet becomes self aware and tries to kill us all. http://terminator.wikia.com/wiki/2011/04/21

I am just a little freaked out right now.

Don't worry. If skynet is in EC2, we'll be fine.

You don't understand, Skynet is using all Amazon resources, hence the outages ;-)

Amazon have stated many times that amazon.com itself runs mostly on AWS platform, but it works fine now ...

AWS platform on a private cloud, it is not the same as AWS platform for us commoners.

I wonder if it's a virus or worm whose activation date was today due to the fact above, found its way into the amazon servers.

Probably just a coincidence.

Do we all regret letting GLaDOS reboot yet?

Sigh, Portal jokes were so much more popular on Reddit. :)

HN really isn't the place for internet memes, jokes about pop culture, and things that are judged trivial / frivolous. Part of what makes the HN community what it is, is a focus on high-quality, reasoned, rational discourse. IOW: HN != Reddit

A couple of hours into the failure, and no sign of coverage on Techcrunch (they're posting "business" stories though). It shows how detached Techcrunch has become from the startup world.

Edit: I tweeted their European editor about it and he's posted a story up now.

Perhaps this isn't really news. These days it's normal.

It's ugly, but true enough. You don't have to like it to acknowledge it. It's just another cloud outage bringing down one or more high profile sites. It's a "dog bites man" story.

This feels the same way as hearing that the whole Internet just got shut down.

I guess this is one Reddit outage that can't be blamed on poor scaling

Thankfully, no. :)

But yeah, right now we're shutting everything down to try and avoid possible data corruption. Once they restore service, hopefully we'll be able to come back quickly.

Hey Jedberg, if you guys aren't already rolling your own, check out fdr's WAL-E tool. It bounces postgres write-ahead logs off S3 and goes great with the new PG9 replication.


Thanks for this. I had designed and partially implemented this exact same thing. Do you know of this running in production anywhere?

Amazon is really not being kind to you guys; I sort of hope you'll find an alternative solution fast!

If I was Rackspace, I'd be at Reddit/Wired's headquarters already.

Didn't Jedburg say that they could reduce the failure by spending with Amazon.

I wonder if Rackspace really want this particular traffic burden. It seems that if Reddit choose not to pay for the load they need then you get lot's of bad press for it ... perhaps I'm seeing it wrong.

Rubbish analogy: Kinda like if I was doing a haulage business and you called out for a wheelbarrow to carry some elephants, then when the barrows broke we got bad press despite. If you'd paid for a heavy animal transport package ... OK it's all going wrong, you get the idea.

No, I said that we have spent all we can, and at this point we need development.

However, in this case, the outage is not because of any issues with our setup, but with Amazon.

>"we have spent all we can"

So is it a financial constraint with Amazon? Would you be suffering the same sorts of outages regardless of the technology on the backend or does AWS basically suck?

You have the wrong end of the stick, because you're missing the history of the story. Reddit have a weird budget when it comes to staffing costs versus operating costs due to their parent company's policies as a media comapny - so they have a decent budget but are massively understaffed.

Statements like the one you're quoting are in that context. Let's say you have an unlimited operating budget - you can come up with all kinds of wonderful plans for massive redundancy and zero downtime. But you can't make that happen if you're not allowed to hire any engineers or sysadmins! As far as I'm aware reddit are paying Amazon mucho dinero but still having irredeemable problems with the storage product, EBS. They are stuck on an unreliable service without the manpower to move off.

That's the story, as far as I can piece together from comments here and on reddit.

Ah, you see from what I read on reddit I understood that the staff shortage was simply part of Conde Nast's unwillingness to spend money on reddit and that constant downtime issues were another facet of that same problem.

It's not making money and those looking after reddit don't want to ruin it with a huge money grab - instead taking a soft approach, first just begging for money, then adding in a subscription model (freemium anyone?) and more subtle advertising by way of sponsored reddits (/r/yourCompany'sProduct type stuff).

I understand they've been hit with more staff problems just recently despite having a new [systems?] engineer start with them.

So in your view EBS is the problem regardless of finance? That was the nut I was attempting to crack. TBH I didn't expect someone at reddit to stick their neck out and say "yeah Amazon sucks" but they might have confirmed that the converse was true and they were simply lacking the necessary finance to support the massive userbase they have.

Rackspace (and really all the "popular" US hosters) seem ridiculously expensive compared to hosting prices we have in Germany (see e.g. http://www.hetzner.de/en/hosting/produktmatrix/rootserver-pr... this is one of the biggest root server hosters in Germany).

Is this really so or are Racksoace and co. Just "boutique" offerings?

Will you guys ever do something like this:


but officially supported (and paid for) by you?

Sup jedberg, I obviously don't have nearly the level of knowledge with the intricacies of reddit, but coming from a strictly "business" standpoint, the amount of downtime reddit receives due to amazon issues is astounding. Perhaps it's time to look for alternatives?

Anyway, thanks for your time.


Good plan though. :p

My original comment was "We're going Madagascar on the servers." Then I remembered I was on HN, not reddit. :)

>Then I remembered I was on HN, not reddit.

Right now we may as well be on reddit.


Don't do that. HN is not the place for memes.

reddit isn't either, but we lost that battle a while ago.

It's refreshing for me to see you say this. God speed soldier.

You don't need to flee the country yet.

Hey man.. some of us would have got it :)

President Madagascar is doin' cloud biz, too?

Reminds me of that scene in Jurassic Park

Good lord. What's with all the DVs??

Ha. HN really on it's high horse today.

I know right? I made a hilarious joke a little while ago and jedberg yelled at me and everyone downvoted me. And I'll be very surprised if this comment doesn't get downvoted to hell too.

EDIT: I also simply greeted jedberg, and a bunch of people thought that was a good reason to downvote. Do people think there's an imminent influx of redditors, and that they have to dissuade them from becoming HNers? I don't think that's the case.

EDIT: Fuckin' called it.


I'm here all the time. :)

Apparently most of their problems are caused by bad EBS writes/performance, or at least so they said a few weeks ago after some particularly bad downtime.

It looks like EBS will randomly decide to switch to a few bps of performance from time to time. I would use Amazon for my startup, but these issues really make it hard to justify.

EBS seems to be the main problem here, I'll cite a former reddit employe (first comment on the blog that talked about EBS problems).

I don't work for reddit anymore (as of about a week ago, although I didn't get as much fanfare as raldi did), but I can tell you that they're giving Amazon too much credit here. Amazon's EBSs are a barrel of laughs in terms of performance and reliability and are a constant (and the single largest) source of failure across reddit. reddit's been in talks with Amazon all the way up to CIOs about ways to fix them for nearly a year and they've constantly been making promises that they haven't been keeping, passing us to new people (that "will finally be able to fix it"), and variously otherwise been desperately trying to keep reddit while not actually earning it.

Source: http://www.reddit.com/r/blog/comments/g66f0/why_reddit_was_d...

cache: - scroll down for comment - http://webcache.googleusercontent.com/search?q=cache:cfbs-sp...

EC2 instances only have one network interface. The public IP address you have pointing to your instance is a DNAT done somewhere further up the chain.

If you get a large network load to your instance - say, a DDoS attack - you can find you no longer have enough network capacity to talk to your EBS disks.

This is what happened to Bitbucket in 2009: http://blog.bitbucket.org/2009/10/04/on-our-extended-downtim...

This doesn't appear to be the issue here, though. valisystem's link mentions that it wasn't an interface issue, EBS is just shit, apparently.

Slightly offtopic, but wasn't that post by an ex-employee as a comment? Not that the technical aspect of it wasn't fantastic, because it was, but I don't think Reddit said anything publically did they?

It was, you are right, I misremembered. valisystem's comments above contains the reference.

Yes valisystem linked to it. All good, I wasn't sure if I misremembered.

Looks like troubles in only one availability zone.

That seems to be incorrect. We have problem children in us-east-1b, -1c, and -1d.

AWS randomize the zones per account, so "your" -1b is not necessarily the same as "my" -1b. I'm only seeing problems in my -1c. Are you seeing 3 zones failing all under the same account?

If multiple AZs are down, AWS are going to have some serious explaining to do...

Amazon RDS's most expensive feature is automatic, instant Multi-AZ failover to protect against this kind of situation. It's not working quite like that, which the AWS status page acknowledges. This is a major failure.

"AWS randomize the zones per account"

Interesting. This is news to me.

I found some more info here (Google cache since alestic.com is presently unreachable): http://webcache.googleusercontent.com/search?q=cache:0jxzyFj...

We (reddit) are seeing failures in all zones.

If memory serves, Amazon's reporting of what zones are experiencing problems has been...optimistic...in the past.

AWS have now confirmed that this affects multiple availability zones. From the status page: "..impacting EBS volumes in multiple availability zones in the US-EAST-1 region"

Thats not good. The whole point of multiple AVs is for them to not fail at the same time. Suggests some dependencies that should not be there perhaps, or at least some correlation of something, like software upgrades. Looking for a good explanation of this; one AV going down is not a problem and should not impact anyone who is load balancing.

How does Reddit display the 'offline' page if it's down?

The server is still up, so we can serve it right out of the load balancer.

Are you able to enable the 'read-only-mode' using the same method?

I was to be under the impression a great deal of Reddits issues were linked to Amazon.

Why is ELB not mentioned at all on the Service Health Dashboard?

We're experiencing problems with two of our ELBs, one indicating instance health as out of service, reporting "a transient error occurred". Another, new LB (what we hoped would replace the first problematic LB), reports: "instance registration is still in progress".

A support issue with Amazon indicated that it was related to the ongoing issues and to monitor the Service Health Dashboard. But, as I mentioned before, ELB isn't mentioned at all.

We've got a single non-responsive load balancer IP in one of our primary ELBs (others have been fine for several hours now), so while everything else for us is up & running, still have transient errors for folks that get shunted to through that one system.

The interesting thing about the ELB in a situation like this is that I believe it may, in many instances, be better to hobble along and deal with an elevated error rate if at least some of your ELB hosts are working than to re-create the entire ELB somewhere else, especially if you're a high-traffic site where you may hit scaling issues going from 0 to 60 in milliseconds (OMMV, but we've been spooked enough in the past not to try anything hasty until things get back to normal).

We have an identical load balancer to one that is causing problems so we're lucky enough to reroute traffic through that one instead to get to the same boxes. (The boxes serve two different APIs through two different DNS CNAMEs so we split the ELBs for future and sanity). In this case, it's helped us out. Alternatively, we would've just routed all traffic to our west coast ELBs.

Quote from the AWS support rep: "I can confirm that ELB has been affected by the EBS issue despite the lack of messaging on the AWS Dashboard".

Quora says: "We'd point fingers, but we wouldn't be where we are today without EC2."

Nice way to point fingers while saying you're not.

Yes, that was by far my favorite comment. Well played.

I just launched a site on Heroku yesterday and cranked up the dynos up in anticipation of some "launch" traffic. Now, I can't log in to switch them off. Thanks EC2, you owe me $$$s

Actually, I'd expect Heroku to not charge for when the site was down, as they are clearly not available, it does not sound fair if they charge for it.

Am I expecting too much from them?

My app on heroku is running, it's just that I can't log in to their management console to de-allocate resources that I am paying for by the hour.

If I were you, I'd send them an email requesting this, on your behalf. At the end of the day, its their responsibility to make the console unavailable. I will be more than unimpressed if they dont see this logic here.

Isn't that a design flaw in Heroku? Shouldn't you be able to log into Heroku and change stuff like that even if the entire of Amazons cloud service is down?

Agree. You can delegate work but not responsibility.

> Nothing special-case here: we deploy with git push, just like any other Heroku user. Dogfooding is good for you. http://blog.heroku.com/archives/2009/4/1/fork_our_docs/

That's all well and good, but it's no use for their customers if/when Amazon goes down.

I think this is a good example of how the "cloud" is not a silver bullet to making your site always up. AWS provides a way to keep it up, but it is up to each developer to ensure that they are using AWS in a way to make sure their site can handle problems in one availability zone.

I think we will see more of a focus from big users of AWS about focusing on how to create a redundant service using AWS. Or at least I hope we will!

All well and good, but the elephant in the room is that multiple availability zones have failed at the same time. It looks like AWS have a single point of failure they weren't previously aware of.

This outage is affecting all AZ's in the East. So even a multizone setup wouldn't help for this one. Only a multiregion setup.

This outage is a lot like having your entire datacenter lose power.

I thought AZs were supposed to be different physical data centers.

If that is not the case, then having a multi-region setup would be a necessity for any major sites on AWS.

Perhaps there will be a time where to truly be redundant, one would need to use multiple cloud providers. Which would be a _huge_ pain to do now I imagine, with all the provider lock-ins we have.

> I thought AZs were supposed to be different physical data centers.

They are. Which means this is probably a software issue or some other systemic issue.

Well that isn't supposed to happen :(

I think I'll go home and wait it out there, but it appears that they are having some progress in recovering it. But our site is still affected.

If "multi-region" means North America, Europe and Asia Pacific, doing so would also improve world-wide latency (e.g. here in Australia...).

Could you use this outage to justify switching to multi-region?

A blog post last month touched on this: "Q: Why is reddit tied so tightly to the affected availability zone?

A: When we started with Amazon, our code was written with the assumption that there would be one data center. We have been working towards fixing this since we moved two years ago. Unfortunately, progress has been slow in this area. Luckily, we are currently in a hiring round which will increase the technical staff by 200% :) These new programmers will help us address this issue."

Not sure if the costs of data transfer between regions (charged at full internet price) would justify the added reliability/lower latency though.

I don't know reddit, but concurrency is very hard with high latency.

NetFlix should sell their chaos monkey as a commercial product.

Don't let AWS hear that, or they'll charge us for their failures by rebranding it as a feature.

Instead of enumerating who's down, I'd be more interested to hear about those that survived the AWS failure. We could learn something from them.

Quora is down, and evidently "They're not pointing fingers at EC2" -- http://news.ycombinator.com/item?id=2470119 -- I was going to post a screen shot, but evidently my Dropbox is down too.

Holy crap. An Amazon rep actually just posted that SkyNet had nothing to do with the outage:


I'm seeing 1 EBS server out of 9 having issues (5 in one availability zone, 4 in another). CPU wait time on the instance is stuck at 100% on all cores since the disk isn't responding. Sounds like others are having much more trouble.

Silver lining: Hopefully I can test my "aws is failing" fallback code. (my GAE based site keeps a state log on S3 for the day when GAE falls in a hole.)

This code should be well tested by now. Amazon is doing you a failure by being rubbish.

AWS/S3 has become the new Windows - great SPOF to go for if you want to attack. This space needs more competition.

Two years ago TechCrunch was publishing an article every time Rackspace went down listing all the hot startups down along with it. AWS is no more a SPOF than any other major hosting provider.


It is more noticeable since so many large sites use it. Kind of like a MS BSOD was a common joke because so many people used MS Windows.

If we were all having our own rental servers then.... well, many sites that we know wouldn't be around :)

http://venuetastic.com/ - feel bad for these guys. They launched yesterday and down today because of AWS. Murphy's law in practice.

Wow. I can only imagine the intense frustration the site owner must be feeling right about now. Makes you really stop and question the whole "cloud" based service. Or at least should make you realize you need fall-backs other than the cloud service itself.

They are scaling in the cloud, at least.

So when big sites deal use Amazon Web Services for major traffic, do they get a serious customer relationship? Or is it just generic email/web support and a status page?

There are subscriptions for various levels of support, from $50/mo for 12 hour response time to $15k/mo for 15 minute response time.


Interesting pricing. The Platinum seems priced to have no one use it, considering how much of a jump it is over Gold.

I would say, rather, that it is priced to have very specific sorts of customer using it.

The relationship between the pricing tiers changes fairly drastically, depending on how much you are already spending on Amazon Web Services. Gold, for instance, starts out at 4x the price of Silver support, but by the time you're spending 80K/month on services, it's only a $900 premium (and stays there no matter how much bigger your bill is). At the $150K/month level, it's a 2x jump from Gold to Platinum, which may or may not be a huge jump, considering the extra level of service you get.

It's a bit ironic that Amazon WS has become a SPoF for half the internet.

Yes, they are. :(

My four-day weekend is already off to a bad start(UK here).

Mine is worse. I booked Tues-Thurs off. I only have internet in work at the moment. I'm going to miss reddit now and be without internet until I return to work on the 3rd of May. Stupid Sky and their stupid take forever switch overs.

It could be worse/better, you could have an Australian 5-day weekend.

Or a South African 11 day weekend. Easter, and then public holidays on 27 April and 1/2 May are combining this year to provide a massive holiday opportunity.

UK Universities give staff the Tuesday too, giving them a five day weekend. I took two days holiday next week and because of Easter and the Royal Wedding that gives me 11 days off straight.

To clarify, this Monday is ANZAC Day, a commemorative holiday for troops who fought for Australia. Because that's also Easter Monday, the ANZAC Day holiday is moved to Tuesday, despite commemorative services being held on Monday.

Actually the official line is that Easter Monday got moved to Tuesday.

Interesting. Thanks. What did the Catholic Church have to say about that? Is it that Easter Monday is still on Monday, but the holiday is on Tuesday?

Easter is the holiday, and it's on Sunday. I'm pretty sure the Pope doesn't care much what people do the day after Easter (or the day after that).

Ah, thanks. I misread the wikipage, and didn't realise.

Is it a 4-day holiday in Norway too?! Opera Support Forums have been sketchy for hours and if they won't come back until Tuesday, I'm SOL with my Opera problems.


Why do you get a 4 day weekend?

Whichever date Easter Sunday falls on, the Friday before it (Good Friday) and Monday after (Easter Monday) are both public holidays.

Getting the 26,27,28th off from work is just genius for a long break!


Royal Wedding

May Day

Easter, probably.

no reddit at work today!

Believe me, the last thing I want is to be up at 3am working on this. I'd much rather be sleeping and letting you not work.

Are you part of the team resolving it?

"Resolving" probably isn't the right word seeing as this is a purely amazon issue, not much they can do.

I'm guessing jedberg is mostly banging his head in walls and seriously looking at alternative hosting solutions right now.

Actaully, I'm on the couch in front of the fireplace, watching old SNL, waiting patiently for Amazon to fix their shit, and figuring out how we can not use EBS anymore.

I'm pretty sure redditors would be more than happy to deal with a day or so extra downtime as you guys switched to a better platform. Just leave a simple page up saying "Dumping Amazon, brb"... doubt you'd get many complaints.

TBH Amazon is so bad at this point that turning off Reddit is as good as trying to keep it running. Of course then you need to deal with the increased suicide rate.

Right on.

Periodically, latest a couple of days ago, there's a post / discussion about whether outsourcing core functionality is the right thing to do. There are valid points on both sides of the issue.

For my part, if I'm going to be up in the middle of the night I'd rather be up working on fixing something rather than up fretting and checking status. But either way things get fixed. The real difference comes in the following days and weeks. When core stuff is in the cloud then you can try to get assurances and such, fwiw. When core stuff is in-house then you spend time, energy and money making sure you can sleep at night.

A couple of years ago you had expressed interest in making a port to App Engine, any interest in doing that still? Want any help? ;)

I think it would take a lot more time than we have to make that work. Our code is open source if you want to give a proof of concept a go. ;)

On the off chance a port to app engine coalesces around this comment, count me in :)

So I was half right? Awesome!

I was looking at EC2 until this!

I thought you could cluster your instances across many regions and replicate blah blah blah and change your elastic ip addresses in instances like this?

Is this a case that it's not being utilised or does that system not work?

I appreciate you are busy right now so I'm not expecting a reply any time soon.

That is the theory, but all of our data is currently locked in the inaccessible EBS system.

I'd still say Amazon is a great place for your startup, just don't use EBS.

I thought you could snapshot drives across regions and bring those EBS drives up under new instances in new regions?

I have not used it all in detail yet so I don't know the practicality of this method.

I think I will stick to my co-location costings I am doing for the time being. There is only one person to rely on when it all goes wrong then!

Good luck getting it sorted, i know I wouldn't appreciate being up at 3am sorting it though

> I thought you could snapshot drives across regions and bring those EBS drives up under new instances in new regions?

In theory, yes. In practice, those snapshots hurt the volume so much that it is impossible to take one in production.

Interesting, your insight has given me a lot to think about.

Do you guys blog this anywhere?

Yeah, usually they are on our blog right after the downtime, or in /r/announcements on reddit.

Or have a fallback, maybe?

Yeah, I'm one of the reddit admins.

Assuming the problem is indeed with EBS, I would say this should be a warning sign to anyone considering going with a PaaS provider, which Amazon is quickly becoming, instead of an IaaS provider like Slicehost or Linode.

The increased complexity of their offering makes it more likely that things will break, leaving you locked in.

I did a 15 minute talk on the subject, which you can check out here: http://iforum.com.ua/video-2011-tech-podsechin

EDIT: here are the slides if you can't bother watching the video http://bit.ly/eqDNei

Every time someone makes the claim that downtime should be a warning sign about going with a PaaS provider (or, indeed, an IaaS provider, or in some cases, people even make this claim about going with someone else's Data Center) - I always respond: "And why do you believe that you would do any better?"

Every environment I've been involved in as an operations professional for the last 15 years has experienced downtime regardless of how much we invested in staging, testing, change control procedures, redundancy and ITIL methodology.

While it is true that there are some environments in the world that experience no, or close to no downtime (The NYSE, google.com, the Space Shuttle) - it's usually not worth the investments those organizations make in ensuring 100% uptime.

I'm more interested in seeing what the monitoring methodology is, what the response protocols are, and what the reported downtime has been over the last year has been, than I am about getting too concerned about the occasional outage.


The thing to remember is that even if you could do better -- and it is certainly possible -- your customers might not pay you for it. It will cost you, a lot, to build a significantly better system than AWS. Reliability is about redundancy, and redundancy is synonymous with paying for things that sit idle for years at a time, and paying for the hardware and personnel to run disaster drills over and over, and customers hate that. Reliability is also about avoiding excessive complexity and limiting the rate at which you change things, and customers hate that too. If they wanted highly reliable, slow-changing established tech they'd be using land-line phones, not Quora or Reddit.

I agree with your sentiments, but this:

> I always respond: "And why do you believe that you would do any better?"

…is an apples-to-oranges comparison that implies that the thing that Amazon is currently doing is what you would be doing in an Amazon-less scenario. It's not.

Amazon hosts thousands of customers and needs to service all of them in the same infrastructure, which coincidentally is also shared between every customer by virtue of being virtualized. It's a complex structure in which every resource -- CPU, RAM, disk, network -- is shared, which means that even a minor EBS problem could potentially have a butterfly-effect propagating through multiple customer's hosts. There are limitations imposed by the structure; for example, the only way to access large amounts of block storage with local-like performance is through EBS. But EBS has been shown to be high-latency and flaky, and Reddit's chronic problems with EBS is a good example why it one probably should stay away from it entirely. Network latency on EC2 is also pretty horrendous compared to classic non-virtual setups; I have set up HAProxy clusters where the connection setup time to backend hosts would be measured in tens of milliseconds.

Now, if you hosted your own stuff, you would be the only customer, and you would not be sharing resources with anyone else, and you could design your hardware exactly to your specifications (fast local disks with low latency, fast networking with low latency and so on). The difference in complexity is significant. There are tons of challenges and costs involved in hosting your own stuff and maintaining uptime, but the complexity equation is different.

The problem with Amazon is that despite touting an open API, their infrastructure internals and practices are a trade secret, so the likes of Eucalyptus are having to play catchup. In other words, I cannot replicate their infrastructure in my own data center, even I had the money to pay them. I suspect this is the main reason that Heroku didn't move off Amazon, and not the fact that Amazon was providing them great value for money.

There is definitely value in platforms, but those platforms should be built on completely open standards with no vendor lock in or influence, as is the case with OpenStack for example: http://www.theregister.co.uk/2011/02/10/rackspace_buys_opens...

If you didn't watch my talk, I should point out that I'm working on Akshell (http://www.akshell.com), which is itself a platform provider, so I am in fact agreeing with you.

Compared to PaaS provides, Amazon is easy to migrate off of because they are simply giving you virtual hardware.

The basics of their setup are well known: Xen and Linux. EBS is some sort of block-based network storage. True, we do not know the backend storage setup, but it does not really matter. The Xen VM simply sees a disk device. NetApp, EMC, Dell, HP and many others all have products that offer similar functionality, including snapshots.

The only part that is potentially hard to migrate off is the security groups, and then only if you are using named groups. If you are just using IP based rules, most any firewall would work.

That's where I think we have to give credit to Rackspace for open sourcing all (or much) of their tech behind Rackspace Cloud

Maybe that's true for an individual, but from a broader (societal/economic) view this is a bigger problem, because it affects many more people.

Lets hope they don't lose any of my data.

I dont see EBS as a very PaaS type service. It is a virtualised network storage volume with snapshotting, just what you might expect from an IaaS provider. Clearly it is one of the more complex parts of the Amazom infrastructure though (other providers dont provide anything similar). If you dont use EBS many applications would have to implement a lot of its functionality themselves, and I suspect this would lead to some data loss instead, when drives fail or instances fail leaving drives unavailable.

You can avoid EBS if you use local drives and s3, it is just difficult to run a database like this if it was not built that way.

From EngineYard: "It looks like EBS IO in the us-east-1 region is not working ideally at this point. That means all /data and /db Volumes which use EBS have bad IO performance, which can cause your sites to go down."

They better start writing their explanation now. Multiple AZ's affected?

They better be cutting me a check too.

As an advertiser on Reddit, you guys better be cutting me a check too :)

(although my app is also dead right now so not as if the advertising would do me any good...)

Send me the link to your ad (when we come back up) and I'll comp you a day.

Isn't it paid on impressions?

I don't know how the sidebar ads work, but the featured links at the top of the page are run on a sort of auction system. Everybody that wants a piece pays however much they want (minimum of $20) for the day, then all the ads are totaled up and each ad gets a percentage of pageviews corresponding to the their percentage paid on that days revenues.

If that wasn't clear (and I'm not sure it was), assume you and I were both the only advertisers on reddit for a day. If I pay $20 and you pay $20, we would both have our ads displayed on 50% of pageviews for that day. If instead I paid $80 and you still only paid $20, then I would get 80% of pageviews to your 20%, regardless of the total pageviews for the day.

Assuming the outage is only for a day, right?

Had our blog go down. Didn't realize it was AWS wide..did a reboot. Now I am in reboot limbo. Put an urgent ticket into Amazon. They just said they are working urgently to fix the issues. Let's see how long this goes.

In case anyone is late to the party and missed the non-green lights on the AWS status dashboard, here is the page as of about 9:30 EDT...


Given that Heroku's parent company (Salesforce) owns a cloud platform, it seems kinda inevitable now that Herkou will perhaps sooner-than-later switch back-ends (or at least use both)

Everyone talks about SLAs but I believe it doesn't consider the fact that the EBS vols are still up (not on fire, and available) and are phantom writing or that the network is queued up the wazoo so writes don't even happen in a timely manner as you'd expect.

I'm not sure that just because they are up, yet unusable, would negate an SLA.

You could have a dedicated server in a datacenter - if the network goes out, your machine is still up and happily waiting to serve requests - but it's still unusable and not actually in service.

So do we get some credit on our AWS accounts? I haven't really read their SLA for EC2.

I did the math on it, with the downtime so far it's almost approaching the magic 99.95% barrier where everyone gets a 10% bill credit. Wouldn't that be something to light a fire under the collective arses of those in charge of keeping EBS stable.

> Wouldn't that be something to light a fire under the collective arses of those in charge of keeping EBS stable.

Yeah, the only possible explanation is that Amazon's put a bunch of lazy guys in charge of EBS. How hard a job could it be?

I'm certain they're plenty talented and hard-working, but there's got to be something in their way. Just like the guy said as he was leaving Reddit, management seems to be pointing fingers in different directions as to why this is going on and how it can be resolved. It's sad, though, because I don't want to think of Amazon as a company with a potentially toxic corporate culture. I enjoy my free Prime membership as a student :D

Amazon's corporate culture works pretty well. The problem is that these large-scale multi-tenant services are new technology and very difficult to create and run.

"In the event Amazon EC2 does not meet the Annual Uptime Percentage commitment, you will be eligible to receive a Service Credit as described below."


To receive a Service Credit, you must submit a request by sending an e-mail message to aws-sla-request @ amazon.com.

Being unable to get much done here, my co-workers have found other things to do in the office: http://www.youtube.com/watch?v=u1-oGxDHQbI :-P

What did I just watch and why?

All hosting services go down occasionally. If you want to stay up you need to build a fault-tolerant distributed system that spans multiple regions and potentially multiple providers.

Also, Amazon should fix EBS.

Ruh Roh. The service I'm using to acquire accommodation seems to be dependent on AWS. Guess I'm going to be homeless tomorrow if it doesn't get fixed. :X

Note to self. Don't ever build a service reliant on AWS.

If its us (crashpadder.com) - sorry for the inconvenience. Email us hello [at] crashpadder.com and we'll do all we can when we're back up.

We'd moved to AWS from an even worse host a few months ago, and until this morning had been pretty impressed...

So what percentage of the top 1000 sites are now crippled by this?

It's definitely a limited outage. My three instances seem to have operated all night with no problem. Two of them are EBS instances.

Wish were able to download our ebs snapshots, which are supposedly hosted on S3. What does everyone else do?

I take the snapshots. I also have servers send backups to each other each night. I also have a nightly cron job run and rotate backups of the most critical databases to an external drive on my home network. A Tonido Plug does that job (Ubuntu on a tiny ARM server in a plug that costs virtually nothing to run).

Now, some of the databases are simply too large or under too much load to take a live backup while the sites are running. Those I run on Amazon RDS with the MultiAZ feature enabled. There should be two copies of the database running at all times, both servers keeping a 3 day binlog for point-in-time backups, and making a nightly snapshot to S3. I have to rely on Amazon for that.

But I still take daily home backups of the most valuable individual tables off those servers, like user registrations and payment records. Even if I can't have off-site backups of the whole database, I'll have off-site copies of the part I'd need most in case of an Amazon-entirely-offline catastrophe.

It seems like availability zone us-east-1c it's working, i can launch a EBS backed instance right now.

Today, April 21st 2011, according to the "Terminator", Skynet was launched... No wonder AWS is down

1:23EST and Reddit is back up. Quora/4SQ still down. My site still down.

Thankfully, my major clients are using the Asia EC2 Instances!

reddit.com is down, but luckily http://radioreddit.com is not.

It's really mostly EBS failures, so the title is overly dramatic. And EBS has been known to have issues.

This is incorrect. EBS is part of the issue, as well as intermittent connectivity failures for EC2 instances.

We're also seeing connectivity issues with our elastic load balancers.

I've got 2 EC2 instances with EBS root devices -- both are unreachable. My RDS instance is responding, but it is very slow -- struggling to make a backup.

Yay for relying on the cloud \o/

Yes! It's like those dumb people that rent out 5 floors of a 100 story building, instead of building their own SkyScraper. Then, horror of horrors, 2 of the 8 lifts breaks down for 12 hours.

Such idiots for paying rent, even though building SkyScrapers isn't their core business.

Thank you. I'm totally stealing this next time someone uses that "cloud is for idiots" line.

This is an excellent metaphor.

Poor analogy; if the lifts break, just use the stairs.

Note that I've written "relying", as in "depending on the cloud without any alternative backup". Depending on the cloud means depending on something you cannot influence.

Ok, it's a bomb scare and the building's shutdown for 12 hours.

How I was supposed to get ' "relying", as in "depending on the cloud without any alternative backup" ' from what your originally wrote, I don't know.

I'm sorry if you didn't understand "relying" as I meant it. I'm not a native speaker, so my assumption of the meaning of "relying" may have been wrong.

I'll try to elaborate: yes, there are of course certain things that you cannot avoid. That's why I meant you should not "rely"/"depend on the cloud without any alternative backup" on the cloud being available. Redundancy, if your service is crucial!

If you're a small company, and cannot go to your building for a day, well then I hope you're not that important so no big damage (financial or otherwise) is caused. Just enjoy your free day ;-) But if there's a bomb threat in, for example, a hospital, I surely hope they have some kind of backup plan!

Cloud as an alternative mechanism, or maybe even the primary resource for your site -- great, why not! But please don't just throw everything into "the cloud" trusting on the belief that the cloud's operator is impossible to fail.

You said 6 words in your original post. And now you're trying to suggest I was to get the above 166 words from the original 6?

You raise perfectly valid and interesting points. It's just a shame they're not the parent comment because then they would have invoked some thoughtful discussion.

It's a shame your sarcastic reply derailed the conversation.

Fair call.

The building analogy works. If you're in a highrise, no elevator service is going to impact your business.

The back analogy works too. If you're an oil company, Goldman Sachs, AT&T or something similar, you have the bucks to pay someone for an empty emergency office complex for critical staff.

If you're anyone else, you're screwed when the elevators break/street floods/bridge falls down. That doesn't mean that buying a building or building a dam is a viable business continuity strategy.

No, presumably if you're in a highrise, you are not running a retail business and your employees can work from home. It's an impact, but not a material one.

A better analogy would be setting up a retail shop on a busy street where every now and then, the police completely close off the street, and when you ask them why they've done it, they tell you that it's none of your concern.

Yet, people still continue to open shops on this street, that is, until the customers start going somewhere else.

What do you do when a building contractor cuts through your datacentre's link? For example, this happened in London a while back.

My point is: things go wrong. Blaming this on the cloud is silly.

It happened to ServInt back in 2004 - the VA Department of Transportation cut the MCI OC-48 and OC-12 fiber strands running to the ServInt datacenter. ServInt handled it amazingly well and learned from their lessons moving forward and if I recall correctly gave a comp'd month to it's customers. They followed up by adding additional backbones in addition to the MCI one and have always been solid since.

That's like saying "yay for relying on a datacenter"

Note that I've written "relying", not "hosting", meaning I'm well aware of the fact that using "the cloud" as an additional layer of capacity can be a good thing.

But relying on the cloud means that you're giving up control.

Power outage in your data center? Buy diesel generators next time! Faulty network card? Put in two redundant ones!

AWS down? Wait and pray!

Go back to the building metaphor. Building datacenters is not our core business, so we outsource it. Much like building a building is not our core business, nor running a telephone system, so we outsource those things too. All business critical, but still outsourced.

Of course it's okay to outsource things; and it's also okay to outsource critical stuff. I'm doing that too.

But, outsourced or not, critical things should be as redundant as needed, and not relying on a single point of failure.

If in a hospital a life-keeping machine depends on electricity, I hope they have diesel generators available even if their outsourced power supply fails.

I think people thought they were outsourcing the job of not having a single point of failure.

Having a critical service supplied by someone does not necessarily mean there's a single point of failure, because that someone might be working to make sure there is no single point of failure.

No, building data centers is not your core business, but your core business relies on them so intimately that it makes sense (at least to some people) to have a little more direct control of your datacenter resources. Or, the have a warm-standby of some sort.

The best analogy I can make is it's like a telemarketing company. Their core business is not building/owning/maintaining PBXs, it's calling you and trying to sell you crap at diner time. However, because they cannot deliver their core product reliably without a PBX, they tend to own/maintain their phone switches. They don't order 300 simple phone lines from the local telco, they order a couple of trunks, and manage phones and call routing to fulfill their core business needs.

It's not a perfect analogy, I know, there are cost factors unaddressed, but it's the best I can come up with right now.

I'm personally a fan of cloud services for elasticity, and early stage launching. I'm not a big fan for day to day operation of a site. The model I haven't seen worked up yet, it what does this REALLY cost a site like reedit or 4sq in relation to owning/controlling more of their own infrastructure? It's quite possible the best overall business decision is to take the risk and deal with the unpredictable inevitable outages and just hope they don't last too long.

At the end of the day, you need to make money. The question becomes: can you afford to be in the datacenter business?

Executing well in operating a facility with significant scale means that you need to have folks working for you who have a clue about operating a datacenter. That's not cheap.

IMO, the real question is whether or not you should be running your core business on a genericized "public cloud" like Amazon or if you should be using a traditional server hosting outfit, a managed service provider like Verizon/AT&T or a colo. There's a value calculation that you have to compute to figure out what's right for you.

I worked for a company that ran call centers and had a significant commerce site (whose growth exploded beyond expectations) back in the early 2000's. Our offices were a converted school and the datacenter was a classroom with a roof A/C unit. We knew it sucked and suffered through downtime and failures, but the company wasn't big enough to do things the "right" way. So we had to do things the "wrong" way, because the alternative was to go out of business. If that were today, we would almost certainly have had 80% of our systems at Amazon/Rackspace facilities.

> Power outage in your data center? Buy diesel generators next time!

> AWS down? Wait and pray!

Wouldn't you be waiting and praying if your datacenter lost power and didn't have diesel generators, too?

Because startups can afford their own data centers?

It isn't about relying on the cloud. It is more about incompetent cloud providers.

Does seem harsh to brand Amazon as incompetent. Things like this probably happen to Google's server farm all the time, but since they're the only customer they can and do reconfigure everything massively to adapt.

Amazon can't do that to your application that runs on a handful of systems.

all is fine on that service page now.

Amazon down - just one more reason to try out BO.LT and their amazing CDN and page sharing services...

Just launched: http://techcrunch.com/2011/04/21/page-sharing-service-bo-lt-...

This is like witnessing your parents having sex while a kid: you sort of knew this is a possibility but it is a devastating blow to your belief system nevertheless.

The amount of services I use that depends on Amazon is amazing. They have really become the utility company of the Web.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact