Hacker News new | comments | show | ask | jobs | submit login
Amazon Web Services are down (amazon.com)
560 points by yuvadam 2431 days ago | hide | past | web | favorite | 332 comments



Some quotes regarding how Netflix handled this without interruptions:

"Netflix showed some increased latency, internal alarms went off but hasn't had a service outage." [1]

"Netflix is deployed in three zones, sized to lose one and keep going. Cheaper than cost of being down." [2]

[1] https://twitter.com/adrianco/status/61075904847282177

[2] https://twitter.com/adrianco/status/61076362680745984


"Cheaper than cost of being down." This is very insightful. Many of us look at the cost of multi zone deployments and cringe, but its a mathematics exercise. (.05 * hours in a year)*(cost of being down per hour) = (expected cost of single zone availability). Now just compare to 2-3x your single zone deployment cost. Don't forget the cost of being down per hour should include lost customers as well.


At their level of income, this is true.

For us, we are just now staffing up to the level where we can make the changes necessary to do the same thing.


I think it's incredible that you guys can run a site at all with the few people you've got. Hope it all gets better again soon.


I would also be shocked if Amazon isn't giving Netflix preferred pricing because it's such a high-profile customer.


Netflix pays standard rates for instances but uses reserved instances to pay less on bulk EC2 deployments


Are you looking to diversify across ebs or set up dedicated hosting?


It's a strange algebra though; doesn't it mean the WORSE Amazon's uptime is, the more money you should give them?


More accurately, the more unstable your infrastructure is, the more you will need to spend to ensure stability.


Spending more on AWS to increase reliability isn't necessarily a benefit to Amazon. The increased costs can them less competative.


I'm actually surprised if incurring 50% extra hardware costs really is cheaper than the cost of being down. If Netflix is down for a few hours, then it costs them some goodwill, and maybe a few new signups, but is the immediate revenue impact really that great? Most of Netflix's revenue comes from monthly subscriptions, and it's not like their customers have an SLA.


Actually, they do. and Netflix proactively refund customers for downtime. Usually it's pennies on the dollar, but i've had more than refund for sub 30 minute outages which have prohibited me from using the service.

Netflix are very very sensitive to this problem because it's much harder for them to sell against their biggest competitor (local cable) since they rely on the cable to deliver their service. If the service goes down, then the cable company can jump in and say, "You'll never lose the signal on our network" -- blatantly untrue, but it doesn't matter.

When you're disrupting a market, remember that what seem trivial is in fact hugely important when you're fighting huge well-established competition :)


I'd imagine that part of this cost is reputation. The only problem I have ever had with Netflix streaming is when an agreement runs out and the pull something I or my wife regularly watch. (looking at you, "Paint Your Wagon")

I have not had a single service issue with them, ever. They do a better job at reliably providing me with TV shows than the cable company does. That seems to be where they're looking to position themselves, and the reputation for always being there is hard to regain if you lose it.


There isn't a 50% extra hardware cost. You spread systems over three zones and run at the normal utilization levels of 30-60%. If you lose a zone while you are at 60% you will spike to 90% for a while, until you can deploy replacement systems in the remaining zones. Traffic spikes mean that you don't want to run more than 60% busy anyway.


Obviously you haven't been around my wife when she loses the last 5 minutes of a show. SLA or no, services will get cancelled.


I don't think the cost of expanding to other regions/AZs is necessarily linear such that adding a zone would incur 50% more costs. Going from one zone to two would probably look that way (or even one server to two), but when you start going from two to three or even 10 to 11 then the %change-in-cost starts to decrease.

This is even more true if/when you load balance between zones and aren't just using them as hot backups. As another commenter pointed out, Netflix says they have three zones and only need two to operate.


Also, when there are service interruptions, they send out credits to customers.


Every decision in a business is like this - measure the cost of action A versus the cost of not-A. It's just rare that in this case, those costs are easily quantifiable.


Are they only in three zones, or three regions? Three zones would not have helped them in this particular scenario and they would have still been at risk.

And if they do mean three regions - can that cost of spanning various regions be quantified for different companies. The money spent vs money earned for Netflix may be very different compared to Quora and Reddit. At the same time, the data synchronization needs in between regions may also vastly differ for different type of companies and infrastructures thus leading to varying amount of cost to maintain a site on multiple regions.


More comments coming from Adrian Cockcroft:

1. See slides 32-35 of http://www.slideshare.net/adrianco/netflix-in-the-cloud-2011

2. "Deploy in three AZ with no extra instances - target autoscale 30-60% util. You have 50% headroom for load spikes. Lose an AZ -> 90% util."

https://twitter.com/#!/adrianco/status/61089202229624832


Here's the 24h latency data on EC2 east, west, eu, apac: http://dl.dropbox.com/u/1898990/EC2-multiple-zones-24h.png

Last 60 minutes comparison data: http://dl.dropbox.com/u/1898990/EC2-multiple-zones-60m.png

time in GMT.

A study we (Cedexis) did in January comparing multiple ec2 zones and other cloud providers: (pdf) http://dl.dropbox.com/u/1898990/76-marty-kagan.pdf


Pure opinion: That convergence might show that Amazon tried to do a failover on a DC level. Once they figured that wouldn't work or east was down for the count they just let it cycle to the ground under latency.


Yes - It is all business decisions. As someone said already an instance on AWS can cost up to 7X a machine you own on co-location. here is how outbrain manages it's multi datacenter architecture while saving on Disaster recovery headroom. http://techblog.outbrain.com/2011/04/lego-bricks-our-data-ce...


Amazon's EC2 SLA is extremely clear - a given region has an availability of 99.95%. If you're running a website and you haven't deployed across across more than one region then, by definition, your website will have 99.95% availailbility. If you want a higher level of availability use more than one region.

Amazon's EBS SLA is less clear, but they state that they expect an annual failure rate of 0.1-0.5%, compared to commodity hard-drive failure rates of 4%. Hence, if you wanted a higher level of data availability you'd use more than one EBS volume in different regions.

These outages are affecting North America, and not Europe and Asia Pacific. That's it. Why is this even news? Were you expecting 100% availability?


    Amazon's EC2 SLA is extremely clear -
    a given region has an availability of 99.95%.
    If you're running a website and you haven't
    deployed across across more than one region then,
    by definition, your website will have 99.95%
    availailbility. If you want a higher level of
    availability use more than one region.
Good point.

let P(region fails) = 0.05% and let's assume (and hope) that the probability of failure of one region is independent of the state of the other regions.

P(two regions fail) = P(one region fails and another region fails) = P(region fails) * P(region fails) = 0.05% * 0.05% = 0.0025%

Making your availability = 100% - 0.0025% = 99.9975%

Ultimately it's more of a business decision if you want to pay for the extra 0.0475% of availability. I would think (or hope) that most engineers would want it anyway.

The numbers at this size appear insignificant. How would one (say an engineer) convince "the management" that the extra 0.0475% of availability is worth the investment/expense?


"let's assume (and hope) the probability of failure of one region is independent of the state of the other regions."

In practice, that's not true, and it's not true enough to ruin the entire rest of your calculations. For Amazon regions to function independently, they'd have to be actually, factually independent; there is no interaction between them. The reaction to one node going down is never to increase the load on other nodes as people migrate services, etc. There's fundamentally nothing you can do about the fact that if enough of your capacity goes out then you will experience demand in excess of supply.

If you want true redundancy you will at the very least need to go to another entirely separate service that is not Amazon... and if enough people do that, they'll break the effective independence of that arrangement, too.

(This is a special case of a more general rule, which is that computers are generally so reliable that the ways in which their probabilities deviate from Gaussian or independence tends to dominate your worst-case calculations.)


I agree with you 100% that they're not independent, but I don't know enough about the data to model the probabilities of failure and availability in a HN comment :-)

After today's event, it would certainly be interesting to see how resource consumption changed in other availability zones and at other providers during this outage.

I wonder if that could be measured passively? What I mean is, by monitoring response times of various services that are known to be in specific regions and seeing how that metric changes (as opposed to waiting on a party that has little-to-no economic benefit to release that information.)


No, your decimal point is off. 0.05% * 0.05% = 0.0005 * 0.0005 = 0.00000025, or 0.000025%. It works out to an expected downtime of 8 seconds per year, instead of over 4 hours for one location.

Of course, redundancy doesn't set itself up, so there are added costs on top of Amazon.


Thank you for this, I don't know I tried doing the math without converting the percents to decimals, I should have known better.


Why wouldn't a simple expected value calculation work? You've shown that you can calculate the extra availability that subscribing to another region provides. Simply multiply the cost of an outage by the extra availability provided by an additional region that would have prevented that outage.

If expanding to another region costs more than just taking the outage, then it's categorically not a good option. If management still says no in the face of numbers that suggest yes, then that tells you that you're missing a hidden objection, and how you proceed will depend on a lot of factors specific to your situation.


I think you're right, that would be the best way of presenting this argument to management. To do so, however, the company would need to calculate its Total Cost of Downtime (which probably isn't very complex for many companies) which is its own subject entirely IMO.


> calculate its Total Cost of Downtime (which probably isn't very complex for many companies)

Not complex even factoring in reputational damage?


> The numbers at this size appear insignificant. How would one (say an engineer) convince "the management" that the extra 0.0475% of availability is worth the investment/expense?

In businesses where physical goods are sold to customers, "the management" is generally very motivated to avoid stock-out situations in which sales are lost due to lack of inventory (even if it's only a very small percentage.) The reason for this is because they are concerned about the potential loss of customer goodwill. It seems that the same applies in this situation.


Minor nitpick, but the availability should be even better, since 1% * 1% = 0.01% the availability becomes 99.999975% - six nines, anyone?


Reddit's been down for several hours today, I'm sure they are already way lower 99.95%.


0.05% of one year is 4 hours, 22 minutes and 48 seconds.


Reddit experienced some issues with Amazon a month ago that resulted in the site being down for almost a day. I'm pretty sure they're way below that percentage.


That's conflating SLAs again, Reddit's long-running problems have been with EBS reliability.


Even so, .5% downtime per year is about 44 hours, and Reddit's definitely had more downtime in the last few months than that.

Of course, that's assuming that 100% of Reddit's problems were due to EBS only and not a combination of EBS, EC2 and their own code.


Is this really standard practice for measuring the SLA? The contracts I've seen for a couple small businesses are generally per billable period.

Which always made sense to me. I pay you for 99.5% uptime this month. If you don't achieve it, then I get a discount, as simple as that. If your availability is below that, I don't pay full price for the billable period and then reconcile at the end of the year.

Any links or general advice on this topic anyone has I'd be pretty interested on finding out if there's a general consensus of it being done differently?


Don't forget to take in to account ALL of Reddit's downtime; there is quite a bit of it.


Which means their entire quota for this year is all gone.


and then some...


If the Reddit web server admins took availability seriously they would have chosen to deploy across more than one region. Do you disagree? Why do you disagree? I'm being honest, no snark involved in my questions.


He wasn't suggesting that all Reddit's problems are due to Amazon services, he was using Reddit's down time today as a data point illustrating that the uptime guarantee claimed for the service has not been kept this year (in fact a whole year's "permitted downtime" as implied by the 99.95% SLA may be eaten on one day). Presumably Amazon will be handing out some refunds and other compensation (assuming the SLA isn't of the toothless "it'll be up at least 99.95% of the time, unless it isn't" variety).

Perhaps the Reddit admins decided that "up to 0.05%" downtime permitted by the SLA would be acceptable, compared to the extra expense of using more of Amazon's services (and any coding/testing time they may have needed to take advantage of the redundancy depending on how automatic the load balancing and/or failover are within Amazon's system) to improve their redundancy. By my understanding the promise isn't 99.95% if you use more than one of our locations, it is 99.95% at any one location, so the fact that Reddit don't make use of more than one location is irrelevant when talking about the one location they do use not meeting the expectations listed in the SLA.

I'm not saying Reddit's implementation decision is right (I don't have the metrics available to make such a judgement) but it would have been made based partly on that 99.95% figure and how much they trusted Amazon's method of coming to that figure as a reliability they could guarantee. If I had paid money for a service with a 99.95% SLA, unless the SLA had no teeth, I would be expecting some redress at this point (though there is probably no use nagging Amazon about that right now: let them concentrate on fixing the problem and worry about explanations/blame/compo later once things are running again).


Very few cloud SLA's seem to have teeth to me. Amazon's SLA gives service credit equal to 10% of your total bill for the billing period if they blow past the 0.05%. This is a lot better than some cloud providers that will simply prorate the downtime, but pretty crappy in terms of actual business compensation. It's equivalent to a sales discount almost any organization with a sales staff could write without thinking about it - meaning Amazon is still making money on every customer even when they've blown past their SLA - assuming every single customer fills out the forms to apply for the discount. Hint: Many won't, see mail in rebates.

A number of tier 1 network providers offer certain customers SLA's that are clearly in place to prove that they invest in redundancy and disaster planning. ex: less than 99.99% --> 10% credit. less than 99.90% --> no charges for the circuit in the billing period.

This reflects an understanding that downtime can hurt your business/infrastructure far in excess of the measurable percentage.


> If the Reddit web server admins took availability seriously

We do.

> they would have chosen to deploy across more than one region.

It's far too costly to do that. We are deployed across multiple AZs, but this failure hit multiple AZs.


Why is it more expensive to deploy in zones X,Y in regions A,B than zones M,N in region C? I assume you don't just mean "US West is ~10% more expensive than US East."


It's the combination of the extra cost of having machines in US West plus the cost of keeping the data synchronized between them (which is a lot) plus the added development overhead of making sure that things work cross region.

We'll get there one day, but we aren't there yet.


> but this failure hit multiple AZs.

Gah! You can't always account for all the failure modes that Amazon might have.


This seems to be a prevalent misunderstanding. Amazon's EC2 SLA of 99.95% applies at the scope of a region. A region may contain more than one availability zone. Hence, deploying on multiple availability zones still only affords the 99.95% availability level.

Yes, multi-region availability on AWS is hideously expensive. However, some organisations value an availability of greater than 99.95% enough to warrant such a multi-region deployment. Clearly reddit, and many, many other AWS users, do not. This isn't a value call on my part; I definitely couldn't afford the inter-region data transfer costs, all I know is that AWS offers you the tools to deploy high availability web services.


Why does Reddit really need 99% availability? Is a customer unduly harmed or is the world even worse off if Reddit is down for a couple cumulative days per year? Is it worth the cost? Would you put up with more ads and/or pay for Reddit in order to make sure that it's available 24/7/365?


Probably not as much for the customer as for the company. When sites are unreliable, people end up going to the more reliable competitors as they arise.


I wouldn't think there a large number of customers deciding "this is too unreliable, I'm leaving" on the basis of a few hours of down time. On the other hand, there might be a large number of people who, upon finding your site down, decide to visit alternatives that are up at the time, and some of those people might decide they like the alternatives better.


If any site can take down time and not lose users, reddit can. And it has.


Not qualified to speak about what Reddit should or should not do about the arrangement with Amazon. I have read several posts, including one by an (ex) Reddit employee saying Amazon is not delivering what they said they would, that much is clear. I really doubt all their downtime is part of the SLA.


The whole point of AWS is to forget about maintaining hardware infrastructure.

Amazon are the ones who should have made backups in multiple regions, and transfer the load on failure.


Actually, the whole point of AWS is to have options for using hardware that you don't own. They don't offer any magic "all your stuff in one package, guaranteed to work all the time" service. So yes, you do still need to think about your hardware infrastructure. You just don't have to own it.

And Amazon does have all their stuff available in multiple regions. It's up to you to use it though.


> The whole point of AWS is to forget about maintaining hardware infrastructure.

If that were the case, you wouldn't be presented with region and availability zone options.


Then it would be much more expensive at the bottom tiers, meaning I wouldn't be able to play with it on a whim without thinking about the money. That would suck.


"[S]hould" is the wrong word here. Clearly, they don't maintain such backups. This is clear to anyone using their service. They pay for the service anyway, so apparently it's still worth it to them, even without auto-backups.

Would it make sense for Amazon to maintain automatic backups (and potentially charge more for them)? I don't know. It might make business sense, it might not. But their service is apparently popular enough even without it.


Not sure if I could say Amazon should be doing that - but I'd love it if other value added providers (such as Heroku) could implement this.


http://cache-scale.appspot.com/c/www.reddit.com/

That's a cache of it. I really wish that the admins at reddit would implement something like this themselves, then link to it when downtime like this happens.


They do have a read-only mode, don't they? I'm not sure why they don't enable read-only mode when things like this happen. It may be that Amazon's service being down forbids this. I dunno.


They have a read-only mode for "free" which is their akamai cache that the unlogged in users see.


do you have any stats of this app in terms of total data stored, bandwidth required per day, requests per second?


I don't, unfortunately, because I didn't write it :(


Reddit being down is not news.


That hurts. But you're right, we've had a lot of issues.

I think the reason this is news is because it is a massive Amazon failure.


Reddit gets a lot of grief for stability issues but the fact is it is an immensely popular site that a huge number of people have a close affiliation to. A massive percentage of these people spend a significant portion of their day browsing Reddit, interacting with other redditors, etc. and for the site to be down for as long as it has is news, regardless of similar issues occuring in the past.

The main reason this is news is because this is an Amazon issue but also because tens of thousands of people who frequent the site regularly are now aimlessly browsing the internet in the hopes of finding alternative lulz and in my case some of us are even getting work done. shudder


I can't imagine how frustrating the jobs of the Reddit admins must be.


Is admin supposed to be plural? I mean, do they really have multiple system admins now? I ask, only because I know people have been coming and going recently.

Frankly, for the size of the site, they do really, really well for the limited resources they have.


I meant admins in the more general purpose sense of administrators, people who are paid to maintain the system. But yeah I agree, quality:resources ratio is really really high.


It's usually very rewarding. The awesome community is what keeps me doing it.


Awesome, I just figured out that you kept the votes tallied during the 'downtime'. What's interesting is how clearly good and bad submissions were dichotomized when nobody had anything else to vote on.


You made Google App Engine managers cry once: http://www.theregister.co.uk/2009/07/06/dziuba_google_app_en...

Anything coming up for amazon? If not for anything else, for pure entertainment value!


No. Not really. But we kind of miss it anyway.


Note also that 0.1-0.5% refers to irrecoverable data loss, not temporary unavailability.


Current status: bad things are happening in the North Virginia datacenter.

EC2, EBS and RDS are all down on US-east-1.

Edit: Heroku, Foursquare, Quora and Reddit are all experiencing subsequent issues.


Indeed they are. Right before the issues began, I pushed a bad update to one of my Heroku apps, causing it to crash. A minute later I fixed the bug, re-pushed the git repo to Heroku... and nothing. I've been stuck with an error message on my website for hours. Unfortunate timing!


Staging servers are an easy thing on Heroku :)


You have my sincere sympathy.


http://status.heroku.com/ for those on Heroku


Is it just me, or is their status page down?


I know you probably have your answer by now, but this site always helps me out when I have that question:

http://www.downforeveryoneorjustme.com


It's been on-and-off for the past few hours.. It's up right now (for me, anyway).


This morning from 5a - 6a Pacific time I was able to access my Heroku app just fine.


I can access all my heroku apps that have their own DNS. Anything with a .heroku.com subdomain is down for me. Frustrating, knowing that the apps are still running but aren't routable.


Yay, cloud.


Please do not make content-free posts such as this. It adds no value to the conversation and is only noise.

If you actually wish to make a useful point about the practicality or otherwise of massively virtualised systems for webapp deployment, please do. It's going to take more than two words though.


I guess you haven't seen the Microsoft ads about the cloud? http://www.youtube.com/watch?v=Lel3swo4RMc

Anyway, my bad, I was just trying to make a joke to lighten up the mood. Sorry.


You're right, I hadn't. Fair enough.


I laughed, it's relevant and puts things in perspective if you had seen the ads. So it's not really content free even if it's just two words.


Please do not make content-free posts such as this. It adds no value to the conversation and is only noise.

If you actually wish to bitch about a post that we all got the point of, you're going to need more than two paragraphs.


Foursquare is up now. Quora not showing the 503 anymore, Reddit still down.


We rely heavily on EBS still, so this is hurting us more than most others. Hopefully they'll have us back up soon.


You guys might have answered this in one of your AMAs/blog posts (or was it raldi who commented?), but what options can reddit resort to should this stuff happen again to this degree of severity?


We're moving away from the EBS product altogether. The hard part is dealing with the master databases. Normally I'd have a master database with a built in raid-10, but I can't do that on EC2, so I have to come up with another option.

So I guess that is the long way of saying that hopefully it won't happen again.


I do not believe you could be effective by moving away from EBS, you know without giving up quite a bit.

Doing things the right way with EC2 means using EBS. It's the brake caliper to the rotor. Sure you could have drum brakes but they're not nearly anywhere effective as they quickly get heat soaked. I'm referring to S3.

One should trust ephemeral storage. Your instance can go down at anytime. Write speeds to S3 are not nearly as fast as ephemeral or EBS arrays (raid).

Hate to say it, but If one cannot trust EBS then what the heck are 'we' doing on EC2... EBS quality should be priority one, otherwise we're all building Skyscrapers on foam foundations of candy cane rebar.


I can't say whether much has changed within the last year, but when I worked at FathomDB we had serious issues with EBS. You couldn't trust it. Odd things would happen like disks getting stuck in a reattaching state for days and disks having poor performance.


How do you move away from EBS and still deal with large data?


Not sure what you had in mind by "large", but instance storage goes up to 1.7TB: http://aws.amazon.com/ec2/instance-types/


The reason Reddit uses RAID10 is for performance, not disk size. A single instance storage device is just too slow for the Reddit database.


Many instance types have 2 or 4 virtual disks (presumably on different physical disks).


I imagine they'd do consider some combination of the following (sorted by most likely)

1. Sharding data 2. Pulling tables out to other servers from the main DB 3. Pruning excessive data 4. Compressing data


It still has to be stored somewhere though right? If it's EBS you've just made yourself a complicated solution that will eventually fail all over again. No?


If the data is sharded, then the data/server is small enough enough to fit within the individual server's disk and you no longer need EBS to store it.


NOTE: I work for Gluster.

We have had a lot of success stabilizing EBS by creating mdadm arrays out of lots of smaller EBS volumes. There is minimal additional costs and you can get better performance, stability, and protection (RAID 5, 6).

Gluster makes an OSS distributed filesystem that runs across availability zones, our AMI (not OSS) builds multiple RAID arrays on each instance then spreads the filesystem across instances in multiple AZs. Send me an email if you want to chat.


Please tell us how you plan on moving 700 EBS volumes to something completely different. It sounds amazing.


Foursquare is up, but Quora is still showing the 503.


I stand corrected. 4sq down again.


Not all EC2 & EBS instances are down. I have several in US-east-1a and 1 is down, while all of the others are working.


Same with us. About 10% of our 700+ volumes are having problems right now.

It's hard to tell for sure since there isn't any load.


We're fine on EC2 -- but everything on RDS seems to be giving us big problems. We started a few backups before we new it was systemic, and all of them are stuck at 0%. We also tried spinning up new instances -- and they're all stuck in booting.


I think which physical data center "us-east-1a" etc. corresponds to differs from user to user, to load-balance given that people will probably be more likely to use 1a than the other zones.


We had about 45 min of downtime around 4am EST. Our RDS instances, EBS backed and normal instances all returned without problems. We are in Virginia us-east-1a and us-east-1b.


Also Cuorizini (http://cuorizini.heroku.com) is down!


4/21/2011 is "Judgement Day" when Skynet becomes self aware and tries to kill us all. http://terminator.wikia.com/wiki/2011/04/21

I am just a little freaked out right now.


Don't worry. If skynet is in EC2, we'll be fine.


You don't understand, Skynet is using all Amazon resources, hence the outages ;-)

Amazon have stated many times that amazon.com itself runs mostly on AWS platform, but it works fine now ...


AWS platform on a private cloud, it is not the same as AWS platform for us commoners.


I wonder if it's a virus or worm whose activation date was today due to the fact above, found its way into the amazon servers.

Probably just a coincidence.


Do we all regret letting GLaDOS reboot yet?


Sigh, Portal jokes were so much more popular on Reddit. :)


HN really isn't the place for internet memes, jokes about pop culture, and things that are judged trivial / frivolous. Part of what makes the HN community what it is, is a focus on high-quality, reasoned, rational discourse. IOW: HN != Reddit


A couple of hours into the failure, and no sign of coverage on Techcrunch (they're posting "business" stories though). It shows how detached Techcrunch has become from the startup world.

Edit: I tweeted their European editor about it and he's posted a story up now.


Perhaps this isn't really news. These days it's normal.


It's ugly, but true enough. You don't have to like it to acknowledge it. It's just another cloud outage bringing down one or more high profile sites. It's a "dog bites man" story.


This feels the same way as hearing that the whole Internet just got shut down.


I guess this is one Reddit outage that can't be blamed on poor scaling


Thankfully, no. :)

But yeah, right now we're shutting everything down to try and avoid possible data corruption. Once they restore service, hopefully we'll be able to come back quickly.


Hey Jedberg, if you guys aren't already rolling your own, check out fdr's WAL-E tool. It bounces postgres write-ahead logs off S3 and goes great with the new PG9 replication.

https://github.com/heroku/WAL-E


Thanks for this. I had designed and partially implemented this exact same thing. Do you know of this running in production anywhere?


Amazon is really not being kind to you guys; I sort of hope you'll find an alternative solution fast!


If I was Rackspace, I'd be at Reddit/Wired's headquarters already.


Didn't Jedburg say that they could reduce the failure by spending with Amazon.

I wonder if Rackspace really want this particular traffic burden. It seems that if Reddit choose not to pay for the load they need then you get lot's of bad press for it ... perhaps I'm seeing it wrong.

Rubbish analogy: Kinda like if I was doing a haulage business and you called out for a wheelbarrow to carry some elephants, then when the barrows broke we got bad press despite. If you'd paid for a heavy animal transport package ... OK it's all going wrong, you get the idea.


No, I said that we have spent all we can, and at this point we need development.

However, in this case, the outage is not because of any issues with our setup, but with Amazon.


>"we have spent all we can"

So is it a financial constraint with Amazon? Would you be suffering the same sorts of outages regardless of the technology on the backend or does AWS basically suck?


You have the wrong end of the stick, because you're missing the history of the story. Reddit have a weird budget when it comes to staffing costs versus operating costs due to their parent company's policies as a media comapny - so they have a decent budget but are massively understaffed.

Statements like the one you're quoting are in that context. Let's say you have an unlimited operating budget - you can come up with all kinds of wonderful plans for massive redundancy and zero downtime. But you can't make that happen if you're not allowed to hire any engineers or sysadmins! As far as I'm aware reddit are paying Amazon mucho dinero but still having irredeemable problems with the storage product, EBS. They are stuck on an unreliable service without the manpower to move off.

That's the story, as far as I can piece together from comments here and on reddit.


Ah, you see from what I read on reddit I understood that the staff shortage was simply part of Conde Nast's unwillingness to spend money on reddit and that constant downtime issues were another facet of that same problem.

It's not making money and those looking after reddit don't want to ruin it with a huge money grab - instead taking a soft approach, first just begging for money, then adding in a subscription model (freemium anyone?) and more subtle advertising by way of sponsored reddits (/r/yourCompany'sProduct type stuff).

I understand they've been hit with more staff problems just recently despite having a new [systems?] engineer start with them.

So in your view EBS is the problem regardless of finance? That was the nut I was attempting to crack. TBH I didn't expect someone at reddit to stick their neck out and say "yeah Amazon sucks" but they might have confirmed that the converse was true and they were simply lacking the necessary finance to support the massive userbase they have.


Rackspace (and really all the "popular" US hosters) seem ridiculously expensive compared to hosting prices we have in Germany (see e.g. http://www.hetzner.de/en/hosting/produktmatrix/rootserver-pr... this is one of the biggest root server hosters in Germany).

Is this really so or are Racksoace and co. Just "boutique" offerings?


Will you guys ever do something like this:

http://cache-scale.appspot.com/c/www.reddit.com/

but officially supported (and paid for) by you?


Sup jedberg, I obviously don't have nearly the level of knowledge with the intricacies of reddit, but coming from a strictly "business" standpoint, the amount of downtime reddit receives due to amazon issues is astounding. Perhaps it's time to look for alternatives?

Anyway, thanks for your time.


SHUT. DOWN. EVERYTHING.

Good plan though. :p


My original comment was "We're going Madagascar on the servers." Then I remembered I was on HN, not reddit. :)


>Then I remembered I was on HN, not reddit.

Right now we may as well be on reddit.


[deleted]


Don't do that. HN is not the place for memes.

reddit isn't either, but we lost that battle a while ago.


It's refreshing for me to see you say this. God speed soldier.


You don't need to flee the country yet.


Hey man.. some of us would have got it :)


President Madagascar is doin' cloud biz, too?


Reminds me of that scene in Jurassic Park


Good lord. What's with all the DVs??


Ha. HN really on it's high horse today.


I know right? I made a hilarious joke a little while ago and jedberg yelled at me and everyone downvoted me. And I'll be very surprised if this comment doesn't get downvoted to hell too.

EDIT: I also simply greeted jedberg, and a bunch of people thought that was a good reason to downvote. Do people think there's an imminent influx of redditors, and that they have to dissuade them from becoming HNers? I don't think that's the case.

EDIT: Fuckin' called it.


[deleted]


I'm here all the time. :)


Apparently most of their problems are caused by bad EBS writes/performance, or at least so they said a few weeks ago after some particularly bad downtime.

It looks like EBS will randomly decide to switch to a few bps of performance from time to time. I would use Amazon for my startup, but these issues really make it hard to justify.


EBS seems to be the main problem here, I'll cite a former reddit employe (first comment on the blog that talked about EBS problems).

I don't work for reddit anymore (as of about a week ago, although I didn't get as much fanfare as raldi did), but I can tell you that they're giving Amazon too much credit here. Amazon's EBSs are a barrel of laughs in terms of performance and reliability and are a constant (and the single largest) source of failure across reddit. reddit's been in talks with Amazon all the way up to CIOs about ways to fix them for nearly a year and they've constantly been making promises that they haven't been keeping, passing us to new people (that "will finally be able to fix it"), and variously otherwise been desperately trying to keep reddit while not actually earning it.

Source: http://www.reddit.com/r/blog/comments/g66f0/why_reddit_was_d...

cache: - scroll down for comment - http://webcache.googleusercontent.com/search?q=cache:cfbs-sp...


EC2 instances only have one network interface. The public IP address you have pointing to your instance is a DNAT done somewhere further up the chain.

If you get a large network load to your instance - say, a DDoS attack - you can find you no longer have enough network capacity to talk to your EBS disks.

This is what happened to Bitbucket in 2009: http://blog.bitbucket.org/2009/10/04/on-our-extended-downtim...


This doesn't appear to be the issue here, though. valisystem's link mentions that it wasn't an interface issue, EBS is just shit, apparently.


Slightly offtopic, but wasn't that post by an ex-employee as a comment? Not that the technical aspect of it wasn't fantastic, because it was, but I don't think Reddit said anything publically did they?


It was, you are right, I misremembered. valisystem's comments above contains the reference.


Yes valisystem linked to it. All good, I wasn't sure if I misremembered.


Looks like troubles in only one availability zone.


That seems to be incorrect. We have problem children in us-east-1b, -1c, and -1d.


AWS randomize the zones per account, so "your" -1b is not necessarily the same as "my" -1b. I'm only seeing problems in my -1c. Are you seeing 3 zones failing all under the same account?

If multiple AZs are down, AWS are going to have some serious explaining to do...


Amazon RDS's most expensive feature is automatic, instant Multi-AZ failover to protect against this kind of situation. It's not working quite like that, which the AWS status page acknowledges. This is a major failure.


"AWS randomize the zones per account"

Interesting. This is news to me.

I found some more info here (Google cache since alestic.com is presently unreachable): http://webcache.googleusercontent.com/search?q=cache:0jxzyFj...


We (reddit) are seeing failures in all zones.


If memory serves, Amazon's reporting of what zones are experiencing problems has been...optimistic...in the past.


AWS have now confirmed that this affects multiple availability zones. From the status page: "..impacting EBS volumes in multiple availability zones in the US-EAST-1 region"


Thats not good. The whole point of multiple AVs is for them to not fail at the same time. Suggests some dependencies that should not be there perhaps, or at least some correlation of something, like software upgrades. Looking for a good explanation of this; one AV going down is not a problem and should not impact anyone who is load balancing.


How does Reddit display the 'offline' page if it's down?


The server is still up, so we can serve it right out of the load balancer.


Are you able to enable the 'read-only-mode' using the same method?


I was to be under the impression a great deal of Reddits issues were linked to Amazon.


Why is ELB not mentioned at all on the Service Health Dashboard?

We're experiencing problems with two of our ELBs, one indicating instance health as out of service, reporting "a transient error occurred". Another, new LB (what we hoped would replace the first problematic LB), reports: "instance registration is still in progress".

A support issue with Amazon indicated that it was related to the ongoing issues and to monitor the Service Health Dashboard. But, as I mentioned before, ELB isn't mentioned at all.


We've got a single non-responsive load balancer IP in one of our primary ELBs (others have been fine for several hours now), so while everything else for us is up & running, still have transient errors for folks that get shunted to through that one system.

The interesting thing about the ELB in a situation like this is that I believe it may, in many instances, be better to hobble along and deal with an elevated error rate if at least some of your ELB hosts are working than to re-create the entire ELB somewhere else, especially if you're a high-traffic site where you may hit scaling issues going from 0 to 60 in milliseconds (OMMV, but we've been spooked enough in the past not to try anything hasty until things get back to normal).


We have an identical load balancer to one that is causing problems so we're lucky enough to reroute traffic through that one instead to get to the same boxes. (The boxes serve two different APIs through two different DNS CNAMEs so we split the ELBs for future and sanity). In this case, it's helped us out. Alternatively, we would've just routed all traffic to our west coast ELBs.


Quote from the AWS support rep: "I can confirm that ELB has been affected by the EBS issue despite the lack of messaging on the AWS Dashboard".


Quora says: "We'd point fingers, but we wouldn't be where we are today without EC2."


Nice way to point fingers while saying you're not.


Yes, that was by far my favorite comment. Well played.


I just launched a site on Heroku yesterday and cranked up the dynos up in anticipation of some "launch" traffic. Now, I can't log in to switch them off. Thanks EC2, you owe me $$$s


Actually, I'd expect Heroku to not charge for when the site was down, as they are clearly not available, it does not sound fair if they charge for it.

Am I expecting too much from them?


My app on heroku is running, it's just that I can't log in to their management console to de-allocate resources that I am paying for by the hour.


If I were you, I'd send them an email requesting this, on your behalf. At the end of the day, its their responsibility to make the console unavailable. I will be more than unimpressed if they dont see this logic here.


Isn't that a design flaw in Heroku? Shouldn't you be able to log into Heroku and change stuff like that even if the entire of Amazons cloud service is down?


Agree. You can delegate work but not responsibility.


> Nothing special-case here: we deploy with git push, just like any other Heroku user. Dogfooding is good for you. http://blog.heroku.com/archives/2009/4/1/fork_our_docs/


That's all well and good, but it's no use for their customers if/when Amazon goes down.


I think this is a good example of how the "cloud" is not a silver bullet to making your site always up. AWS provides a way to keep it up, but it is up to each developer to ensure that they are using AWS in a way to make sure their site can handle problems in one availability zone.

I think we will see more of a focus from big users of AWS about focusing on how to create a redundant service using AWS. Or at least I hope we will!


All well and good, but the elephant in the room is that multiple availability zones have failed at the same time. It looks like AWS have a single point of failure they weren't previously aware of.


This outage is affecting all AZ's in the East. So even a multizone setup wouldn't help for this one. Only a multiregion setup.

This outage is a lot like having your entire datacenter lose power.


I thought AZs were supposed to be different physical data centers.

If that is not the case, then having a multi-region setup would be a necessity for any major sites on AWS.

Perhaps there will be a time where to truly be redundant, one would need to use multiple cloud providers. Which would be a _huge_ pain to do now I imagine, with all the provider lock-ins we have.


> I thought AZs were supposed to be different physical data centers.

They are. Which means this is probably a software issue or some other systemic issue.


Well that isn't supposed to happen :(

I think I'll go home and wait it out there, but it appears that they are having some progress in recovering it. But our site is still affected.


If "multi-region" means North America, Europe and Asia Pacific, doing so would also improve world-wide latency (e.g. here in Australia...).

Could you use this outage to justify switching to multi-region?


A blog post last month touched on this: "Q: Why is reddit tied so tightly to the affected availability zone?

A: When we started with Amazon, our code was written with the assumption that there would be one data center. We have been working towards fixing this since we moved two years ago. Unfortunately, progress has been slow in this area. Luckily, we are currently in a hiring round which will increase the technical staff by 200% :) These new programmers will help us address this issue."

Not sure if the costs of data transfer between regions (charged at full internet price) would justify the added reliability/lower latency though.


I don't know reddit, but concurrency is very hard with high latency.


NetFlix should sell their chaos monkey as a commercial product.


Don't let AWS hear that, or they'll charge us for their failures by rebranding it as a feature.


Instead of enumerating who's down, I'd be more interested to hear about those that survived the AWS failure. We could learn something from them.


Quora is down, and evidently "They're not pointing fingers at EC2" -- http://news.ycombinator.com/item?id=2470119 -- I was going to post a screen shot, but evidently my Dropbox is down too.


Holy crap. An Amazon rep actually just posted that SkyNet had nothing to do with the outage:

https://forums.aws.amazon.com/message.jspa?messageID=238872#...


I'm seeing 1 EBS server out of 9 having issues (5 in one availability zone, 4 in another). CPU wait time on the instance is stuck at 100% on all cores since the disk isn't responding. Sounds like others are having much more trouble.


Silver lining: Hopefully I can test my "aws is failing" fallback code. (my GAE based site keeps a state log on S3 for the day when GAE falls in a hole.)


This code should be well tested by now. Amazon is doing you a failure by being rubbish.


AWS/S3 has become the new Windows - great SPOF to go for if you want to attack. This space needs more competition.


Two years ago TechCrunch was publishing an article every time Rackspace went down listing all the hot startups down along with it. AWS is no more a SPOF than any other major hosting provider.


[deleted]


It is more noticeable since so many large sites use it. Kind of like a MS BSOD was a common joke because so many people used MS Windows.

If we were all having our own rental servers then.... well, many sites that we know wouldn't be around :)


http://venuetastic.com/ - feel bad for these guys. They launched yesterday and down today because of AWS. Murphy's law in practice.


Wow. I can only imagine the intense frustration the site owner must be feeling right about now. Makes you really stop and question the whole "cloud" based service. Or at least should make you realize you need fall-backs other than the cloud service itself.


They are scaling in the cloud, at least.


So when big sites deal use Amazon Web Services for major traffic, do they get a serious customer relationship? Or is it just generic email/web support and a status page?


There are subscriptions for various levels of support, from $50/mo for 12 hour response time to $15k/mo for 15 minute response time.

http://aws.amazon.com/premiumsupport/


Interesting pricing. The Platinum seems priced to have no one use it, considering how much of a jump it is over Gold.


I would say, rather, that it is priced to have very specific sorts of customer using it.

The relationship between the pricing tiers changes fairly drastically, depending on how much you are already spending on Amazon Web Services. Gold, for instance, starts out at 4x the price of Silver support, but by the time you're spending 80K/month on services, it's only a $900 premium (and stays there no matter how much bigger your bill is). At the $150K/month level, it's a 2x jump from Gold to Platinum, which may or may not be a huge jump, considering the extra level of service you get.


It's a bit ironic that Amazon WS has become a SPoF for half the internet.


Yes, they are. :(


My four-day weekend is already off to a bad start(UK here).


Mine is worse. I booked Tues-Thurs off. I only have internet in work at the moment. I'm going to miss reddit now and be without internet until I return to work on the 3rd of May. Stupid Sky and their stupid take forever switch overs.


It could be worse/better, you could have an Australian 5-day weekend.


Or a South African 11 day weekend. Easter, and then public holidays on 27 April and 1/2 May are combining this year to provide a massive holiday opportunity.


UK Universities give staff the Tuesday too, giving them a five day weekend. I took two days holiday next week and because of Easter and the Royal Wedding that gives me 11 days off straight.


To clarify, this Monday is ANZAC Day, a commemorative holiday for troops who fought for Australia. Because that's also Easter Monday, the ANZAC Day holiday is moved to Tuesday, despite commemorative services being held on Monday.


Actually the official line is that Easter Monday got moved to Tuesday.


Interesting. Thanks. What did the Catholic Church have to say about that? Is it that Easter Monday is still on Monday, but the holiday is on Tuesday?


Easter is the holiday, and it's on Sunday. I'm pretty sure the Pope doesn't care much what people do the day after Easter (or the day after that).


Ah, thanks. I misread the wikipage, and didn't realise.


Is it a 4-day holiday in Norway too?! Opera Support Forums have been sketchy for hours and if they won't come back until Tuesday, I'm SOL with my Opera problems.


5-day


Why do you get a 4 day weekend?


Whichever date Easter Sunday falls on, the Friday before it (Good Friday) and Monday after (Easter Monday) are both public holidays.


Getting the 26,27,28th off from work is just genius for a long break!


Easter

Royal Wedding

May Day


Easter, probably.


no reddit at work today!


Believe me, the last thing I want is to be up at 3am working on this. I'd much rather be sleeping and letting you not work.


Are you part of the team resolving it?


"Resolving" probably isn't the right word seeing as this is a purely amazon issue, not much they can do.

I'm guessing jedberg is mostly banging his head in walls and seriously looking at alternative hosting solutions right now.


Actaully, I'm on the couch in front of the fireplace, watching old SNL, waiting patiently for Amazon to fix their shit, and figuring out how we can not use EBS anymore.


I'm pretty sure redditors would be more than happy to deal with a day or so extra downtime as you guys switched to a better platform. Just leave a simple page up saying "Dumping Amazon, brb"... doubt you'd get many complaints.


TBH Amazon is so bad at this point that turning off Reddit is as good as trying to keep it running. Of course then you need to deal with the increased suicide rate.


Right on.

Periodically, latest a couple of days ago, there's a post / discussion about whether outsourcing core functionality is the right thing to do. There are valid points on both sides of the issue.

For my part, if I'm going to be up in the middle of the night I'd rather be up working on fixing something rather than up fretting and checking status. But either way things get fixed. The real difference comes in the following days and weeks. When core stuff is in the cloud then you can try to get assurances and such, fwiw. When core stuff is in-house then you spend time, energy and money making sure you can sleep at night.


A couple of years ago you had expressed interest in making a port to App Engine, any interest in doing that still? Want any help? ;)


I think it would take a lot more time than we have to make that work. Our code is open source if you want to give a proof of concept a go. ;)


On the off chance a port to app engine coalesces around this comment, count me in :)


So I was half right? Awesome!


I was looking at EC2 until this!

I thought you could cluster your instances across many regions and replicate blah blah blah and change your elastic ip addresses in instances like this?

Is this a case that it's not being utilised or does that system not work?

I appreciate you are busy right now so I'm not expecting a reply any time soon.


That is the theory, but all of our data is currently locked in the inaccessible EBS system.

I'd still say Amazon is a great place for your startup, just don't use EBS.


I thought you could snapshot drives across regions and bring those EBS drives up under new instances in new regions?

I have not used it all in detail yet so I don't know the practicality of this method.

I think I will stick to my co-location costings I am doing for the time being. There is only one person to rely on when it all goes wrong then!

Good luck getting it sorted, i know I wouldn't appreciate being up at 3am sorting it though


> I thought you could snapshot drives across regions and bring those EBS drives up under new instances in new regions?

In theory, yes. In practice, those snapshots hurt the volume so much that it is impossible to take one in production.


Interesting, your insight has given me a lot to think about.

Do you guys blog this anywhere?


Yeah, usually they are on our blog right after the downtime, or in /r/announcements on reddit.


Or have a fallback, maybe?


Yeah, I'm one of the reddit admins.

More

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: