Hacker News new | past | comments | ask | show | jobs | submit login
Amazon cloud outage takes down Netflix, Instagram, Pinterest, & more (venturebeat.com)
130 points by amnigos on June 30, 2012 | hide | past | favorite | 33 comments

From the article - "The outage underscores the vulnerabilities of depending on the public cloud versus using your own data centers."

No it doesn't. It underscores the vulnerabilities of not understanding your hosting and accepting the "no outages" slogans of ANY cloud. A single data centre is always susceptible to outages like this, it doesn't matter who owns it. If any of those sites had owned a single data centre that was hit by storm damage, the impact would be the same. I know this is supposed to be the year of the cloud backlash but even so...

Exactly. Anyone running their own data center would be subject to exactly this kind of outage, and wouldn't have access to the many, many resources that cloud providers make available to ensure successful fail-over.

Yes, I can see the "many, many resources" Amazon is making available to you in this situation. Did they send you a rubber ball that you can squeeze while waiting for your servers to come back up?

They said fail-over. Those "many, many resources" are other regions and zones.

I thought both Heroku & Netflix had fairly robust multi-AZ deployments, so I'm hoping they share any of their learnings from this outage.

Either way, that quote is ridiculous.

Netflix does, but considering the time, running out of capacity after loosing a zone might have been more of an issue.

Definitely a possibility - I was actually watching Netflix when my phone started rattling with all the alerts.

I was able to finish out the episode, so their CDN was working for the actual media, but everything else was dead for me.

Another useless anecdote: A coworker was watching on his xbox, and it apparently cut mid-stream for him.

Heroku doesn't. They are on US-East.

Heroku is single-AZ within US-East, not even multiple-AZ.

Sorry but that is not even remotely true. We are hosted across all availability zones in the US-EAST region.

Ah, I didn't know that. Sorry. (I don't remember where I heard about single-AZ but it seemed consistent with observation)

Is Heroku resilient against single-AZ failure (so only some subset of customers go down, and then it restarts), or is it exposed so that if any AZ goes down, core stuff also goes down? The sites I care about on Heroku seem to go down whenever any US-East badness happens at all, even if it is "limited to a single AZ" per Amazon.

Actually it shows the inability to rely on amazon. A storm causes a complete disaster? Where is the redundant connections? The independent power supply? Heck where is the storm drains?

Thats sort of what some large storms can do - esp when they come with a large amount of electrical activity in them. You can easily suffer all kinds damage to power system that generators can't deal with.

That's exactly why they offer multiple zones. You can't make a single zone of anything 100% fault-tolerant.

It's funny, the more I think about I think this is actually a good reason to host on the cloud. From a technical standpoint it's terrifying to see all these big players down at once. But what the average user likely sees is "something is wrong with the internet". So rather than seeing that your site X is down and users being angry with you, users are probably likely to think "well instagram is also down, oh and so is netflix, something big must be broken, I'll check back later" the same way users don't blame you if the power goes out.

Interesting thought. A couple users may be empathetic because the actual problem is somewhat visible but I'm not sure if that is an real benefit. It is of course a negative perception when they see that youtube is up and then perceive all the down sites as being less technically competent.

It's interesting how AWS outage didn't take down Amazon.com.

Actually temporarily it did. At least for me for about 5 min amazon.com was unreachable yesterday when i first read about this. was around 10pm EST i believe.

That would be because Amazon.com doesn't use AWS.

While what you say was certainly true when AWS launched, today it is not. Much of Amazon is hosted in an EC2 VPC connected to the traditional prod network. This includes all of retail website and (today) many of the services that power said website.

Other AWS services are also used frequently and heavily. AWS use is strongly encouraged for any new projects.

No, it doesn't. Even the name servers of Amazon.com belong to UltraDNS and Dynect, instead of their own Route 53.

Route53 uses UltraDNS.

No it's wrong.

> "I also wanted to clarify that Route 53 is an Amazon-built and operated service. It is not a re-branding of a third party DNS service. Over time you'll see various parts of Amazon move over to use Route 53."


I've always found the double-standard amusing :)

If you are hosting on a server(as everyone is) it will, at some point fail. You have to choose a service that has minimal failure combined with quick resolution times. I think AWS fits this description...

AWS has had far more failures than my servers at any data center ever have. Running in 'the cloud', you're taking all the unavoidable points of failure (power, network, hardware) and adding in a bunch of proprietary ones (all the software that manages EC2, EBS, ELBs, internal routing between them, etc) that have all failed spectacularly at least once already with hours- to days-long resolution times.

Yes, risk still exists, and risk profile shifts a little, but I find it to be a toward the better. Here's an anecdote:

I run applications on EC2 and RDS. I'm using Oracle. AWS has recently introduce Multi-AZ Oracle, but I haven't enabled yet. Before it was available, though, I set up a poor-man's procedure that consists of running data exports and dropping them on S3.

Now, when everything went to hell in the east, I lost an RDS instance. I couldn't do point-in-time restore, and I couldn't snapshot (both are still pending since 7 AM or so).

Luckily, I was able to spin up an RDS instance in the west, pull down the latest data from S3, and do an import. I repointed my apps at the new database, and now I'm back up.

The process took about 45 minutes. Setting up the backup scripts took about 20 minutes about 2 years ago. Now I'm just sitting on my hands waiting for the AWS ops team to fix everything. This is work I'd normally be scrambling to do myself. I'm quite happy to let those talented folks deal with it. When it's all back up and running, I'll check integrity and consistency, and I might have to restore some interim data, but for now I'm operational.

I'm sure there are worse scenarios, but the major outage last year and in the past 24-hours were quite easily mitigated.

There's something to be said for being part of a giant machine. AWS really is utility computing, so even the small guys get the benefit by virtue of standing next to the big guys.

Instagram is still down! However Netflix & Pinterest seem to be back.

Google, MS, Rackspace etc. ought to give a good look at all the middle layer libraries like boto, and support them to make it a matter of configuration to switch cloud service providers.

This already works well for email providers.

Here's to keeping all the eggs in one basket!

Just use rackspace already, they hardly have an outage.

If Pinterest is down, then there may be a net gain for the internet.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact