Hacker News new | comments | show | ask | jobs | submit login
A Closer Look At The Christmas Eve Outage (netflix.com)
41 points by aaronbrethorst 1754 days ago | hide | past | web | 21 comments | favorite



Currently in a battle with management about whether to launch on AWS or not. They want to, engineers don't. Engineers are largely driven by raw CPU performance. Management, well, they seem to be thinking "no one ever gets fired for picking..."

Anyways, when you read all the effort Netflix has put into their cloud architecture (1) and the hiccups they (and others) have...I just don't know what hope our small team of 5 has of success. It seems like, to succeed on the cloud, you really need to build your app specifically for it (which we haven't done!)

1 - http://techblog.netflix.com/2010/12/5-lessons-weve-learned-u...


"engineers don't. Engineers are largely driven by raw CPU performance"

There are several reasons to not use AWS. CPU is not one of them. Especially when you have the choice EC2 offers (though at a price)

But unless you're doing something very, very CPU intensive (like doing heavy math like CFD, integer programming, etc) this is irrelevant. My bet is that you aren't.


I know I was vague, but we really are CPU-bound. We do some fairly heavy set and bitwise operations. Our smallest unit of work is about 3x-5x slower on AWS (while costing about 4x more than alternatives). This can't be further parallelized without a rewrite. The problem is compounded by the negative impact this has on concurrency - which can be solved by adding more machines, but that just makes AWS that much more expensive.

A different, less important system, does image manipulation. This is also very CPU-sensitive.


Interesting

Maybe I need to rephrase what I said: AWS instances are very bad at CPU, but the flexibility to add more instances and different instance types can compensate for that

Yes, maybe you can try Linode or Rackspace to compare their CPU performance


Really? I was just using aws to run some load testing bots against our game servers and found the CPUs even in High CPU instances pitiful for the price.

The worst part about it is that the performance is so variable between instances. Some could comfortably run 1300 bots while others were slowing down so much after 750 that they started dropping bots.

Because of that, it's easier to just run 750 on each which is wasting resources.

My desktop PC can easily run 4000.


Yes, desktop PCs are much faster than an EC2 instance

Maybe Linode or Rackspace (or other solution) is better for your case.

Still, apparently for your case you can just add more servers according to demand, which is the advantage of EC2


CPU is a limiting factor for a lot of applications. And EC2's offerings are paltry until you get into the really expensive VMs. But it's not just CPU, it's the overall performance. Engineers want to run on hardware for many more reasons than CPU. And for most apps it's the rational choice. After all, you can build your own data center for the price you pay to rent some virtual machines.

tl;dr: With cloud services you pay thousands a month for _renting software_.


Yes, I agree. You must build not only for "the cloud" (whatever that means) but for your specific "cloud" provider like AWS. Then you're stuck with them. You can (and should) of course build a bridge to such services but there are some managers that just don't believe in bridges or anything rational at all. Bezos is a brilliant CEO, especially now that he's stopped pixel-pushing.

It wasn't always like this, but I'm just here to collect a paycheck. The fact that I'm really good at what I do is incidental and hardly comes into play as it's not usually called upon. Sad but true.


Chaos Monkey fights back :-) I wonder if there is a way for the CM team to do ELB outages. At some point you entertain the idea that Amazon goes offline but probabilistically is that similar to a 9.0 magnitude quake and a 30M tsunami on the same day?


Correct. Probabilistically, I'd say it's the same as long as that earthquake is under the ocean. Yeah you'll get 30M tsunami with a 9.0mag earthquake (most likely, I'm no geologist).

So yeah, AWS going down should be something _every_ company that runs its services on AWS including providers like Heroku should take into account as far as their architecture goes. It's not a matter of if, it's a matter of when. You _will_ have downtime.

Period.


Why is CPU performance on EC2 so terrible compared to dedicated servers?


"The Netflix Web site remained up throughout the incident, supporting sign up of new customers and streaming to Macs and PCs, although at times with higher latency and a likelihood of needing to retry. Over-all streaming playback via Macs and PCs was only slightly reduced from normal levels."

This is simply false. I tried Netflix streaming on both my Macs and my Roku and neither worked. The site may have been up, but the streaming was down for Macs (and any computers in general I assume), not just TV boxes like Roku.


http://dvd.netflix.com/ is down right now and has been every time I've checked today. "We're Sorry

The Netflix site is temporarily unavailable. Our engineers are working hard to bring the site back up as quickly as possible."

Of course it's not a priority over the streaming portion I'm sure for obvious reasons.


It's up for me for at least a little while now, but I think it was down for me all afternoon (Eastern timezone). This supposedly only impacted "some" of the DVD customers. http://www.bloomberg.com/news/2012-12-31/netflix-says-some-c...


This and their re-hauled (for awhile now) UI to a horrible mess that is almost impenetrable even by seasoned UI/UX developers is most likely due to a shortage of good engineers willing to go down to Los Gatos every day. I told them, one day a week, but big companies like this never learn till they're on the brink of collapse and it's too late. coughBlockbustercough


A small number of users unable to stream out of several hundred thousand might reasonably be described as "slightly reduced from normal levels".


True, but false claims in a post-mortem from Netflix make me wonder if I should retain the service. What they should be offering is credit to those affected. Their excuse that people don't watch too many streams on XMas eve is lame even by big company standards and makes no sense. My family was spending time together and we wanted to wind down with a movie. It was close to midnight (EST).


I continued to be able to use Netflix through the problem from my Mac, however it would take multiple minutes for the stream to connect (it would get stuck at like 7%--or maybe it was actually 0%--while buffering, but with patience it would work).


Yeah, I tried that but it didn't work for me.


I really don't like the excuse laden and sugar coated style of this postmortem.


Seriously. Especially after all the Chaos monkey articles. Their Chaos monkey doesn't have enough Chaos apparently or is too stupid to hit load balancers? Shit, that's the first thing I'd knock out!




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: