Hacker News new | past | comments | ask | show | jobs | submit login

Some quotes regarding how Netflix handled this without interruptions:

"Netflix showed some increased latency, internal alarms went off but hasn't had a service outage." [1]

"Netflix is deployed in three zones, sized to lose one and keep going. Cheaper than cost of being down." [2]

[1] https://twitter.com/adrianco/status/61075904847282177

[2] https://twitter.com/adrianco/status/61076362680745984




"Cheaper than cost of being down." This is very insightful. Many of us look at the cost of multi zone deployments and cringe, but its a mathematics exercise. (.05 * hours in a year)*(cost of being down per hour) = (expected cost of single zone availability). Now just compare to 2-3x your single zone deployment cost. Don't forget the cost of being down per hour should include lost customers as well.


At their level of income, this is true.

For us, we are just now staffing up to the level where we can make the changes necessary to do the same thing.


I think it's incredible that you guys can run a site at all with the few people you've got. Hope it all gets better again soon.


I would also be shocked if Amazon isn't giving Netflix preferred pricing because it's such a high-profile customer.


Netflix pays standard rates for instances but uses reserved instances to pay less on bulk EC2 deployments


Are you looking to diversify across ebs or set up dedicated hosting?


It's a strange algebra though; doesn't it mean the WORSE Amazon's uptime is, the more money you should give them?


More accurately, the more unstable your infrastructure is, the more you will need to spend to ensure stability.


Spending more on AWS to increase reliability isn't necessarily a benefit to Amazon. The increased costs can them less competative.


I'm actually surprised if incurring 50% extra hardware costs really is cheaper than the cost of being down. If Netflix is down for a few hours, then it costs them some goodwill, and maybe a few new signups, but is the immediate revenue impact really that great? Most of Netflix's revenue comes from monthly subscriptions, and it's not like their customers have an SLA.


Actually, they do. and Netflix proactively refund customers for downtime. Usually it's pennies on the dollar, but i've had more than refund for sub 30 minute outages which have prohibited me from using the service.

Netflix are very very sensitive to this problem because it's much harder for them to sell against their biggest competitor (local cable) since they rely on the cable to deliver their service. If the service goes down, then the cable company can jump in and say, "You'll never lose the signal on our network" -- blatantly untrue, but it doesn't matter.

When you're disrupting a market, remember that what seem trivial is in fact hugely important when you're fighting huge well-established competition :)


I'd imagine that part of this cost is reputation. The only problem I have ever had with Netflix streaming is when an agreement runs out and the pull something I or my wife regularly watch. (looking at you, "Paint Your Wagon")

I have not had a single service issue with them, ever. They do a better job at reliably providing me with TV shows than the cable company does. That seems to be where they're looking to position themselves, and the reputation for always being there is hard to regain if you lose it.


There isn't a 50% extra hardware cost. You spread systems over three zones and run at the normal utilization levels of 30-60%. If you lose a zone while you are at 60% you will spike to 90% for a while, until you can deploy replacement systems in the remaining zones. Traffic spikes mean that you don't want to run more than 60% busy anyway.


Obviously you haven't been around my wife when she loses the last 5 minutes of a show. SLA or no, services will get cancelled.


I don't think the cost of expanding to other regions/AZs is necessarily linear such that adding a zone would incur 50% more costs. Going from one zone to two would probably look that way (or even one server to two), but when you start going from two to three or even 10 to 11 then the %change-in-cost starts to decrease.

This is even more true if/when you load balance between zones and aren't just using them as hot backups. As another commenter pointed out, Netflix says they have three zones and only need two to operate.


Also, when there are service interruptions, they send out credits to customers.


Every decision in a business is like this - measure the cost of action A versus the cost of not-A. It's just rare that in this case, those costs are easily quantifiable.


Are they only in three zones, or three regions? Three zones would not have helped them in this particular scenario and they would have still been at risk.

And if they do mean three regions - can that cost of spanning various regions be quantified for different companies. The money spent vs money earned for Netflix may be very different compared to Quora and Reddit. At the same time, the data synchronization needs in between regions may also vastly differ for different type of companies and infrastructures thus leading to varying amount of cost to maintain a site on multiple regions.


More comments coming from Adrian Cockcroft:

1. See slides 32-35 of http://www.slideshare.net/adrianco/netflix-in-the-cloud-2011

2. "Deploy in three AZ with no extra instances - target autoscale 30-60% util. You have 50% headroom for load spikes. Lose an AZ -> 90% util."

https://twitter.com/#!/adrianco/status/61089202229624832


Here's the 24h latency data on EC2 east, west, eu, apac: http://dl.dropbox.com/u/1898990/EC2-multiple-zones-24h.png

Last 60 minutes comparison data: http://dl.dropbox.com/u/1898990/EC2-multiple-zones-60m.png

time in GMT.

A study we (Cedexis) did in January comparing multiple ec2 zones and other cloud providers: (pdf) http://dl.dropbox.com/u/1898990/76-marty-kagan.pdf


Pure opinion: That convergence might show that Amazon tried to do a failover on a DC level. Once they figured that wouldn't work or east was down for the count they just let it cycle to the ground under latency.


Yes - It is all business decisions. As someone said already an instance on AWS can cost up to 7X a machine you own on co-location. here is how outbrain manages it's multi datacenter architecture while saving on Disaster recovery headroom. http://techblog.outbrain.com/2011/04/lego-bricks-our-data-ce...




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: