"Cheaper than cost of being down."
This is very insightful. Many of us look at the cost of multi zone deployments and cringe, but its a mathematics exercise.
(.05 * hours in a year)*(cost of being down per hour) = (expected cost of single zone availability).
Now just compare to 2-3x your single zone deployment cost. Don't forget the cost of being down per hour should include lost customers as well.
I'm actually surprised if incurring 50% extra hardware costs really is cheaper than the cost of being down. If Netflix is down for a few hours, then it costs them some goodwill, and maybe a few new signups, but is the immediate revenue impact really that great? Most of Netflix's revenue comes from monthly subscriptions, and it's not like their customers have an SLA.
Actually, they do. and Netflix proactively refund customers for downtime. Usually it's pennies on the dollar, but i've had more than refund for sub 30 minute outages which have prohibited me from using the service.
Netflix are very very sensitive to this problem because it's much harder for them to sell against their biggest competitor (local cable) since they rely on the cable to deliver their service. If the service goes down, then the cable company can jump in and say, "You'll never lose the signal on our network" -- blatantly untrue, but it doesn't matter.
When you're disrupting a market, remember that what seem trivial is in fact hugely important when you're fighting huge well-established competition :)
I'd imagine that part of this cost is reputation. The only problem I have ever had with Netflix streaming is when an agreement runs out and the pull something I or my wife regularly watch. (looking at you, "Paint Your Wagon")
I have not had a single service issue with them, ever. They do a better job at reliably providing me with TV shows than the cable company does. That seems to be where they're looking to position themselves, and the reputation for always being there is hard to regain if you lose it.
There isn't a 50% extra hardware cost. You spread systems over three zones and run at the normal utilization levels of 30-60%. If you lose a zone while you are at 60% you will spike to 90% for a while, until you can deploy replacement systems in the remaining zones. Traffic spikes mean that you don't want to run more than 60% busy anyway.
I don't think the cost of expanding to other regions/AZs is necessarily linear such that adding a zone would incur 50% more costs. Going from one zone to two would probably look that way (or even one server to two), but when you start going from two to three or even 10 to 11 then the %change-in-cost starts to decrease.
This is even more true if/when you load balance between zones and aren't just using them as hot backups. As another commenter pointed out, Netflix says they have three zones and only need two to operate.
Are they only in three zones, or three regions? Three zones would not have helped them in this particular scenario and they would have still been at risk.
And if they do mean three regions - can that cost of spanning various regions be quantified for different companies. The money spent vs money earned for Netflix may be very different compared to Quora and Reddit. At the same time, the data synchronization needs in between regions may also vastly differ for different type of companies and infrastructures thus leading to varying amount of cost to maintain a site on multiple regions.
Pure opinion: That convergence might show that Amazon tried to do a failover on a DC level. Once they figured that wouldn't work or east was down for the count they just let it cycle to the ground under latency.