
Lessons Netflix Learned from the AWS Storm - justinsb
http://techblog.netflix.com/2012/07/lessons-netflix-learned-from-aws-storm.html
======
paulsutter
It's not clear they learned the simplest and most import ant fact: You have to
be able to migrate your production traffic away from whole regions.

~~~
jedberg
Paul, I'm sure your skills are top notch, but even you have to admit that the
problem you solved -- a globally available write only system -- is a totally
different problem than the one Netflix is solving, which is a read/write
workload.

Also, as someone who was a user of your globally available service, I can tell
you that while it might have been UP all the time, it certainly had no
problems losing data all the time too. Some months there was just simply no
data for reddit at all, even though we were sending the service more than a
billion data points.

So we could sit here and sling insults all day, or you can operate under the
assumption I do -- that each of is a competent engineer who works for a
business that has to make decisions and tradeoffs between costs and
reliability.

~~~
paulsutter
We're responding to a post, by Netflix, explaining their downtime. That post
is missing the single most important fact - that they need to be able to
failover across regions. The rest of their explanation is just second-order
noise.

Anyone who reads HN can see that a minimum uptime strategy for Amazon is to
failover across regions. Each time there is a major AWS outage, we hear about
HN readers whose service was affected even though they spanned availability
zones within a single region. But to date, Amazon's regions have operated
independently.

That observation is not dependent on knowledge of Quantcast (which is
incidentally, far more than a write-only system), or the other production
systems I've built in the last 35 years.

(I'll follow up by email about your support questions)

~~~
paulsutter
A little transparency can make life easier. Try this:

"Don't panic. You are using a backup datacenter. Some very recent queue or
account changes may be missing, and some changes you make tonight may be lost.
We are working nonstop to resolve this and appreciate your patience"

When stuck, just change the requirements.

------
eragnew
Dear Netflix, thank you for being transparent and honest about what happened.

------
soup10
Anyone find it really weird that netflix doesn't run it's own datacenters?

If the netflix business fails, they would have giant valuable datacenters
leftover. Instead by relying on the cloud, they are "all-in" on serving
movies. The movie and tv studios have giant leverage here, they can easily
make or back a competing service and users will go where the content is. Is
their strategy for being on the cloud really, "it's easier than doing it
ourselves?".

~~~
whichdan
Maybe not necessarily easier, but it isn't part of their focus. Besides having
collateral (the datacenters) what else do they gain?

It's not like Amazon where they're providing infrastructure to other
companies. I remember someone else on HN pointing out that there aren't many
non-adult video providers the size of Netflix/YouTube/etc that aren't already
rolling their own solutions or served by companies like Brightcove.

~~~
ams6110
By not having datacenters they don't have capital tied up in buildings, real-
estate, staff/benefits at those data centers, they can be much smaller
personnel wise and having much more of the staff focused on stuff that matters
to customers. Customers don't care where the data center is or who is running
it, as long as their movies come on when they want to watch.

------
edouard1234567
RESILIENCY = REDUNDANCY + INSULATION. Great post. I look forward to hear what
heroku is cooking, I heard they are working hard on better handling similar
incidents. Redundancy without insulation is what happened to the titanic and
seems to be the most common mistake when architecting HA systems.

------
ctulek
"The service that keeps track of the state of the world has a fail-safe mode
where it will not remove unhealthy instances in the event that a significant
portion appears to fail simultaneously."

You should keep your logics dumb.

~~~
Domenic_S
That was my thought as well when reading that sentence (actually, I was
thinking "was this overengineered for no good reason?), however they go on to
say that there _is_ a purpose -- mitigating "network partition events" which I
can only guess is referring to AWS's version of netsplits.

It sounds like there was some technical debt to that implementation, but hey,
I for one am glad they gave us some insight into what happened.

~~~
adrianco
"Technical debt" is a nice way of saying it had bugs. It was mostly a
configuration problem, if it had been setup better we would have had no outage
or a much shorter one. The work to test all our zone level resilience (Chaos
Gorilla) was underway but hadn't got far enough to uncover this bug.

