
Netflix suffers first massive global outage - john58
https://news.alphastreet.com/netflix-suffers-first-massive-global-outage/
======
jack6e
At PyCon in Cleveland this year Amjith Ramanujam did a presentation on "how
Netflix does failovers in 7 minutes flat" [0]. Worth a watch/listen for anyone
interested in what their response may have looked like. Now I'm curious to
read a post-mortem from them and see whether their procedures worked as
expected (it sounds like they were down longer than 7 minutes) or where they
encountered unexpected issues.

[0]
[https://www.youtube.com/watch?v=iQI56-up3Yk](https://www.youtube.com/watch?v=iQI56-up3Yk)

~~~
pofilat
Is 7 minutes good? If you have a multimaster redundant systems, you don't need
failover.

~~~
toredash
You write that as to indicate it is easy do. It's more than "a multimaster db"
to Netflix, it's CDNs, authentication, logging, security...

------
paulgpetty
It might not seem like this is a big deal. For now that’s probably right; but
Netflix has aspirations to do nightly news and if that’s intended to be live
or near real-time we should be comparing their uptime with broadcast TV. When
was the last time ABC or Fox went off air?

Again, not right now, but at some point this sort of outage becomes critical
and will probably deserve more scrutiny.

I’m impressed by how reliable Netflix is, hopefully this just pushes them
harder.

And maybe even prepares them for a less neutral internet.

~~~
willyt
Doesn't seem to be all that reliable or is it just me? I have it on a fire
stick and it regularly stalls at 20%, the only reliable way to get it going
again is to reboot the whole stick. About once every two weeks the cache gets
corrupted and I have to clear it and login again. Both a total PITA when you
are just trying to squeeze an episode of something in between kids bedtime and
grownup bedtime...

~~~
rootusrootus
That sounds like the difference between Netflix-the-service and Netflix-the-
app. Maybe ditch the Fire stick and get something more reliable?

------
dnate
I hope they release a post mortem about this. Would be very interesting.

~~~
weliketocode
I was just about to reply with the exact same message.

Netflix has spent A LOT of time and money on their infrastructure, expounding
their engineering views, proclaiming best practices, and how they use
redundancy/sharding/insert best practices and major buzzwords in order to
combat the chance of this happening.

I'm a huge fan of Netflix and their approaches to infrastructure, and I'm
almost certain that they'll have some very interesting conclusions from this.

~~~
awalton
Can't test for _everything_ , but seeing this is the most significant outage
in ages, I'd say they're doing a damned good job.

I mean, after all, the internet's true natural predator still lurks, its
yellow CAT steel present across all of the nation...

I speak of course, of Backhoes.

(1): [https://www.wired.com/2006/01/the-backhoe-a-real-
cyberthreat...](https://www.wired.com/2006/01/the-backhoe-a-real-cyberthreat/)
(2): [https://it.slashdot.org/story/06/01/19/1643215/the-
backhoe-t...](https://it.slashdot.org/story/06/01/19/1643215/the-backhoe-the-
internets-natural-enemy) (3):
[https://www.urbandictionary.com/define.php?term=Fiber%20Seek...](https://www.urbandictionary.com/define.php?term=Fiber%20Seeking%20Backhoe)

~~~
baud147258
I heard a story from the construction company (not int US) where my brother is
working: on a construction site, they had to dig in an area with fiber. So
they called the telco, aking them to disconnect the cables so they could start
digging. But the telco was late and a few days later, fed up by the delay (and
associated costs), the site manager said "start digging, we'll deal with the
fallout". So the backhoe start digging and cut the cables. The telco were
quick to send a guy after that...

------
gwbas1c
Ironically, the article isn't loading for me.

~~~
IMTDb
Same here. Netflix on the other hand is perfectly fine.

~~~
Operyl
The outage happened last night.

------
danschumann
I thought they have a button in their office which shuts down a server, as a
means of testing. Maybe someone brought in their 3 year old and he just kept
hitting it.

~~~
wyldfire
They have a "Chaos Monkey" [1] feature that is intended to bring down
individual nodes. "Exposing engineers to failures more frequently incentivizes
them to build resilient services."

If Chaos Monkey had been responsible for setting off a global outage, I could
imagine business leaders getting cold feet about using a tool like this. In
traditional companies, anyways, they'd never have seen the benefit of it and
after only hearing the costs, they'd probably be livid that a widespread
outage had been caused by something like this.

[1]
[https://github.com/Netflix/chaosmonkey](https://github.com/Netflix/chaosmonkey)

~~~
some_account
It's possible that executes would think that way but only because they are not
too technical. Every company should have a chaos monkey to make sure stability
is at the top of the list always.

~~~
noir_lord
> Every company should have a chaos monkey.

Big companies do though they call then interns.

------
LogicX
Google cache:
[https://webcache.googleusercontent.com/search?q=cache:https%...](https://webcache.googleusercontent.com/search?q=cache:https%3A%2F%2Fnews.alphastreet.com%2Fnetflix-
suffers-first-massive-global-outage%2F)

------
pofilat
Instagram, GitLab, Netflix all down at the same time. Coincidence?

~~~
breakingcups
Add StackOverflow to that list.

------
Retroity
Here's an alternative article from CNBC since the linked article is having
trouble loading: [https://www.cnbc.com/2018/06/12/netflix-down-streaming-
servi...](https://www.cnbc.com/2018/06/12/netflix-down-streaming-service-says-
it-fixed-problem-that-caused-outage.html)

------
lyjackal
[https://webcache.googleusercontent.com/search?q=cache:mKL4Iq...](https://webcache.googleusercontent.com/search?q=cache:mKL4IqFNOO8J:https://news.alphastreet.com/netflix-
suffers-first-massive-global-outage/)

------
a012
HN hugs of death? It seems like the article is also down.

------
filereaper
I'm looking forward to the post-mortem of this (hopefully by someone like Dave
Hahn)

Hopefully it wasn't a Hurricane's Butterfly kind of deal...

~~~
pc86
What do you mean by Hurricane's Butterfly? I'm familiar with the "if a
butterfly flaps its wings" thing but not sure what you're referring to in this
context is all.

~~~
filereaper
Apologies, was referring to bcantrill's tech talk. [1]

Similar idea, you're in a hurricane but it may have been caused by an
insignificant butterfly far awaya.

[1] [https://www.janestreet.com/tech-talks/hurricanes-
butterfly/](https://www.janestreet.com/tech-talks/hurricanes-butterfly/)

------
poundtown
cool to see people totally having to face their addiction. my kids were
flopping around on the floor like fish out of water.

------
john58
Here's the link to AMP page of the article:
[https://news.alphastreet.com/netflix-suffers-first-
massive-g...](https://news.alphastreet.com/netflix-suffers-first-massive-
global-outage/amp/)

------
arjunvijay
I wonder why they have not given an official statement on why it happened.
Probably just a one-off case.

------
mikece
Did someone at NFLX point Chaos Monkey at something they shouldn't have or was
this an "honest failure?"

------
nmg
Net Neutrality's been dead for less than 24 hours. Tin foil hats aside, is it
realistic to consider a relationship between these two events?

~~~
nvr219
Not realistic at all.

~~~
pritambaral
I think you should expand on why you think so. Here are my reasons:

1\. This is a global outage. Net Neutrality is down in US only.

2\. The pattern of failure is not consistent with American ISPs past attempts.
They've wanted to make Netflix less attractive compared to their services, not
outright broken. Subtlety is better because outright breakage brings too much
attention.

------
thermodynthrway
I'm not surprised.

I've tried to get Netflix's OSS tools like Spinnaker running and it's a total
nightmare with how many interdependent services need to run. It took me days
to get running and was never reliable. I think they drank a little too much
koolaid.

Microservices + async DB updates = hell. After working on such projects, I
respect the wisdom of Google making all of their data stores immediately
consistent.

A monolith isn't bad at all if it's organized and built to scale horizontally.
Move your consistency concerns back to the database where they belong, not
between services. It's a joke that monoliths are bad when you look at every
popular operating system kernel. These things are multi million line binary
blobs in languages that aren't friendly to mistakes and they run EVERYTHING

~~~
TotempaaltJ
Seeing as this is their first major outage, and Netflix-the-streaming-service
has existed since 2007, I'd say don't know what you're talking about.

~~~
thermodynthrway
Assuming their internal projects are built similar to OSS, I ask you to try
playing with them yourself and see if you come to the same opinion. There's
better ways to critique an arch than just measuring outages.

Netflix has built an amazing system for messing with their servers to try find
failure points. And they have a bunch of cleanup jobs that run to fix
consistency errors. This is just what I picked up from a past obsession with
their architecture.

Then I worked on a few projects built that way, and realized it was a horrible
nightmare. By having a global immediately consistent datastore you can push
all these concerns back to the database, and your codebase ends up far
smaller.

The failure modes tend to explode in complexity when you're doing rpc across
services with different datastoress because you have to deal with distributed
transactions yourself. Every single call needs to be able to unwind itself
across all services.

If you have a call that fans out to 10 other service calls, you need to be
able to unwind any of them in any order. It quickly becomes untenable

