Hacker News new | comments | show | ask | jobs | submit login
Netflix suffers first massive global outage (alphastreet.com)
171 points by john58 4 months ago | hide | past | web | favorite | 100 comments

At PyCon in Cleveland this year Amjith Ramanujam did a presentation on "how Netflix does failovers in 7 minutes flat" [0]. Worth a watch/listen for anyone interested in what their response may have looked like. Now I'm curious to read a post-mortem from them and see whether their procedures worked as expected (it sounds like they were down longer than 7 minutes) or where they encountered unexpected issues.

[0] https://www.youtube.com/watch?v=iQI56-up3Yk

Is 7 minutes good? If you have a multimaster redundant systems, you don't need failover.

You write that as to indicate it is easy do. It's more than "a multimaster db" to Netflix, it's CDNs, authentication, logging, security...

It might not seem like this is a big deal. For now that’s probably right; but Netflix has aspirations to do nightly news and if that’s intended to be live or near real-time we should be comparing their uptime with broadcast TV. When was the last time ABC or Fox went off air?

Again, not right now, but at some point this sort of outage becomes critical and will probably deserve more scrutiny.

I’m impressed by how reliable Netflix is, hopefully this just pushes them harder.

And maybe even prepares them for a less neutral internet.

Depends on what you're measuring.

Local stations regularly go off air. Sure, some stations outside the range of your antenna might work but effectively their just down. For national events they will fail over to the local stations and keep broadcasting, but if the content is not there they are just down.

That's like blaming Netflix when your local ISP is down.

Do you have a citation for the nightly news aspirations? Whenever I read about Netflix not wanting to purchase the rights to soccer, it's nearly always with the reason being that Netflix has little to no interest in live streaming:


_"We have so much we want to do in our area, so we're not trying to copy others, whether that's linear cable, there's lots of things we don't do. We don't do [live] news, we don't do [live] sports. But what we do do, we try to do really well."_ - Reed Hastings.

They already have shows that run a bit like that. I believe Michelle Wolff's show is one of them.

The Break is pre-recorded and released weekly with a quick turnaround time (hours maybe), but it's not at all live.

You're right. This might sound weird but it's been so long since I've watched nightly news that I actually forgot that it's live.

joel mchale's show was weekly too. i'm no television expert so i don't know how much more it would take to get a daily show rolling.

How is nightly news critical? Worst case, watch it 20 minutes later instead?

Or you've already read the details of every story on the net hours earlier.

It's not mission critical except for stickiness/adherence.

Arguably it could be used as a guaranteed self-marketing platform (in the intimate/most core sense - you can't watch if you don't have Netflix so people will just think of it as NetflixNews).

Believe it or not, a lot of people still use local tv/news for entertainment or information (of local news).

(Near) realtime would be important for emergencies, as rare as they are.

I'm not questioning that and it doesn't imply that an outage would be a critical event.

Yes it would be embarrassing and it would be an inconvenience, but nothing worse than what happened today.

My dad has directv and they regularly have issues. He wasn't able to get NBA games for weeks, due to some software glitch (the entire channel was unavailable to him, for unknown reasons).

Directv is a distributor, not a producer, so the stakes aren't as high.

That’s a distinction the average person wouldn’t make.

But it has the potential to affect more people. Cable providers can be regional, tv productions national/international.

> When was the last time ABC or Fox went off air?

Broadcasting TV and Netflix are a very different payment model though. Broadcasting get their money from adverts. So even a couple of minutes of outage could cost them tens of thousands in terms of compensating sponsors for lost ads (depending on when the outage is as not all ad slots are worth the same). Where as Netflix is a subscription model where an outage is an inconvenience to their customers but they're not directly losing money (aside engineers overtime fees etc). Obviously Netflix could potentially lose customers but that's not going to be at a high rate from a single outage like this.

This distinction is can change dynamic of how you build broadcasting infrastructure. eg redundancy equipment is typical in any high availability deployment but broadcasting will not only have redundant physical equipment in different physical locations, but will often also have a second set of redundant hardware in each location purchased from different suppliers and running different software just in case it is a software / hardware malfunction specific to the product. Whereas in internet streaming services the emphasis is more on standardising software stacks to aid scaling - which makes total sense in terms of cloud services but that does still give you a potential point of failure (eg poorly tested Puppet or Terraform code getting deployed to prod).

Local US stations might not have the same level of redundancy as their national counterparts, but like the distinction between Netflix and traditional broadcasting if a financial one, equally the difference between ad revenue for local vs national broadcasters would be massively different. Ultimately the more costly it is for your service to be off the air, the most you'd expect to invest into your infrastructure to ensure you don't have any such outages.

> Netflix has aspirations to do nightly news

I hope not, that has echos of Yahoo. It's bad enough their streaming library is so thin, I'd rather see more content there than create their own new line of content.

> I'd rather see more content there than create their own new line of content.

Their content strategy (and Amazon's) seems to be evolving toward developing their own content library, much like HBO. In theory, it frees Netflix of over-dependence on networks and studios. In practice, we get crappy shows and standup specials.

I disagree wholeheartedly. Sure, there will be low quality shows. This expected by the sheer amount of content they are currently producing. But there are equal amounts of good to very good shows also getting released at a pace that a normal working person just can’t keep up with.

My watch-list just keeps growing. I don’t know if I’ll ever be able to catch up.

Licensing content it’s their biggest cost so yes, that’s their (probably correct) strategy. But instead of ‘crappy’ I believe the right terms is ‘long tail’. Make something for everyone since relative producing costs are minimal vs buying that content and distribution costs are almost zero for low-viewing content.

I think it’s probably to win over the last bit of TV people who claim “but where I’ll get my news if I cut the cable?”

I think the last TV people are the ones who claim, "Cable is my only option for internet, and the way it is bundled, I might as well get cable TV too."

Doesn't seem to be all that reliable or is it just me? I have it on a fire stick and it regularly stalls at 20%, the only reliable way to get it going again is to reboot the whole stick. About once every two weeks the cache gets corrupted and I have to clear it and login again. Both a total PITA when you are just trying to squeeze an episode of something in between kids bedtime and grownup bedtime...

That sounds like the difference between Netflix-the-service and Netflix-the-app. Maybe ditch the Fire stick and get something more reliable?

That just the fire stick, or the netflix firestick app or your network. That isn't netflix's infrastructure problem. Netflix is the most reliable of netflix/hulu/hbo.

Try a higher end streaming devices that wired to your router

I find the Fire Stick generally pretty awful. No problems at all since I switched to Nvidia Shield.

Just you. In my family we have it on 3 TVs as built in application (2 of the same, and 1 diff) and on th iPads and pretty much never have any issues.

Uptime on broadcast is different because it's not on demand. The reason people use Netflix is because on demand is better than broadcast, because broadcast has implicit uptime issues. Anytime that I cannot watch something on broadcast television is downtime, not only that, but it is unrecoverable downtime. There is no comparison to a service whose downtime functionally compares to the latter's "commercial break.

I hope they release a post mortem about this. Would be very interesting.

I was just about to reply with the exact same message.

Netflix has spent A LOT of time and money on their infrastructure, expounding their engineering views, proclaiming best practices, and how they use redundancy/sharding/insert best practices and major buzzwords in order to combat the chance of this happening.

I'm a huge fan of Netflix and their approaches to infrastructure, and I'm almost certain that they'll have some very interesting conclusions from this.

Can't test for everything, but seeing this is the most significant outage in ages, I'd say they're doing a damned good job.

I mean, after all, the internet's true natural predator still lurks, its yellow CAT steel present across all of the nation...

I speak of course, of Backhoes.

(1): https://www.wired.com/2006/01/the-backhoe-a-real-cyberthreat... (2): https://it.slashdot.org/story/06/01/19/1643215/the-backhoe-t... (3): https://www.urbandictionary.com/define.php?term=Fiber%20Seek...

I heard a story from the construction company (not int US) where my brother is working: on a construction site, they had to dig in an area with fiber. So they called the telco, aking them to disconnect the cables so they could start digging. But the telco was late and a few days later, fed up by the delay (and associated costs), the site manager said "start digging, we'll deal with the fallout". So the backhoe start digging and cut the cables. The telco were quick to send a guy after that...

I knew a guy who was the head of networking for a university. His biggest rant was that the fine structure for cutting a line made it cheaper to cross their fingers than to wait for someone to mark the exact location of the cables so they could dig.

Or also known has backhoe fade

Nothing is bullet proof. Even resilient systems like Amazon’s S3 are subject to human error, it’s not like Netflix being down is the end of the world.

true. netflix is probably the world’s least critical system at global scale.

consider the increase in child birth rate by this outage. lives will be forever changed :-)

Rjevski 4 months ago [flagged]

Ad networks would be the least critical. Netflix at least provides content some people actually want. I have yet to hear someone say "I want to see ads".

Certainly not, considering how heavily businesses rely on ad networks as a source of revenue (both by selling ad space but even more so by buying ad space and advertising their products)

Do their employees and CEOs actually say "I want to see ads"? Nope.

I fail to see how any employees/CEOs personal ad viewing preference would be relevant here.

Employees, CEOs and businesses in general like generating cash and/or getting paid. If an ad network being down hinders their ability to do so, then that seems pretty critical to me.

Do they want to stay in business and keep their jobs? Yes.

Many people won't be happy if their salary would be delayed because ad network was down and website didn't generate enough income to pay expenses.

Rjevski 4 months ago [flagged]

They are welcome to jump ship to companies that sell an actual product and don't profit off wasting everyone else's time with their ads.

Sorry but I really have no sympathy for people that happily waste my time by showing me irrelevant content, trying to sell me garbage or flood my mailbox at every opportunity (the only reason my mailbox is clean is because I am allergic to any web form that asks for my email).

Just curious, do you use any google products?

I only use Search, and even then I'm trying to cut down as much as possible by using DuckDuckGo.

I don't have anything against Google per-se, by the way. I'd be happy to pay for their services should they allow me to do so.

But you at least recognize that ad supported businesses have solved some fairly interesting technical challenges like: - email spam - search - fast dns - secure browsing - giant linux forks

And that we would be unlikely to have such things in a pay for everything model?

> Ad networks would be the least critical. Netflix at least provides content some people actually want. I have yet to hear someone say "I want to see ads".

That's actually a commonly-heard phrase in the US around Superbowl time.

Well, when the alternative is watching American football...

> watching American football

Are you referring to the 17 minutes per hour of actual gameplay, the 21 minutes of rambling commentary/sports celebrity gossip, or the 22 minutes of ad spots?

Upon recent years of dissapointment, I no longer watch superbowl ads.

I think there was a massive marketing campaign to get people to pay attention to superbowl ads claiming they are funny.

They've just turned into ads in recent years.

And if the curiosity gets the better of you, numerous people upload them all to YouTube, often bundled together in one long clip (unfortunately too often with meager attempts at witty commentary).

I thought the amazon alexa one this year was really funny: https://www.youtube.com/watch?v=J6-8DQALGt4

True but you know people will act like it is.

The worst outage my company ever had was due to bad cost cutting decisions.

My company ended up having an equipment failure that took out both production servers and out internal support system at the same time. Lots of pale faces that day. What has two thumbs and bureaucratic bullshit that put runbooks and deployment tools on production hardware? This guy.

My bet is whatever cause this problem, someone got a raise last year for doing it.

You make it sound as if good engineering practices somehow should guarantee no outages ever. Best practices are risk minimization, not risk elimination.

Ironically, the article isn't loading for me.

Same here. Netflix on the other hand is perfectly fine.

The outage happened last night.

I thought they have a button in their office which shuts down a server, as a means of testing. Maybe someone brought in their 3 year old and he just kept hitting it.

They have a "Chaos Monkey" [1] feature that is intended to bring down individual nodes. "Exposing engineers to failures more frequently incentivizes them to build resilient services."

If Chaos Monkey had been responsible for setting off a global outage, I could imagine business leaders getting cold feet about using a tool like this. In traditional companies, anyways, they'd never have seen the benefit of it and after only hearing the costs, they'd probably be livid that a widespread outage had been caused by something like this.

[1] https://github.com/Netflix/chaosmonkey

They've actually upgraded to the "Simian Army" now. [0] Which among others includes the "Chaos Gorilla":

> Chaos Gorilla is similar to Chaos Monkey, but simulates an outage of an entire Amazon availability zone.


Isn't there like a Kong for an entire region as well?

> they'd probably be livid that a widespread outage had been caused by something like this

Widespread outage is exactly what something like chaos monkey is intended to help prevent. Even if this were chaos monkey induced (seems very unlikely), they've gained far more stability by having it than they would have lost in a single outage

That's extremely speculative and impossible to quantify. There are plenty of counter examples where companies have impressive uptimes and don't resort to such extreme testing. Then again, attributing the outage to these chaos services is also extremely speculative at this point.

This seems like basically the same principle as dropout from neural networks.

One could make an Arrested Development episode where George Sr. and Lucille hear about this concept and decide to apply it to their children.

(Edit: the Arrested Development <-> Netflix tie in didn't even occur to me until after writing the comment.)

It's possible that executes would think that way but only because they are not too technical. Every company should have a chaos monkey to make sure stability is at the top of the list always.

> Every company should have a chaos monkey.

Big companies do though they call then interns.

Every large company building mission critical systems, maybe.

Resilient systems take longer to build, and thus are more costly. It is not always wise to spend that much resources on this for, say, a startup.

After a startup is crushing it, then maybe they can start rewriting any brittle monolithic systems into something more resilient.

Instagram, GitLab, Netflix all down at the same time. Coincidence?

Add StackOverflow to that list.

Here's an alternative article from CNBC since the linked article is having trouble loading: https://www.cnbc.com/2018/06/12/netflix-down-streaming-servi...

HN hugs of death? It seems like the article is also down.

I'm looking forward to the post-mortem of this (hopefully by someone like Dave Hahn)

Hopefully it wasn't a Hurricane's Butterfly kind of deal...

What do you mean by Hurricane's Butterfly? I'm familiar with the "if a butterfly flaps its wings" thing but not sure what you're referring to in this context is all.

Apologies, was referring to bcantrill's tech talk. [1]

Similar idea, you're in a hurricane but it may have been caused by an insignificant butterfly far awaya.

[1] https://www.janestreet.com/tech-talks/hurricanes-butterfly/

cool to see people totally having to face their addiction. my kids were flopping around on the floor like fish out of water.

Here's the link to AMP page of the article: https://news.alphastreet.com/netflix-suffers-first-massive-g...

I wonder why they have not given an official statement on why it happened. Probably just a one-off case.

Did someone at NFLX point Chaos Monkey at something they shouldn't have or was this an "honest failure?"

Net Neutrality's been dead for less than 24 hours. Tin foil hats aside, is it realistic to consider a relationship between these two events?

Not realistic at all.

I think you should expand on why you think so. Here are my reasons:

1. This is a global outage. Net Neutrality is down in US only.

2. The pattern of failure is not consistent with American ISPs past attempts. They've wanted to make Netflix less attractive compared to their services, not outright broken. Subtlety is better because outright breakage brings too much attention.


I'm not surprised.

I've tried to get Netflix's OSS tools like Spinnaker running and it's a total nightmare with how many interdependent services need to run. It took me days to get running and was never reliable. I think they drank a little too much koolaid.

Microservices + async DB updates = hell. After working on such projects, I respect the wisdom of Google making all of their data stores immediately consistent.

A monolith isn't bad at all if it's organized and built to scale horizontally. Move your consistency concerns back to the database where they belong, not between services. It's a joke that monoliths are bad when you look at every popular operating system kernel. These things are multi million line binary blobs in languages that aren't friendly to mistakes and they run EVERYTHING

> I've tried to get Netflix's OSS tools like Spinnaker running and it's a total nightmare

You may want to check out Armory Spinnaker[0]. It's a commercial version of Spinnaker that takes care of all the hard bits of setting it up.

[0] https://www.armory.io

Disclaimer: I'm an investor in Armory

Hey @thermodynthrway -- CEO of Armory here. You can give our enterprise distribution a try at http://go.Armory.io/install

We'd love to hear how it goes for you!

Bit drastic considering this was their first major global outage...

Seeing as this is their first major outage, and Netflix-the-streaming-service has existed since 2007, I'd say don't know what you're talking about.

Assuming their internal projects are built similar to OSS, I ask you to try playing with them yourself and see if you come to the same opinion. There's better ways to critique an arch than just measuring outages.

Netflix has built an amazing system for messing with their servers to try find failure points. And they have a bunch of cleanup jobs that run to fix consistency errors. This is just what I picked up from a past obsession with their architecture.

Then I worked on a few projects built that way, and realized it was a horrible nightmare. By having a global immediately consistent datastore you can push all these concerns back to the database, and your codebase ends up far smaller.

The failure modes tend to explode in complexity when you're doing rpc across services with different datastoress because you have to deal with distributed transactions yourself. Every single call needs to be able to unwind itself across all services.

If you have a call that fans out to 10 other service calls, you need to be able to unwind any of them in any order. It quickly becomes untenable

For the record, there was a global Netflix outage Christmas 2012. It's what motivated us to move to multiple Amazon regions.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact