The arrogant Netflix! They always brag about how technologically superior they a...

StressedDev · 2024-11-16T22:01:27 1731794487

Every organization makes mistakes and every organization has outages. Netflix is not different. Instead, of bashing them because they are imperfect, you might want to ask what you can learn from this incident. What would you do if your service received more traffic than expected? How would you test your service so you can be confident it will stay up?

Also, I have never seen any Netflix employees who are arrogant or who think they are superior to other people. What I have seen is Netflix's engineering organization frequently describes the technical challenges they face and discusses how they solve them.

notimetorelax · 2024-11-16T21:54:51 1731794091

I think you’re oversimplifying it. Live event streaming is very different from movie streaming. All those edge cache servers become kinda useless and you start hitting peering bottlenecks.

YZF · 2024-11-16T22:09:14 1731794954

Edge caches are not useless for live streaming. They're critical. The upstream from those caches has no way of handling each individual users. The stream needs to hit the edge cache and end users should be served from there.

A typical streaming architecture is multi-tiered caches, source->midtier->edge.

We don't know what happened but it's possible they ran out of capacity on their edge (or anywhere else).

ta1243 · 2024-11-17T13:32:08 1731850328

BBC had a similar issue in a live stream 5 years ago where events conspired and a CDN "failed open", which effectively DOSsed the entire output via all CDNs

> Even though widely used, this pattern has some significant drawbacks, the best illustration being the major incident that hit the BBC during the 2018 World Cup quarter-final. Our routing component experienced a temporary wobble which had a knock-on effect and caused the CDN to fail to pull one piece of media content from our packager on time. The CDN increased its request load as part of its retry strategy, making the problem worse, and eventually disabled its internal caches, meaning that instead of collapsing player requests, it started forwarding millions of them directly to our packager. It wasn’t designed to serve several terabits of video data every second and was completely overwhelmed. Although we used more than one CDN, they all connected to the same packager servers, which led to us also being unable to serve the other CDNs. A couple of minutes into extra time, all our streams went down, and angry football fans were cursing the BBC across the country.

https://www.bbc.co.uk/webarchive/https%3A%2F%2Fwww.bbc.co.uk...

YZF · 2024-11-17T19:02:16 1731870136

This feels like a bug in the implementation and not really a drawback of the pattern. "Routing component experienced a temporary wobble" also sounds like bug of sorts.

I worked in this space. All these potential failure modes and how they're mitigates is something that we paid a fair amount of attention to.