
How Netflix evacuates regions in less than 10 minutes - aaronblohowiak
https://medium.com/netflix-techblog/project-nimble-region-evacuation-reimagined-d0d0568254d4
======
Nexxxeh
Excuse me potentially asking a question that's daft in the context of TFA, but
relevant purely to the headline...

The disaster warnings that interrupt TV broadcasts, like tsunami evacuation
orders, does Netflix display them?

I've not seen anything about it, and my Google-fu is struggling because they
have a show called The Warning, and a lot of disaster movies.

Many of my friends don't even have a TV for anything other than consoles. Only
one of my friends doesn't have Netflix.

If the Netflix app displayed "TSUNAMI WARNING - If you are in _insert place
name_ , head to high ground immediately", they're far more likely to see it.

~~~
toomuchtodo
If you use Netflix through something like Xfinity's over the top system, it
should go to a live EAS alert. If you're using it through a browser, it will
not. If you feel this deserves attention, I'd suggest contacting your
legislator to introduce legislation to require it (as Netflix and other
streamers aren't going to implement it without regulatory requirements).

You'd probably want said legislation to cover any over the top device
(Chromecast, Fire stick, Roku).

~~~
dragonwriter
> If you use Netflix through something like Xfinity's over the top system, it
> should go to a live EAS alert. If you're using it through a browser, it will
> not

If you are using it through a smartphone, it won't, but the phone will, so
there's that.

------
mrbrowning
> We also considered simply abandoning autoscaling altogether and pinning to a
> calculated value, but this would hide performance regressions in the code by
> absorbing them into a potentially enormous buffer intended for regional
> evacuation absorption.

I'm probably missing something here, but why wouldn't performance regressions
still be detectable via utilization metrics? I understand the difficulty of
determining resource allocation a priori, but I'm not sure how this relates.

~~~
aaronblohowiak
What you propose would catch many changes by looking at things like cpu and
latency, but the space of potential hidden resource constraints is vast enough
that without empirical verifician, we cannot have high confidence that
performance regressions have not occurred (eg: lock contention can be not a
problem at all until it is a huge problem...)

~~~
mrbrowning
That’s a good point. Out of curiosity, how do those constraints get surfaced
with hosts running in ASGs?

~~~
aaronblohowiak
as the new version gets deployed and starts to take more traffic it doesnt
keep up and is scaled up accordingly which establishes a new performance curve
for the failover prediction system.

------
camtarn
They mention that the solution remained cost neutral. Given that they're now
running a huge number of additional instances, I'm curious as to how that
could be the case. I wonder if they e.g. put a dollar figure on downtime,
based on expected lost subscriptions?

~~~
aaronblohowiak
We already had reserved the failover capacity to ensure it was available
during failover so running more instances is just taking advantage of that.

~~~
toomuchtodo
Are those reserved instances used for preemptible tasks (transcoding,
analytics processing) when not experiencing a business continuity event? Or is
just considered a sunk cost?

~~~
aaronblohowiak
The dark capacity is dynamically scaled throughout the day to handle
anticipated failover needs. Depending on the region and time of day, the live,
dark (and total) usage increases and decreases. When not being used for live
or dark capacity, the resources are utilized for other tasks.

------
aaronblohowiak
Latest update from our team. AMAA. Also, we're hiring:
[https://jobs.netflix.com/jobs/866321](https://jobs.netflix.com/jobs/866321)

