
Netflix Blog: Tips for High Availability - yarapavan
https://medium.com/@NetflixTechBlog/tips-for-high-availability-be0472f2599c
======
tyingq
Interesting they choose to rename blue/green deployments to red/black. Is
there some Netflix specific reason for that? Like their logo colors maybe?
Red/black is more commonly meant for insecure/secure
([https://en.m.wikipedia.org/wiki/Red/black_concept](https://en.m.wikipedia.org/wiki/Red/black_concept))

Not trying to be a pedant, just curious.

~~~
banachtarski
Red black is used for all sorts of things. (See red-black tree, roulette, a
checkers board, etc). This deployment style has been labeled red-black
elsewhere also.

~~~
tyingq
Okay, fair enough. We are though, headed into an era of overloaded terms.
Which drives confusion and unnecessary arguments. Language is funny and
fickle.

~~~
bovermyer
I'm going to start calling them "sarcoline-coquelicot" deployments now.

------
jsiepkes
The article makes it clear Netflix uses Spinnaker a lot for these things.
Anyone willing to comment on their Spinnaker usage? Especially interested in
smaller deployments.

~~~
robotmay
Takes a while to get it going but it does seem very good once it's set up. I
had trouble trying to find any guidance on ways of running DB migrations
during pipelines, so I'd be very curious as to how other people solve that (as
I'm tempted to try it out again).

~~~
hurricaneSlider
You could trigger a DB migration using a Jenkins step. Alternatively could
have a web hook which triggers a migration as a step in the pipeline. Third
option would be to run the migration inside the container.

------
ghaff
Another Netflix piece on failover: [https://opensource.com/article/18/4/how-
netflix-does-failove...](https://opensource.com/article/18/4/how-netflix-does-
failovers-7-minutes-flat)

~~~
amjith
Thanks for the plug. I'm the author of the failover piece. Happy to answer
questions.

~~~
tyingq
_" When we failover US-East, we send traffic from the Eastern U.S. to the EU
and traffic from South America to US-West. "_

Interesting that you shift US-East to the EU. Any more color on why that
direction, versus say, US-West? Like maybe less common components in AWS? At
face value, it seems like an odd choice.

~~~
fred256
See slide 5 of this presentation:
[https://www.slideshare.net/mobile/InfoQ/chaos-kong-
endowing-...](https://www.slideshare.net/mobile/InfoQ/chaos-kong-endowing-
netflix-with-antifragility)

~~~
ihsw2
That "this is what success feels like" slide is mind-boggling. All Netflix
clients on all platforms on all devices simultaneously react.

I can only imagine what debugging misbehaving clients looks like, it's got to
be north of 1,000 hardware/software profiles.

------
majestik
This post should be called “Deploying while maintaining HA” or something.

I thought of some questions:

How is the dev/test env kept up to date for integration/canary tests? When a
new app/service is pushed to production does its build/image (AMI?) become the
new dev/test env base image for everyone?

Do engineers decide which metrics are tracked by Kayenta? Does ops? How does
this come together?

What about post deployment service monitoring / alerting? When is the dev team
off the pager hook?

Do you assume that a successful deployment in us-west means it will be
successful in us-east or is there an [integration] test done per region?

Just curious.

------
saagarjha
The blog post had a graph in the middle of what I assume was some sort of load
metric. It’s surprising how regular it was across the week and on the weekend.

~~~
fred256
It appears to be an SPS (stream-starts per second) graph.

[https://medium.com/netflix-techblog/sps-the-pulse-of-
netflix...](https://medium.com/netflix-techblog/sps-the-pulse-of-netflix-
streaming-ae4db0e05f8a)

~~~
diab0lic
Hey, One of the authors of the SPS blog post here, as well as the engineer
responsible for real-time alerting on SPS.

It is an SPS graph (as indicated by the title). Spinnaker displays the graph
on that page to give engineers a visualization for what SPS looks like during
their deployment windows. If your service is critical for streaming you'll
have a preference for deploying during lower traffic hours to minimize
potential impact. Fortunately as the post mentions it is very regular, and has
different praks in different aws regions, which allows regional deployments to
be staggered.

------
belltown98121
I work at one if the big tech companies. Some of these practices are ingrained
in me - even for services that do not promise top tier availability. I just
realized I take much of this for granted even though it may not be common
knowledge.

