
How Canary Deployments Work in Kubernetes, Istio and Linkerd - sickeythecat
https://glasnostic.com/blog/how-canary-deployments-work-1-kubernetes-istio-linkerd
======
GauntletWizard
The article talks about a bunch of features around canary routing that
Kubernetes does not have. All of these features are misfeatures. They are
ingrained because it used to be easiest to deploy that way - you deployed to
your Australian data center and so all your Australian users got the new
version, and that was a good enough sample. Kubernetes makes deploying a
canary easy, so deploy a canary everywhere.

Likewise, it complains that "10% canary is hard" because you need to scale the
deployments in sync. This is a problem with your monitoring layer - Your
alerts and triggers and thresholds should all be based on proportion of
traffic, not the other way around. Your graphs should scale to the number of
canary instances. Every canary metric should either be an absolute, or a
proportion of traffic.

In short: Your canary should look as close to normal traffic as possible.
There should be no appreciable differences between the traffic going to your
canaries, and the traffic going to the rest of your production deployment.
Your canary is a production deployment and any deviance needs to either be
intentional or is a sign of defect.

~~~
shereadsthenews
In practice it is easier to have same-sized experiment and control deployments
than it is to normalize all the stats on a per-instance or per-request basis.
I get more signal out of having identical populations than I have ever been
able to achieve trying to normalize for known differences.

------
jbergstroem
For me, the missing piece of canary deployments was how to handle gradual
traffic changes ("confidence building" if you may). I now handle this with
flagger ([https://docs.flagger.app/](https://docs.flagger.app/)) and Istio.

------
labrabbit
Glasnostic CEO here. These are all valid points and good, straightforward ways
of doing things in a perfectly engineered world, where they work until they
don't.

Problem is, our world is rarely perfectly engineered. Our perfectly engineered
application becomes a service to other applications. Some other team deploys
something that affects our dependencies. An external partner hammers our API.
Our managed service provider has an issue. Noisy neighbors cause shock waves,
gray failures compound. These things happen irrespective of whether your code
is correct or not.

Unless your service architecture is small, "perfectly engineered" is an anti-
pattern because it is too expensive to track down and code against such
events, no matter whether we run in Kubernetes/Istio or elsewhere. Operational
challenges always require operators.

