How Canary Deployments Work in Kubernetes, Istio and Linkerd

GauntletWizard · on June 5, 2019

The article talks about a bunch of features around canary routing that Kubernetes does not have. All of these features are misfeatures. They are ingrained because it used to be easiest to deploy that way - you deployed to your Australian data center and so all your Australian users got the new version, and that was a good enough sample. Kubernetes makes deploying a canary easy, so deploy a canary everywhere.

Likewise, it complains that "10% canary is hard" because you need to scale the deployments in sync. This is a problem with your monitoring layer - Your alerts and triggers and thresholds should all be based on proportion of traffic, not the other way around. Your graphs should scale to the number of canary instances. Every canary metric should either be an absolute, or a proportion of traffic.

In short: Your canary should look as close to normal traffic as possible. There should be no appreciable differences between the traffic going to your canaries, and the traffic going to the rest of your production deployment. Your canary is a production deployment and any deviance needs to either be intentional or is a sign of defect.

shereadsthenews · on June 5, 2019

In practice it is easier to have same-sized experiment and control deployments than it is to normalize all the stats on a per-instance or per-request basis. I get more signal out of having identical populations than I have ever been able to achieve trying to normalize for known differences.

hinkley · on June 5, 2019

Some data collection tools allow you to add metadata to the metrics, so for aggregate alerts you can use all of the data and for things like canaries you filter based on the metadata.

If the canary is generating too much traffic to a service we need to know about that, and that will be a proportional alert. But at the same time we also need to know that the overall request rate to all servers combined has dropped precipitously and may mean we are having gross routing issues. In the rare event where someone is rolling out a huge improvement, like better cache infrastructure, you disable the alert, and possibly offer a new one.

jbergstroem · on June 5, 2019

For me, the missing piece of canary deployments was how to handle gradual traffic changes ("confidence building" if you may). I now handle this with flagger (https://docs.flagger.app/) and Istio.

labrabbit · on June 6, 2019

Glasnostic CEO here. These are all valid points and good, straightforward ways of doing things in a perfectly engineered world, where they work until they don't.

Problem is, our world is rarely perfectly engineered. Our perfectly engineered application becomes a service to other applications. Some other team deploys something that affects our dependencies. An external partner hammers our API. Our managed service provider has an issue. Noisy neighbors cause shock waves, gray failures compound. These things happen irrespective of whether your code is correct or not.

Unless your service architecture is small, "perfectly engineered" is an anti-pattern because it is too expensive to track down and code against such events, no matter whether we run in Kubernetes/Istio or elsewhere. Operational challenges always require operators.