
Canary Releases in Practice - drdrey
http://dreynaud.fail/canaries-in-practice/
======
cortesoft
My experience might be a little different due to the nature of the environment
I work in, but I strongly disagree with the feature flag segment.

I run the DevOps team for a large CDN, and am the lead developer of our canary
system. We deploy a wide variety of software to a number of different systems,
and can have anywhere from 20-50 simultaneous canaries deploying code to
30,000+ servers. With that many servers and canaries going, it can be easy to
run into conflicts when you want to deploy a new version of a piece of
software while another canary is still deploying the previous version.
Canaries often need to go slowly enough to get full day-night cycles in a
number of global regions (traffic patterns are not the same everywhere).

The way we get around this is through the heavy use of feature flags, and the
idea of having all feature flags enabled by default is the opposite of our
strategy.

My old boss used a 'runway' analogy for canaries; you have to wait for the
previous release to finish 'taking off' (be deployed) before you can release
the next version. So we need to deploy quickly so you don't 'hog the runway'

So it seems we have two disparate goals; we want to let canaries release
slowly to see full traffic cycles, but need to release quickly to get off the
runway.

We solve this problem by having the default behavior for any new feature be
disabled when releasing the code. This allows us to deploy quickly; once you
verify in your canary that your new code is not being executed (i.e. your
feature flag is working), you can deploy quite quickly. Your code should be a
no-op with the feature flag disabled.

Once your code is released, THEN you can start the canary for enabling your
feature flag. Feature flags (and in fact, all configuration choices) are first
class citizens in our canary system; you can canary anything you want, and you
get all the support of a slow controlled release, with metrics and A-B
comparisons.

Since the slower canary is simply enabling a feature, other people can keep
releasing new versions of code without interfering with your feature flag
canary. Now, you can have confounding issues with multiple people making
canaries on the same systems, but our tooling allows you to disambiguate which
canary is causing issues we find.

~~~
echlebek
Thanks so much for this comment. Do you have a single person tracking the
state of all the canaries, or are developers responsible for their own release
cycles, including tracking any canaries that might interact with their own?

~~~
cortesoft
Developers are responsible for their own releases. We have a lot of tooling
around our canaries that helps keep track of everything. A lot of stuff
happens via chat; we have a chatbot that creates a channel for every canary,
and invites everyone that has any code in the release (and interested parties
that want to know whenever code is released to a certain platform). It tells
the room whenever there are conflicts, and will say information about the
canary as it progresses. You can also control the canary, advancing or
reverting via commands in the channel.

Whoever is advancing the canary is responsible for checking the A-B graphs for
any problems, and verifying that everything is working before advancing. The
system will tell you if it sees issues, but it is up to the person who wrote
the code to make sure things are good.

------
windowsworkstoo
Stickyness is the key and the trickiest - especially if you use multiple host
names pointing the same codebase to get around HTTP 1.1 limitations around
multiple connections.

For example, you roll out a new canary on your main domain and sticky a %age
of traffic to it. Fine, requests come in, but you forgot - you have a
cdn.example.com subdomain which serves your styles/js etc - but this is not
stickied.

Result is your either serve old content to the canaries, which isn't great
(would fall under client errors in the blind spot I suppose), but the other
option is your global CDN caches old content for your new cdnkey/cache buster
(you know because that key came from the initial canary request)...so now you
turn on the canary and _everyone_ is getting old styles/js from the CDN.
Boooooo!

~~~
_betty_
Add a hash or version to your CSS/image URLs. Store them all on same CDN.

~~~
Klathmon
Or better yet, have the hash be the filename.

~~~
seanwilson
Is putting the hash/version in a query string generally recommended? Some CDNs
have issues with it.

~~~
Klathmon
No, I'm saying don't put the hash or version in the query string, put it in
the filename.

Have your static files named a hex encoded sha1 hash of the contents, then
never worry about cache expiration again.

~~~
seanwilson
I get it, I meant are there any advantages to this approach over using a query
string? I guess one advantage of changing the filename is you can easily find
all the places in your code that is referring to the filename without the
hash. If you forget the query string for query string approach, the code will
still look like it's working which is worse.

~~~
Klathmon
It's just allowed us to completely sidestep the whole "cache" and "version
number" issues.

our build system will generate the filenames, so that's not as much of a
problem for me, but what is a problem is cache dates that are set incorrectly,
or set correctly but I need to overwrite sometimes. And versioning which can
easily "lie" whether intentionally or unintentionally.

With a "content addressable web" style, you don't need to worry. If a filename
is named `bde1ca6a5d7cefc8108c75fdaad29ed6.js` then you know that it will
always be named `bde1ca6a5d7cefc8108c75fdaad29ed6.js` if it has those
contents.

If you break a build and need to roll back, you will get the same filename,
you can ensure builds are fully reproducible, and you won't run into problems
if you forget to bump a version one time, or you roll back to an older version
but hotfix a bug and don't set the version number correctly, or even the "my
cachebuster RNG gave me the same random number 2 times in a row and it caused
a few users to error out" that I actually hit once.

And of course it has the advantage of _forcing_ you to use it, so like you
said you can't include the file without it, or try to manually patch it in
"just this once" (which always leads to more and just causes problems).

------
callumjones
AWS needs to help drive this practice towards a greater adoption: it should be
a basic feature of their LBs (and not some magic Lambda/Route53 trick).

Hopefully they’ll make it easier with their Kubernetes integration.

------
wahnfrieden
Anyone able to share details on how they achieved “sticky canaries” on AWS or
otherwise? It seems to require pretty custom infrastructure, haven’t yet seen
an out of the box solution for it.

~~~
cortesoft
An easy way to do it is have some piece of code that chooses if a user should
be in the canary or not, and then sets a variable in their session recording
the choice. That variable is used to determine which url they request, and
therefore which version they use.

