Hacker News new | past | comments | ask | show | jobs | submit login
Canary Releases in Practice (dreynaud.fail)
72 points by drdrey on Jan 21, 2018 | hide | past | favorite | 19 comments

My experience might be a little different due to the nature of the environment I work in, but I strongly disagree with the feature flag segment.

I run the DevOps team for a large CDN, and am the lead developer of our canary system. We deploy a wide variety of software to a number of different systems, and can have anywhere from 20-50 simultaneous canaries deploying code to 30,000+ servers. With that many servers and canaries going, it can be easy to run into conflicts when you want to deploy a new version of a piece of software while another canary is still deploying the previous version. Canaries often need to go slowly enough to get full day-night cycles in a number of global regions (traffic patterns are not the same everywhere).

The way we get around this is through the heavy use of feature flags, and the idea of having all feature flags enabled by default is the opposite of our strategy.

My old boss used a 'runway' analogy for canaries; you have to wait for the previous release to finish 'taking off' (be deployed) before you can release the next version. So we need to deploy quickly so you don't 'hog the runway'

So it seems we have two disparate goals; we want to let canaries release slowly to see full traffic cycles, but need to release quickly to get off the runway.

We solve this problem by having the default behavior for any new feature be disabled when releasing the code. This allows us to deploy quickly; once you verify in your canary that your new code is not being executed (i.e. your feature flag is working), you can deploy quite quickly. Your code should be a no-op with the feature flag disabled.

Once your code is released, THEN you can start the canary for enabling your feature flag. Feature flags (and in fact, all configuration choices) are first class citizens in our canary system; you can canary anything you want, and you get all the support of a slow controlled release, with metrics and A-B comparisons.

Since the slower canary is simply enabling a feature, other people can keep releasing new versions of code without interfering with your feature flag canary. Now, you can have confounding issues with multiple people making canaries on the same systems, but our tooling allows you to disambiguate which canary is causing issues we find.

Thanks so much for this comment. Do you have a single person tracking the state of all the canaries, or are developers responsible for their own release cycles, including tracking any canaries that might interact with their own?

Developers are responsible for their own releases. We have a lot of tooling around our canaries that helps keep track of everything. A lot of stuff happens via chat; we have a chatbot that creates a channel for every canary, and invites everyone that has any code in the release (and interested parties that want to know whenever code is released to a certain platform). It tells the room whenever there are conflicts, and will say information about the canary as it progresses. You can also control the canary, advancing or reverting via commands in the channel.

Whoever is advancing the canary is responsible for checking the A-B graphs for any problems, and verifying that everything is working before advancing. The system will tell you if it sees issues, but it is up to the person who wrote the code to make sure things are good.

Stickyness is the key and the trickiest - especially if you use multiple host names pointing the same codebase to get around HTTP 1.1 limitations around multiple connections.

For example, you roll out a new canary on your main domain and sticky a %age of traffic to it. Fine, requests come in, but you forgot - you have a cdn.example.com subdomain which serves your styles/js etc - but this is not stickied.

Result is your either serve old content to the canaries, which isn't great (would fall under client errors in the blind spot I suppose), but the other option is your global CDN caches old content for your new cdnkey/cache buster (you know because that key came from the initial canary request)...so now you turn on the canary and everyone is getting old styles/js from the CDN. Boooooo!

(I work for Google, opinions are my own).

GCP offers stickiness if you use their HTTPS load balancer with a feature called session affinity[0]. And the CDN solution GCP provides also uses the HTTPS load balancer, so that sort of problem shouldn't happen (at least for your example, but it would be easy to architect it in a way that session affinity wouldnt fix the problem you described).

It doesn't look like k8s has as nice of features around session affinity, it seems only to support client IP affinity. (I'm not as familiar with k8s, so feel free to correct me)

There are ways out there to support this, may just depend your infrastructure.

[0] https://cloud.google.com/compute/docs/load-balancing/http/#s...

Add a hash or version to your CSS/image URLs. Store them all on same CDN.

Or better yet, have the hash be the filename.

Is putting the hash/version in a query string generally recommended? Some CDNs have issues with it.

I prefer the file name because it fails obviously: use the wrong value and you get a 404 whereas many implementations around the query string meant you’d get a different version than expected.

I’ve seen that cause a lot of confusion because people would look and think that it was working correctly until they tested what the backend was actually serving. I like the approach of adding a hash to the file name (e.g. foo.<SHA>.css) so that cannot happen but related files are grouped together.

Any decent environment should make that simple: your code references foo.css and it’s automatically replaced with the expanded value.

No, I'm saying don't put the hash or version in the query string, put it in the filename.

Have your static files named a hex encoded sha1 hash of the contents, then never worry about cache expiration again.

I get it, I meant are there any advantages to this approach over using a query string? I guess one advantage of changing the filename is you can easily find all the places in your code that is referring to the filename without the hash. If you forget the query string for query string approach, the code will still look like it's working which is worse.

It's just allowed us to completely sidestep the whole "cache" and "version number" issues.

our build system will generate the filenames, so that's not as much of a problem for me, but what is a problem is cache dates that are set incorrectly, or set correctly but I need to overwrite sometimes. And versioning which can easily "lie" whether intentionally or unintentionally.

With a "content addressable web" style, you don't need to worry. If a filename is named `bde1ca6a5d7cefc8108c75fdaad29ed6.js` then you know that it will always be named `bde1ca6a5d7cefc8108c75fdaad29ed6.js` if it has those contents.

If you break a build and need to roll back, you will get the same filename, you can ensure builds are fully reproducible, and you won't run into problems if you forget to bump a version one time, or you roll back to an older version but hotfix a bug and don't set the version number correctly, or even the "my cachebuster RNG gave me the same random number 2 times in a row and it caused a few users to error out" that I actually hit once.

And of course it has the advantage of forcing you to use it, so like you said you can't include the file without it, or try to manually patch it in "just this once" (which always leads to more and just causes problems).

I recommend this to everyone regardless of whether they're using canary releases or not. Sidesteps a bunch of issues regarding static asset caching.

Yes, and that's the problem....think about the request flow between a canary deploy, that hits the canary home page html with a reference to a cdn subdomain with a new version string, but that CDN subdomain goes to the old, non-canary version....

I'm not sure I understand. Let's write it out:

User 1 goes to homepage.com and gets served the canary html.

User 1 sees they need the js file included on the homepage, identified by it's hash $hash, and so it requests the file $hash from cnd.homepage.com/$hash.

User 1 gets $hash and everything loads fine.

User 2 goes to homepage.com and does not get the canary, so their browser requests $old_hash, and gets the old version of the file from cnd.homepage.com/$old_hash, and everything is fine?

Unless you are talking about rolling out a new version of the cdn server along with the main website I don't see what the issue is here?

Static content, including js files: like everybody says in this thread, you should use cache busting (add hash of file to filename) regardless of canary/blue-green pattern.

Back end services: use a cookie or a header. Once a user is selected for canary'ing a device, they get a special value in a header, and your service router sends their requests to the right set of servers.

AWS needs to help drive this practice towards a greater adoption: it should be a basic feature of their LBs (and not some magic Lambda/Route53 trick).

Hopefully they’ll make it easier with their Kubernetes integration.

Anyone able to share details on how they achieved “sticky canaries” on AWS or otherwise? It seems to require pretty custom infrastructure, haven’t yet seen an out of the box solution for it.

An easy way to do it is have some piece of code that chooses if a user should be in the canary or not, and then sets a variable in their session recording the choice. That variable is used to determine which url they request, and therefore which version they use.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact