I run the DevOps team for a large CDN, and am the lead developer of our canary system. We deploy a wide variety of software to a number of different systems, and can have anywhere from 20-50 simultaneous canaries deploying code to 30,000+ servers. With that many servers and canaries going, it can be easy to run into conflicts when you want to deploy a new version of a piece of software while another canary is still deploying the previous version. Canaries often need to go slowly enough to get full day-night cycles in a number of global regions (traffic patterns are not the same everywhere).
The way we get around this is through the heavy use of feature flags, and the idea of having all feature flags enabled by default is the opposite of our strategy.
My old boss used a 'runway' analogy for canaries; you have to wait for the previous release to finish 'taking off' (be deployed) before you can release the next version. So we need to deploy quickly so you don't 'hog the runway'
So it seems we have two disparate goals; we want to let canaries release slowly to see full traffic cycles, but need to release quickly to get off the runway.
We solve this problem by having the default behavior for any new feature be disabled when releasing the code. This allows us to deploy quickly; once you verify in your canary that your new code is not being executed (i.e. your feature flag is working), you can deploy quite quickly. Your code should be a no-op with the feature flag disabled.
Once your code is released, THEN you can start the canary for enabling your feature flag. Feature flags (and in fact, all configuration choices) are first class citizens in our canary system; you can canary anything you want, and you get all the support of a slow controlled release, with metrics and A-B comparisons.
Since the slower canary is simply enabling a feature, other people can keep releasing new versions of code without interfering with your feature flag canary. Now, you can have confounding issues with multiple people making canaries on the same systems, but our tooling allows you to disambiguate which canary is causing issues we find.
Whoever is advancing the canary is responsible for checking the A-B graphs for any problems, and verifying that everything is working before advancing. The system will tell you if it sees issues, but it is up to the person who wrote the code to make sure things are good.
For example, you roll out a new canary on your main domain and sticky a %age of traffic to it. Fine, requests come in, but you forgot - you have a cdn.example.com subdomain which serves your styles/js etc - but this is not stickied.
Result is your either serve old content to the canaries, which isn't great (would fall under client errors in the blind spot I suppose), but the other option is your global CDN caches old content for your new cdnkey/cache buster (you know because that key came from the initial canary request)...so now you turn on the canary and everyone is getting old styles/js from the CDN. Boooooo!
GCP offers stickiness if you use their HTTPS load balancer with a feature called session affinity. And the CDN solution GCP provides also uses the HTTPS load balancer, so that sort of problem shouldn't happen (at least for your example, but it would be easy to architect it in a way that session affinity wouldnt fix the problem you described).
It doesn't look like k8s has as nice of features around session affinity, it seems only to support client IP affinity. (I'm not as familiar with k8s, so feel free to correct me)
There are ways out there to support this, may just depend your infrastructure.
I’ve seen that cause a lot of confusion because people would look and think that it was working correctly until they tested what the backend was actually serving. I like the approach of adding a hash to the file name (e.g. foo.<SHA>.css) so that cannot happen but related files are grouped together.
Any decent environment should make that simple: your code references foo.css and it’s automatically replaced with the expanded value.
Have your static files named a hex encoded sha1 hash of the contents, then never worry about cache expiration again.
our build system will generate the filenames, so that's not as much of a problem for me, but what is a problem is cache dates that are set incorrectly, or set correctly but I need to overwrite sometimes. And versioning which can easily "lie" whether intentionally or unintentionally.
With a "content addressable web" style, you don't need to worry. If a filename is named `bde1ca6a5d7cefc8108c75fdaad29ed6.js` then you know that it will always be named `bde1ca6a5d7cefc8108c75fdaad29ed6.js` if it has those contents.
If you break a build and need to roll back, you will get the same filename, you can ensure builds are fully reproducible, and you won't run into problems if you forget to bump a version one time, or you roll back to an older version but hotfix a bug and don't set the version number correctly, or even the "my cachebuster RNG gave me the same random number 2 times in a row and it caused a few users to error out" that I actually hit once.
And of course it has the advantage of forcing you to use it, so like you said you can't include the file without it, or try to manually patch it in "just this once" (which always leads to more and just causes problems).
User 1 goes to homepage.com and gets served the canary html.
User 1 sees they need the js file included on the homepage, identified by it's hash $hash, and so it requests the file $hash from cnd.homepage.com/$hash.
User 1 gets $hash and everything loads fine.
User 2 goes to homepage.com and does not get the canary, so their browser requests $old_hash, and gets the old version of the file from cnd.homepage.com/$old_hash, and everything is fine?
Unless you are talking about rolling out a new version of the cdn server along with the main website I don't see what the issue is here?
Back end services: use a cookie or a header. Once a user is selected for canary'ing a device, they get a special value in a header, and your service router sends their requests to the right set of servers.
Hopefully they’ll make it easier with their Kubernetes integration.