
How release canaries can save your bacon - mooreds
https://cloudplatform.googleblog.com/2017/03/how-release-canaries-can-save-your-bacon-CRE-life-lessons.html
======
luctor_ad_astra
A quick rollback frees up the developer's contextual burden and makes it easy
to take calculated risks. This works better in some environments, of course.
You wouldn't want it in testing a medical support system, for example.
LinkedIn goes a bit further and has continuous release monitoring:
[https://engineering.linkedin.com/blog/2015/11/monitoring-
the...](https://engineering.linkedin.com/blog/2015/11/monitoring-the-pulse-of-
linkedin)

------
siliconc0w
I'm less enthused about 'rollbacks' being considered 'normal'. They signify
something didn't go quite right with your unit/integration/qa process. IMO
there should be at least a 'mini-postmortem' to understand why it was missed
even if it's in an intentional blind spot. (i.e you made an explicit decision
it wasn't worth the engineering resources to get the testing fidelity needed
to catch the issue earlier). It's almost always better to catch issues
earlier, even if you have super neat tooling that makes it easy to rollback.

~~~
nurettin
You are a CTO sitting atop very expensive hardware and software. Would you
start removing deployment and runtime safety guards (such as a consumer-facing
staging environment) because you want to "discipline coders and devops"?

~~~
d4mi3n
A post-mortem should never be about placing blame on individuals, it should be
about identifying flaws in a system or a process.

There are places where post-mortems can turn into blame games, but in my
experience such things are counter-productive to actually solving problems.
Luckily, there are plenty of engineering organizations that do not make this
mistake! :)

~~~
qqg3
The easiest way to avoid that is to have well structured post-mortem process,
and post-mortem everything. Successful and unsuccessful releases.

~~~
ojilles
We need to go from postmortem to postpartum!

------
jwatte
Yes, a Canary lets you limit the damage if some bug sneaks past testing. We've
done it for over 10 years, with staged rollouts and automated crash statistics
and such.

The draw back is that prod needs to be tolerant of multiple versions. Which is
usually a fine practice in itself, anyway!

~~~
skookum
That's not a drawback so much as it's a fact of life. In any large-scale
(read: distributed) system trying to provide a high degree of availability,
rolling upgrades are the only way code goes out and individual components need
to deal with interacting with newer/older dependencies. You can constrain the
matrix by only allowing current version plus one back running in production,
or forcing deployment orders, and so on, but in modern systems (read: ones
where you can't just say "We're taking everything down for 3 hours on Sunday
to upgrade.") you can't escape non-atomic upgrades.

~~~
londons_explore
Another approach is you start up a full copy of the new system with all new
versions, then change loadbalancers to direct all traffic away from the old
system and to the new. Then decomission the old system.

With dedicated hardware, you need twice as much hardware. With cloud, you only
pay double for 10 minutes during the rollout, which usually is very cheap.

~~~
skookum
That approach only works if your system is stateless, a caveat which excludes
virtually all large-scale systems.

------
alpb
If you are interested in canary deployments, check out Spinnaker by Netflix:
[http://www.spinnaker.io/](http://www.spinnaker.io/) There's a good talk about
it here with stories from Waze and Google:
[https://www.youtube.com/watch?v=05EZx3MBHSY](https://www.youtube.com/watch?v=05EZx3MBHSY)

------
retreatguru
What are the best practices redarding rollbacks when the database is affected.
I would think a large amount of overhead would be required.

~~~
ademarre
They touched on this in their SRE post last week:
[https://cloudplatform.googleblog.com/2017/03/reliable-
releas...](https://cloudplatform.googleblog.com/2017/03/reliable-releases-and-
rollbacks-CRE-life-lessons.html)

~~~
ams6110
From that link: _At Google, our philosophy is that “rollbacks are normal.”
When an error is found or reasonably suspected in a new release, the releasing
team rolls back first and investigates the problem second._

I like that -- reminds me of aviation, where a go-around is normal. If your
approach to landing isn't stabilized, you're too high, too low, too fast, too
slow, etc. don't try to save it. Go around and try again.

~~~
londons_explore
At our place, we rollback every few weeks just to test the system, even if
nothing appears abnormal.

Next we plan to automate the process - 1 in 10 rollouts will actually be a
rollout, a rollback, and another rollout, checking system health at each step.

~~~
hectormalot
I just wanted to suggest this. Similar to 'chaos monkey' killing my processes
once every few days, developers would also look differently at rollback
procedures if I guaranteed that one in 10 would be rolled back randomly.

------
Pfhreak
Interesting. I've heard of this practice as 'one box' or 'one pod'. And canary
used to mean, 'tests that run continously against your production stack.'

I wonder which is more prevalent.

~~~
obstinate
Probably depends on your workplace, but at Google canary has meant a subset of
production running at a newer version at least since '07.

------
pgrote
I wonder how the botched Google Drive release issue from a weeks ago worked
under this scenario?

------
wahnfrieden
Anyone find a good way to do this with AWS Lambda / API Gateway?

~~~
mooreds
Well, you could definitely front two versions of your application lambda
function with a traffic splitter lambda function that would send 99% of
traffic to the production alias and 1% to the canary (or whatever the number
you wanted was. See how to call one lambda from another:
[http://stackoverflow.com/questions/31714788/can-an-aws-
lambd...](http://stackoverflow.com/questions/31714788/can-an-aws-lambda-
function-call-another) and aliases:
[http://docs.aws.amazon.com/lambda/latest/dg/versioning-
alias...](http://docs.aws.amazon.com/lambda/latest/dg/versioning-aliases.html)

This post might also be of interest:
[https://blog.jayway.com/2016/09/07/continuous-deployment-
aws...](https://blog.jayway.com/2016/09/07/continuous-deployment-aws-lambda-
behind-api-gateway/)

Note, I have never done this, this is just how is approach it.

------
Touche
> One solution is to version your JavaScript files (first release in a /v1/
> directory, second in a /v2/ etc.). Then the rollout simply consists of
> changing the resource links in your root pages to reference the new (or old)
> versions.

I wouldn't take this advise as it's bad for caching. A change to one
JavaScript file will then result in breaking the cache of everything.

~~~
mattnewton
Maybe I misunderstand, but don't you want to invalidate the cache when a new
version comes along? Isn't the risk of version skew worse?

~~~
Touche
If one.js changes but two.js doesn't, then two.js should come from the cache.
Only one.js should be fetched from network. Sticking all assets in an
/assets/v2 folder invalidates _everything_.

~~~
gefh
If one.js and two.js are really separate components, they should each get a
version. If they're closely coupled, they should be compiled together into one
unit, to take advantage of deduplication, inlining, dead code elimination,
fewer requests, better compression, etc etc.

~~~
Touche
Versioning assets separately by sticking them in new subfolders is just re-
inventing ETags in a bad way. Just use ETags.

The point of the folder versioning scheme the article proposes is to make
rollbacks easier. You can easily rollback a /assets/v2 folder to /assets/v1 by
updating your server template, but if you have a dozen separate version
folders (with different latest version numbers) for each resource then it's no
longer easy to roll those all back.

------
kazinator
> _any reliable software release is being able to roll back if something goes
> wrong; we discussed how we do this at Google ..._

How we do this at Google? Okay, tell me how to roll back the crap Android 6
upgrade back to 5 on this Samsung tablet I have here.

~~~
kasey_junk
Clearly everyone should consider their delivery costs and error acceptability
when determining their development & release process.

Continuous automated deployments might not be a great fit for satlellite
control software but that doesn't mean SaaS apps should switch to the same
process as satellite software teams.

------
packetized
I am mostly amazed that this is the first time that I've read in depth about
John Scott Haldane, who is the father of noted evolutionary biologist J.B.S.
Haldane. Super interesting.

------
mrj
As a high traffic customer of Google, I've been this person far too many
times.

    
    
       [...] if it breaks, real users get affected, so canarying should be the first step in your deployment process, as opposed to the last step in testing. 
    

It's a fine pattern and all, but not an excuse to throw stuff at prod and see
what happens.

~~~
draw_down
What's a high traffic customer of Google?

~~~
mrj
Host a high-traffic site on Google's infrastructure. Since you can see the
version number of the platform, it's obvious when they're rolling out changes.
This has cause many partial outages until the change was (I assume)
automatically rolled back.

It's a little hard to take this advice from Google after being the victim of
so many bad rollouts. Because we use a lot of services, we are far more likely
to have problems. We seem to always be the canary.

That's not fun.

~~~
jpatokal
But without canaries, you would have complete outages instead of partial ones.

~~~
mrj
Sure, but I'm pointing out that this strategy relies on real customers
encountering an error. I caution people to not forget that is a failure for us
trying to ensure reliable websites.

