Great article. The checklist bit speaks to me in particular. I once worked on a team that was responsible for maintaining a medium-sized legacy application. After one-too-many failed deploys, we instituted a pre-deployment checklist. The checklist consisted of things like:
* Have the tests all ran and passed? Is CI green?
* In addition to the tests passing, has a third party (not the devs themselves) clicked around and kicked the tires on any new features?
* What is the roll back plan? Can we roll back cleanly?
* What is in this set of changes? Should all of these changes go out?
These checks seem stupid, but the simple act of taking a breath and slowly going down the list really seemed to make a big difference. After instituting the list we had almost no failed deployments.
I don't know if this is an approach that generalizes. I tend to think that for clean, well-maintained and well-tested web applications, a continuous deployment approach is safer and faster in the long run. But when you're working on legacy code, a checklist works wonders.
And software has an even bigger advantage (making the lack of checklist use inexcusable) that all of the process can be encoded in software.
You can restrict deploys to only CI-checked revisions, you can mandate some sort of functional testing (with the tester stamping the revision), you can check that the rollback procedure works and works cleanly, you can put the changes list in front of somebody and ask that its items be validated.
The field seems to be moving very slightly forward, but it's not always easy to get people to relinquish useless control to automated systems. I'm currently trying to get an integration bot[0] set up in my company and the pushback surprised me. But fundamentally, you can get something similar for deploys, ask for a deploy and it kicks off the checklist automatically, mailing people and updating statuses on dashboards as needed.
And then you can even start collecting stats and charting time to deploy or amounts of rollbacks.
[0] by which I mean something similar to the Rust project's "bors": when changes are proposed, the bot requires validation by a "core developer" (who may be able to self-validate, it's only software so the validation process for a change are flexible), runs the linter, runs the tests, sends the changes to functional testing for validation[1], and merges the changes to "mainline". Humans only get involved where they make sense and can not be automated away: proposing and reviewing (technically and functionally) the changes.
[1] optional, and requires a significant functional validation/testing team, but definitely possible
* Have the tests all ran and passed? Is CI green?
* In addition to the tests passing, has a third party (not the devs themselves) clicked around and kicked the tires on any new features?
* What is the roll back plan? Can we roll back cleanly?
* What is in this set of changes? Should all of these changes go out?
These checks seem stupid, but the simple act of taking a breath and slowly going down the list really seemed to make a big difference. After instituting the list we had almost no failed deployments.
I don't know if this is an approach that generalizes. I tend to think that for clean, well-maintained and well-tested web applications, a continuous deployment approach is safer and faster in the long run. But when you're working on legacy code, a checklist works wonders.