Do you roll forward? Flip DNS back to the old deployment? Click the button in heroku that takes you back to the previous version?
For the database, during a migration we didn't synchronize code with one version of the db. Database structure was modified to add new fields or tables, and the data was migrated, all while the site was online. The code expected to find either version of the db, usually signaled by a flag for the shard. If the changes were too difficult to do online, relatively small shards of the db were put on maintenance. Errors were normally caught after a migration of one shard, so switching it back wasn't too painful. Database operations for the migrations were a mix of automatic updates to the list of table fields, and data-moving queries written by hand in the migration scripts—the db structure didn't really look like your regular ORM anyway.
This approach served us quite well, for about five years that I was there—with a lot of visitors and data, dozens of servers and multiple deployments a day.
This is the way to go. Have your root web directory be a symlink. EG. /var/www/app -> /code_[git_hash]/ You can whip through a thousand vms less than a second with this method. Connect, change the symlink. Other options: Pushing out a new code branch, reverting with git, launching new vms with reverted images, rsync'ing with overwriting -- is slower, and more dangerous with prod.
"For the database, during a migration"
There is no such thing as a database migration on prod. There is just adding columns. Code should work with new columns added at any point. Altering a column or dropping a column is extremely dangerous.
I forgot to mention a rather useful quality of this scheme when there are a whole lot of visitors: you can upload the code, switch the link on just a portion of the servers, gawk at the error logs and put the link back if you don't like what you see.
One downside - depending on your setup - is you may not have an easy way to hit the hosts directly/deterministically via any UIs in case you wanted to do any manual verification/debugging yourself.
Strange terms. Isn't this just a canary?
For me, currently, "canary" means a set of basic automated integration tests that are continually running in production with alarms that feed into a master aggregate "switch". Wether the dedicated canary accounts end up hitting a one-box prod host or real prod host in the end isn't a factor.
The important thing is we incrementally expose our latest code commit to prod hosts via one-boxing to reduce the customer exposure if an acute problem somehow gets past the previous code deploy stages/tests.
this sounds roughly like synthetic monitoring:
synthetic in the sense of synthetic traffic, since it isn't traffic from genuine users.
Can you go into more detail about what is meant by this:
> alarms that feed into a master aggregate "switch"
what is the master aggregate "switch" ? what does it do?
Yup - I think that lines up.
> what is the master aggregate "switch" ? what does it do?
We have a hierarchy of aggregrate monitors (or "switches") that watch n amount of either specific metrics or other sub-aggregate monitors.
In the case of production deployments, we watch a specific rollback aggregrate monitor for either a fixed amount of time or customer traffic that will auto-trigger a rollback if it goes into alarm (aka switches on).
We also have a master aggregrate monitor that will switch on if any sub-monitors get swtiched on for any reason. We typically watch this master aggregate alarm to auto-disable any promotions in our code pipeline.
I've never heard of this practice described as a canary before (shrug)
If you can't alter a column, how do you prevent your database slowly rotting in terms of its design integrity?
We have a DB schema that we not so affectionately refer to as the Standard Oil Octopus because of this methodology applied over ~20yr.
I agree with you in the general case but eventually hard cuts have to be made or you will perpetuate the existence of all sorts of legacy spaghetti (not necessarily in the DB, but in all the other things that use the DB). Like everything else there's a balance to be struck.
Or could probably just use overlayfs or something like that.
Note that, while containers can give you COW, they shouldn't really be necessary for the files of the app. And with php-fpm, changing the app dir is faster than restarting containers: it's actually done via changing the Nginx config.
We also used to do a symlink method, but proper kubernetes ci/cd setup is so much more better.
Not sure what you mean by ‘on the same machine,’ though. It's not like we flip server roles between releases. Multiple backend servers all had those several versions of the backend code, and the machines serving the frontend stuff had corresponding versions of that.
Generally, going back to a known clean state should be easier, safer and relatively quick (DNS flip is fast, redeploy of old code is fast if your automation works well).
In some cases changes to your data may make rolling back cause even more problems. I've seen that happen and we were stuck doing a rapid hot fix for a bug, which was ugly. We did a lot more review to ensure we avoided breaking roll back. So I'd advise code review and developer education of that risk.
Assuming you've got a CI in place, making the migrations a separate, testable commit will let you do this easily. We did this at my last company with a small GitHub bot and a CODEOWNERS file.
This process lets you validate and rollback/recover if needed.
In general what I have found is that database changes are unique to your situation. When an app is small it's fine for them to automatically run with the code commits. The system I deal with now has some very large tables, and running database migrations often requires planning. A column addition might be added weeks before the code is written to use the new column. It's just the nature of dealing with large tables.
Working solo, typically also means smaller, so there is a lot more leeway. I would do whatever works for you, and realize it's a good thing if you ever large enough to need to address other problems.
My deployment have a optional Target on the migration for rollbacks and I have never had any problems.
Depends on the tech you use ofc. But for. Net the entity framework make it easy.
Your comment is only used as a excuse for imperfect deployments and creates too much additional problems.
You should really do it like this, so the code decides how the connected database should.
Otherwhise the human overhead to execute the correct scripts will have a problem sooner or later ( eg. If a separate team/person handjes deployments).
It's also vague to know if someone else has configured everything correct ( at my current job, they put all the scripts in a separate branch. One application connects to another DB and ofc the DB deployments where not configured yet).
That's just another way of saying "you need to make sure you can roll forwards and backwards".
The question was how do you do it. The answer should include the phrase "we test it".
If it does, we take the app down first (with a maintenance message) so that errors won't bubble up to the users and APIs.
If we need to roll back a deploy that has migrations, we just roll back the migration. Every migration requires a "down" step that reverses itself. Our Go apps use Goose , but there are many other solutions.
For database migrations, we (1) design them so they can be applied without breaking the existing app and (2) make a "pre-launch" commit that adds those migrations to the codebase but doesn't have any code that uses them yet.
To deploy, we merge the "pre-launch" into "prod" and deploy, and since the app doesn't use the new db changes yet, it will happily continue working fine. Then, at our leisure we can manually run the migration (either through the framework built-in migration tools or manually through the db shell). Then, we can merge the full "launch" branch into "prod" and deploy again, which will push the code that starts using the db changes.
To roll back, we move "prod" back to "pre-launch" and deploy, which moves the app back to the state where the code isn't using the db changes, but the changes are still expected to be in place. Then, we manually roll back the migrations using the reverse of whatever migration method you used originally, which is fine since nothing in the codebase in the "pre-launch" commit is using the db changes. Then, we move "prod" back to whatever commit we need to roll back to and deploy again.
It takes a bit of planning and forethought, but it means no downtime and you have all the time you need to manually apply and roll-back db changes that can take a while (adding indexes to huge tables, etc.).
In the first two cases, we revert to the commit where untested or failed code was introduced to the master. This basically never happens. In the third case, you need to do some debugging and try to figure out why it's broken and either fix it or revert it. Basically just look at the git history. If the code has dependencies then you might need to do code triage and produce a hotfix. If it's a relatively isolated piece, then just revert and fix it at a better time.
We use semver and gitlab's tags so we know just by the versioning if the code that is broken is important or not and if we can roll back.
math.pow(2, 4) vs math.pow(4, 2)
Properly input space partitioned tests have something like 90 to 95 accuracy if properly written.
The issues always come from people not wanting to add sufficient tests.
Continous Integration is mostly about regression, so the new tests have less value than running old tests.
We tend to fix issues in production (bugs) by mandating a unit test with the failing issue to be written as part of the fix.
Properly, if your code has an ‘if’, you need two tests, for the two branches. Same with every place the outcome may diverge. With this approach, it's basically impossible to botch the code unless something slips your mind while writing both the code and the tests. Otherwise, it's pretty much ‘deploy and go home.’
if (x % 3 == 0)
if (x % 5 == 0)
Seriously though, the criterion you mentioned is only one of many of increasing strictness (see e.g. https://en.wikipedia.org/wiki/Code_coverage#Basic_coverage_c...), namely branch coverage. Having branch coverage still says very little - the interaction between different branches can be trivially wrong. And desiring full path coverage immediately leads to combinatorial explosion (and the halting problem, once loops are involved).
> unless something slips your mind while writing both the code and the tests.
That is true for any choice of coverage metric and target.
Well, if you don't see other points where the input→output mapping can diverge (and rely on the client's requirements for that?), then no, you didn't nail it.
You normally cannot prove general correctness with unit tests. You can try to probe interesting points in the input space, and different coverage metrics encourage different levels of rigour and effort with this, leading to different fractions of bugs caught, but you'll never have a guarantee to catch all bugs (short of formal proof or exhaustive input coverage).
And if the same unit has multiple branches, you need double the tests to cover all paths.
And that doesn't guarantee correctness, and it's only unit tests.
But the 50% was a number I vaguely recall from Code Complete.
Yeah, I should've probably clarified this, seeing as we're on the topic of correctness and pedantry.
Personally I'm doing this in TDD style, before the code is written. And then while it's written, too. And for code that didn't have tests—the approach really helps catch bugs previously unknown.
I don't think a unit test will be able to distinguish 16 from 16.
if build #20 is bad, build #19 deployed with hash XYZ was last known good, click 'rebuild' or 'replay' and it'll deploy #19 again.
Nowadays - flip a toggle in the admin. Deployments and releases are separated.
Made a major blunder? In kubernetes world we do "helm rollback". Takes seconds. This allows for a super fast pipeline and a team of 6 devs pushes out like 50 deployments a day.
Pre-kubernetes it would be AWS pipeline that would startup servers with old commits. We'd catch most of the stuff in blue/green phase, though. Same team, maybe 10 deployments a day but I think this was still pretty good for a monolith.
Pre-aws we used deployment tools like capistrano. Most of the tools in this category have multiple releases on the servers and a symlink to the live one. If you make mistake - run a command to delete the symlink, ln -s old release, restart web server. Even though this is the fastest rollback of the bunch the ecosystem was still young and we'd do 0-2 releases a day.
Ps. Releases are about building new versions of code packages. Deployments about pushing them out to environments.
Would you mind explaining this a little further? How does the separation allow you to flip a switch in the admin?
Uploading new code to a server where the new-code is behind a disabled feature flag means the program hasn't changed. The feature flag can be enabled when it's suitable. (This could even be 'rolled out' to only a subset of users; e.g. testers, internal, 10% of users, etc.)
deploy = put the code on the server, release = enable the feature flags
My last job had an extra layer of security. As a .net house all new deployments were sent to Azure in a zip file. We backed those up and maintained FTP access to the Azure app service. If a deployment went really wrong and we couldn't wait the 10-20 mins for the CI pipline to process a revert, we'd just switch off the CI process and FTP upload the contents of the previous last good version.
Of course, if there were database migrations to deal with then all hell could break loose. Reverting a DB migration in production is easier said than done especially if a new table or column has already started being filled with live data.
To be fair though, most of the problems I encountered were usually as the result of penny pinching by management who didn't want to invest in proper deployment infrastructure.
It's more often been the case for us that issues are caused by mistaken configuration / infrastructure updates. We do a lot of IAC (Chef, Cloudformation), so with those, it's usually a straight git revert and then a normal release.
"Production"? Does that mean something that goes to the customers? Very few of our customers keep up with releases so it's generally not a big deal. We can have a release version sitting around for weeks before any customer actually installs it; some customers are happy with a five year old version and occasional custom patches.
I bet it's a bigger problem for those for whom the product is effectively a running website, but those of us operating a different software deployment model have a different set of problems.
We use Cloud 66 Skycap for deployment which gives us a version controlled repository for our Kubernetes configuration files as well as takes care of image tags for each release.
Which.. isn't as bad as I thought it would be (so far).
If it’s not urgent we’d just revert with a PR though and let the regular deploy process handle it.
The frontend we deploy with Heroku, so we deploy with the rollback button or Heroku CLI. Unfortunately we don’t have something setup where the frontend checks if it’s on the correct version or not, so people will get the bad code until they refresh
Same for us, but we use nixpkgs directly over CentOS. Nix is perfect for rollback. It can be done on an entire cluster in seconds.
For the DB, We use schemaless DBs with Devs that care about forward and backward compatibility.
Code rollbacks are simple as heck - I just keep the previous Docker container(s) up for a potential rollback target, and/or have a symlink cutover strategy for the webservers. I use GitLab CI/CD for the majority of what I do so the SCM is not on the server, it's deployed as artifacts (either a clean tested container and/or .tar.gz). If I need to rollback it's a manual operation for the code but I want to keep it that way because I am a strong believer in not automating edge-cases which is what running rollbacks through your CI/CD pipeline is.
Also for code I've been known to even cut a hot image of the running server just in case something goes _really_ sideways. Never had to use it though, and I will only go this far if I'm making actual changes to the CI/CD pipeline (usually).
The biggest concern for me is database changes. You may think I'm nuts but I have been burnt _sooooo_ bad on this (we were all young and dumb at one time right?)... I have multiple points of "oh %$&%" solutions. The first is good migrations - yeah yeah yell at me if you wish... I run things like Laravel for my API's and their migration rollbacks can take care of simple things. TEST YOUR ROLLBACK MIGRATIONS! The second solution is that I cut an actual readslave for each and every update of the application and then segregate it so that I have a "snapshot" that is at-most 1-2 hours out of date.
Have redundancy to your redundancy is my motto... and although my deployments take a 1-3 hours for big changes (cutting hot images of a running server, building/isolating an independent DB slave, shuffling containers, etc.) I've never had a major "lights out" issue that's lasted more than 1hr.
I imagine that much larger operations likely do feature flags or a rolling release so that problems can be isolated to a small subset of production before going wide. But still the same principle, redeploy with different code.
but smaller, routine deploys with unexpected failures could be just as dangerous.
Setup environment with previous version of production code (which does not have issue) and then using load balancer switch the traffic to this new environment
The App is docker, so we have a tag called app-production, and app-production-1(up to 5) which are all the previous production versions. If anything goes wrong, we can flip over to the last known good version.
We are multi-region, so we don't update all at once.
The dataset is a bit harder. Because its > 100gigs, and for speed purposes it lives on EFS (its lots of 4meg files, and we might need to pull in 60 or so files at once, access time is rubbish using S3) Manually syncing it takes a couple of hours.
To get round this, we have a copy on write system, with "dataset-prod" and "dataset-prod-1" up to 6. Changing the symlink of the top level directory is minimal.
We were able to do a manual rollback for each deployment from the GitLab UI.
Disclaimer: I work at GitLab now, but my old agency was also using GitLab and their CI/CD offering for client projects for a couple years while I was there.
At that agency they have even open sourced their GitLab CI configs :)
I can't say enough great things about it - solutions like Jenkins and Travis CI just feel antiquated and clunky anymore. I always thought it wouldn't really be worth it to run CI/CD on my personal projects due to the complexity inherent in setting these solutions up until I saw the light... I had a coherent "one-click" deploy setup from scratch within an hour with GitLab.
What GitLab gets right is having TONS of enterprise-quality solutions available to you in _one place_... for 100% free as their community offering is AMAZING. That's insanely valuable to me as a startup engineer who doesn't have the time to run 4-5 disparate solutions that are difficult to integrate in a secure/simple way.
Having one solution for the above list has been a "game changer" to me because I've got one monolith piece of software to keep updated/manage vs. stringing a whole bunch of solutions together - and I say "monolith" in a 100% positive context =)
Then there's just the speed issue... I did DevOps at a huge mega-corp not long ago and the expectation of "major things to get done" was 3-4 things a week. Now that I'm doing my own startup my expectation on my self is 3-4 major things _in a day_. GitLab is the only tooling that I can imagine keeping up with me with near-zero BS, and because of that I'm a _huge_ brand advocate for them! (Not directly affiliated, just a passionate user!)
All-in-all I understand people have different tools and that's totally cool, but I did a lot to test out different CI/CD tooling and GitLab was amazingly simple, secure, and quick to setup.
Check out for a solid feature comparison: https://about.gitlab.com/devops-tools/travis-ci-vs-gitlab.ht...
It's just doing another deployment. It doesn't matter what version you are deploying.
That's the whole point.
My teams go into their CI/CD platform and just cherry pick which build they want to release.
I work for an enterprise. We automated the change control process and integrated it with our CD dashboard. So a quick peak at the dashboard will tell us.
Though depending on the criticality of the app. We may only retain a previous release for a specific time period.
There are two identical prod servers/cloud configurations/datacenters: blue and green. Each new version is deployed intermittently on blue and green areas: if version N is on blue, version N-1 is on green, and vice versa. If some critical issue happens, rolling back is just switching the front router/balancer to the other area, which can be done instantly.
Any mechanism for rollbacks that isn't tested continuously is likely to fail during incident response. It's a huge anti-pattern to have 'dark' processes only used during incident response -- same thinking behind why you should also be continually testing your backups, continuously killing servers to verify recovery, etc.
It takes Pulumi about 15 minutes to create our kubernetes cluster with all pods and monitoring in place.
After a succesful rollout we can pulumi down one of the clusters and reduce costs (We're on azure.)
whenever we need to roll back something we just use the corresponding Github feature to revert a merge, and that is automatically shoved into production using GH hooks and stuff.
Again, we have a rather easy and ancient deploy system, and it just works.
We do several updates a week if needed. We try to avoid late Friday afternoon merges, but with a couple alerts here and there (Mostly, New Relic) we have a good coverage to find out about problems.
For other issues we press the rollback button in the Heroku dashboard.
Heroku has its problems: buildpacks, reliability, cost, etc, but the dashboard deploy setup is pretty nice.
Since the question of database migrations came up: We take care to break up backwards incompatible changes into multiple smaller ones.
For example, instead of introducing a new NOT NULL column, we first introduce it a NULLable, wait until we are confident that we don't want to roll back to a software version that leaves the column empty, and only then changing it to NOT NULL.
It requires more manual tracking than I would like, but so far, it seems to work quite well.
When compared to our Fastly deploys which are global in seconds, it leaves me wanting a faster solution.
* Deploy new code in new VMs.
* Route some prod traffic to the new nodes.
* Watch the nodes misbehave somehow.
* Route 100% of the prod traffic back to old nodes (which nobody tore down).
In the case of normal deployment, 100% of prod traffic would eventually be directed to new modes. After a few hours of everything running smoothly, the old nodes would be spun down.
This does take more planning and gradual deployment, but saves the day when it matters.
1) Issues that cause a complete failure to start containers will fail healthchecks and are auto rolled back in our new CI/CD flow.
2) Issues that are more subtle are manually rolled to one back hash until it goes away (then we create a revert branch from that diff between HEAD and WORKING).
1. Developer creates a PR. To be mergeable, it must pass code review, be based on master, and be up-to-date with master (GitHub recently made this really easy by adding a one-click button to resync master into the PR).
2. Each commit runs a build system that installs dependencies, runs tests, and ZIPs the final code to an S3 bucket.
3. Once the developer is ready to deploy, and the PR passes the above checks, they type "/deploy" as a GitHub comment.
3. A Lambda function performs validation and then updates our dev Lambda functions with the ZIP file from S3. Once complete, it leaves a comment on the PR with a link to the dev site to review.
4. The developer can now comment "/approve" or "/reject". Reject reverts the last Lambda deploy in dev. Approve moves the code to stage.
5. The above steps repeat for stage --> prod.
6. Once the code is in prod, the developer must approve or reject. If rejected, the Lambdas are reverted all the way back through dev. If approved, the PR is merged by the bot (we have some additional automation here, such as monitoring CloudWatch metrics for API stability, end-to-end tests, etc).
TL;DR - Don't merge PRs until the code is in production and reviewed. If a rollback is needed afterwards, create a rollback (roll-forward) PR and repeat.
I’m trying to find this feature but my google-fu is failing me. Can you link to an announcement or doc page for this?
I'm not sure if there are specific settings required for this to work. For example, we have the master branch protected and require the status checks to pass and the PR to be up to date before it can be merged.
we roll forward and thus far never ran into the situation that that wasn't possible in a reasonable amount of time.
nevertheless i've wondered more than once what would happen if we run into such a situation and there's a substantial database migration in the process (i.e. with table drops).
curious to learn what the different strategies are on that point: do you put your table contents in the down migration, do you revert to the last backup, etc.
If things look stable after whatever time you deem necessary, you can write a second migration to actually drop them.
If you run into issues, your down migration simply undoes the rename.
or build a new VM and send it to them.
Not everything is web-based
* A version control system (ie. git) that has a methodology for controlling what is tested and then released (ie. feature releases). If you want the ability to revert a feature, you need to use your version control to group (ie. squish) code into features they can can be easily reverted. Look up the GIT Branching Model . It's a good place to start when thinking about organizing your versioning to control releases.
* You should be able to deploy from any point in your version control. Make sure your deployment system is able to deploy from a hash, tag or branch. This gives you the option of "reverting" by deploying from a previously known good position. I would highly suggest automating deployment to generate timestamp tags into the repo for deployment so you can see the history of deployments.
* Try to make your deployments idempotent and/or separate your state changes so they can be independently controlled. If you have migrations, make sure they can withstand being deployed again, ie. "DROP TABLE IF EXISTS" then "CREATE TABLE", so redeploying doesn't blow up. If you need to roll back, you can rollback as much as you need to the point you want to deploy. A trait of a well designed system is it needs few state changes to add new features and/or those state changes can be easily controlled.
* Have a staging system(s). You should be able to deploy to a staging system to verify the behavior of a deployment. It should be able replicate the production every way except in data content. Ideally, should also build this from scratch every time so that you can guarantee if production dies hard death you can completely reproduce it. A great system will also do this for production, bring it up for final testing, and then you can switch over to it once tested.
Notice the trend here is to breakup the dependences between how, what, and where code is deployed so that have many ways to respond to issues. Maybe the solution is small enough to just make fix in the future. Maybe it is create an emergency patch, test it on a new production deployment and then switch over. Maybe it is so bad you want to immediately deploy a previous version and get things running again. All of these abilities depend on building your system such that you have these choices.
kubectl rollout undo deployment/$DEPLOYMENT
If you can pinpoint a specific commit that is causing the issue. Revert that commit and go through your standard release process.
Similar to "clicking the button in heroku"
Now we should only march forward with small, on demand releases, this way we will know exactly where the issue is and will be able to fix it forward quickly.
Rollbacks were a strategy with monthly (or even quarterly [insane huh?]), giant, stinky, release dumps, knowing there is no way we could quickly identify and deploy the fix. aka lets throw production 3 months back and take another 2 month for figuring out there the issue that happened during last release is.
And to finally answer your question: we never roll back. We always march forward.
There are always going to be failure modes that require extensive time to diagnose and debug, even with small changes being made. Additionally, you want that diagnostic phase to happen without time pressure. If you do not have a sane rollback mechanism to use in those scenarios, you are doing a disservice to your users and your team.
Your users suffer, because the outage or breakage will last as long as it takes for you to address the underlying issue directly, instead of just rolling back to restore service. They will be forced to hear frustrating things like "we're working on it", since you don't know what's wrong yet, when instead you could have just rolled back before most users even noticed there was a problem.
And, more importantly, your team will suffer greatly, because they will be forced to work under pressure when an incident like this arises. And, worse, they will also 'learn' that accidentally pushing breaking changes to production results in an extremely unpleasant and toxic situation for everyone, leading to systemic fear-of-deploys and undermining a blameless culture.
So you should have a rollback mechanism that is solid, tested, and easy to use for scenarios where a non-trivial regression or outage arises in production, even if you are doing continuous delivery of small patches.
But again, I also saw the other comment about how "Dogmatic" my approach is. I wouldn't say it's dogmatic, idealistic - yes. But not dogmatic. There is a place and time for anything and roll back can STILL be useful when you don't trust the system nor the code base (as I pointed in my other comment, rollbacks are useful with legacy systems and systems that you have to maintain that were build by outsourced teams).
Well also roll backs is the first thing you think of when you join a large company as a director of engineering to support systems you never touched before.
In my experience, a healthy incident response process has a fork in the decision tree at the very top: do we roll back, or do we attempt to fix live? And in the latter case, we time box how long we're willing to spend, and defer to rolling back for all but the most trivial, obvious fixes. Even if you don't use rollback often, having that top level fork is a release valve for all of the toxic implications I mentioned in the scenario where you do actually need it.
Even if you have several dozen incidents happen where you didn't need it a black swan event will eventually show up -- and that event will be the one that will have the lasting impact on your company's public perception and the morale of your team.
If I need to do a roll back it means that I don't trust the system nor the code base. I will do the roll back but after that there will be a very productive retro about how we can do better to avoid rollbacks in the future (aka what did we learn).
But again, as I said, there is a place and time for everything! And there are many variables! Even how you structure your teams affects deployments, engineering culture, engineering team types (cross-functional, generalized; specialized etc), if the team that makes a decision about the roll back is not the team that introduced the bug.
My approach is not dogmatic (have your standardized roll backs if those work best for your company, release cycles, teams) it's idealistic (that's what I aim for, personally)
When we roll back on my team, it's uncommon but when it happens it's considered a success if it was made through a systematic decision-making process. Making a sane decision in the interest of our users to restore service quickly is always a win. I can assure you, it does not compromise your ability to do continuous delivery or small changes by having and occasionally using a rollback mechanism. If you are fearful of the idea that having such a mechanism and plan in place somehow will lead to people questioning your principles in a way you cannot defend, then that is a separate problem, since the two things you mention that are incompatible are in fact compatible and highly defensible.
It is not a legacy from "waterfall" or any of the other things you mention, because your claim can be refuted through a single counter example, and I've worked on 3 separate projects where such counter examples exist: we had a rollback method, it was used once in a while, and we shipped changes to production multiple times a day using continuous delivery. At no point on these projects did the ability or use of roll back lead to some kind of hard-to-explain loss in delivery velocity. On the contrary, I suspect if that mechanism did not exist, several failures that were easy to get back to green would have turned into a toxic hellhole, and my team mates would have been much more fearful around shipping, which is the high order bit when it comes to velocity and embracing continuous delivery of small changes.
I also suspect that we're going to agree to disagree here. There are so many nuances, it's impossible to properly communicate most of those without writing a chapter of a book.
Appreciate your points though. Great food for thought right there.
In a blue/green world, failing back prod to the cluster with the known-good code should be reflexive. After that maybe you can be lax about whether to roll back or fix forward on the unhealthy cluster (or maybe not, if you only have n=2 clusters).
For example, always marching forward means that any time an issue is coming up, certain resources must be allocated in tackling the issue. That can't always be the case.
Smaller and more frequent releases are preferable and most of the time a single line change will fix the issue, but other times rolling back may be the best option.
Also note, that roll back was a valid strategy back in the day and still can be useful tool in your garage of tools. It can be useful when, for example, dealing with complex legacy systems that were created decades ago, or complex systems developed by outsourced development teams. You'll know when roll back is useful when you see it.
If the business is losing $BIG_BUCKS per minute of downtime, it is most definitely a step forward.
* Find the code that is affected
* Write a test
* Have it go through ci/cd
* Deploy to prod.
Or is there a different way of deploying a big priority bugfix to production than normal deploys?
Also worth to note, priority bug fix is not really about pipelines it's more about the ability to dynamically reallocate resources. Depending on the complexity of the affected area we should be able allocate as many devs as it is useful to fixing it. (similar to "Fast Lane" in Kanban)