
Ask HN: How do you roll back production? - mooreds
Once you have pushed code to production and realized there is an issue, how do you roll back?<p>Do you roll forward? Flip DNS back to the old deployment? Click the button in heroku that takes you back to the previous version?
======
aasasd
A place I worked at had a symlink pointing to the app directory, and a new
version went to a new dir. This allowed us to do atomic deployments: code
wasn't replaced while it's being run. A rollback, consequently, meant pointing
that symlink to the older version.

For the database, during a migration we didn't synchronize code with one
version of the db. Database structure was modified to add new fields or
tables, and the data was migrated, all while the site was online. The code
expected to find either version of the db, usually signaled by a flag for the
shard. If the changes were too difficult to do online, relatively small shards
of the db were put on maintenance. Errors were normally caught after a
migration of one shard, so switching it back wasn't too painful. Database
operations for the migrations were a mix of automatic updates to the list of
table fields, and data-moving queries written by hand in the migration
scripts—the db structure didn't really look like your regular ORM anyway.

This approach served us quite well, for about five years that I was there—with
a lot of visitors and data, dozens of servers and multiple deployments a day.

~~~
ransom1538
"A place I worked at had a symlink pointing to the app directory"

This is the way to go. Have your root web directory be a symlink. EG.
/var/www/app -> /code_[git_hash]/ You can whip through a thousand vms less
than a second with this method. Connect, change the symlink. Other options:
Pushing out a new code branch, reverting with git, launching new vms with
reverted images, rsync'ing with overwriting -- is slower, and more dangerous
with prod.

"For the database, during a migration"

There is no such thing as a database migration on prod. There is just adding
columns. Code should work with new columns added at any point. Altering a
column or dropping a column is extremely dangerous.

~~~
dsfyu404ed
>There is no such thing as a database migration on prod. There is just adding
columns.

We have a DB schema that we not so affectionately refer to as the Standard Oil
Octopus because of this methodology applied over ~20yr.

I agree with you in the general case but eventually hard cuts have to be made
or you will perpetuate the existence of all sorts of legacy spaghetti (not
necessarily in the DB, but in all the other things that use the DB). Like
everything else there's a balance to be struck.

~~~
ransom1538
Yeah! I wasn't trying to suggest keep columns that are not used for years. I
was trying to suggest during that release cycle don't alter or drop columns.

------
robbya
Specifically, I tell Jenkins to deploy the commit hash that was last known
good. Jenkins just deploys, and doesn't really know that it's a "roll back."

Generally, going back to a known clean state should be easier, safer and
relatively quick (DNS flip is fast, redeploy of old code is fast if your
automation works well).

In some cases changes to your data may make rolling back cause even more
problems. I've seen that happen and we were stuck doing a rapid hot fix for a
bug, which was ugly. We did a lot more review to ensure we avoided breaking
roll back. So I'd advise code review and developer education of that risk.

~~~
mooreds
How do you find the commit hash that is the last known good? Looking through
jenkins release logs, asking someone, something else?

~~~
httpsterio
In our case its quite simple. All commits should be tested and if we need to
roll back, it means either that tests have failed and it was pushed
nonetheless, tests were missing or tests didn't catch the issue.

In the first two cases, we revert to the commit where untested or failed code
was introduced to the master. This basically never happens. In the third case,
you need to do some debugging and try to figure out why it's broken and either
fix it or revert it. Basically just look at the git history. If the code has
dependencies then you might need to do code triage and produce a hotfix. If
it's a relatively isolated piece, then just revert and fix it at a better
time.

We use semver and gitlab's tags so we know just by the versioning if the code
that is broken is important or not and if we can roll back.

~~~
StreamBright
What about logical errors?

math.pow(2, 4) vs math.pow(4, 2)

~~~
ISL
A test can test for those: "does the code give mathematically correct
results?"

~~~
Scarblac
Even the best tests only catch like 50% of the bugs though.

~~~
aasasd
Bahahaha, your tests really aren't ‘best’.

Properly, if your code has an ‘if’, you need two tests, for the two branches.
Same with every place the outcome may diverge. With this approach, it's
basically impossible to botch the code unless something slips your mind while
writing both the code and the tests. Otherwise, it's pretty much ‘deploy and
go home.’

~~~
MauranKilom
> Properly, if your code has an ‘if’, you need two tests, for the two
> branches. Same with every place the outcome may diverge. With this approach,
> it's basically impossible to botch the code
    
    
        if (x % 3 == 0)
          println "Fizz"
        if (x % 5 == 0)
          println "Buzz"
    

There, solved it! And it works fine in all cases you told me to test (e.g. 3,
4, and 5), which means it's impossible I botched anything! Surely I nailed
this interview?

Seriously though, the criterion you mentioned is only one of many of
increasing strictness (see e.g.
[https://en.wikipedia.org/wiki/Code_coverage#Basic_coverage_c...](https://en.wikipedia.org/wiki/Code_coverage#Basic_coverage_criteria)),
namely branch coverage. Having branch coverage still says very little - the
interaction between different branches can be trivially wrong. And desiring
full path coverage immediately leads to combinatorial explosion (and the
halting problem, once loops are involved).

> unless something slips your mind while writing both the code and the tests.

That is true for any choice of coverage metric and target.

~~~
aasasd
> _all cases you told me to test (e.g. 3, 4, and 5)! Surely I nailed this
> interview_

Well, if you don't see other points where the input→output mapping can diverge
(and rely on the client's requirements for that?), then no, you didn't nail
it.

~~~
MauranKilom
The point is that you claim to have "the" (best?) way to do proper unit tests,
but it still fails in basic cases like some combination of paths through if
statements being buggy (even if you properly covered each one in isolation).

You normally cannot prove general correctness with unit tests. You can try to
probe interesting points in the input space, and different coverage metrics
encourage different levels of rigour and effort with this, leading to
different fractions of bugs caught, but you'll never have a guarantee to catch
_all_ bugs (short of formal proof or exhaustive input coverage).

------
karka91
Assuming this is about web dev.

Nowadays - flip a toggle in the admin. Deployments and releases are separated.

Made a major blunder? In kubernetes world we do "helm rollback". Takes
seconds. This allows for a super fast pipeline and a team of 6 devs pushes out
like 50 deployments a day.

Pre-kubernetes it would be AWS pipeline that would startup servers with old
commits. We'd catch most of the stuff in blue/green phase, though. Same team,
maybe 10 deployments a day but I think this was still pretty good for a
monolith.

Pre-aws we used deployment tools like capistrano. Most of the tools in this
category have multiple releases on the servers and a symlink to the live one.
If you make mistake - run a command to delete the symlink, ln -s old release,
restart web server. Even though this is the fastest rollback of the bunch the
ecosystem was still young and we'd do 0-2 releases a day.

~~~
Shalle135
Why not canary releases? You can load balance for example 1% of the traffic to
the new deployment and see if you experience any issues. If you do - you just
change the loadbalancer to use the known good pods.

~~~
pojzon
How you take care of DB updates when using cannary deployments ? For example
those which are not backwards compatible ?

Ps. Releases are about building new versions of code packages. Deployments
about pushing them out to environments.

~~~
karka91
This depends a lot on the databases used and flexibility of the code that's
accessing the data. One method is to deploy in multiple stages and to use
views. In pre-deploy you create a view with schema/data needed for the
release. In deploy stage you roll-out the canary code. In post-deploy stage
remove either the old data/schema on success or new data/schema on failure.
It's quite an overhead to implement and maintain this process.

------
DoubleGlazing
My almost universal experience has been to simply do a Git revert and let the
CI pipeline do its thing. Pros - It's simple. Cons - It's slow, especially in
an emergency.

My last job had an extra layer of security. As a .net house all new
deployments were sent to Azure in a zip file. We backed those up and
maintained FTP access to the Azure app service. If a deployment went really
wrong and we couldn't wait the 10-20 mins for the CI pipline to process a
revert, we'd just switch off the CI process and FTP upload the contents of the
previous last good version.

Of course, if there were database migrations to deal with then all hell could
break loose. Reverting a DB migration in production is easier said than done
especially if a new table or column has already started being filled with live
data.

To be fair though, most of the problems I encountered were usually as the
result of penny pinching by management who didn't want to invest in proper
deployment infrastructure.

------
drubenstein
Depends on the issue - if there's a code / logic bug that doesn't affect the
data schema, we use elastic beanstalk versions to go back the previous version
(usually this is a rolling deploy backwards), and then clean up the data
manually if necessary. Otherwise, we roll forward (fix the bug, clean up the
data, etc).

It's more often been the case for us that issues are caused by mistaken
configuration / infrastructure updates. We do a lot of IAC (Chef,
Cloudformation), so with those, it's usually a straight git revert and then a
normal release.

------
EliRivers
Tell people we need to roll back, clone the repo to my hard drive, open up
git, undo the commit that merged the bad code in, push it. All done.

"Production"? Does that mean something that goes to the customers? Very few of
our customers keep up with releases so it's generally not a big deal. We can
have a release version sitting around for weeks before any customer actually
installs it; some customers are happy with a five year old version and
occasional custom patches.

I bet it's a bigger problem for those for whom the product is effectively a
running website, but those of us operating a different software deployment
model have a different set of problems.

~~~
sqldba
You’re being downvoted but it’s a totally reasonable reply for normal vendor
software.

~~~
EliRivers
Something I remind myself of regularly is that the entire software world I
have ever experienced is actually a tiny, tiny piece of the real software
world.

------
ksajadi
Since we run on Kubernetes, rolling back the code is a matter of redeploying
the older images. Rolling back database changes is more challenging but
usually we have “down” scripts as well as up scrips for all dB changes
allowing us to roll database changes back too.

We use Cloud 66 Skycap for deployment which gives us a version controlled
repository for our Kubernetes configuration files as well as takes care of
image tags for each release.

~~~
protonimitate
Similar set up here... except our CI doesn't handle "down" scripts, so bad db
migrations are roll-forward only.

Which.. isn't as bad as I thought it would be (so far).

------
MaxGabriel
For our backend, we deploy it as a nix package on NixOS, so we can atomically
rollback the deployed code, as well as any dependencies like system libraries.
Right now this requires SSHing into each of our two backend servers and
running a command.

If it’s not urgent we’d just revert with a PR though and let the regular
deploy process handle it.

The frontend we deploy with Heroku, so we deploy with the rollback button or
Heroku CLI. Unfortunately we don’t have something setup where the frontend
checks if it’s on the correct version or not, so people will get the bad code
until they refresh

~~~
adev_
> For our backend, we deploy it as a nix package on NixOS, so we can
> atomically rollback the deployed code, as well as any dependencies like
> system libraries

Same for us, but we use nixpkgs directly over CentOS. Nix is perfect for
rollback. It can be done on an entire cluster in seconds.

For the DB, We use schemaless DBs with Devs that care about forward and
backward compatibility.

------
folkhack
I have multiple strategies because I've got one foot in Docker and one foot in
the "old-school" realm of simple web servers.

Code rollbacks are simple as heck - I just keep the previous Docker
container(s) up for a potential rollback target, and/or have a symlink cutover
strategy for the webservers. I use GitLab CI/CD for the majority of what I do
so the SCM is not on the server, it's deployed as artifacts (either a clean
tested container and/or .tar.gz). If I need to rollback it's a manual
operation for the code but I want to keep it that way because I am a strong
believer in not automating edge-cases which is what running rollbacks through
your CI/CD pipeline is.

Also for code I've been known to even cut a hot image of the running server
just in case something goes _really_ sideways. Never had to use it though, and
I will only go this far if I'm making actual changes to the CI/CD pipeline
(usually).

The biggest concern for me is database changes. You may think I'm nuts but I
have been burnt _sooooo_ bad on this (we were all young and dumb at one time
right?)... I have multiple points of "oh %$&%" solutions. The first is good
migrations - yeah yeah yell at me if you wish... I run things like Laravel for
my API's and their migration rollbacks can take care of simple things. TEST
YOUR ROLLBACK MIGRATIONS! The second solution is that I cut an actual
readslave for each and every update of the application and then segregate it
so that I have a "snapshot" that is at-most 1-2 hours out of date.

Have redundancy to your redundancy is my motto... and although my deployments
take a 1-3 hours for big changes (cutting hot images of a running server,
building/isolating an independent DB slave, shuffling containers, etc.) I've
never had a major "lights out" issue that's lasted more than 1hr.

------
dxhdr
Push different code to production, either the last-known-good commit, or new
code with the issue fixed.

I imagine that much larger operations likely do feature flags or a rolling
release so that problems can be isolated to a small subset of production
before going wide. But still the same principle, redeploy with different code.

~~~
thih9
We devs are usually prepared for failures during larger operations;

but smaller, routine deploys with unexpected failures could be just as
dangerous.

------
technological
You can do rolling deployment.

Setup environment with previous version of production code (which does not
have issue) and then using load balancer switch the traffic to this new
environment

~~~
wwweston
Once automated this feels like the simplest approach, and it lets you
temporarily suspend the continuous integration between source repo and the
application if that turns out to be useful in managing response to an issue.

------
KaiserPro
WE have two things that could need to be rolled back, the app/api and the
dataset.

The App is docker, so we have a tag called app-production, and app-
production-1(up to 5) which are all the previous production versions. If
anything goes wrong, we can flip over to the last known good version.

We are multi-region, so we don't update all at once.

The dataset is a bit harder. Because its > 100gigs, and for speed purposes it
lives on EFS (its lots of 4meg files, and we might need to pull in 60 or so
files at once, access time is rubbish using S3) Manually syncing it takes a
couple of hours.

To get round this, we have a copy on write system, with "dataset-prod" and
"dataset-prod-1" up to 6. Changing the symlink of the top level directory is
minimal.

------
jekrb
At the agency I used to work for, we used GitLab CI/CD.

We were able to do a manual rollback for each deployment from the GitLab UI.

[https://docs.gitlab.com/ee/ci/environments.html#retrying-
and...](https://docs.gitlab.com/ee/ci/environments.html#retrying-and-rolling-
back)

Disclaimer: I work at GitLab now, but my old agency was also using GitLab and
their CI/CD offering for client projects for a couple years while I was there.

At that agency they have even open sourced their GitLab CI configs :)
[https://gitlab.com/digitalsurgeons/gitlab-ci-
configs](https://gitlab.com/digitalsurgeons/gitlab-ci-configs)

~~~
folkhack
GitLab CI/CD is amazing. I run a private instance on a VPS with a single
worker (probably the simplest setup you can imagine) and it's astoundingly
powerful.

I can't say enough great things about it - solutions like Jenkins and Travis
CI just feel antiquated and clunky anymore. I always thought it wouldn't
really be worth it to run CI/CD on my personal projects due to the complexity
inherent in setting these solutions up until I saw the light... I had a
coherent "one-click" deploy setup from scratch within an hour with GitLab.

~~~
gingerlime
I'm curious about the difference. Jenkins, ok. Sure. But in what way are
Travis/CircleCI/Semaphore vastly different from Gitlab CI/CD? honest question.

~~~
folkhack
Right on. GitLab has everything right there for me from SCM, to it's own
Docker container registry, to static site hosting capabilities (think
generated documentation), issues management, etc etc. I've tested the setup of
many of them and GitLab was the quickest/easiest/most configurable.

What GitLab gets right is having TONS of enterprise-quality solutions
available to you in _one place_... for 100% free as their community offering
is AMAZING. That's insanely valuable to me as a startup engineer who doesn't
have the time to run 4-5 disparate solutions that are difficult to integrate
in a secure/simple way.

Having one solution for the above list has been a "game changer" to me because
I've got one monolith piece of software to keep updated/manage vs. stringing a
whole bunch of solutions together - and I say "monolith" in a 100% positive
context =)

Then there's just the speed issue... I did DevOps at a huge mega-corp not long
ago and the expectation of "major things to get done" was 3-4 things a week.
Now that I'm doing my own startup my expectation on my self is 3-4 major
things _in a day_. GitLab is the only tooling that I can imagine keeping up
with me with near-zero BS, and because of that I'm a _huge_ brand advocate for
them! (Not directly affiliated, just a passionate user!)

All-in-all I understand people have different tools and that's totally cool,
but I did a lot to test out different CI/CD tooling and GitLab was amazingly
simple, secure, and quick to setup.

Check out for a solid feature comparison: [https://about.gitlab.com/devops-
tools/travis-ci-vs-gitlab.ht...](https://about.gitlab.com/devops-tools/travis-
ci-vs-gitlab.html)

------
ryanthedev
I mean it's not rolling back or rolling forward.

It's just doing another deployment. It doesn't matter what version you are
deploying.

That's the whole point.

My teams go into their CI/CD platform and just cherry pick which build they
want to release.

~~~
mooreds
How do they know which release to cherry pick?

~~~
ryanthedev
We usually tag the release with something.

I work for an enterprise. We automated the change control process and
integrated it with our CD dashboard. So a quick peak at the dashboard will
tell us.

Though depending on the criticality of the app. We may only retain a previous
release for a specific time period.

------
atemerev
Blue-green deployment is the only way to fly:
[https://martinfowler.com/bliki/BlueGreenDeployment.html](https://martinfowler.com/bliki/BlueGreenDeployment.html)

There are two identical prod servers/cloud configurations/datacenters: blue
and green. Each new version is deployed intermittently on blue and green
areas: if version N is on blue, version N-1 is on green, and vice versa. If
some critical issue happens, rolling back is just switching the front
router/balancer to the other area, which can be done instantly.

~~~
gfodor
Yup, can't beat blue/green. A big part of the reason I recommend it is because
unlike most other rollback schemes, the actual mechanism for rolling back is
the same as for a normal release (a load balancer flip), so it's continually
exercised and validated. To roll back, you literally do a deploy, but you just
skip the step where you alter the bits on the dark cluster. This is highly
unlikely to fail, since the only thing that has to not go wrong is the part
where the software update is skipped, which basically boils down to a
conditional.

Any mechanism for rollbacks that isn't tested continuously is likely to fail
during incident response. It's a huge anti-pattern to have 'dark' processes
only used during incident response -- same thinking behind why you should also
be continually testing your backups, continuously killing servers to verify
recovery, etc.

~~~
Angostura
They you just have to worry about rolling back the changed database schema
that you have rolled out to both green and bluie variants - I presume. That
bit feels tricky

~~~
gfodor
Yep, that's a fair point. In scenarios where you have made non-backwards
compatible database changes, this approach cannot be used. Typically the dark
cluster is accessible for pre-flight testing, so you can run tests there to
determine if the service is still operating as expected despite the database
being one migration step ahead. In your response runbook, you should have a
contingency for the scenarios where that happens. (For many projects, the % of
changes that have this characteristic are small enough that it doesn't move
the needle on the risk profile, but Murphy's law can mess with you :))

------
ericol
We have a rather simple app that we manage with Github. When a Pr is merged
into our main repo's master branch it gets automatically deployed into
production.

whenever we need to roll back something we just use the corresponding Github
feature to revert a merge, and that is automatically shoved into production
using GH hooks and stuff.

Again, we have a rather easy and ancient deploy system, and it just works.

We do several updates a week if needed. We try to avoid late Friday afternoon
merges, but with a couple alerts here and there (Mostly, New Relic) we have a
good coverage to find out about problems.

------
helloguillecl
Before implementing CI with containers I used to deploy using Capistrano. One
thing I loved about this setup was that in case of needing to rollback, I
would just run a command which would change a symlink pointing to the previous
deploy and restart. All usually done in a couple of seconds.

------
m00dy
I rollback by deploying previously tagged docker image.

------
emptysea
Ideally we'd have the problematic code behind a feature flag and we'd turn the
flag off.

For other issues we press the rollback button in the Heroku dashboard.

Heroku has its problems: buildpacks, reliability, cost, etc, but the dashboard
deploy setup is pretty nice.

------
perlgeek
We use [https://gocd.io/](https://gocd.io/) for our build + deployment
pipelines. A rollback is just re-running the deployment stage of the last
known-good version.

Since the question of database migrations came up: We take care to break up
backwards incompatible changes into multiple smaller ones.

For example, instead of introducing a new NOT NULL column, we first introduce
it a NULLable, wait until we are confident that we don't want to roll back to
a software version that leaves the column empty, and only then changing it to
NOT NULL.

It requires more manual tracking than I would like, but so far, it seems to
work quite well.

------
insulfrable
Flip dns back, keep the old stack around for a few days. The only case that
doesn't just work is with db schema updates that no longer work with the
previous version of prod, but this is true for any rollback.

------
Cofike
We use Elastic Beanstalk so we just deploy whatever application version we'd
like. Honestly not the biggest fan of that strategy because there is at least
a 5 minute period of time while the new instances are provisioned and
healthchecked that you just need to wait for.

When compared to our Fastly deploys which are global in seconds, it leaves me
wanting a faster solution.

------
nine_k
100% of my rollbacks were like this:

* Deploy new code in new VMs.

* Route some prod traffic to the new nodes.

* Watch the nodes misbehave somehow.

* Route 100% of the prod traffic back to old nodes (which nobody tore down).

Rollback complete.

In the case of normal deployment, 100% of prod traffic would _eventually_ be
directed to new modes. After a few hours of everything running smoothly, the
old nodes would be spun down.

~~~
mooreds
So all of your new code was completely backward compatible with your old code
(in terms of state)? That is, the database didn't care which version was
interacting with it?

~~~
nine_k
Yes, it was backwards-compatible for at least one DB migration step. That is,
you can roll forward / back the database schema without breaking other code,
or roll forward / back the code without affecting the DB.

This does take more planning and gradual deployment, but saves the day when it
matters.

------
lelabo_42
We started using docker and kubernetes not long ago. Every deployement in
production must have a release number as a tag. If one element of our
environment need a rollback, I redeploy an old image on kubernetes. Its very
fast, only a few seconds to rollback and you can do rolling updates to avoid
downtime.

------
WrtCdEvrydy
Depends on the issues.

1) Issues that cause a complete failure to start containers will fail
healthchecks and are auto rolled back in our new CI/CD flow.

2) Issues that are more subtle are manually rolled to one back hash until it
goes away (then we create a revert branch from that diff between HEAD and
WORKING).

------
cddotdotslash
We actually just implemented something like this. Our entire environment is
AWS CodeBuild, CodePipeline, and Lambda-based, but the process would be
similar for more traditional environments:

1\. Developer creates a PR. To be mergeable, it must pass code review, be
based on master, and be up-to-date with master (GitHub recently made this
really easy by adding a one-click button to resync master into the PR).

2\. Each commit runs a build system that installs dependencies, runs tests,
and ZIPs the final code to an S3 bucket.

3\. Once the developer is ready to deploy, and the PR passes the above checks,
they type "/deploy" as a GitHub comment.

3\. A Lambda function performs validation and then updates our dev Lambda
functions with the ZIP file from S3. Once complete, it leaves a comment on the
PR with a link to the dev site to review.

4\. The developer can now comment "/approve" or "/reject". Reject reverts the
last Lambda deploy in dev. Approve moves the code to stage.

5\. The above steps repeat for stage --> prod.

6\. Once the code is in prod, the developer must approve or reject. If
rejected, the Lambdas are reverted all the way back through dev. If approved,
the PR is merged by the bot (we have some additional automation here, such as
monitoring CloudWatch metrics for API stability, end-to-end tests, etc).

TL;DR - Don't merge PRs until the code is in production and reviewed. If a
rollback is needed afterwards, create a rollback (roll-forward) PR and repeat.

~~~
jacobkg
“GitHub recently made this really easy by adding a one-click button to resync
master into the PR”

I’m trying to find this feature but my google-fu is failing me. Can you link
to an announcement or doc page for this?

~~~
cddotdotslash
I can't find any announcements about it, but you can try it out: open a PR
based on master, then push a separate commit (or merge a different PR) to
master and return to your original PR. If you scroll down, there will be a box
saying the base branch is out of date and a button asking to resync with
master. If you click it, they will merge the master branch into your PR.

I'm not sure if there are specific settings required for this to work. For
example, we have the master branch protected and require the status checks to
pass and the PR to be up to date before it can be merged.

------
sbmthakur
At my workspace, we use Gitlab. We pick up the old(stable) Job ID and ask
Devops to deploy it.

------
yellow_lead
For our Kubernetes apps, every merge to master creates a tag in GitHub. If
there's an issue that doesn't result in a failed health check (these would be
rolled back automatically), we can rollback by passing the older tag into a
Jenkins job.

------
bdcravens
Our app is in ECS, our database in RDS. I'd roll back to a prior task
definition, and if absolutely necessary do a point in time restore in RDS. (I
tend to leave the database alone unless absolutely necessary but our schema is
pretty mature)

------
savrajsingh
Google App Engine, just set previous deploy to live version (one click)

------
markbnj
We revert the bad commit and redeploy. We do this for our workloads on vms as
well as those on kubernetes, but it is both easier and faster for the
kubernetes workloads.

------
wickedOne
interesting question.

we roll forward and thus far never ran into the situation that that wasn't
possible in a reasonable amount of time.

nevertheless i've wondered more than once what would happen if we run into
such a situation and there's a substantial database migration in the process
(i.e. with table drops).

curious to learn what the different strategies are on that point: do you put
your table contents in the down migration, do you revert to the last backup,
etc.

~~~
ht85
Generally a good strategy for dropping columns or tables is to rename them
instead (e.g. `table_deprecated`).

If things look stable after whatever time you deem necessary, you can write a
second migration to actually drop them.

If you run into issues, your down migration simply undoes the rename.

------
sunasra
We use chef framework. Each deployment has specific tag. If something goes
wrong, we revert to most stable chef tag and redeploy to all the tiers.

------
sodosopa
Depends. For base code moves we restore from the previous release. For larger
things like websites we use blue green deployments.

------
mister_hn
I simply build a new install package and send it to customer

or build a new VM and send it to them.

Not everything is web-based

------
cdumler
You are asking a very generic question without really stating what environment
you are using. The ability to "rollback" is really a statement of how you have
defined your deployments. A good environment really should have a few things:

* A version control system (ie. git) that has a methodology for controlling what is tested and then released (ie. feature releases). If you want the ability to revert a feature, you need to use your version control to group (ie. squish) code into features they can can be easily reverted. Look up the GIT Branching Model [1]. It's a good place to start when thinking about organizing your versioning to control releases.

* You should be able to deploy from any point in your version control. Make sure your deployment system is able to deploy from a hash, tag or branch. This gives you the option of "reverting" by deploying from a previously known good position. I would highly suggest automating deployment to generate timestamp tags into the repo for deployment so you can see the history of deployments.

* Try to make your deployments idempotent and/or separate your state changes so they can be independently controlled. If you have migrations, make sure they can withstand being deployed again, ie. "DROP TABLE IF EXISTS" then "CREATE TABLE", so redeploying doesn't blow up. If you need to roll back, you can rollback as much as you need to the point you want to deploy. A trait of a well designed system is it needs few state changes to add new features and/or those state changes can be easily controlled.

* Have a staging system(s). You should be able to deploy to a staging system to verify the behavior of a deployment. It should be able replicate the production every way except in data content. Ideally, should also build this from scratch every time so that you can guarantee if production dies hard death you can completely reproduce it. A great system will also do this for production, bring it up for final testing, and then you can switch over to it once tested.

Notice the trend here is to breakup the dependences between how, what, and
where code is deployed so that have many ways to respond to issues. Maybe the
solution is small enough to just make fix in the future. Maybe it is create an
emergency patch, test it on a new production deployment and then switch over.
Maybe it is so bad you want to immediately deploy a previous version and get
things running again. All of these abilities depend on building your system
such that you have these choices.

[1] [https://nvie.com/posts/a-successful-git-branching-
model/](https://nvie.com/posts/a-successful-git-branching-model/)

------
nodesocket
With Kubernetes and deployments simply:

kubectl rollout undo deployment/$DEPLOYMENT

------
SkyLinx
I deploy to Kubernetes with Helm, so I just do an Helm rollback.

------
rvdmei
Rolling back is usually a bad practice and can get quite challenging if not
impossible in distributed environments.

If you can pinpoint a specific commit that is causing the issue. Revert that
commit and go through your standard release process.

~~~
mooreds
Would love to know more about how you roll forward in that case. Or do you
canary so thoroughly that bugs never get to prod?

~~~
rvdmei
Bugs do get to prod, that’s just reality. If you keep changes small and
release often you won’t see many big bugs in production. We typically keep
master in an “always deployable” state. If a bug gets found in prod, we either
fix it or create a new commit that reverts the commit that caused the problem.
If it’s a faulty migration, reverting the commit usually won’t work and we
have to fix the bug. For QA we use a mix of automated tests and manual tests.

------
elliotec
Click the button, revert master, then fast follow with a fix.

------
billconan
each of my deployments is a zip file. I just redeploy with an older zip

~~~
antoineMoPa
How do you manage data migration?

~~~
billconan
My database is relative stable. I'm also using mongodb, it has some
flexibility.

------
crb002
Liquibase

------
faissaloo
I don't rollback unless it's for work, I just take the time to fix it, the
site can stay down for as long as it needs to.

------
rolltiide
Rollback the master branch and deploy that again.

Similar to "clicking the button in heroku"

------
mnm1
Build previous working version. Deploy to elastic beanstalk. Rollback any
migrations. Fix the issue and redeploy at leisure.

------
rinchik
roll back (step back), is an inherited from waterfall anti-pattern.

Now we should only march forward with small, on demand releases, this way we
will know exactly where the issue is and will be able to fix it forward
quickly.

Rollbacks were a strategy with monthly (or even quarterly [insane huh?]),
giant, stinky, release dumps, knowing there is no way we could quickly
identify and deploy the fix. aka lets throw production 3 months back and take
another 2 month for figuring out there the issue that happened during last
release is.

And to finally answer your question: we never roll back. We always march
forward.

~~~
gfodor
Shorter releases can help with reducing the difficulty of immediately
addressing problems, but it's an error to equate the reduction of risk as an
elimination of risk.

There are always going to be failure modes that require extensive time to
diagnose and debug, even with small changes being made. Additionally, you want
that diagnostic phase to happen without time pressure. If you do not have a
sane rollback mechanism to use in those scenarios, you are doing a disservice
to your users and your team.

Your users suffer, because the outage or breakage will last as long as it
takes for you to address the underlying issue directly, instead of just
rolling back to restore service. They will be forced to hear frustrating
things like "we're working on it", since you don't know what's wrong yet, when
instead you could have just rolled back before most users even noticed there
was a problem.

And, more importantly, your team will suffer greatly, because they will be
forced to work under pressure when an incident like this arises. And, worse,
they will also 'learn' that accidentally pushing breaking changes to
production results in an extremely unpleasant and toxic situation for
everyone, leading to systemic fear-of-deploys and undermining a blameless
culture.

So you should have a rollback mechanism that is solid, tested, and easy to use
for scenarios where a non-trivial regression or outage arises in production,
even if you are doing continuous delivery of small patches.

~~~
rinchik
Well, highly unlikely that such an error will arise. If it does and you know
that you actually need to roll back then most likely something else is wrong.

But again, I also saw the other comment about how "Dogmatic" my approach is. I
wouldn't say it's dogmatic, idealistic - yes. But not dogmatic. There is a
place and time for anything and roll back can STILL be useful when you don't
trust the system nor the code base (as I pointed in my other comment,
rollbacks are useful with legacy systems and systems that you have to maintain
that were build by outsourced teams).

Well also roll backs is the first thing you think of when you join a large
company as a director of engineering to support systems you never touched
before.

~~~
gfodor
You didn't really address my points. Its hard to quantify just how "highly
unlikely" a failure is, but its your job as a systems designer to build
systems that are robust under a wide variety of unlikely failure scenarios.
Not having rollbacks results in a system that is extremely problematic in
those unlikely scenarios where a quick fix cannot be immediately addressed.
Not to mention, rolling forward under such a regime has its own unique risk:
since the 'fix' was made under pressure since there was no alternative to
restore service quickly, it's often the case that simple human errors get
introduced when rolling forward. I've seen it a million times.

In my experience, a healthy incident response process has a fork in the
decision tree at the very top: do we roll back, or do we attempt to fix live?
And in the latter case, we time box how long we're willing to spend, and defer
to rolling back for all but the most trivial, obvious fixes. Even if you don't
use rollback often, having that top level fork is a release valve for all of
the toxic implications I mentioned in the scenario where you _do_ actually
need it.

Even if you have several dozen incidents happen where you didn't need it a
black swan event will eventually show up -- and that event will be the one
that will have the lasting impact on your company's public perception and the
morale of your team.

~~~
rinchik
I'm strictly against rollbacks and I'm strictly for everything continuous.

If I need to do a roll back it means that I don't trust the system nor the
code base. I will do the roll back but after that there will be a very
productive retro about how we can do better to avoid rollbacks in the future
(aka what did we learn).

But again, as I said, there is a place and time for everything! And there are
many variables! Even how you structure your teams affects deployments,
engineering culture, engineering team types (cross-functional, generalized;
specialized etc), if the team that makes a decision about the roll back is not
the team that introduced the bug.

My approach is not dogmatic (have your standardized roll backs if those work
best for your company, release cycles, teams) it's idealistic (that's what I
aim for, personally)

~~~
gfodor
I suspect we're going to agree to disagree here, but I highly advise you to
re-consider the idea of framing a roll back as an unforced failure to your
team. The last dynamic you want in a retrospective is one where not only did
an unexpected failure happen (a bug pushed to production), but then the team
collectively 'let you down' by pulling the rollback lever, instead of thinking
and working harder on fixing the issue live. In such a scenario you're forcing
people to feel they need to "cry uncle" when they can't solve the problem
quickly, and putting themselves into the middle of a conflict of interest
between making a well-tested, reviewed change that is sure to fix the problem,
and rolling the dice on a quick fix in the hope it'll reduce the total outage.
That's not the recipe for a positive, blameless culture.

When we roll back on my team, it's uncommon but when it happens it's
considered a success if it was made through a systematic decision-making
process. Making a sane decision in the interest of our users to restore
service quickly is always a win. I can assure you, it does not compromise your
ability to do continuous delivery or small changes by having and occasionally
using a rollback mechanism. If you are fearful of the idea that having such a
mechanism and plan in place somehow will lead to people questioning your
principles in a way you cannot defend, then that is a separate problem, since
the two things you mention that are incompatible are in fact compatible and
highly defensible.

It is not a legacy from "waterfall" or any of the other things you mention,
because your claim can be refuted through a single counter example, and I've
worked on 3 separate projects where such counter examples exist: we had a
rollback method, it was used once in a while, and we shipped changes to
production multiple times a day using continuous delivery. At no point on
these projects did the ability or use of roll back lead to some kind of hard-
to-explain loss in delivery velocity. On the contrary, I suspect if that
mechanism did not exist, several failures that were easy to get back to green
would have turned into a toxic hellhole, and my team mates would have been
much more fearful around shipping, which is the high order bit when it comes
to velocity and embracing continuous delivery of small changes.

~~~
rinchik
"toxic hellhole", "blame culture" \- I don't think we need those dark,
marginal extremes to make a point

I also suspect that we're going to agree to disagree here. There are so many
nuances, it's impossible to properly communicate most of those without writing
a chapter of a book.

Appreciate your points though. Great food for thought right there.

