
You Can’t Have a Rollback Button - kiyanwang
https://blog.skyliner.io/you-cant-have-a-rollback-button-83e914f420d9#.ns2uqasoj
======
craigds
Article seems a bit over the top. Many of us can and do have a rollback
button.

Sure, rolling back isn't trivial. If the code has side effects (e.g. database,
disk or cache state), you need to account for those side effects when rolling
back.

The example given in the article could be easily fixed by adding a
cache.clear() to the rollback procedure (assuming cache performance isn't
considered panic-critical)

For database state, use a migration system and make sure each change you make
can be reversed (and make sure that's tested!)

> If developers incorrectly believe that their mistakes can be quickly
> reversed, they will tend to take more foolish risks. It might be hard to
> talk them out of it.

I doubt it. Either encourage rapid iteration and deployment, or encourage a
more stable, well-tested production environment. Do that via feedback, code
reviews, post-mortems, internal docs. The presence or absence of a rollback
button is not going to a major contributor to this.

~~~
sudhirj
> For database state, use a migration system and make sure each change you
> make can be reversed (and make sure that's tested!)

Isn't this debunked clearly in the article? If you add a new feature / column
and users begin using it (adding data, making transactions) rolling back is a
catastrophic data loss. Sure, the code and database schema will go back to a
consistent state, but the data never can.

That's what the is pointing out - that deployed applications + data are a
state machine without cycles - i.e., even if you 'rollback', it's akin to
adding a revert commit on top of your existing state. The arrow of state, as
with time, only moves forward.

~~~
kelnos
That's why you don't do it that way.

Adding a new column to a database table (for example) is a single distinct
deployment. You do that one without touching your application code. After the
column has been added, you deploy the code that uses it. If the new column is
a replacement for an existing column, your new code should continue to write
to the existing column as well in the same way it always has. If you have to
roll back, you roll back the application code change, but the database change
remains, so there's no problem. Meanwhile, the new code that was rolled back
was still writing to the existing column, so there's no data missing that the
old code expects to be there.

After you are comfortable with the change, you can update your code to stop
using the obsolete column, test that change, and then later drop the column
from the database table.

If you've added a new column or database table for a new type of data, and
have to roll back the application code that uses it, you don't lose any data.
The customer is of course unable to use the new feature and new data post-
rollback, but that's okay: you can roll out the new feature again when you've
figured out what the problem is.

Yes, I'm sure we can all come up with examples where this approach won't work,
but _they are not the norm_. The are rare. If they are not, then _you are
doing infrastructure wrong_.

~~~
brightball
Exactly. Perfect.

------
skookum
The author of the blog starts with the claim that you can't have a rollback
button because applications aren't self contained and then at the bottom of
the article proceeds to outline parts of how rollback "buttons" in distributed
systems are enabled. A rollback button is rarely a "button" and is instead a
procedure that was reasoned through in advance of deployment.

Every CM at every large cloud vendor/distributed infrastructure shop has a
field asking the operator to detail the rollback procedure which is usually a
set of steps that include both turning off new code and fixing up state.
Moreover, virtually all large scale distributed systems need to be able to run
in mixed mode with old versions and new versions co-existing in different
parts of the fleet for hours or days at a time. It's simply not possible to
flip from one version to the next in any way that even loosely approximates
atomicity at the fleet scale of Google, Amazon, Facebook, etc. Any competent
distributed systems engineer will think through the upgrade implications...
"How does Vnext RPC to/from Vcurrent?", "How will Vcurrent handle changes to
central database state made by Vnext?", "How will Vcurrent handle local on-
disk state if I need to flip back to it from Vnext on some set of nodes?", and
so on. Often such upgrades entail multiple deployments where backward
compatibility code is pushed ahead of RPC/state changing code.

Does all that mean there aren't occasional mistakes rendering a planned
rollback impossible and the only path forward is to soldier through to
V(original + 2)? Of course not. But the author's idea that you should
generally plan to soldier through to newer code if things go wrong during
deployment as a standard operating procedure betrays a lack of experience and
sends a dangerous message. When deployments go off the rails is exactly the
time when follow-on mistakes are made - at that point you've got a tense
conference call going, the managers/PMs/VPs are asking for ETA for the
restored system and the only acceptable reply is "very soon", the engineers
are trying to diagnose quickly and may overlook some symptoms of the botched
deployment, etc. Even the best engineer is not going to do a great job
thinking through all the implications of the remediation they are attempting
at that point and the code they're now about to push has almost certainly cut
a bunch of corners in QA. That's not a situation I willingly put myself in.

~~~
tarmstrong
> But the author's idea that you should generally plan to soldier through to
> newer code if things go wrong during deployment as a standard operating
> procedure betrays a lack of experience and sends a dangerous message.

The author says "reverting smaller diffs as a roll-forward is more verifiable"
near the end of the article. I agree the title makes this a bit confusing, but
I don't think he's arguing that the only way to recover is to write a patch
under pressure.

~~~
kelnos
Reverting a smaller diff _is_ writing a patch under pressure. Reverting a
section of your new code is just as difficult as authoring a new fix. You need
to be just as sure that the partial revert will not interact poorly with other
parts of the new code that you are _not_ reverting.

~~~
tarmstrong
> You need to be just as sure that the partial revert will not interact poorly
> with other parts of the new code that you are not reverting.

By partial revert, I'm imagining that three people have changesets (A, B, C,
in that order) that have been deployed. You notice that A broke and you make
A' to revert it. I think the author is arguing that it is easier to review A'
to see if it is a safe change than it is to verify that A', B', and C' (the
full revert) are safe to revert.

In other words, even if you don't use version control to record that you
reverted A, B, and C, you still effectively do that by reverting in full. You
just know that the combination of A', B', and C' _was_ safe when it was
deployed.

Is that what you're imagining or are we talking about different things? (I
don't have strong opinions about this, I just want to make sure I understand
your perspective (: )

------
morgante
The title is a bit clickybaity, a better one would be: rolling back changes
isn't as straightforward as just changing your git hash.

In addition to the strategies mentioned, having backward-and-forward
migrations for all database changes is essential. You need to have a plausible
path for how to restore state to where it was _before_ the broken change.

If you've done the necessary engineering work, this can ultimately be packaged
into a "rollback" button which does things like ramping down dark deploys.

------
csense
The blurb about the company at the end of this article made me wince:

> Skyliner is an AWS platform for continuous delivery. We’re trying to build a
> straight jacket that you can wear to stop hitting yourself in the face.

I'm usually one of those guys with an instinctive distaste for marketing, but
it seems like this is one startup that badly needs some professional help in
the way they talk to potential customers...

~~~
Terr_
I imagine it polls very well with developers who somewhat-distrust the
flailings of "the business guys", but if the phrase gets past that layer...

~~~
debaserab2
Why would you ever buy a strait jacket in real life, though? The metaphor
doesn't even make sense.

~~~
skyrw
Well I guess if you're the type of person that can't stop hitting themselves
in the face you may be in the market for one. Limited market though however.

------
davidhariri
This article should really be one sentence: "Pure codebase rollbacks in
production are impossible if schema changes are made". Pretty obvious,
frankly.

------
kelnos
Article is just flat-out wrong. Sure, rollback buttons _can_ cause more
problems than they solve, but depending on the problem at hand, they can also
decrease customer downtime _greatly_ if you roll back in a situation where
it's safe and possible to do so.

In my experience, assuming you're doing a decent job of separating your
application code and application state, and are avoiding backward-incompatible
changes to state storage, rollbacks are almost always safe, and almost always
the right first step when a deployment causes customer-visible errors.

I don't find their trivial example of rollback failure compelling. Sure, you
can always find an instance when a particular tool does more harm than good,
or just doesn't do good. A single example, or even many examples of such, need
not form a basis for discrediting that tool.

Rollbacks are a tool in your operations toolbox. Sometimes you should reach
for them, and sometimes you should reach for another tool. Claiming you "can't
have a rollback button" is counterproductive and needlessly discards a tool
that can help your customers when your process fails and you put them in a
tight spot.

------
daddykotex
I'd like to know how you handle feature flags in a full feature web
applications.

In my experience, it's relatively easy and clean to work with `if(feature)
this else that` in a simple backend application (a service that pulls from a
queue and maintain a state, for example).

The problem arise when you start working with more complex applications.
Specifically application with complex UI where you have to check the state of
the flag at multiple different places in the code. You end up adding branches
all over the code, branches that will be removed once the feature is fully
deployed and known to be correct.

It adds a lot of overhead, the code is less readable.

There must be a way to do implement feature flags in a smoother way.

~~~
dsr_
Think of it as being exactly the same problem as "different customers have
paid for different features". It's just that only the internal tester customer
has paid for this new feature, so far.

How do you handle that? You need a consistent permissions framework.

~~~
daddykotex
The thing is that in my experience, features behind flags, are not always a
per customer artifact.

Or maybe they should be and out implementation of feature flags should always
take the current customer into consideration...

------
kaishiro
It's somewhat unfair, but one of the best things about our current build
system is that we have a very exact rollback button. So, first off, we're
dealing with static sites, but there is still a dynamic data model since we
utilize cloud based CMSs. However, since we only ever save the built output,
we can rollback to any version of the site going all the way back to launch.
I've never been able to do that before moving to this system - pretty neat.

~~~
laktek
Neat! Is it a custom built solution or you're using a product?

We are building similar capability for Pragma
([https://pragma.build](https://pragma.build)), if anyone else is interested
in having a similar workflow.

~~~
kaishiro
Yeah, good question. We built out our own system on AWS w/ Lambda and S3
buckets, but lately we've been utilizing Netlify as well. Netlify has a pretty
slick feature where each deploy is automatically available via a subdomain =
the short git SHA - super clever.

------
zitterbewegung
Doesn't CQRS style design attempt to promise a "rollback button" or method of
rolling back ? In the example you would see the event emitted by the faulty
program and then you would reconstruct the object based on replaying the log
right before the invalid function was executed.

~~~
kevsim
Yep, any sort of event sourcing approach will theoretically give you the time
machine this guy is looking for (albeit with the data from the "bad" release
being entirely thrown away). However in practice I think it's non-trivial to
bake this into your deployments.

In the process of rolling out a CQRS-based system now though so maybe we'll
give it a go when time allows.

------
joshuamorton
I really want to see some comments from SREs (or similar) at major companies
on this. It seems like a lot have a rollback first, ask questions later
policy, and I'd love to hear about how they mitigate or flat out avoid the
problems described in this article, or if I'm just mistaken.

~~~
peterwwillis
SREs only have a supporting role to play in change rollback. They maintain the
tools/services used by SWEs, DBAs, etc to follow the rollback procedure
documented in Change Management.

Rollback is often a suitable first response because the rollback is not a
button, it is a procedure, and it can be practiced in a test environment and
on a small subset of production without significant customer impact.

Like the other commenter suggested, gradual rollout is a great way to spot
bugs or customer feedback. You can accomplish this by building deployment
tools that allow for targeted deployment, and by maintaining services that can
be targeted by release. For example, you can use your Configuration Management
engine to snapshot a DB, deploy schema changes to it and then your code to the
app servers using only that DB. If there's a problem you can roll back the
code, the data, the schema, whatever changed at that release version, and
reload all the services. For things that affect customer assets you'll need
extra procedures to work around potential failures there, but that's not
complicated (usually).

------
lobster_johnson
You can, but you need to design and plan for it. _State_ depended on by code
is usually the biggest problem; you need to design your code and rollout
process to work with both old and new state.

For example, if you add a column to a database table, then both the old code
should be able to work seamlessly both with and without this column. In
practice, this would mean not making the column a required one, unless you can
invent some kind of neutral default value.

------
asher_
A cache seems like a particularly odd example to choose for this. If you are
changing the way a cache functions, then presumably you either have fairly
short expiration times (problem will fix itself) or you would have some form
of cache invalidation as part of the deployment process.

Additionally, it would have been nice to see some mention of patterns that
solve this issue more completely, like CQRS, where state is disposable.

------
NTripleOne
Stopped reading at "The internet is a big truck", that's a flat-out lie,
everyone knows that the internet is a series of tubes.

------
joshdotsmith
Does anyone have resources on how one should design their changes to run side-
by-side? I have not been at large companies and don't have the advantage of
institutional knowledge to help here. Book, articles, and practical examples
would be fantastic.

And since it's often hard to generalize, I work today with Elixir and
Postgres. Anything specific around this stack would be exceptional.

~~~
lukejduncan
I don't know if this helps, but reading up on backwards and forwards
compatibility can get you in the right direction. The same concepts apply, but
the details may very. For API interfaces read up on or think through best
practices around overloading old fields, adding new fields, removing old
fields, making them optional. I remember Acro had a "schema evolution" doc
that talks through some of these concepts and they may apply to things you
care about. For databases, I've seen updates staged as pre-commit, post-
commit, and rollback patchs. Pre-commit modifies the database in a backwards
compatible way -- for example adding new fields and mutating old fields into
them if necessary. Code would be pushed live and could read the "old" or "new"
way. After all the code is rolled out and there's high confidence the post
commit would cleanup things left behind only for backwards conpatabaility
reasons. Rollback only operated on pre-commit, post commit may be destructive.

~~~
lukejduncan
Acro == Avro

~~~
jlgaddis
click "edit"

~~~
lukejduncan
sorry, was on mobile

------
jlebrech
if you have a load balanced system you could always have the "roll-back"
button prevent users upgrading their profile to the newer version of the app.

but anyone already upgraded would be left to deal with the bugs till they are
fixed and the there's a further upgrade scheduled.

------
oconnor663
Realistically, when you run a website big enough that the people pushing the
code to servers aren't the people who wrote it, you're going to have a
rollback button. The ops people can't wait to figure out what commit caused
the issue, much less wait for a fix. They're going to roll back the servers.

Note that a rollback could be useful even in the example in this post.
Machines that have hit the bad codepath are corrupted, but machines that
haven't hit it yet still have good data. Sometimes you need to stop the
bleeding even if there's more work to be done for a fix.

> If developers incorrectly believe that their mistakes can be quickly
> reversed, they will tend to take more foolish risks.

That may be true, but by the same token, if developers believe their code
can't be rolled back, they're writing code that's broken in a different way.

~~~
jwatte
We have milions of active users, and a large engineering organization, and our
engineers push their own code. Every hour, every day.

We've internalized writing code that deploys well both going forwards and
going back. 99% of all changes don't even need to worry about that, because
they are UI/rules/bug changes, not base data schema.

We actually don't roll back database schemata; schemata are applied before
code that relies on the changes can be written, and thus must be backwards
compatible! If you think you can't do it that way, think harder.

It's not cheap to build all the necessary infrastructure and keep training all
new engineers in the necessary skills, but it can absolutely be done and pay
off handsomely. We never run stale code, and we never diverge far from master.

