
A push gone wrong in the name of what, exactly? - janvdberg
https://rachelbythebay.com/w/2019/10/25/enabler/
======
mikenew
Too many details missing here to make any sort of conclusions. Yeah, of course
bugs aren't supposed to make it to production. But they do. Is it easier to
roll back than it is to roll forward? Usually it's not. How far along are
things before you recognize what's happening? How critical is the service?

You can't draw life lessons from some vague description of "something" that
happened to "someone". Continuing the rollout could very well have been the
best option. Or not. Who knows.

~~~
klodolph
> Is it easier to roll back than it is to roll forward? Usually it's not.

If downtime has consequences then you have every incentive to make rollbacks
happen at the touch of a button. From personal experience I know that plenty
of teams simply are not at that point for many reasons.

If rollbacks are “hard” in any way for you, then you should know the known
consequences of difficult rollbacks—they incentivize you to push forward even
when the new release is kind of broken.

For example, let’s say that your team doesn’t test rollbacks and doesn’t have
procedures in place. You know, in the corner of your mind, that a version
rollout could potentially take your service completely down for days while
people hunt down the bug or figure out how to do rollbacks on the fly. This is
a predictable scenario to end up in, so at some point I recommend addressing
it, or justifying why you don’t.

> Continuing the rollout could very well have been the best option. Or not.
> Who knows.

Rollback should be the default response in the face of uncertainty when you
might have a bad rollout. Ideally. It should be the safe, one-button-press
reset back to the way things were before.

Again, I know that’s often not the case. Schema changes are the biggest
sticking point, usually. I often see schema changes in two phases—first you
update the schema, and then later you roll out functionality which uses the
new schema, in a separate rollout. This is tricky, but most updates are not
schema changes.

Teams generally learn these things as object lessons (as in “we had downtime,
wrote a postmortem, and it won’t happen again”) but that’s not the way things
need to be. As the team evolves and the product matures, the release tooling
gets better. You read articles like this and you can make the tooling better
without the object lesson. No one-size-fits-all solution, obviously, but…
painless, one-button rollbacks are one-size-fits-most for sure!

Someday, someone writes an integration test that sets up a new and old version
and makes them talk to each other, checking that they both work correctly.

Another day, somebody writes an integration test that does a rollout from the
previous version, followed by a rollback.

And someone else adds more monitoring signals so if they see service failures
above some threshold it automatically rolls back.

These are all evolutionary steps. Yes, they don’t apply to all services.
But—most of the services I run are somehow making money for the company, or
people depend on them for critical pieces of functionality related to the
company’s bottom line.

~~~
bloody-crow
> If downtime has consequences then you have every incentive to make rollbacks
> happen at the touch of a button.

Not all rollbacks are the same. Even in a theoretical scenario where you can
roll back the code to a previous version with a push of a button, it's not
always possible due to many factors.

Sometimes new code creates and persists state that the old code can't handle
and rolling back would just make the old version blow up because of it.
Sometimes clients can't talk to the old code after talking to new code and
expecting new protocol.

The point is, there's a lot of factors that could make one or another
remediation strategy more or less viable, and sometimes rolling forward is the
best way to proceed.

The article tries to advocate that rolling back is ALWAYS the correct thing to
do, which is demonstrably not true for certain circumstances.

~~~
gtyras2mrs
Or sometimes rolling back has political consequences (rolling back could be
construed as failure and an exec somewhere up the chain gets pissed off).

~~~
delusional
I get what you're saying, but by the time you are playing political games with
your software you've already lost the technical battle.

------
Aeolun
I’m not sure I agree with this viewpoint. Assuming nothing critical lived on
the service, and the worst side effect of the deploy was the client failing
once and reconnecting, I’m not sure there’s anything wrong with just
restarting the world.

Obviously if you know in advance you should not even start, but if you notice
after you are halfway done? Meh, chances are that a rollback would take as
long as finishing the push, and either way the service disruption continues
until it’s finished.

~~~
michaelt
When I hear "It's one of those things that talks to other instances of itself"
I think of a redundant, high-availability database.

Assuming every machine sends data to all its peers regularly, if I have a
cluster of 10 A machines and upgrade 1 machine to B, and the A machines crash,
that means I lose 90% of my capacity - likely making the sole survivor get
overwhelmed too.

"Outages are OK, clients can always reconnect or try again later" isn't the
usual standard HA databases are operated to :D

~~~
Aeolun
At that point the A machines are dead though. There is no point in restarting
them as A when you can also restart them as B, so I assume they don’t all die
in one go.

But I guess we just have too little information to make a really informed
decision about what you should do.

~~~
klodolph
> There is no point in restarting them as A when you can also restart them as
> B, so I assume they don’t all die in one go.

A is “baked in”, so it should be the default.

There are simply too many classes of bugs which take time to manifest. Memory
leaks, cache staleness, etc. So, for very good reasons, you consider version A
preferable to version B.

------
joatmon-snoo
ITT: people who don't do incremental rollouts or run reliable software stacks.

Sure, there are scenarios where you can justify rolling forward when you're
spraying packets of death everywhere. None of those scenarios are the default
- the default is roll back. (Imagine if there was some latent bug in the new
release corrupting data in transit that you didn't notice until 100%, and
you're not sure where it is, just that it's happening - now you literally
can't roll back, because your current release will hose the previous one. At
this point all you can do is go ¯\\_(ツ)_/¯.)

Also, the title of the post screams pretty loudly to me that the people making
the call to push the backwards-incompatible release didn't seriously think
through the implications of their decision.

------
zaroth
Working in a much less formal environment, I still religiously abide by a
backward compatible concept whereby new code can require new conditions (e.g.
schema changes) but those new conditions must be invisible/transparent to old
code.

It’s generally very risky to put your state of world into a place where the
old code can’t even look at it. Because rollouts and fallbacks become much
harder to orchestrate. Mostly it provides the property where you only need to
worry about writing environmental _upgrades_ and never _downgrades_ even if
you sometimes fall back to older builds.

However the story in TFA is a bit different. What we have is an
incompatibility between two versions running side-by-side. Nothing wrong with
Version A or Version B per say, but just that you cannot roll out B
progressively. Obviously the only fix is to immediately get your environment
homogenized in A or B, and maybe there is actually a valid decision / risk
analysis to go one way or another. For example, if you _don’t_ respect the
backward compatibility edict outlined above, it could be tricky to downgrade
your B instances back to A depending on what state changes in the data B may
have already made.

But the real problem is actually one level deeper. The fact that B got to the
point of being rolled out with a catastrophic incompatibility with A means
that DEV is fundamentally not testing for the ability to do a progressive
rollout before they actually start doing one. The new B code was never
actually tested in a heterogenous environment with A, meaning these rollouts
must have been happening on blind faith alone.

------
th-th-throwaway
People are saying there isn't enough info but I think the author definitely
made the right analysis here.

The purpose of partial rollouts is to observe if there are bugs. The unusual
case here is that version B's bug is forgetting backward compatibility,
causing it to rapidly take down version A too. This means you can't _simply_
rollback B as usual. After you rollback you still need to fix broken A
instances. It's a lot of work but it would be the right thing to do.

Instead, they went all in on version B to avoid the bug they _just_
introduced. This is completely reckless. You're skipping your usual process so
you never get a chance to observe for other bugs in B. You should actually
expect B to have even more bugs in it given you know it already has one major
production breaking bug.

Going all in on a version that you're not confident in just to fix the one bug
you know about is stupid.

~~~
ryanbrunner
With the info given, it's entirely possible that the bug was actually in
version A. Perhaps sending it a payload that is perfectly acceptable by the
spec of how the API should work caused A to crash.

In that world, "fixing" B could involve sending invalid or unintended data to
work around the problem in A, or patching A before rolling out B (which, when
you're at the point of rolling out A.2, you may as well just roll out B)

------
post_below
There aren't enough details... I feel like the author needed to at least make
a token effort to support the level of conviction she has that the developers
in question are so terribly wrong.

It sounds like A was buggy (it dies when it receives a particular sort of
request that it doesn't understand).

Maybe the devs knew about the bug in A? Maybe they decided it was better to
make version B work as intended rather than try to throw in a one time hack to
be backwards compatible with a broken version that was going to be a thing of
the past in... How long does deployment take? Is it even long enough to
matter? Maybe the most efficient solution was to allow some of the instances
to die during deployment? What kind of application is it? Possibly a few
failed requests while you wait for B to fully deploy isn't a big deal. Maybe
the client auto-retries quickly enough that the dropped connection is
negligible. Maybe there isn't a client at all.

The possibilities are pretty much endless given what we know.

~~~
journalctl
Or perhaps it’s just a short opinion blog post and shouldn’t be overanalyzed.

~~~
Dylan16807
If that couple sentences of analysis (not counting the guesses of context) is
too much then maybe it's not a great post to submit to HN.

------
xbhdhdhd
Knowing the code changes well and making the decision to push on or revert
back cant be second guessed by arm chair observers.

And of course it shouldnt happen in the first place but hey shit does happen.
No one has ever managed to figure that out, and testing isnt a silver bullet.

~~~
klodolph
It absolutely can be second-guessed, because this is a very common scenario
and there is a consensus about the default way to deal with it—roll back
first, ask questions later. It’s easier to fix bugs while production services
are not on fire.

If you are seeing failures and want to roll forward, you should be able to
clearly articulate why this is better than rolling back, and what makes this
particular scenario different.

Otherwise, I’m going to be on the side of the armchair observers calling for
rollbacks.

~~~
hhas01
“roll back first, ask questions later”

Better yet: build code and processes appropriate for long-lived massively
distributed systems that will be incrementally upgraded over time. If the
system is architected right, it will never get into a state where a rollback
recovery becomes necessary. This is why we have Content Negotiation. This is
why we have Erlang. This is not a new challenge by _decades_ , and there is a
huge body of expert knowledge and tools upon which to draw when implementing
such systems, so any such complete catastrophic _basic_ failures now are
entirely down to PEBKAC, and remedied by a swift clue-by-four with a pink slip
nailed to the end.

There is a very simple principle underpinning distributed communication:
servers should _never_ make assumptions about who their clients are and what
they need. Talk to the client, find out what format(s) it’s willing/able to
accept, and serve it the best match. A client should _never_ need to know, nor
care, if it’s talking to a version A server or a version B server: if the
client says “I only understand version A data” then it’s the server’s job to
serve up data in that exact format, not to pique and whine about how old and
out of date the client is, push it version B data instead, and then blame the
client for choking on it.

Indolent developers who approach IPC the same as local messaging and then
blame everything but themselves when it barfs all over the place are the
absolute bane of this industry, and this shit is entirely on them. And shame
on the equally inept management culture that continues to let such incompetent
amateurs get away with it.

~~~
klodolph
You’re asking too much.

There will be bad rollouts. I know of no set of practices which prevent bad
rollouts. You talk about “indolent web developers”, well, that’s not
productive and pointing fingers doesn’t make your software work. Your software
will, in spite of your best practices, in spite of hiring the best people, in
spite of experience, sometimes fall over.

Yes, it will sometimes segfault.

~~~
hhas01
My software shits itself all the time. What matters is that it does so
_safely_. And when it doesn’t, I can tell you why, because I know what corners
have been cut and why, and I’m not afraid to accept and acknowledge my
responsibilities in such fuckups.

And yeah, I count on the fingers of one hand the number of web developers I’ve
dealt with over the last decade who I’d be willing to cross the road to piss
on were they on fire, and still have fingers to spare. They’re just the worst
of the worst.

There was NO excuse for the failure described in the article. There was NO
excuse for the described response to that failure. Yet such base incompetence
and gross irresponsibility is not only systemic but entrenched, rationalized,
and embraced in this industry. With responses like yours, it’s not hard to
tell why. Buncha _Children_.

~~~
klodolph
> There was NO excuse for the failure described in the article.

In this case, right. In general, stuff happens. There’s a tradeoff between
reliability and effort. The correct reliability target is not 100%, because
you can’t get 100% anyway, and as you approach 100% reliability the cost
increases without bound.

I’m not sure what the rest of your comment is about besides taking a big shit
on web developers and talking about how awful they are.

There is a precious small percentage of developers who are really good at
making reliable systems and they have the burden / responsibility of spreading
their knowledge. They work with the other actual developers you hire, those
beautiful imperfect developers who cut corners, test in production, and don’t
write tests.

You make changes to your culture and your practices. You build monitoring and
rollout automation. You increase test coverage.

If you just call people children you’re going to be there, on the sidelines,
watching other people build real products. You don’t teach people by making
fun of them.

------
JayMickey
This article fails to mention the consequences of the rush to push version B,
rather than rollback to resolve. As far as I can tell the result met the needs
of the business?

~~~
yuliyp
If it did that's pure luck. It could have also had another bug. Rolling back
keeps your service up. Pushing forward faster maybe gets your service back up,
or maybe breaks it some other way.

When dealing with a complex system, roll back first, then figure out why it
broke and how to fix it for the next attempt.

~~~
JayMickey
I should clarify - I don't disagree that rolling back and fixing the bug (or
in a continuous delivery world - fixing forward as a priority) is the right
thing to do. I'm just saying the article leaves out a lot of details and fails
to articulate the reasons why the road taken was bad.

------
jakobegger
From experience with an app that doesn't auto-update, I know that the only
thing that gets people to upgrade is a critical bug that affects them.

People generally keep using old versions of software for years. 90% of users
don't care about new features or performance improvements or bug fixes that
don't directly affect them.

So I understand what they did. If they indeed introduced a version C that had
a workaround to avoid crashing A, then they would have never gotten rid of A.
No matter how hard they pushed C, some people would just continue to use A.

But if A starts crashing, and there's an update available that fixes the
issue, that's a really good motivator to get people to upgrade.

Sure, it'll suck for everyone for a short while, but it'll soon be over once
everyone has upgraded.

And, as a dev, I also know that crashes are not the most serious kinds of
bugs. They are usually easy to diagnose, and people realise that something is
wrong right away -- much better than a bug that silently corrupts data, or a
bug that causes subtly incorrect behavior.

~~~
sokoloff
If your goal is to get rid of A by any means necessary, make the business
decision to turn off A at some point.

Not having the backbone to make that decision and allowing the call to forced
at some arbitrary time by a bug and concluding the bug was a good thing (or at
least had a silver lining) because it allowed the org to move away from A
suggests a possible leadership decisiveness problem to me.

~~~
user5994461
There is no organization of more than 500 employees that can take decisions
like that.

Developers of one service simply don't have the power to force other developer
in another part of the organization to do anything. That's assuming they're
even aware of the existence of that service and who is supposed to own it.

~~~
8note
counterexample:

the team I'm working on has successfully turned off ~10 systems with/without
replacements, and got all of our clients within a much larger than 500 person
org to migrate fully.

~~~
user5994461
I am not saying it's impossible. I'm doing that with hundreds of systems
covering 5000+ developers/clients.

Fact is, it's a nightmware to handle. It just takes one client who can't or
won't upgrade and it's blocked forever.

------
MrQuincle
If this would be our situation, it would be a firmware update for our smart
outlets in the field.

You definitely want to make sure that updates are backwards compatible. We use
a mesh and once there was an incompatibility on the mesh protocol. You want to
make sure this will never cause trouble. We decided to use another channel all
together.

Our next update will update the bootloader and radio stack as well. Scary!
This is an update that uses multiple intermediate bootloaders so we fall back
to a working system at each step. We definitely do not want to have bricked
devices out there. [https://github.com/naveenspace7/bluenet-
bootloader/blob/inte...](https://github.com/naveenspace7/bluenet-
bootloader/blob/intermediate-bl-2/intermediate/docs/flowchart.pdf)

We almost didn't find a bug nevertheless. In our older devices there were some
programmed with a binary generated by objcopy without the --gap-fill 0xFF
flag. It lead to an hardfault in initialization of the new file system. Brrr.
[https://github.com/crownstone/bluenet/issues/73](https://github.com/crownstone/bluenet/issues/73)

------
tus88
> When you're already in a hole, QUIT DIGGING.

Except this time they did seem to dig their way out of the hole :D

------
scrame
I generally like rachels posts, but I don't know what the hell she's talking
about here.

~~~
pm
The updated version of the program, when pushed to servers, crashed the older
version of the program on other servers which had yet to be updated. Rather
than rolling back to diagnose and understand the cause, they charged on
updating all the servers.

~~~
axaxs
That part I understand. What wasn't elaborated on is what problems that caused
and why it was a bad idea.

------
glitchc
It’s not clear whether their strategy was a success or a failure. If it
succeeded, the old bromides of “don’t rush pushes” doesn’t actually apply. I
mean it worked didn’t it? The reality of production software is that it’s
always pushed in somewhat of a rush, to meet a myriad of internal and external
deadlines.

I would counter the “don’t rush” bromide with “anything worth doing is worth
doing adequately.”

------
fouc
It sounds like the team doubled down on the deployment without actually
informing their managers or anyone at all.

------
keeperofdakeys
An interesting case of rollback vs rollforward from bitcoin.

[https://github.com/bitcoin/bips/blob/master/bip-0050.mediawi...](https://github.com/bitcoin/bips/blob/master/bip-0050.mediawiki)

In that case they decided to rollback.

------
omgtehlion
I myself was in such a situation. Though after a heated argument we decided to
revert to previous version (A) and not push forward.

Of course well-thought release policy will help immensely (how to test, when
to release, staged rollouts, etc...)

------
splittingTimes
It seems odd to me that they do not version their exchange protocols, file
formats, APIs etc. And then check if both participants have compatible
versions when interacting..

------
demarq
usually you want to a release to be all the way deployed or all the way rolled
back not sitting in some half state.

because it gets really hard to reason about the state of mutable resources
like databases and storage that is being written to by two different versions
of the application simultaneously.

in this situation pushing on with the deploy or rolling it all back are valid
options.

however everyone knows rollback are always more difficult than deploys

------
janpot
I'd say this a good resiliency test. If anything, they should do this on
purpose from time to time.

------
jpswade
Mistakes will happen, especially when your mantra is release early and release
often.

~~~
user5994461
Seems like a successful deployment as per "move fast and break things"

------
z3t4
I would probably call that a feature...

~~~
Thorrez
The purpose of slow rollouts is to catch bugs before they affect everyone. We
already have evidence this is buggy software, so slow rollouts are even more
important.

And even if there aren't any other bugs, this scenario must have caused a lot
of stress for the people in charge of the rollout.

------
peignoir
Death by push

------
tbyehl
It always amuses me how the HN collective tears apart whatever rachelbythebay
writes. Maybe they should change their online identity to rossbythebay.

The tale they've shared should be obvious: When a deploy is causing problems,
stop. And here are some generic techniques that could have been used to avoid
ending up in a situation where one might choose to plow ahead instead of
rolling back.

How is this controversial to anyone? This is Formalized Change Process 101.

~~~
soneca
I think that any post that is technical and related to software* go through
some process of being tear apart collectively by HN.

I have the opposite impression that HN collectively appreciate Rachel posts
very much, judging by the sheer number of her blog's posts that are upvoted to
front page recurrently. Not many personal blogs reach that number of upvotes
in HN.

*Maybe even all kind of posts, but the ones that are technical and about software are just the sweet spot because HN is full of software engineers.

------
jmmcd
Without some juicy details of customers being screwed over, or name-and-
shaming evil cackling developers, this story is a bit mom-and-apple-pie.

------
frei
> What could have happened instead? Many things.

Yes, yes, many things, certainly. Obviously, fix-forward is very bad because
of these many things.

As a highly compensated software engineer, of course _I_ know what these
things are, but, just checking, what would _you_ say they are?

~~~
tomjakubowski
You misread that "many things", which refers to hypothetical actions the team
could have taken (and which the post immediately describes), not hypothetical
outcomes of their actual actions.

------
phendrenad2
Since rachelbythebay didn't tell us what the result of these broken sessions
was, I assume it wasn't critical. Or at least that's the attitude of the HN
crowd complaining about a lack of details. Developers only pay attention to
these kinds of stories when the tale involves someone getting fired or money
being lost.

------
jonny383
I guess Rachel had a bad day at work? Rachel's writing is normally top-notch,
but this is whiny, vague, and frankly annoying.

