
Releasing software to the fleet far too quickly broke stuff - guiambros
http://rachelbythebay.com/w/2020/04/23/rel/
======
hpoe
I work at a very large IT org (1000+) and the only way I have ever seen
anything change substantially and persistently was when the change was
introduced and made it far easier to do the right thing. Ultimately if you
want people to do the right thing you can't force them into it, you have to
make them want it.

It seems to me there are lots of opinions on how to change an organization or
introduce improvements but ultimately if you aren't making people's lives
easier, making them want the change you are trying to bring about, you'll find
it almost impossible to succeed, human inertia is just too great.

~~~
Cthulhu_
The only major changes I've seen in IT organizations of companies was where
the existing IT department had declined or was let go over time and any
development was in slumber mode. Then an excited manager comes around and
erects a new IT organization from nothing (usually leaning heavily on
consultants and freelancers), having sold a Digital Transformation to the
company. The new IT / developer department spends millions a year in conjuring
up a new server stack (cloud native) and a new software architecture
(microservices woo), often building on top of existing ancient software
(probably running on a mainframe three further abstraction levels down), with
the intent of eventually replacing them. Eventually. Maybe.

Usually those things got bogged down in the end because after 1-2 years, the
first wave of developers get bored and move on to other opportunities, and the
architecture they made is poorly abstracted microservices communicating via
undocumented and unstructured REST APIs, with not nearly enough end-to-end
tests or traceability to reliably make changes or add features.

[/rant]

~~~
DerArzt
That got a little to real for me, I started thinking about work.

------
hackeryogi
This is quite the representative case between Infra teams (or PaaS or SRE -
based on where you're located) and application teams.

Being part of the platform team, I've been in the exact same situation before
at an ex employer. We assumed that _infra problems_ were _everyone's problems_
and how can anyone who cares about their application _not_ adopt to the
latest-and-greatest-platform (or tool) we released. A lot of time was spent on
figuring out the 'best approach' to getting everyone onboard quickly - we
tried the carrot and the stick. We failed miserably at both.

The trouble is, as someone else mentioned here - we didn't really go and _talk
to the customer_. Internal customers are customers too - and 'product'
thinking should be equally applied to all internal platforms. The general rule
applies - talk to the customer and understand what they're going through. They
have their own sets of problems and priorities.

(a) Is the latest way actually _saving them a lot of time_

(b) Can we take another 2 weeks and make it even easier for them to come on
board?

(c) If they're really busy and/or lack the expertise, can we carve out some
time to give enough training sessions?

Yet, this never happens and we invariably end up blaming the application teams
for _not doing this seemingly simple_ change. It may seem downright stupid
from the platform engineer's PoV, but we are partly to blame.

This also leads me to think that the cloud wars will be won by the <cloud-
provider> that does this job better - of understanding the customer needs on a
day to day basis and building platforms for them.

------
erulabs
Really good post - "making it hard to do the wrong thing" is more or less my
entire design ethos when it comes to software, particularly developer tools.

Somewhat on-topic, instead of re-inventing the wheel on this one, I've
recently been using Flagger ([https://flagger.app/](https://flagger.app/)) and
have been exceptionally happy with it. Automated rollback on error rates,
metrics or integration tests. It's so nice to finally be able to stop writing
this code at every company I work for!

~~~
sebcat
> making it hard to do the wrong thing" is more or less my entire design ethos
> when it comes to software, particularly developer tools.

See also poka-yoke: [https://en.m.wikipedia.org/wiki/Poka-
yoke](https://en.m.wikipedia.org/wiki/Poka-yoke)

~~~
Someone
I found it strange that that page doesn’t mention the example I immediately
thought of after reading the first sentence: large dangerous machinery such as
presses or lasers that require two hands to operate to prevent the case where
the operator, in a hurry, still has one hand in the danger area when he
presses the “do it” button with the other.

------
paxys
Large-scale deploy systems for a distributed environment are ridiculously hard
to build. We had to set one up quickly and on a budget, so in lieu of fancy
engineering we set up three rules:

\- Any change, whether it's a one-line config update or an entirely new
backend, can only go out to a maximum of x% of hosts at a time

\- Hosts are taken fully out of rotation before getting an update

\- Every change has to be backwards compatible, and there can be no side-
effects of random rollbacks, sometimes even across several versions

The first two are easy to enforce programmatically, and the third one relies
on the team. While the trade-off is that it's harder to make relatively simple
changes, it has served us well so far.

~~~
seanwilson
> Every change has to be backwards compatible, and there can be no side-
> effects of random rollbacks, sometimes even across several versions

How does that rule work out in practice? If you're doing staged rollouts of
new builds because you think there's a decent chance those new builds have
bugs in them, shouldn't you realistically expect that rollbacks could have
unexpected side-effects?

~~~
tantalor
> expect that rollbacks could have unexpected

Expect the unexpected? Sure anything can happen.

But paxys is specifically referring to expected (known) side-effects of
rollbacks. For example if Foo and Bar are on versions Foo.3 and Bar.3, and we
know rolling back to Bar.2 breaks Foo.3, then you broke backwards
compatibility. Generally all combinations of possible rollbacks must continue
to work, e.g., {Foo.2, Foo.3} x {Bar.2, Bar.3}. The number of rollbacks is
limited to 1 or 2 versions, e.g., Foo.1 and Bar.1 cannot be used anymore.

------
aahortwwy
It's funny that the big tech companies have such high hiring bars and yet when
you get inside there are large teams of people actively working to protect the
company from the stupidity of their colleagues.

~~~
hinkley
I’m mostly trying to protect myself from my own stupidity.

That other people fall prey to the same issues in many scenarios just makes
the pill go down easier for them.

~~~
aahortwwy
When I say stupidity I don't mean mistakes.

From the article:

> At that point, it was decided that something easier had to be done. The
> barrier to entry for safe and sane rollouts had to be lowered even more to
> bring more teams on board.

Why does something have to be easy for a team of professional software
engineers to do it correctly? Make things easier to improve productivity,
sure, but in my opinion awkward/difficult/complex tooling isn't a license to
cut corners.

~~~
hinkley
If you meet an asshole in the morning, you met an asshole. If you meet
assholes all day long, you're the asshole. I can think of a few current or
former coworkers who talk about how everybody is an idiot and they are the
biggest idiots I know.

Insisting people use a tool that is capable of doing dangerous things
correctly, every time, is white-knuckling. Yelling at them about screwing it
up is running your organization by shame. When you do that, eventually only
the people with very high pain tolerance ever get promoted, or even heard. Do
you really want to work on a team run by masochists/sadists?

Similar to make the change easy, make the easy change, if you want to have a
dialog with people about doing better, the easiest way to do that is to give
them a reliable tool (literal or figurative), and then start insisting they
use it when it becomes obvious they haven't. The people who insist on doing
things the hardest way possible can still do it as long as nobody else feels
the consequences. But typically what they're going to learn is that where
self-reflection is concerned, they're bad at math and they have way more
frequent problems and far more collateral damage than they like to tell
themselves.

~~~
aahortwwy
There are a few strawmen in this post.

1\. I never said everybody is stupid. The vast majority of people in Big Tech
are great, but there ARE stupid people there too. I resent the implication of
your first paragraph.

2\. Who said anything about yelling or other toxic behaviors? There are
options other than "engineer a foolproof system" and "institute a sado-
masochistic management strategy".

3\. I never advocated for doing things the hardest way possible, I advocated
for doing things correctly (to the best of your ability) even when it is
difficult to do so.

The example given in the article is a team choosing to do a global rollout
when the option existed to do a staged rollout. There's myriad technical
things you can do to make it easier and faster to do a staged rollout, harder
to do a global rollout, reduce the risk of a global rollout, etc. etc. ... but
what about instilling a healthy respect for production in your colleagues?
That's been conspicuously absent from most postmortems I've sat in on.

------
csours
> "This was one of those times when we needed to build a system that made it
> stupidly easy to do the right thing, so that it would actually be MORE work
> to do the wrong thing"

This is basically the core of UX. It's kind of amazing to me that UX is so
poorly represented in management tools.

------
greendave
The joke about building idiot-proof software comes to mind, but it's
surprising to me how many systems really don't do a good job of making the
right thing easy and the wrong hard. Dangerous defaults have consequences!

Of course sometimes there's no substitute for learning things the hard way.
Where practical, I've found having the responsible team on the hook for
cleaning up their mess tends to work wonders in terms of getting people to be
appropriately careful. That's not to say mistakes can't still happen, but it
cuts down a lot.

------
primitivesuave
Really interesting post. Can anyone comment on how serverless deployments help
and/or exacerbate the deployment problem?

To give a theoretical starting point - Acme corporation has a serverless API
deployed at api.acme.com powering website acme.com. The staging system is
deployed at staging.api.acme.com, and the corresponding web app (e.g.
staging.acme.com) targets that API. Staging gets deployed to production via a
pull request that triggers CI/CD, which has no ability to do an X% rollout
(i.e. the underlying serverless function is updated).

Developer John Smith is writing a feature and merging to his personal branch
"jsmith" \- he can test his branch with jsmith.staging.api.acme.com and a
corresponding development web app. When he's finished the feature, he does a
PR to the staging branch, which is packaged into a PR to live.

You may be able to see I'm trying to get input on our own release system :)
It's worked quite well so far but I'm still paranoid about not having gradual
roll-out - in my experience in bigger companies, there was always some A/B
testing in place so bad deploys could be reverted quickly. If the sysop gurus
of HN gathering on this thread can pass any comments/critique it would be
greatly appreciated.

~~~
twic
To do a release, create a new serverless function with the new code.
Initially, no traffic routes to it, all traffic goes to the old function.
Change things so 1% of traffic goes to the new function, and see if it looks
okay. If it does, make that 2%, then 3%, or 4%, or 10%, etc. Once it's 100%,
delete the old serverless function.

The provider hosting your serverless functions might not give you that much
control over routing. In that case, you would need to write another serverless
function which sits in front of your serverless functions and does the
routing. If you ever need to update that, well, then you have to cross your
fingers.

~~~
primitivesuave
This seems like a great strategy, thank you for your input! It can indeed be
accomplished by AWS Global Accelerator along with application load balancers
targeting Lambda functions to route a percentage of traffic. I am not sure
about the overhead of having a Lambda function to route requests, I’ll have to
look into the latency that adds (especially for cold starts).

~~~
ultrafez
Using a lambda to route requests to other lambdas would result in you
effectively doubling your Lambda cost, as your routing lambda would sit idling
while waiting for the target lambda to return a result. If you're using API
Gateway, it has a canary deployment option which can route percentages of
customer traffic to different Stages, which gives you the gradual rollout
capability that you wanted.

------
diebeforei485
This lesson, but in a UX context - Apple generally does a good job of making
it hard for users to do the wrong thing (for their definition of wrong).

------
hinkley
Sounds like a setup for immutable servers, and I think I'm okay with that.

If new copies of your service don't come up cleanly, you do not power down the
old ones. If old servers can't talk to the new service, you get alarm bells
going off while most of the old copies are still up and running. You've also
partially serialized the deployment so you get some peak shaving.

I wonder sometimes if the problem is that despite our protestations (and
things like naming servers by unpronounceable names), we still think of
servers running, instead of instances running. So the fact that the 20 copies
of my service keep bouncing from machine to machine is at the very least a
logistics issue, and possibly unsettling.

------
user5994461
Been there, done that.

The next rant will be: Massive outage because of a bug that was fixed last
month. But the fix only shipped to 50% of the hosts, because the deployment
system doesn't deploy everywhere by default.

------
yjftsjthsd-h
On the one hand, there are decent technical tools to deal with this kind of
thing; I'm directly familiar with Ansible's support for controlled rollouts,
most container management layers happily support blue/green deploys, etc. On
the other hand -

> Some people obviously couldn't be bothered to step their release along over
> the course of a couple of days, and we had to find some way to protect the
> production environment from them.

That's easy: HR. By all means start by being nice, but if people refuse to be
careful then they can burn down someone else's prod.

~~~
paxys
HR is exactly the worst approach to deal with organizational and process
problems like this. What do you do? Fire the entire dev team involved with the
decision? Fire their managers who failed to set the right expectations? Fire
the product team for setting deadlines that were too aggressive? Fire the CEO
because hey, ultimately they are the one responsible for everything?

Even if you do all that, you're not going to go back in time and fix things
for your customers, and you're definitely not going to prevent the replacement
developers from making the exact same mistakes.

It takes really good leadership to build an engineering culture where good
practices are valued and people learn from mistakes, which is why filling
those positions is so hard and expensive. A company whose solution for every
outage is to fire people isn't going to last very long.

~~~
AnimalMuppet
GP said:

>> That's easy: HR. By all means start by being nice, but if people refuse to
be careful then they can burn down someone else's prod.

If people _refuse_ to be careful. If they make a mistake once, teach them. If
they make a mistake again, that happens. But if they make a mistake
_regularly_ , warn them, then warn them sternly, then fire them.

~~~
compiler-guy
Maybe so, but then you still have the problem, because in a big organization,
some newbie is always present and going to make that first mistake. So you
still need to make it hard for people to make mistakes.

~~~
perl4ever
That's simple; you just make sure nobody ever leaves and then you don't have
to hire new people...

~~~
adrianN
I worked at a company with that strategy. It worked reasonable well until they
realized that 80% of the team will retire in the next few years.

~~~
csours
Yea, and you get these weird inbred tools and insane levels of Not-Invented-
Here

------
jedieaston
It seems like the system for sending out packages should've been designed to
be unless this is being sent out as a critical security patch (and signed off
by a security manager or something in the log), the package is deployed to 5%
of the fleet per day (unless someone opts in to get it early for some reason
on a package-by-package basis). In a big fleet, anything else would still be
chaos if someone clicked the wrong button.

