
Limiting breakage with a software deployment checklist - plinkplonk
https://blog.gojekengineering.com/limiting-software-infant-mortality-rate-decoding-gojek-deployment-checklist-1c6cc3e28df
======
pnevares
The blog post refers to "RCA" three times and "RSA" once, and doesn't seem to
define either acronym after reading it once.

Also this?

> And once you fail a build, then every team member in your team has to do
> deployments and go through the deployment checklist.

Sounds like there's a piece of context missing from the section before it. You
have to do the checklist to deploy, and if you fail once, then every member of
your team has to do the checklist as well?

~~~
zacherates
"RCA" likely means "Root Cause Analysis". I'm not sure how to interpret RSA
other than as a typo for RCA.

------
iandanforth
This is a good place to start but you also need to commit to automating these
steps. Copying and pasting URLs is a waste of time, rollbacks should be
automatic, contributor lists should be auto-generated etc. Since it takes time
to automate each new thing a checklist is still a good idea, but you have to
recognize the danger of engineers subverting/rejecting the process if you let
the list grow much at all.

~~~
sitharus
Agreed. At my job we've integrated all the tools. Developers have to put the
ticket number in the commit manually, but then the tooling will attach the
commit to the ticket automatically. There's also a bot that checks what you're
committing and adds checklists for riskier things.

Then the ticket is picked up on merge to master which creates a change list.
Everything gets deployed to QA where it's tested. QAs go over the changes and
approve it, then developers push to production. Afterwards a random developer
from a different team is picked to do a post-deploy review which spreads
knowledge and sometimes picks up other changes.

If something can be automated it should be :)

------
Annatar
"We don’t restrict a deployment trigger to specific people. As soon as you are
done, go ahead."

So they have no change management process in place and are basically hacking
on it 'till it works. Very professional.

Does not look like they ever heard of the capability maturity model, either.

~~~
mnd999
One persons professional is another’s bureaucratic. There’s no one approach
right for every team in every situation. It sounds like they’ve find something
that works for them for now and that’s great.

~~~
Annatar
I disagree. United States department of defense, advanced research project
agency and the software engineering institute at Carnegie-Mellon university
disagree as well.

I'm very much inclined to defend their methodology and position on this since
that is their core area of expertise, and I've experienced it working on a
very large scale (tens of thousands of servers) rather than a bunch of hacking
efforts at some company on the InterNet.

[https://www.amazon.com/Capability-Maturity-Model-
Guidelines-...](https://www.amazon.com/Capability-Maturity-Model-Guidelines-
Improving/dp/0201546647/)

[https://www.amazon.com/Managing-Software-Process-Watts-
Humph...](https://www.amazon.com/Managing-Software-Process-Watts-
Humphrey/dp/0201180952/)

------
protomyth
On the subject of database indexes, know when you should deploy the indexes.
This comes up when adding tables that get populated during deploy. Sometimes,
depending on your database, its a really bad move to add the index to the
empty table. Its an interesting problem because it might be time critical if
you are doing enterprise deploys where you are taking down the whole system
for the duration. Probably a less common circumstance these days. Also,
getting rid of temporary code needed only for the conversion is a super good
thing to remember.

------
brockers
I know Dev shops don't like processes and forms, but that "checklist" is
exactly that. It simply shows that processes and procedures are tools that are
useful if that use can be limited.

~~~
Cthulhu_
It's a means of formalizing the deployment before automating it. Which makes
sense, when automating you need to understand what you're trying to automate
first.

------
runlevel1
We do something similar for changes that are risky, complicated, or manual:

[https://sendgrid.com/blog/change-management-keep-it-
simple-s...](https://sendgrid.com/blog/change-management-keep-it-simple-
stupid/)

Doing it formally for every deployment seems like it would kill productivity.

~~~
spc476
Where I work, our programs are first installed in QA, then staging and finally
production. For each step, there's a web form we fill out, listing the release
number, tracking system ID, what program, any config changes, any database
changes, any special instructions and what testing have been done. Once
submitted, the OPs team pulls the proper program from the build server,
updates the config and pushes the stuff automatically (I know they use
something like chef or ansible, but I'm not in that department (which is on
the other side of the country) so I'm not sure of the exact details).

For the final push into production, the developers have to be online (2:00 am
Eastern), along with QA and OPs. QA or the developers can abort the deployment
[1] for any reason, and rolling back is trivial. So far in my seven years at
The Company, I've had to abort a production deployment once (yes, I noticed an
issue and aborted the deployment---it was totally my call).

[1] Our customers are the various Monopolistic Phone Companies. We have scary
SLAs. We get approval for deployment from them. Downtime costs us Real Money.
I don't get to deploy stuff all that often (stuff that doesn't talk directly
with the customers is easier to deploy---unfortunately, most of what I work on
talks directly with the customers).

------
trollopTheJope
what an unfortunate metaphor

~~~
dang
Since you're not the only commenter who complained, we've taken that bit out
of the title above.

