
Moving Fast and Securing Things - Chris911
https://slack.engineering/moving-fast-and-securing-things-540e6c5ae58a
======
wpietri
One of the things I think about when analyzing organizational behavior is
where something falls on the supportive vs controlling spectrum. It's really
impressive how much they're on the supportive end here.

When organizations scale up, and especially when they're dealing with risks,
it's easy for them to shift toward the controlling end of things. This is
especially true when internally people can score points by assigning or
shifting blame.

Controlling and blaming are terrible for creative work, though. And they're
also terrible for increasing safety beyond a certain pretty low level. (For
those interested, I strongly recommend Sidney Dekker's "Field Guide to
Understanding Human Error" [1], a great book on how to investigate airplane
accidents, and how blame-focused approaches deeply harm real safety efforts.)
So it's great to see Slack finding a way to scale up without losing something
that has allowed them to make such a lovely product.

[1] [https://www.amazon.com/Field-Guide-Understanding-Human-
Error...](https://www.amazon.com/Field-Guide-Understanding-Human-
Error/dp/0754648257)

~~~
dvtrn
Having recently escaped from a "control and blame" environment, this is also
horrible for releases as left unchecked, more energy is expended trying to
double-down on architecting for perfection in fault tolerance. Risk aversion
goes through the roof cripples decision making, and before you know it your
entire team of developers have become full time maintenance coders, you stop
innovating and spend cycles creating imaginary problems for yourself and begin
slowly sinking.

We had a guy who more or less appointed himself manager when previous
engineering manager decided he couldn't deal with the environment anymore, his
insistence on controlling everything resulted in a conscious decision to
destroy the engineering wiki and knowledge base and forced everyone to funnel
through him-creating a single source of truth. Once his mind was made up on
something, he would berate other engineers, other developers and team members
to get what he wanted. Features stopped being developed, things began to fail
chronically, and because senior leadership weren't made up of tech people-they
all deferred to him-and once they decided to officially make him engineering
manager (for no reason other than he had been on the team the longest-because
people were beginning to wise up and quit the company), the entire engineering
department of 12 people except for 2 quit because no one wanted to work for
him.

Imagine my schadenfreude after leaving that environment to find out they were
forced to close after years of failing to innovate, resulting in the market
catching up and passing them. Never in my adult life have I seen a company
inflict so many wounds on itself and then be shocked when competitors start
plucking customers off like grapes.

~~~
wpietri
For those for whom this excellent description has resonance, I strongly
recommend the book, "Why Does He Do That? Inside the Minds of Angry and
Controlling Men". [1] It's nominally written about domestic abuse, but its
descriptions of abuser psychology and its taxonomy of abuser behaviors have
been really helpful to me in a work context.

[1] [https://www.amazon.com/Why-Does-He-That-Controlling-
ebook/dp...](https://www.amazon.com/Why-Does-He-That-Controlling-
ebook/dp/B000Q9J0RO)

------
vasilakisfil
I am in favor of checklists for certain critical tasks, even if they are
repetitive and/or boring. I think checklists are underrated.

~~~
ggregoire
I am in favor of (check)lists for everything. I am actually surprised about
how few developers take notes and make list. It’s one of the most important
part of my workflow.

~~~
icebraining
I've been convinced that checklists are great (by theory and by practice), yet
I still write way fewer than I should.

I strongly dislike repetitive mental work, and writing a checklist is
essentially resigning myself that such work will be necessary. Until I write
it, I can still convince myself I'll be able to automate the process.

~~~
MaulingMonkey
I use checklists to turn mental work ("uhh, let me think, what all do I need
to do...") into straightforward physical work ("check, next step..."). I also
use checklists as a first step towards automation and a stopgap until
automation is complete, since I live in a constant state of infinite backlog.

If I run through the checklist a couple times and it seems to:

    
    
        [ ] Cover everything
        [ ] Not require complicated decision making or value judgments
        [ ] Has few edge cases in need of handling
        [ ] Doesn't require automation-opaque tooling
        [ ] Not change more frequently than I execute the checklist
    

Then I know I have a prime candidate for automation, and already have great
documentation of exactly what to automate.

~~~
Cyphase
Awesome, a meta-checklist.

------
ejcx
I love this! If you're a part of a security team and you are not automating
your processes and procedures then your team is going to drown. You must
automate.

It seems like some simple checklist app but having a non Jira process that
takes only a few minutes is so valuable, and "security reviews" and "threat
models" as part of your SDLC take insane amounts of time and honestly aren't
super helpful.

~~~
insensible
What I think is brilliant here is that that sort of work can take place
separately and provide feedback to these checklists when problems or
deficiencies are found. Basically the security checklists are a deliverable
that can be iterated on independently while they still benefit from the
existing one.

------
maccard
> At the start of 2015, Slack had 100 employees. Today, we’re over 800 people!

That's a lot of people...

~~~
corrigible
Seems like their client memory usage scales with their headcount

------
punnerud
I like the addition question if you are using C/C++: «We confirm that we
really, really need to use a non-memory-safe language.". PHP/Python/C/C++ get
Medium Risk directly, Low Risk:
WebApp/API/MessageServer/iOS/Android/Electron/WindowsPhone

------
spydum
So glad they finally published this, saw the OWASP AppSec talk, was eagerly
awaiting it.

However - I would want to caution: I think this model works because Slack has
a self-described "culture of developer trust". I tend to think, they hire
bright engineers and ensure they are equipped to do the right thing. I believe
the vast majority of organizations are NOT ready for this. I direly want them
to be, but simple fact is there are too many mediocre developers, and they
can't be trusted without guardrails (and some straight up need babysitters).

------
JepZ
And I thought 'security' itself is friction ;-)

No seriously, I was wondering if that tool has a CLI interface? Might make it
more accessable for some devs.

------
mbid
A security app written in PHP. Nice touch.

------
hhaidar
The company I work for has been offering an enterprise level service like for
about 8 years now:
[https://www.securitycompass.com/sdelements/](https://www.securitycompass.com/sdelements/)

------
mikekey
Well written and timely for me. I would like to see this capable of something
other than Jira though :/

~~~
coldacid
There's instructions for working with Trello in the repo's README, but so far
it seems to be just that or JIRA Enterprise (not Cloud).

------
jrochkind1
this is really cool.

------
boffinism
> The process of deploying code to production is very simple, and takes about
> ten minutes total. This results in a life cycle in which we deploy code to
> production approximately 100 times per day.

What? They spend 1000 minutes out of every 1440 deploying to production? The
deployment process is occurring over 16 hours out of every 24? Am I the only
one who is nonplussed by this?

EDIT: Ok I get it, I get it. I guess I always worked in much smaller companies
where CD meant deploying about 10 times a day tops. TIL big companies are big.

~~~
wgerard
Not a slack employee but worked at a company with similar CD views:

(Likely) various groups of people are deploying to production throughout the
day. Out of those 100 deploys, an individual is probably only involved in 1 or
2 a day. As soon as you're ready to deploy your code, you queue up and see it
all the way through to production along with probably a few other people doing
the same thing.

The actual "change the servers over to the new production code" process is
usually instantaneous or extremely quick, the 10 minutes is mostly spent
testing/building/etc.

People (including myself) enjoy this because you can push very small
incremental changes to production, which significantly reduces the chance of
confounding errors or major issues.

Note that this is would be a Sisyphean task if your company doesn't have great
logging/metrics reporting/testing/etc.

~~~
toomuchtodo
It’s usually a Sisyphean task. Everyone wants to look and act cutting edge
(“but Netflix!”) but nobody wants to make the necessary investments in the
tooling, org structure, and management ability/support that is required to
support that sort of deployment cadence (if your org focuses on who broke
something instead of the process, and management doesn’t want to change that
culture, all hope is already lost [based on experience in a large enterprise,
YMMV]).

There are some legitimate needs for continuous deployment, the rest of it is
cargo culting.

~~~
wgerard
> There are some legitimate needs for continuous deployment, the rest of it is
> cargo culting.

Maybe, but I wouldn't go that far. Small companies already often do CD,
because there's rarely a rigid deploy schedule. It's a practice people
understand and feel the benefits of immediately. If you ask someone who moved
from a small startup to a huge company what their biggest complaints are, I
bet "longer/stricter deploy process" comes up 8/10 times.

When I think of cargo cult programming I think more of TDD or Agile: Practices
that people aren't familiar with and often implement without understanding the
benefits or reasoning.

~~~
toomuchtodo
For every developer that complains about the longer/stricter deploy process,
I'd offer up for consideration deployments that went out through the CD
pipeline where production data was mangled with no rollback possible. As with
everything, its determining your appetite for risk.

~~~
wgerard
Hmm, I don't see how that changes with longer/stricter deploy processes -
unless you have some of the tooling around that makes CD very possible in the
first place (automated checks/etc.)

I've certainly worked in places with very long and strict deploy processes
that managed to mangle production data frequently. Even worse, because the
deploy process was so strict and long the bad code managed to stay on
production for much longer than 10 minutes (the deploy time mentioned in the
article).

There's some vague notion out there that long deploy process == safe, but
there's very little evidence to suggest that's the case. If anything, it seems
much more dangerous because larger changesets are going out all at once.

~~~
toomuchtodo
It goes back to my original comment above; if you have the proper tooling
(tests executed that must pass prior to deploy given a green light, green/blue
deploys, canaries, automated datastore snapshots/point in time recovery,
granular control of the deployment process), I think continuous deployment
provides a great deal of value above what you've invested into the process.
But that investment is critical if you've bought into CD. Otherwise, it's
"deploy and pray".

~~~
wgerard
Sure, and I guess my point is: If you haven't invested in those things,
waterfall-esque deploy processes are just as bad and perhaps even worse
because there's more chance for confounding changes to cause a nasty error.

The only reason waterfall-esque deploy processes work without those things is
because companies often waste tons of people-hours on testing things out in
the staging environment (which requires time, obviously).

~~~
vlovich123
The thing you're missing is that you're amortizing the cost. Yeah, it's
typically prohibitive to run the manual testing on every CL. However, if you
have any manual testing you need to run, then at some point you have to batch
the changes & test them out together anyway. Automated tests don't necessarily
solve this problem either. 1) Some automated tests can be time-consuming & so
require batching of CLs to run too 2) it's impossible to predict if you are
going to catch all issues via automated testing 3) there's always things it's
easier to test for manually.

When it comes to data integrity, I would think you need a structured mechanism
(at least for larger teams that have a high cost for failure) for rolling back
any given CL either by tracking writes, making sure to have a plan in place to
recover from any given CL (e.g. nuking the data doesn't break things), being
able to undo the bad writes, or just reverting to a snapshot. Without being
careful here CD-style development feels like lighting up a cigarette beside an
O2 tank. Now for web development this is fine since it's not touching any
databases directly. More generally it feels like a trickier thing to attempt
everywhere across the industry.

~~~
wgerard
> However, if you have any manual testing you need to run, then at some point
> you have to batch the changes & test them out together anyway.

Wait, why is that? Manual testing should be reserved for workflows that can't
be automatically tested (or at least, aren't yet).

I'm not sure I see why doing any amount of manual testing would necessitate
manually testing everything.

> Some automated tests can be time-consuming & so require batching of CLs to
> run too

I'm not sure I see why this is a problem, and CD certainly doesn't require
that only one changeset go live at a time.

> it's impossible to predict if you are going to catch all issues via
> automated testing

This is also true of manual testing.

> there's always things it's easier to test for manually.

I'd go further and say it's almost always easier to test manually, but the
cost of an automated test is definitely amortized and you come out ahead at
some point. That point is usually quicker than you think.

> I would think you need a structured mechanism...

This paragraph is entirely true of traditional deploys with long cadences as
well. The need (or lack thereof) for very formal and structured mechanisms for
rolling back deploys doesn't really have much to do with the frequency that
you deploy.

> Now for web development this is fine since it's not touching any databases
> directly.

Maybe we're speaking about different things here, but the trope about web
development is that it's basically a thin CRUD wrapper around a database, so
I'm not sure this is true.

~~~
vlovich123
> Wait, why is that? Manual testing should be reserved for workflows that
> can't be automatically tested (or at least, aren't yet). I'm not sure I see
> why doing any amount of manual testing would necessitate manually testing
> everything.

I never said you need to manually test everything. This is about continuous
deployment where typically a push to master is the last step anyone does
before the system at some point deploys it live shortly thereafter. In the
general case, however, how do you know in an automatic fashion if a given CL
may or may not need manual testing? If you have any manual testing then you
can't just continuously deploy.

> This is also true of manual testing.

I never opined that there should only be one or the other exclusively so I
don't know why you're building this strawman & arguing it throughout. A mix of
automatic & manual testing is typically going to be more cost effective for a
given quality bar (or vice-versa - for a given cost you're going to have a
higher quality) because (good) manual testing involves humans that can spot
problems that weren't considered in automation (you then obviously improve
automation if you can) or things automation can't give you feedback on (e.g.
UX issues like colors not being friendly to color-blind users).

> The need (or lack thereof) for very formal and structured mechanisms for
> rolling back deploys doesn't really have much to do with the frequency that
> you deploy.

That just isn't true. If you're thorough with your automatic & manual testing
you may establish a much greater degree of confidence that things won't go
catastrophically wrong. You deploy a few times a year & you're done. Now _of
course_ you should always do continuous delivery so that to the best of your
ability you ensure in an automated fashion an extremely high quality bar for
tip of tree at all times so that you're able to kick off a release. Whether
that translates into also deploying tip of tree frequently is a different
question. Just to make clear what the thesis of my post is, I was saying
continuous deployment is not something that's generally applicable to every
domain (continuous delivery is). If you want an example, consider a FW release
for some IoT device. If you deployed FW updates all the time you're putting
yourself in a risky scenario where a bug bricks your units (e.g. OTA protocol
breaks) & causes a giant monetary cost to your business (RMA, potential
lawsuits, etc). By having a formal manual release process where you perform
manual validation to catch any bugs/oversights in your automation you're
paying some extra cost as insurance against a bad release.

> Maybe we're speaking about different things here, but the trope about web
> development is that it's basically a thin CRUD wrapper around a database, so
> I'm not sure this is true.

The frontend code itself doesn't talk do the DB directly (& if it does, you're
just asking for huge security problems). The middle/backend code takes front-
end requests, validates permissions, sanitizes DB queries, talks to other
micro-services etc. Sometimes there are tightly coupled dependencies but I
think that's rarer if you structure things correctly. Even FB which can be
seen as the prototypical example of moving in this direction no longer pushes
everything live. Things get pushed weekly likely to give their QA teams time
to manually validate changes for the week, do staged rollouts across their
population to catch issues, etc.

In general I think as you scale up continuous deployment degrades to
continuous delivery because the risks in the system are higher; more users,
more revenue, more employees & more SW complexity means the cost of a problem
go up as does the probability of a catastrophic problem occurring. When I
worked at a startup continuous deployment was fine. When I worked at big
companies I always push for continuous delivery but continuous deployment
would just be the wrong choice & irresponsible to our customers.

