When organizations scale up, and especially when they're dealing with risks, it's easy for them to shift toward the controlling end of things. This is especially true when internally people can score points by assigning or shifting blame.
Controlling and blaming are terrible for creative work, though. And they're also terrible for increasing safety beyond a certain pretty low level. (For those interested, I strongly recommend Sidney Dekker's "Field Guide to Understanding Human Error" , a great book on how to investigate airplane accidents, and how blame-focused approaches deeply harm real safety efforts.) So it's great to see Slack finding a way to scale up without losing something that has allowed them to make such a lovely product.
We had a guy who more or less appointed himself manager when previous engineering manager decided he couldn't deal with the environment anymore, his insistence on controlling everything resulted in a conscious decision to destroy the engineering wiki and knowledge base and forced everyone to funnel through him-creating a single source of truth. Once his mind was made up on something, he would berate other engineers, other developers and team members to get what he wanted. Features stopped being developed, things began to fail chronically, and because senior leadership weren't made up of tech people-they all deferred to him-and once they decided to officially make him engineering manager (for no reason other than he had been on the team the longest-because people were beginning to wise up and quit the company), the entire engineering department of 12 people except for 2 quit because no one wanted to work for him.
Imagine my schadenfreude after leaving that environment to find out they were forced to close after years of failing to innovate, resulting in the market catching up and passing them. Never in my adult life have I seen a company inflict so many wounds on itself and then be shocked when competitors start plucking customers off like grapes.
The reviewer rejected my Slack add-on twice, but was really nice about it, gave specific reasons, encouraged me to fix it and reapply, etc.
A very pleasant experience compared to some of the other systems where it feels like you're begging to be capriciously rejected.
I strongly dislike repetitive mental work, and writing a checklist is essentially resigning myself that such work will be necessary. Until I write it, I can still convince myself I'll be able to automate the process.
If I run through the checklist a couple times and it seems to:
[ ] Cover everything
[ ] Not require complicated decision making or value judgments
[ ] Has few edge cases in need of handling
[ ] Doesn't require automation-opaque tooling
[ ] Not change more frequently than I execute the checklist
At least that's what I keep telling myself. I also write much fewer checklists than I should.
Similar boring processes in a tech business can be checklisted to increase efficiency, and the checklist itself can be iterated over. Everything from designing a new feature, to troubleshooting an error, to addressing a customer support ticket, to getting access to a new resource should use checklists.
If I had to switch to some other note taking platform it would probably break my flow enough that I wouldn't do it.
People too often get used to the routine and will end up skipping bits from checklists, or even outright missing stuff. It's strange, given whole idea of there being something to actually check would hopefully mitigate that.
Generally I do my utmost to avoid having checklists, so that they're left for things where they can't be avoided (e.g. where automation makes no sense, or potentially makes things worse)
This depends on the person and their approach.
If I understand the need for each checklist step as a means of avoiding future hassle/work or outright danger of some sort, and I'm buying into the concept that I actually need to reference the checklist (perhaps even going so far as to print out a copy and actually physically check things off), and being properly conservative (no "I think I did that...", only "I know I did that" or "I double checked and yes I did that"), checklists stay very effective.
In other words, it's not enough to have a checklist, you need checklist discipline, the kind of discipline that can only be accomplished through buy-in to the concept. The same checklist approached with two different mindsets can range from a damn-near error free way to accomplish something, to a piece of scrap paper that would be worth more blank.
This does sometimes make checklist handoff problematic.
And your discipline might depend on the checklist. I'm way more disciplined about a release checklist than I am about a weekend chores checklist - and this is perfectly fine.
It seems like some simple checklist app but having a non Jira process that takes only a few minutes is so valuable, and "security reviews" and "threat models" as part of your SDLC take insane amounts of time and honestly aren't super helpful.
That's a lot of people...
However - I would want to caution: I think this model works because Slack has a self-described "culture of developer trust". I tend to think, they hire bright engineers and ensure they are equipped to do the right thing. I believe the vast majority of organizations are NOT ready for this. I direly want them to be, but simple fact is there are too many mediocre developers, and they can't be trusted without guardrails (and some straight up need babysitters).
No seriously, I was wondering if that tool has a CLI interface? Might make it more accessable for some devs.
What? They spend 1000 minutes out of every 1440 deploying to production? The deployment process is occurring over 16 hours out of every 24? Am I the only one who is nonplussed by this?
EDIT: Ok I get it, I get it. I guess I always worked in much smaller companies where CD meant deploying about 10 times a day tops. TIL big companies are big.
A culture of continuous deployment is often hard to fathom for people who've never worked at a company with one. Everything, down to what you write and how you write it, is influenced by being able to deploy it and see its effects almost instantly.
These aren't huge, sweeping changes being deployed. They're small pieces of larger feature sets. It's more like: Deploy a conditional statement with logging and confirm from the logs that it works; next, deploy the view that you're testing with a feature flag toggled in a way that you and the pm can see it that is properly called from the conditional from before; when things look good, deploy some controller code that handles form requests from the view, etc.
You deploy small changes piecemeal and so spread out the risk over a larger period of time. It makes identifying issues with a new piece of code almost trivial. Needing to debug 30 lines of code is so much less harrowing than needing to look over 900.
(Likely) various groups of people are deploying to production throughout the day. Out of those 100 deploys, an individual is probably only involved in 1 or 2 a day. As soon as you're ready to deploy your code, you queue up and see it all the way through to production along with probably a few other people doing the same thing.
The actual "change the servers over to the new production code" process is usually instantaneous or extremely quick, the 10 minutes is mostly spent testing/building/etc.
People (including myself) enjoy this because you can push very small incremental changes to production, which significantly reduces the chance of confounding errors or major issues.
Note that this is would be a Sisyphean task if your company doesn't have great logging/metrics reporting/testing/etc.
I was excited for the move to a large corporation where there would be amazing room for growth and learning.
I have to say that almost a year into my work on this project, I was absolutely stunned how inept this company was at coordinating a technology project.
Something a small team could accomplish in a matter of months was taking 100's of developers and 100's more in supporting / operational roles years to accomplish. My guess is the developers on this project would gladly trade places with Sisyphus.
There are some legitimate needs for continuous deployment, the rest of it is cargo culting.
The first time I switched from CI to full CD was circa 2011. I loved it because the mental bucket "later" went away. Except to the extent something was declared as a formal experiment in our A/B tests, code was either live or it wasn't. We were doing pair programming and committing every few hours, so aside from the little scratch-paper to-do list before we committed, there was no "later" for us. It made it real clear what our "good enough to ship" standards were. There was less room for bullshit. The resulting code was tighter, cleaner.
It also forced us to work much more closely as a team. We couldn't leave product questions for some end-of-iteration review. We had more mini-reviews with the product manager, and also improved our product judgment. Everybody trusted each other more. Partly because we had to, and partly because close collaboration is how you build trust.
It also shifted incentives further upstream. Suddenly there were no more big releases. No matter how big your vision, you had to learn to think of it in bite-sized pieces. It became less about answers, and more about questions. Not "Users need X!" but "What's the smallest thing we can do to see if users benefit from X?" Being able to make a lot of small bets made it easier to explore.
The Lean and Kanban folks talk a lot about "minimum WIP", where WIP is work in process. My CD experiences have definitely convinced me that they're right. Smaller pieces deployed more frequently requires a fair bit of discipline, but there are such huge gains in effectiveness and organizational sanity that I'll always try to work that way if I can.
Make a change to a thing, which might take a few minutes or a few hours, get it reviewed and merged and it'll be in production a few minutes later.
Once you get past 500 engineers, even just completing a single task a week means 100 things to deploy every day: either you batch them together somehow or you work on the tooling to just get them to production without any fuss.
Maybe, but I wouldn't go that far. Small companies already often do CD, because there's rarely a rigid deploy schedule. It's a practice people understand and feel the benefits of immediately. If you ask someone who moved from a small startup to a huge company what their biggest complaints are, I bet "longer/stricter deploy process" comes up 8/10 times.
When I think of cargo cult programming I think more of TDD or Agile: Practices that people aren't familiar with and often implement without understanding the benefits or reasoning.
I've certainly worked in places with very long and strict deploy processes that managed to mangle production data frequently. Even worse, because the deploy process was so strict and long the bad code managed to stay on production for much longer than 10 minutes (the deploy time mentioned in the article).
There's some vague notion out there that long deploy process == safe, but there's very little evidence to suggest that's the case. If anything, it seems much more dangerous because larger changesets are going out all at once.
The only reason waterfall-esque deploy processes work without those things is because companies often waste tons of people-hours on testing things out in the staging environment (which requires time, obviously).
When it comes to data integrity, I would think you need a structured mechanism (at least for larger teams that have a high cost for failure) for rolling back any given CL either by tracking writes, making sure to have a plan in place to recover from any given CL (e.g. nuking the data doesn't break things), being able to undo the bad writes, or just reverting to a snapshot. Without being careful here CD-style development feels like lighting up a cigarette beside an O2 tank. Now for web development this is fine since it's not touching any databases directly. More generally it feels like a trickier thing to attempt everywhere across the industry.
Wait, why is that? Manual testing should be reserved for workflows that can't be automatically tested (or at least, aren't yet).
I'm not sure I see why doing any amount of manual testing would necessitate manually testing everything.
> Some automated tests can be time-consuming & so require batching of CLs to run too
I'm not sure I see why this is a problem, and CD certainly doesn't require that only one changeset go live at a time.
> it's impossible to predict if you are going to catch all issues via automated testing
This is also true of manual testing.
> there's always things it's easier to test for manually.
I'd go further and say it's almost always easier to test manually, but the cost of an automated test is definitely amortized and you come out ahead at some point. That point is usually quicker than you think.
> I would think you need a structured mechanism...
This paragraph is entirely true of traditional deploys with long cadences as well. The need (or lack thereof) for very formal and structured mechanisms for rolling back deploys doesn't really have much to do with the frequency that you deploy.
> Now for web development this is fine since it's not touching any databases directly.
Maybe we're speaking about different things here, but the trope about web development is that it's basically a thin CRUD wrapper around a database, so I'm not sure this is true.
I never said you need to manually test everything. This is about continuous deployment where typically a push to master is the last step anyone does before the system at some point deploys it live shortly thereafter. In the general case, however, how do you know in an automatic fashion if a given CL may or may not need manual testing? If you have any manual testing then you can't just continuously deploy.
> This is also true of manual testing.
I never opined that there should only be one or the other exclusively so I don't know why you're building this strawman & arguing it throughout. A mix of automatic & manual testing is typically going to be more cost effective for a given quality bar (or vice-versa - for a given cost you're going to have a higher quality) because (good) manual testing involves humans that can spot problems that weren't considered in automation (you then obviously improve automation if you can) or things automation can't give you feedback on (e.g. UX issues like colors not being friendly to color-blind users).
> The need (or lack thereof) for very formal and structured mechanisms for rolling back deploys doesn't really have much to do with the frequency that you deploy.
That just isn't true. If you're thorough with your automatic & manual testing you may establish a much greater degree of confidence that things won't go catastrophically wrong. You deploy a few times a year & you're done. Now of course you should always do continuous delivery so that to the best of your ability you ensure in an automated fashion an extremely high quality bar for tip of tree at all times so that you're able to kick off a release. Whether that translates into also deploying tip of tree frequently is a different question. Just to make clear what the thesis of my post is, I was saying continuous deployment is not something that's generally applicable to every domain (continuous delivery is). If you want an example, consider a FW release for some IoT device. If you deployed FW updates all the time you're putting yourself in a risky scenario where a bug bricks your units (e.g. OTA protocol breaks) & causes a giant monetary cost to your business (RMA, potential lawsuits, etc). By having a formal manual release process where you perform manual validation to catch any bugs/oversights in your automation you're paying some extra cost as insurance against a bad release.
> Maybe we're speaking about different things here, but the trope about web development is that it's basically a thin CRUD wrapper around a database, so I'm not sure this is true.
The frontend code itself doesn't talk do the DB directly (& if it does, you're just asking for huge security problems). The middle/backend code takes front-end requests, validates permissions, sanitizes DB queries, talks to other micro-services etc. Sometimes there are tightly coupled dependencies but I think that's rarer if you structure things correctly. Even FB which can be seen as the prototypical example of moving in this direction no longer pushes everything live. Things get pushed weekly likely to give their QA teams time to manually validate changes for the week, do staged rollouts across their population to catch issues, etc.
In general I think as you scale up continuous deployment degrades to continuous delivery because the risks in the system are higher; more users, more revenue, more employees & more SW complexity means the cost of a problem go up as does the probability of a catastrophic problem occurring. When I worked at a startup continuous deployment was fine. When I worked at big companies I always push for continuous delivery but continuous deployment would just be the wrong choice & irresponsible to our customers.
Instead imagine each deploy as an edge, and imagine them to be near instantaneous. With respect to an instance, and a user, and the observable effects of a deploy, this paints a more accurate picture. 100 times a day means one deploy every 14.4 minutes.
How many times a day do you think Amazon deploys changes? Or Facebook? Or Google?
So having more pushes per day isn't necessarily the metric to maximize. Quality of code changes for each push is important, and this is where automated testing can be very valuable. The goal is for automated testing to be a "gatekeeper of bad code".
But even this system isn't perfect, and it's possible to deploy things that pass tests but still have show-stopping bugs. Or for the code to cause your tests to misbehave - I'm seeing this now with Tape.js on Travis, where Tape sees my S3 init calls as extra tests. Then my build fails because - of the 2 tests specified, 3 failed, and another 4 passed.