
Deploys at Slack - michaeldeng18
https://slack.engineering/deploys-at-slack-cd0d28c61701
======
hn_throwaway_99
Interested in how they handle DB updates/migrations (I don't know what Slack
uses for data storage backend).

IMO those DB migrations are the most difficult/fraught with risk because you
need to ensure that the different versions of the servers that are running as
they are deploying can work with whatever state your DB is in at the moment.

~~~
derekperkins
Mostly MySQL that is moving to Vitess (transparently sharded MySQL). I believe
they use gh-ost for migrations.

------
brycethornton
It's always nice to see how other teams do it. Nothing too groundbreaking here
but that's a good thing.

I did notice the screenshot of "Checkpoint", their deployment tracking UI. Are
there solid open source or SaaS tools doing something similar? I've seen
various companies build similar tools but most deployment processes are
consistent enough to have a 3rd-party tool that was useful for most teams.

~~~
thinkingkong
I've built that tool 2-3 times now. The issue is really the deploy function
and what controls it. It's always a one-off, or so tightly integrated into the
hosting environment, that reaching in with a SaaS product is somewhat
difficult. That being said, the new lowest-common-denominator standards like
K8s make it way easier. If anyone is interested in using a tool just leave a
comment and I'll reach out.

~~~
sciurus
Please provide a way for people to reach you without commenting here.

~~~
thinkingkong
Just ping here for now. hello@hover.sh

------
nathankunicki
Fun to read, but there's a lack of detail here that I'd like to see. For
example, this talks purely about code changes. However times a code change
requires a database schema change (as mentioned above), different API's to be
used, etc. In the percentage based rollout where multiple versions are in use
at once, how are these differences handled?

~~~
navaati
For database schema changes, here is the standard practice: - You have version
1 of the software, supporting schema A. - You deploy a version 2 supporting
both schema A and new schema B. Both versions coexist until the deployment iis
complete and all version 1 instances are stopped. During all this time the
database is still on schema A, this is fine because your instances, both
version 1 and 2, support schema A. \- Now you do the schema upgrade. This is
fine because your instances, now all runnning version 2, support schema B \-
At last, if you wish you can now deploy a version 3, dropping the support for
schema A.

~~~
daigoba66
We do it the other way (and I’ve always seen it done this way): database
change is compatible with current code and new code. So deploy the database
change, then deploy the code change. It usually allows you to rollback code
changes.

~~~
sciurus
This is generally harder to pull off though unless you do things like force
all DB access to go through stored procedures.

And then you're really still pursuing the same strategy described above,
except for your stored procedures instead of your app code.

------
RussianCow
> Even strategies like parallel rsyncs had their limits.

They don't really go into detail as to what limitations they hit by pushing
code to servers instead of pulling. Does anyone have any ideas as to what
those might be? I can't think of any bottlenecks that wouldn't apply in both
directions, and pushing is much simpler in my experience, but I've also never
been involved with deployments at this scale.

~~~
rbtying
I can't speak for Slack, but it's not unreasonable to believe that a single
machine's available output bandwidth (~10-40Gbps) can be saturated during a
deploy of ~GB to hundreds of machines. Pushing the package to S3 and fetching
it back down lets the bandwidth get spread over more machines and over
different network paths (e.g. in other data centers)

~~~
zerd
We do it similarly except we push an image to a docker registry (backed by
multi-region S3), then you can use e.g. ansible to pull it to 5, 10, 25, 100%
of your machines. It "feels" like push though, except that you're staging the
artifact somewhere. But when booting a new host it'll fetch it from the same
place.

------
Saaster
I'm surprised at the 12 deployments per day, if that's truly to production.
There's bugfixes etc., but feature wise Slack has been... let's say slow. Not
Twitter slow, but still slow, in making any user visible changes.

~~~
onion2k
Far too many people on HN seem to think the public facing code that we see is
all that the engineering team in a large company works on. There's _so much_
more to running a large SaaS business. If Slack is like all the other SaaS
companies I've encountered they'll have dozens of internal apps for sales,
comms, analytics, engineering, etc that they work on that people outside of
the business never see[1]. Those all need developing and all need deploying.

[1] They might buy in solutions for some business functions like accounting,
HR and support, but they'll still have tons of homegrown stuff. Every tech
company does.

------
darkwater
I wonder why they didn't evaluate at some point using an immutable
infrastructure approach leveraging tools like Spinnaker to manage the deploy?
They sure have the muscle and numbers to use it and even contribute to it
actively, no? I mean, I know that deploying your software is usually something
pretty tied to a specific engineering team but I really like the immutable
approach and I was wondering why a company the size of Slack, born and grown
in the "right" time, did not consider it.

~~~
truetuna
I had similar thoughts when I read their article. Their atomic deploy problem
completely disappears had they gone with an immutable approach.

------
daenz
I'm kind of surprised they don't have a branch-based staging. Every place I've
worked at has evolved in the direction of needing the ability to spin up an
isolated staging environment that was based on specific tags or branches.

~~~
sophiebits
It’s become more common to eschew long-lived release branches for SaaS
applications. For example: [https://engineering.fb.com/web/rapid-release-at-
massive-scal...](https://engineering.fb.com/web/rapid-release-at-massive-
scale/)

------
roadbeats
It's cool to see how big organizations have deployment setups, while it feels
like there is not enough resources about how one should setup a deployment
system for a new startup just in the beginning.

The setup I currently use is custom bash scripts setting up EC2 instances.
Each instance installs a copy of the git repo(s), and runs a script to pull
updates from production/staging branches, compiles a new build, replaces the
binaries & frontend assets, then restarts the service, and sends a slack
message with list of changes just deployed.

It works good enough for a startup with 2 engineers. However, I'd like to know
what could be better ? What could save my time from maintaining my own
deployment system in AWS world, without investing days of resources to K8s?

~~~
gtsteve
You don't have to do a big-bang style Google thing. You can just invest in
some continuous improvement over the next few years:

Iteration 0: What you have now.

Iteration 1: A build server builds your artifact, and your EC2 instances
download the artifact from the build server.

Iteration 2: The build server builds the artifact and builds a container and
pushes it to ECR. Your EC2 instances now pull the image into Docker and start
it.

Iteration 3: You use ECS for basic container orchestration. Your build server
instructs your ECS instances to download the image and run them, with blue-
green deployments linked to your load balancer.

Iteration 4: You set up K8s and your build server instructs it to deploy.

I went in a similar trajectory, and I'm at iteration 3 right now, on the verge
of moving to K8s.

It's your call on how long the timespan is here, and commercial pressures will
drive it. It could be 6 months, it could be 3 years.

~~~
roadbeats
Thanks a lot for the answer.

------
capableweb
No mention of feature toggles what so ever. I guess that's why it took them a
long time to fix the thing with the new WYSIWYG editor, where after 2 weeks or
something, they offered a toggle for people to change back.

Anyone knows their reasoning behind not employing feature toggles? I would
feel very slowed down if I didn't have the guarantee and confidence I could
quickly rollback in the event of errors.

~~~
derision
They had an undocumented feature toggle for that since day 1. A JavaScript
snippet was issued was posted on a thread here that reverted it to the old
functionality. So they are using them but not always surfacing them

------
wrkronmiller
Nice write-up! It would be interesting, however, to get more details on what
types of errors were caught in dogfooding, which made it to production, what
kind of hotfixes have had to be made in the past, etc...

It's nice to know what Slack does to mitigate bugs in releases, but it would
also be useful to know what kinds of bugs each step catches and what bugs
still slip through.

------
mleonhard
How do they choose which shards are included in the first 10% canary group?

This is a tricky problem. It's tempting to include only small (less valuable)
accounts in the first group. But some bugs only occur with large accounts, so
you need some of those in the first 10%.

Many bugs affect only a small portion of customers. There are many categories.
A canary becomes more effective when it includes members from each category.
Example: account type, number of users, client type
(web/ios/android/macos/windows/linux), client version, web browser type and
version, ipv4/ipv6, vpn, TLS MITM proxy, language, timezone, payment currency,
country, tax region, mobile service provider, etc.

------
tcgv
Interesting, last year I wrote a blog post on this subject and it seems pretty
in line with Slack's approach :)

In regards to deployment monitoring, besides "error monitoring", I would also
add "Health Monitoring" as valuable for early detection of deployment issues:

> In this line of monitoring we are interested in assuring that our
> application is performing as expected. First we define a set of system and
> business metrics that adequately represents the application behaviors. Then
> we start tracking these metrics, triggering an alert whenever one of them
> falls outside of its expected operational range. [1]

[1] [https://thomasvilhena.com/2019/08/a-successful-deployment-
mo...](https://thomasvilhena.com/2019/08/a-successful-deployment-model)

------
Silhouette
It's interesting that atomic deploys weren't in from the start. That was one
of the few deployment practices we really insisted on from day one at my own
businesses, if only because the uncertainty you get from trying to trace
problems where your system isn't in any known state makes it all but
impossible to work systematically.

A related challenge where we've never really found a good solution is how to
handle deploying updates atomically when both code and data model are
changing. That is, we need to migrate both our application software and our
database schema in some co-ordinated way.

In practice, this usually ends up being done in multiple stages, where during
some intermediate part of the process we are actively maintaining both the old
and new database structure and running both versions of relevant code, at some
point in the process there will be a bulk conversion of existing DB data that
was only in the old format to the new one, and then hopefully at the end we
switch to reading only the new version, retire the old code, and if necessary
remove the old DB contents that are no longer in use. Even then we probably
still want to keep an implementation of our previous data API available that
is reverse engineering data from the new format, just in case we have to wind
back the application code due to some other problem.

I got tired just _writing_ that, and it feels similarly dirty actually
deploying it. How is everyone else handling this? Has anyone found a
satisfactory way to migrate code and data forwards, and if necessary
backwards, without timing or data loss issues? Controlled deployments of
application code seem to be largely a solved problem with modern tools and a
bit of common sense, but the database side of things doesn't seem to be nearly
as clean, at least not with any of the strategies I've encountered so far.

[Edit: I see that while I was writing this, someone else has already raised a
similar point elsewhere in the discussion and a few people have replied, but
unfortunately only along the lines I mentioned here as well. This does not
make me optimistic about finding a cleaner strategy, but further comments are
still welcome.]

~~~
codenesium
I've never seen it solved. You either write and test migration scripts to roll
it back or you restore from a backup. Idk what you do if you add a new column
that's populated in the new version and you rollback. I guess this would be a
good place to roll out as small of piece as you can and hope you don't find
out it's busted a week later.

~~~
Silhouette
_Idk what you do if you add a new column that 's populated in the new version
and you rollback._

I think that one depends on what you are rolling back and whether you have
your application code somewhat isolated from your underlying database via a
well-defined API.

Assuming that you will at some point need to populate your new column for all
your pre-existing records in some well-defined way, you can handle rolling
back the application code _as long as_ you have a version of the database API
that still provides the interface the older application code requires. You
might no longer be updating your new column with new data at that time, but
the data you did get is still there, and when you later want to move your
application code forward again you can populate the new column for any extra
records that have been added to your database in the meantime just as you did
on the initial migration.

Given the practicalities of a multi-step migration involving both application
and database schema, you might already have the necessary extra code in your
database API to support running old application code against the new database
schema, and even to fill in any missing data for that extra column according
to the same rules you used for migrating older data from before transition and
ensure any new constraints are satisfied. So this way, you can wind back your
application code but not damage your new database.

If for some reason the database schema itself needs to be rolled back, and you
can't just fake it at the API level, things become a lot more difficult as you
have potential data loss issues to contend with. Likewise if it's possible
that the old application code would not maintain any new database records in a
way that satisfies all required constraints and you can't handle that at the
API. Fortunately, this doesn't seem to happen very often in practice.

------
rickspencer3
Seems very relevant to many existing SaaS services today: 1\. They are not
doing CD, but they do deploy frequently. 2\. They are using K8s, or even
immutable infrastructure, so far as I can tell. 3\. They have a lot of people
involved in maintaining their deployments system. 4\. Speaking as a user, I do
not recall many significant outages, so on the surface, it seems that they
have sufficient reliability.

Speaking as a heavy user of Kubernetes, evolving from an existing VM-based
application to something like what Slack is doing seems like it might be more
sensible than a "move everything to microservices and Kubernetes"
modernization strategy.

------
aledalgrande
A few questions I have left unanswered:

\- does the deploy commander create the hotfixes or the engineers who authored
the commits?

\- it seems that the deployment is fully automated, but engineers still have
to be available in case of problems, does that impact productivity?

\- "Once we are confident that core functionality is unchanged", is there a
particular metric to assert that?

\- how long does deployment take currently?

\- switching directories doesn't seem like a fully atomic operation yet, isn't
there a delay from loading the files and wouldn't that generate 502s from the
service? Maybe it's better to create new instances with the new files and then
change the router to use those (blue-green)?

~~~
datasage
With PHP (What slack was using at one point for some of the services. I think
everything uses Hack now which may still maintain a similar model). Switching
directories can be mostly atomic.

PHP-FPM with opcaching doesn't need to access files once all the opcodes are
cached (turn off file modification checks in production). When you move the
directory, you will restart the service.

Unless a request hits a file that is rarely used and not cached, you should be
not receive any errors moving the directories.

~~~
aledalgrande
My point is that if there is any downtime for the switch, for example
restarting a service, it's not atomic. A small percentage of failed requests
can still be high in absolute terms for a company like Slack, so why not using
a paradigm [1] where you have atomic switch? And also instant rollback.

[1]
[https://www.martinfowler.com/bliki/BlueGreenDeployment.html](https://www.martinfowler.com/bliki/BlueGreenDeployment.html)

~~~
lukevp
Nginx can hot reload a config file while running that’s pointed at a different
directory, or perhaps they’re updating a symlink?

~~~
aledalgrande
Possible!

------
sandGorgon
I constantly wonder if all of this UI is better expressed as a slack chat room
(instead of a whole new UI)

Flowdock thought of this long time back -
[http://blog.flowdock.com/2014/11/11/chatops-devops-with-
hubo...](http://blog.flowdock.com/2014/11/11/chatops-devops-with-hubot/)

Github Hubot is of course a modern interpretation of it..but I wonder why
chatops doesnt have the mindshare that gitops has.

Slack's deployment is human driven. It's a natural fit for a chatops style
model.

~~~
thanksforfish
> an engineer is designated as the deploy commander in charge of rolling out
> the new build to production.

When I last did ops we pushed the automation and alerting hard, so the idea of
someone being formally assigned to a deployment is interesting. This sounds
like they have a ton of manual or semi scripted steps. At some point, removing
the dedicated deployment commander and relying on alerting is helpful,
although preference of where that point is can be debated.

~~~
sandGorgon
i think the notion of a commander is a very interesting people-ops strategy.
it keeps the little element of subjectivity in things like - when do you kick
off a build, how long do you run the integration/release process, etc

You do lead with automation, but the introduction of human subjectivity is a
low-overhead way to still have flexibility.

~~~
antpls
More likely, they have a few bugs here and there in the deployment tools that
require human supervision and intervention, and they don't have the resources
right now to fix them and to make it more reliable.

There is no need for flexibility in a repetitive process, unless there are
bugged edge cases

------
circular_logic
Their deployment UI looks nice but this feels like they made their own wheel
here in order to keep their In-place upgrade method over something such
immutable infrastructure using pre-existing deployment systems.

I wonder if this was ruled out for some reason or perhaps for a large company
with people dedicated to deploying this doesn't matter. One example, as they
are on AWS autoscaling groups with prebuilt AMI's could have been used to roll
new machines instead of copying files to the server.

------
stopachka
This is very similar to the process fb had for years. With some caveats (prod
deploys once a week, handled by a central team)

I think this kind of process can last a company well into the thousands of
engineers.

Great work

------
MuffinFlavored
Do they use Kubernetes at Slack?

~~~
moondev
Doesn't seem like it based on

> Instead of pushing the new build to our servers using a sync script, each
> server pulls the build concurrently when signaled by a Consul key change.

~~~
MuffinFlavored
does that mean they are not even using containers?

~~~
Saaster
Plain EC2, backend in PHP.

~~~
echelon
> Plain EC2, backend in PHP.

That's slightly horrific. Weirdware NIH deploy system, no containers, PHP.

~~~
icedchai
If I was using PHP, I wouldn't use containers either. Just sync the latest
code over, change a sym link to the new build, done.

~~~
1f97
capistrano is perfect for that. we use it for all our deployment needs and it
has been wonderful!

~~~
mcpeepants
second this. I haven't needed cap in a while for the work I'm doing, and I
don't often see it mentioned (perhaps because I'm not looking) but it's a
fantastic tool for managing atomic deployment.

------
freepor
Can anyone explain why they do 12 deploys a day? Are engineers pushing to
production as a way of iteratively testing a feature?

~~~
garethrowlands
They're not deploying untested software, if that's what you're asking. They
most likely simply deploy each change when it is ready, rather than building
up work-in-progress and deploying many changes at the same time. It's a lot
safer to change one thing at a time, see
[https://www.goodreads.com/en/book/show/35747076-accelerate](https://www.goodreads.com/en/book/show/35747076-accelerate).
Releasing changes as soon as they are ready can also enable them to gather
feedback faster - in this sense they would be iteratively 'testing' the
product.

------
mleonhard
12 deploys in an 8-hour window is only 40 min per deploy. Do they really
perform all of those steps in 40 mins, or do they have multiple deployments
going at once (pipelining)?

~~~
GordonS
I was thinking the same, especially since they mention manual testing.

------
servercobra
How are deploy commanders chosen? On my (very small currently) team, the
person who is on-call is also our deploy commander, but it seems like you
might need something else for a larger team.

------
shusson
What happens after hot-fixing the release branch? Does the release branch get
merged back into master?

~~~
aledalgrande
Github only merges to master when it's 100% deployed and working for example.
I like their workflow better.

------
riquito
HHVM, didn't expect this (PHP yes, HHVM not really).

------
ryanmarsh
Link 404’s

------
7ewis
This link has now been reposted 6 times in the past two weeks:

[https://news.ycombinator.com/item?id=22816645](https://news.ycombinator.com/item?id=22816645)

[https://news.ycombinator.com/item?id=22729766](https://news.ycombinator.com/item?id=22729766)

[https://news.ycombinator.com/item?id=22801191](https://news.ycombinator.com/item?id=22801191)

[https://news.ycombinator.com/item?id=22784712](https://news.ycombinator.com/item?id=22784712)

[https://news.ycombinator.com/item?id=22720028](https://news.ycombinator.com/item?id=22720028)

[https://news.ycombinator.com/item?id=22806810](https://news.ycombinator.com/item?id=22806810)

~~~
dang
That's an indicator of interest. I actually emailed one of the submitters to
repost the article for that reason. (Yes, we're thinking about software to
detect cases like this.)

On HN, a submission doesn't count as a dupe unless it has had significant
attention. This is in the FAQ:
[https://news.ycombinator.com/newsfaq.html](https://news.ycombinator.com/newsfaq.html).

~~~
7ewis
Fair enough, only reason I noticed is because I was actually going to post
this link yesterday but did a search to make sure sure I wasn't reposting.

Plus they didn't get much traction anyway, so I wrongly assumed there wasn't
interest.

Know for the future now!

------
walrus01
<sarcasm>

How nice of them to volunteer 2% of their paid customer base as "canary"
without them specifically opting in to it, or perhaps even being aware.

</sarcasm>

Or perhaps they do it exclusively with the free service tier, which is much
more understandable.

~~~
friend-monoid
Seems reasonable to me? Better to deploy gradually in case the deploy is bad,
right?

~~~
toomuchtodo
If the users are aware and consent to being beta testers, versus what’s
already likely stable (caveat being when you’re rapidly pushing out a hotfix
because your last deploy broke something).

~~~
copecopecope
At some point a new build needs to roll out to production. There's always
going to be some risk that something goes wrong, so better to test with 2% of
the population initially rather than 100%. By then, the build has already gone
through integration tests/dog-fooding, so if something goes wrong in the
canary phase, it's generally due to some production environment configuration
issue.

~~~
toomuchtodo
Not disagreeing, simply stating users should be aware and get a say (an option
would be fine to opt in to early release access), especially if they’re a
paying customer.

~~~
copecopecope
I hear where you're coming from, but from my experience, the canary phase
usually lasts less than an hour. And the traffic is usually split randomly, so
the same 2% of users aren't at elevated risk for every deployment. I don't
know how Slack does it, though.

