Hacker News new | past | comments | ask | show | jobs | submit login
Deploys at Slack (slack.engineering)
360 points by michaeldeng18 on April 8, 2020 | hide | past | favorite | 136 comments

Interested in how they handle DB updates/migrations (I don't know what Slack uses for data storage backend).

IMO those DB migrations are the most difficult/fraught with risk because you need to ensure that the different versions of the servers that are running as they are deploying can work with whatever state your DB is in at the moment.

Mostly MySQL that is moving to Vitess (transparently sharded MySQL). I believe they use gh-ost for migrations.

It’s almost a running joke that if a big or well known company blogs about their deploys, they won’t go into detail about databases.

It's always nice to see how other teams do it. Nothing too groundbreaking here but that's a good thing.

I did notice the screenshot of "Checkpoint", their deployment tracking UI. Are there solid open source or SaaS tools doing something similar? I've seen various companies build similar tools but most deployment processes are consistent enough to have a 3rd-party tool that was useful for most teams.

I've built that tool 2-3 times now. The issue is really the deploy function and what controls it. It's always a one-off, or so tightly integrated into the hosting environment, that reaching in with a SaaS product is somewhat difficult. That being said, the new lowest-common-denominator standards like K8s make it way easier. If anyone is interested in using a tool just leave a comment and I'll reach out.

Please provide a way for people to reach you without commenting here.

Just ping here for now. hello@hover.sh









Interested, especially in K8s based

For Kubernetes, there's this: https://github.com/lensapp/lens

This is super cool. I wonder if there's anything like this as a vscode plugin

Someone mentioned this in another thread: https://github.com/GoogleCloudPlatform/cloud-code-vscode






Sleuth is a SaaS deployment tracker that pulls deployments from source repositories, feature flags, and other sources, in addition to pushes via curl. You can see Sleuth used to, well, track Sleuth at https://app.sleuth.io/sleuth

[Disclaimer: am a Sleuth co-founder]

I can also recommend Sleuth. We use it at our company and the integration is very good. Their team is constantly working on new features, integrations and better UI.

Hi Don :)

Is it possible to view the page you linked without creating an account? It redirects me to your landing page.

Sorry about that the live demo is at https://app.sleuth.io/sleuth/sleuth

> most deployment processes are consistent enough

Definitely disagree with this. I have never worked at two places with a similar enough deploy process that would benefit from a generic tool.

Sure, I see your point. I'd just like to see a pattern that works for most that could gain some traction. At the end of the day we're all trying to do the same thing (deploy high quality software), just in different ways. Deployment strategy shouldn't need to be a main competency of most teams.

We (Gumroad) open sourced ours: http://github.com/gumroad/wilfred

Here's what it looks like: https://twitter.com/shl/status/1128039742308737024/photo/2

I've never seen anything that could even remotely give us what we wanted. We ultimately decided to roll our own devops management platform in-house which was 100% focused on our specific needs. We are now on generation 4 of this system. We just rewrote our principal devops management interface using Blazor w/ Bootstrap4. The capabilities of the management system relative to each environment are fairly absolute - Build/Deploy/Tracing/Configuration/Reporting/etc. is all baked in. We can go from merging a PR in GitHub to a client system being updated with a fresh build of master in exactly 5 button clicks with our new system.

The central service behind the UI is a pure .NET Core solution which is responsible for executing the actual builds. The entire process is self-contained within the codebase itself. Very powerful the contract enforcement you get when the application you are building and tracking is part of the same type system as the application building and tracking it.

I'm curious what a Jenkins + Octopus system is missing that your system provides. Most companies would have a hard time justifying the expense to build a bespoke system just for devops.

Jenkins/octo, as tools, have their place but are just parts of the tooling you need when things go business critical or when teams scale up.

Most companies that run business critical services would be spending wisely putting effort down in building or customizing dev tooling and automations.

Gitlabs pipelines and issues/merges UI is similar and open source.

Spinnaker does this - https://www.spinnaker.io/concepts/

This is part of what we're doing with Reliza Hub - https://relizahub.com (note, we're in a very early stage).

Apart from tracking deployments, we're really focused on tracking bills of materials and communication between Business and Tech teams.

I don't know if this will tick all of the boxes you need because it is primarily IAC, and is for k8s only afaik: https://www.pulumi.com

I think ArgoCD is close.

Fun to read, but there's a lack of detail here that I'd like to see. For example, this talks purely about code changes. However times a code change requires a database schema change (as mentioned above), different API's to be used, etc. In the percentage based rollout where multiple versions are in use at once, how are these differences handled?

For database schema changes, here is the standard practice: - You have version 1 of the software, supporting schema A. - You deploy a version 2 supporting both schema A and new schema B. Both versions coexist until the deployment iis complete and all version 1 instances are stopped. During all this time the database is still on schema A, this is fine because your instances, both version 1 and 2, support schema A. - Now you do the schema upgrade. This is fine because your instances, now all runnning version 2, support schema B - At last, if you wish you can now deploy a version 3, dropping the support for schema A.

We do it the other way (and I’ve always seen it done this way): database change is compatible with current code and new code. So deploy the database change, then deploy the code change. It usually allows you to rollback code changes.

This is generally harder to pull off though unless you do things like force all DB access to go through stored procedures.

And then you're really still pursuing the same strategy described above, except for your stored procedures instead of your app code.

My company uses HBase currently for things on premise and we're moving to a mix of psql and BigTable in GCP. This is how we do things except all of our "schemas" are defined by the client so we just have to make sure that serialization/deserialization works correctly. With psql we might have to figure out a migration strategy, but for now we'll just be using it to store raw bytes.

Easy: don't do that.

Always make your code compatible with the old and new schema. Migrate the database separately. Then after the migration, remove the code that supports the old schema.

I think every DB change should be done like you suggest. An example I worked on recently:

- migrate DB and create new field

- deploy code for writing into such field (not read yet), in parallel with old field

- backfill data migration for older records

- deploy code with feature flag to read new field in workflows, but still write to both fields

- switch read feature flag on

- make sure everything works for a few weeks

- switch write feature flag to only use new field

I'm more curious about how DB rollbacks occur in situations where a PR changes DB and is then reverted.

It would be a good practice to first make a DB change alone, which is compatible with both and new code, so you don't need rollbacks. Then separately deploy a code change.

Edit: also suggested by Martin Fowler https://www.martinfowler.com/bliki/BlueGreenDeployment.html

> Even strategies like parallel rsyncs had their limits.

They don't really go into detail as to what limitations they hit by pushing code to servers instead of pulling. Does anyone have any ideas as to what those might be? I can't think of any bottlenecks that wouldn't apply in both directions, and pushing is much simpler in my experience, but I've also never been involved with deployments at this scale.

I can't speak for Slack, but it's not unreasonable to believe that a single machine's available output bandwidth (~10-40Gbps) can be saturated during a deploy of ~GB to hundreds of machines. Pushing the package to S3 and fetching it back down lets the bandwidth get spread over more machines and over different network paths (e.g. in other data centers)

We do it similarly except we push an image to a docker registry (backed by multi-region S3), then you can use e.g. ansible to pull it to 5, 10, 25, 100% of your machines. It "feels" like push though, except that you're staging the artifact somewhere. But when booting a new host it'll fetch it from the same place.

Considering they are not bringing machines out of rotation or draining connections in the example given with the errors, I assume that more than 10 machines produces too many errors or takes too long to have two versions of the code deployed, and wherever they pull from is not scalable. All those problems can be easily solved though.

I'm surprised at the 12 deployments per day, if that's truly to production. There's bugfixes etc., but feature wise Slack has been... let's say slow. Not Twitter slow, but still slow, in making any user visible changes.

Far too many people on HN seem to think the public facing code that we see is all that the engineering team in a large company works on. There's so much more to running a large SaaS business. If Slack is like all the other SaaS companies I've encountered they'll have dozens of internal apps for sales, comms, analytics, engineering, etc that they work on that people outside of the business never see[1]. Those all need developing and all need deploying.

[1] They might buy in solutions for some business functions like accounting, HR and support, but they'll still have tons of homegrown stuff. Every tech company does.

Lots of places do a lot of deploys but hide significant new features behind A/B testing and feature flags. So the two things are disconnected from each other.

User visible changes are dependent on the product development process rather than the rate of deploys. Whether you deploy 12 times a day or once a month, it's not like code is getting written any faster.

I wonder why they didn't evaluate at some point using an immutable infrastructure approach leveraging tools like Spinnaker to manage the deploy? They sure have the muscle and numbers to use it and even contribute to it actively, no? I mean, I know that deploying your software is usually something pretty tied to a specific engineering team but I really like the immutable approach and I was wondering why a company the size of Slack, born and grown in the "right" time, did not consider it.

I had similar thoughts when I read their article. Their atomic deploy problem completely disappears had they gone with an immutable approach.

I'm kind of surprised they don't have a branch-based staging. Every place I've worked at has evolved in the direction of needing the ability to spin up an isolated staging environment that was based on specific tags or branches.

It’s become more common to eschew long-lived release branches for SaaS applications. For example: https://engineering.fb.com/web/rapid-release-at-massive-scal...

It's cool to see how big organizations have deployment setups, while it feels like there is not enough resources about how one should setup a deployment system for a new startup just in the beginning.

The setup I currently use is custom bash scripts setting up EC2 instances. Each instance installs a copy of the git repo(s), and runs a script to pull updates from production/staging branches, compiles a new build, replaces the binaries & frontend assets, then restarts the service, and sends a slack message with list of changes just deployed.

It works good enough for a startup with 2 engineers. However, I'd like to know what could be better ? What could save my time from maintaining my own deployment system in AWS world, without investing days of resources to K8s?

You don't have to do a big-bang style Google thing. You can just invest in some continuous improvement over the next few years:

Iteration 0: What you have now.

Iteration 1: A build server builds your artifact, and your EC2 instances download the artifact from the build server.

Iteration 2: The build server builds the artifact and builds a container and pushes it to ECR. Your EC2 instances now pull the image into Docker and start it.

Iteration 3: You use ECS for basic container orchestration. Your build server instructs your ECS instances to download the image and run them, with blue-green deployments linked to your load balancer.

Iteration 4: You set up K8s and your build server instructs it to deploy.

I went in a similar trajectory, and I'm at iteration 3 right now, on the verge of moving to K8s.

It's your call on how long the timespan is here, and commercial pressures will drive it. It could be 6 months, it could be 3 years.

Thanks a lot for the answer.

For me, it feels a bit "wrong" to be building on each production server.

Firstly, production servers are usually "hardened", and only have installed what they need to run, reducing the attack surface as much as possible.

Secondly, for proprietary code, I don't want it on production servers.

But most importantly, I want a single, consistent set of build artifacts that can be deployed across the server/container fleet.

You can do this with CI/CD tools, such as Azure DevOps (my personal favourite), Github Actions, CircleCI, Jenkins and Appveyor.

The way it works is you set up an automated build pipeline, so when you push new code, it's built once, centrally, and the build output is made available as "build artifacts". In another pipeline stage, you can then push out the artifacts to your servers using various means (rsync, FTP, build agent, whatever), or publish them somewhere (S3, Docker Registry, whatever) where your servers can pull them from. You can have more advanced workflows, but that's the basic version.

Automate compilation on a buildserver and run tests on that, and if everything is ok, use the artifacts to push to your servers. This way you can guarantee that the code is tested and all running versions are from the same build environment.

Thanks for the answer.

If you make your application stateless and have it in a container then there are many managed services out there that can do this for you. For example, in AWS there is fargate and EKS.

No mention of feature toggles what so ever. I guess that's why it took them a long time to fix the thing with the new WYSIWYG editor, where after 2 weeks or something, they offered a toggle for people to change back.

Anyone knows their reasoning behind not employing feature toggles? I would feel very slowed down if I didn't have the guarantee and confidence I could quickly rollback in the event of errors.

They had an undocumented feature toggle for that since day 1. A JavaScript snippet was issued was posted on a thread here that reverted it to the old functionality. So they are using them but not always surfacing them

Who said they don't use feature toggles? That is a separate concern from deployment. As far as I can tell you got mad about a feature in their UI and decided that implies something about their infrastructure with no actual evidence.

Nice write-up! It would be interesting, however, to get more details on what types of errors were caught in dogfooding, which made it to production, what kind of hotfixes have had to be made in the past, etc...

It's nice to know what Slack does to mitigate bugs in releases, but it would also be useful to know what kinds of bugs each step catches and what bugs still slip through.

How do they choose which shards are included in the first 10% canary group?

This is a tricky problem. It's tempting to include only small (less valuable) accounts in the first group. But some bugs only occur with large accounts, so you need some of those in the first 10%.

Many bugs affect only a small portion of customers. There are many categories. A canary becomes more effective when it includes members from each category. Example: account type, number of users, client type (web/ios/android/macos/windows/linux), client version, web browser type and version, ipv4/ipv6, vpn, TLS MITM proxy, language, timezone, payment currency, country, tax region, mobile service provider, etc.

Interesting, last year I wrote a blog post on this subject and it seems pretty in line with Slack's approach :)

In regards to deployment monitoring, besides "error monitoring", I would also add "Health Monitoring" as valuable for early detection of deployment issues:

> In this line of monitoring we are interested in assuring that our application is performing as expected. First we define a set of system and business metrics that adequately represents the application behaviors. Then we start tracking these metrics, triggering an alert whenever one of them falls outside of its expected operational range. [1]

[1] https://thomasvilhena.com/2019/08/a-successful-deployment-mo...

It's interesting that atomic deploys weren't in from the start. That was one of the few deployment practices we really insisted on from day one at my own businesses, if only because the uncertainty you get from trying to trace problems where your system isn't in any known state makes it all but impossible to work systematically.

A related challenge where we've never really found a good solution is how to handle deploying updates atomically when both code and data model are changing. That is, we need to migrate both our application software and our database schema in some co-ordinated way.

In practice, this usually ends up being done in multiple stages, where during some intermediate part of the process we are actively maintaining both the old and new database structure and running both versions of relevant code, at some point in the process there will be a bulk conversion of existing DB data that was only in the old format to the new one, and then hopefully at the end we switch to reading only the new version, retire the old code, and if necessary remove the old DB contents that are no longer in use. Even then we probably still want to keep an implementation of our previous data API available that is reverse engineering data from the new format, just in case we have to wind back the application code due to some other problem.

I got tired just writing that, and it feels similarly dirty actually deploying it. How is everyone else handling this? Has anyone found a satisfactory way to migrate code and data forwards, and if necessary backwards, without timing or data loss issues? Controlled deployments of application code seem to be largely a solved problem with modern tools and a bit of common sense, but the database side of things doesn't seem to be nearly as clean, at least not with any of the strategies I've encountered so far.

[Edit: I see that while I was writing this, someone else has already raised a similar point elsewhere in the discussion and a few people have replied, but unfortunately only along the lines I mentioned here as well. This does not make me optimistic about finding a cleaner strategy, but further comments are still welcome.]

I've never seen it solved. You either write and test migration scripts to roll it back or you restore from a backup. Idk what you do if you add a new column that's populated in the new version and you rollback. I guess this would be a good place to roll out as small of piece as you can and hope you don't find out it's busted a week later.

Idk what you do if you add a new column that's populated in the new version and you rollback.

I think that one depends on what you are rolling back and whether you have your application code somewhat isolated from your underlying database via a well-defined API.

Assuming that you will at some point need to populate your new column for all your pre-existing records in some well-defined way, you can handle rolling back the application code as long as you have a version of the database API that still provides the interface the older application code requires. You might no longer be updating your new column with new data at that time, but the data you did get is still there, and when you later want to move your application code forward again you can populate the new column for any extra records that have been added to your database in the meantime just as you did on the initial migration.

Given the practicalities of a multi-step migration involving both application and database schema, you might already have the necessary extra code in your database API to support running old application code against the new database schema, and even to fill in any missing data for that extra column according to the same rules you used for migrating older data from before transition and ensure any new constraints are satisfied. So this way, you can wind back your application code but not damage your new database.

If for some reason the database schema itself needs to be rolled back, and you can't just fake it at the API level, things become a lot more difficult as you have potential data loss issues to contend with. Likewise if it's possible that the old application code would not maintain any new database records in a way that satisfies all required constraints and you can't handle that at the API. Fortunately, this doesn't seem to happen very often in practice.

Seems very relevant to many existing SaaS services today: 1. They are not doing CD, but they do deploy frequently. 2. They are using K8s, or even immutable infrastructure, so far as I can tell. 3. They have a lot of people involved in maintaining their deployments system. 4. Speaking as a user, I do not recall many significant outages, so on the surface, it seems that they have sufficient reliability.

Speaking as a heavy user of Kubernetes, evolving from an existing VM-based application to something like what Slack is doing seems like it might be more sensible than a "move everything to microservices and Kubernetes" modernization strategy.

A few questions I have left unanswered:

- does the deploy commander create the hotfixes or the engineers who authored the commits?

- it seems that the deployment is fully automated, but engineers still have to be available in case of problems, does that impact productivity?

- "Once we are confident that core functionality is unchanged", is there a particular metric to assert that?

- how long does deployment take currently?

- switching directories doesn't seem like a fully atomic operation yet, isn't there a delay from loading the files and wouldn't that generate 502s from the service? Maybe it's better to create new instances with the new files and then change the router to use those (blue-green)?

With PHP (What slack was using at one point for some of the services. I think everything uses Hack now which may still maintain a similar model). Switching directories can be mostly atomic.

PHP-FPM with opcaching doesn't need to access files once all the opcodes are cached (turn off file modification checks in production). When you move the directory, you will restart the service.

Unless a request hits a file that is rarely used and not cached, you should be not receive any errors moving the directories.

My point is that if there is any downtime for the switch, for example restarting a service, it's not atomic. A small percentage of failed requests can still be high in absolute terms for a company like Slack, so why not using a paradigm [1] where you have atomic switch? And also instant rollback.

[1] https://www.martinfowler.com/bliki/BlueGreenDeployment.html

Nginx can hot reload a config file while running that’s pointed at a different directory, or perhaps they’re updating a symlink?


I constantly wonder if all of this UI is better expressed as a slack chat room (instead of a whole new UI)

Flowdock thought of this long time back - http://blog.flowdock.com/2014/11/11/chatops-devops-with-hubo...

Github Hubot is of course a modern interpretation of it..but I wonder why chatops doesnt have the mindshare that gitops has.

Slack's deployment is human driven. It's a natural fit for a chatops style model.

> an engineer is designated as the deploy commander in charge of rolling out the new build to production.

When I last did ops we pushed the automation and alerting hard, so the idea of someone being formally assigned to a deployment is interesting. This sounds like they have a ton of manual or semi scripted steps. At some point, removing the dedicated deployment commander and relying on alerting is helpful, although preference of where that point is can be debated.

i think the notion of a commander is a very interesting people-ops strategy. it keeps the little element of subjectivity in things like - when do you kick off a build, how long do you run the integration/release process, etc

You do lead with automation, but the introduction of human subjectivity is a low-overhead way to still have flexibility.

More likely, they have a few bugs here and there in the deployment tools that require human supervision and intervention, and they don't have the resources right now to fix them and to make it more reliable.

There is no need for flexibility in a repetitive process, unless there are bugged edge cases

Their deployment UI looks nice but this feels like they made their own wheel here in order to keep their In-place upgrade method over something such immutable infrastructure using pre-existing deployment systems.

I wonder if this was ruled out for some reason or perhaps for a large company with people dedicated to deploying this doesn't matter. One example, as they are on AWS autoscaling groups with prebuilt AMI's could have been used to roll new machines instead of copying files to the server.

This is very similar to the process fb had for years. With some caveats (prod deploys once a week, handled by a central team)

I think this kind of process can last a company well into the thousands of engineers.

Great work

Do they use Kubernetes at Slack?

Doesn't seem like it based on

> Instead of pushing the new build to our servers using a sync script, each server pulls the build concurrently when signaled by a Consul key change.

does that mean they are not even using containers?

Plain EC2, backend in PHP.

> Plain EC2, backend in PHP.

That's slightly horrific. Weirdware NIH deploy system, no containers, PHP.

> That's slightly horrific.

I'd argue that the contemporary infatuation with mastery of complex toolchains as being the only possible solution to modern technical problems is far more horrific.

Smart businesses focus on simple, effective solutions and avoid hiring engineers who obsess with rewriting everything using the latest over-hyped technology.

Why so? They just haven't jumped on the meme-tech train, at least wrt to this setup.

Running an actual process on an actual server has been around since time immemorial, as has doing the "atomic" deploy thing (which I'm guessing is just updating a symlink from cold to hot).

The approach is refreshingly sane.


You would be surprised at how 99.9% of the companies work. Including a lot of departments inside Google, Amazon and Netflix.

Maybe you were being sarcastic and I felt for it since you stressed it a bit too much.

Look at almost all successful companies and they're pretty "boring" under the hood: GitHub and Stripe are Ruby on Rails, Facebook is PHP, Google is Java (still almost all new code written in 2020)... it gets the job done. Yes, they do some optimizations (HHVM etc.), but nobody is considering rewriting FB, Stripe or Google services in the language du jour.

The complexity of the infra and deployments are always relative to the size of the company, and no two companies are alike there. Small or big, it's all bespoke. Even if a few pieces are shared as open source projects, there's a veritable iceberg of complexity in the form of inhouse knowledge and tooling in each of the companies, there is nothing even close to a standard deployment system in either green field startups or FAANGs today.

If I was using PHP, I wouldn't use containers either. Just sync the latest code over, change a sym link to the new build, done.

That seems a little bit simplistic for today's workflow as there's chances you'll need to to restart php-fpm anyway, discard/refresh some cache (Doctrine metadada ...), maybe update your composer / vendor directory and its autoloading files, maybe run db migrations and more.

capistrano is perfect for that. we use it for all our deployment needs and it has been wonderful!

second this. I haven't needed cap in a while for the work I'm doing, and I don't often see it mentioned (perhaps because I'm not looking) but it's a fantastic tool for managing atomic deployment.

And yet...it just works. And scales!

The only horrific thing is that you think that.

From the article it seems their deployment relies on moving application binaries to an installation directory within the VM instead of running container images.

Can anyone explain why they do 12 deploys a day? Are engineers pushing to production as a way of iteratively testing a feature?

They're not deploying untested software, if that's what you're asking. They most likely simply deploy each change when it is ready, rather than building up work-in-progress and deploying many changes at the same time. It's a lot safer to change one thing at a time, see https://www.goodreads.com/en/book/show/35747076-accelerate. Releasing changes as soon as they are ready can also enable them to gather feedback faster - in this sense they would be iteratively 'testing' the product.

Multiple teams working on different features rolling their local commits into release branches - and rather than feature related deploys they do time-boxed ones by the looks of it so that they have dedicated support on hand to spot issues and rollback immediately (guessing from what I can see based on the screenshot).

Easy to get that many deploys out the door if you have a managed process like this - fast iteration, lots of different feature bumps and tweaks, different locales, updating even 1/2 links or words in a hardcoded page...

12 deploys in an 8-hour window is only 40 min per deploy. Do they really perform all of those steps in 40 mins, or do they have multiple deployments going at once (pipelining)?

I was thinking the same, especially since they mention manual testing.

How are deploy commanders chosen? On my (very small currently) team, the person who is on-call is also our deploy commander, but it seems like you might need something else for a larger team.

What happens after hot-fixing the release branch? Does the release branch get merged back into master?

Github only merges to master when it's 100% deployed and working for example. I like their workflow better.

Usually you fix master first then cherrypick the fix to the release branch.

HHVM, didn't expect this (PHP yes, HHVM not really).

Link 404’s

That's an indicator of interest. I actually emailed one of the submitters to repost the article for that reason. (Yes, we're thinking about software to detect cases like this.)

On HN, a submission doesn't count as a dupe unless it has had significant attention. This is in the FAQ: https://news.ycombinator.com/newsfaq.html.

Fair enough, only reason I noticed is because I was actually going to post this link yesterday but did a search to make sure sure I wasn't reposting.

Plus they didn't get much traction anyway, so I wrongly assumed there wasn't interest.

Know for the future now!

So? Guidelines don't explicitly say this behavior is unallowed. https://news.ycombinator.com/newsguidelines.html

Sometimes posts that deserve to be on the front page don't make it. Seems fine to repost periodically as long as you aren't spamming many times per day.

None of them garnered any traction/comments, in case others were looking for that.

And yet it has only reached the front page once. Every article you see on top has been posted multiple times in order to get there. That is how online voting/aggregation systems work.

Every article you see on top has been posted multiple times in order to get there.

This isn't true at all, for the record.


How nice of them to volunteer 2% of their paid customer base as "canary" without them specifically opting in to it, or perhaps even being aware.


Or perhaps they do it exclusively with the free service tier, which is much more understandable.

Anecdotally I usually see slack changes in my free tier channels a good week before paid tier ones so it wouldn't surprise me.

2% chance of being canary 20% chance it breaks Expected 15 minutes to roll back

Expect 3.6 seconds of outage per user outage per release.

It’s fine.

What I’d like you to get behind is disabling Windows Update. THAT thing is a menace.

Seems reasonable to me? Better to deploy gradually in case the deploy is bad, right?

Tangential, but why do companies continually misuse verbs as nouns?

Nothing is gained by saying 'deploys' instead of 'deployments' but instead confusion can be introduced.

See also ' what is the ask' and 'minimum spend'.

The gain is 1 syllable

If the users are aware and consent to being beta testers, versus what’s already likely stable (caveat being when you’re rapidly pushing out a hotfix because your last deploy broke something).

At some point a new build needs to roll out to production. There's always going to be some risk that something goes wrong, so better to test with 2% of the population initially rather than 100%. By then, the build has already gone through integration tests/dog-fooding, so if something goes wrong in the canary phase, it's generally due to some production environment configuration issue.

Not disagreeing, simply stating users should be aware and get a say (an option would be fine to opt in to early release access), especially if they’re a paying customer.

I hear where you're coming from, but from my experience, the canary phase usually lasts less than an hour. And the traffic is usually split randomly, so the same 2% of users aren't at elevated risk for every deployment. I don't know how Slack does it, though.

They aren't beta testers. They are still getting the real production build, just in the first step of a phased rollout. Beta, pre-release etc. have a very different meaning.

Link doesn’t work for me right now so I haven’t read the article, but usually beta testing precedes canary deploys. Maybe this is different.

If it’s canary, you don’t trust it fully, no? Tests can pass and you still end up munging data or the user experience.

Do you ever "fully" trust any software before it's been released to its final environment and run under load? I don't.

The idea isn't that you release less-tested software because you have the canary as a safety net. The idea is that you put in place all of the other practices you would anyway to minimise the likelihood of bugs and mistakes, and then you add a canary rollout as one extra layer of protection to mitigate the damage of anything you missed.

I would look at it as 98% of the users getting an even more reliable experience than they would otherwise (per release; everyone benefits over time), rather than 2% being given a worse experience. The alternative is just that everyone is in the "canary" release and everyone has to immediately use the release you "don't fully trust".

Applications are open for YC Winter 2022

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact