IMO those DB migrations are the most difficult/fraught with risk because you need to ensure that the different versions of the servers that are running as they are deploying can work with whatever state your DB is in at the moment.
I did notice the screenshot of "Checkpoint", their deployment tracking UI. Are there solid open source or SaaS tools doing something similar? I've seen various companies build similar tools but most deployment processes are consistent enough to have a 3rd-party tool that was useful for most teams.
[Disclaimer: am a Sleuth co-founder]
Hi Don :)
Definitely disagree with this. I have never worked at two places with a similar enough deploy process that would benefit from a generic tool.
Here's what it looks like: https://twitter.com/shl/status/1128039742308737024/photo/2
The central service behind the UI is a pure .NET Core solution which is responsible for executing the actual builds. The entire process is self-contained within the codebase itself. Very powerful the contract enforcement you get when the application you are building and tracking is part of the same type system as the application building and tracking it.
Most companies that run business critical services would be spending wisely putting effort down in building or customizing dev tooling and automations.
Apart from tracking deployments, we're really focused on tracking bills of materials and communication between Business and Tech teams.
And then you're really still pursuing the same strategy described above, except for your stored procedures instead of your app code.
Always make your code compatible with the old and new schema. Migrate the database separately. Then after the migration, remove the code that supports the old schema.
- migrate DB and create new field
- deploy code for writing into such field (not read yet), in parallel with old field
- backfill data migration for older records
- deploy code with feature flag to read new field in workflows, but still write to both fields
- switch read feature flag on
- make sure everything works for a few weeks
- switch write feature flag to only use new field
Edit: also suggested by Martin Fowler https://www.martinfowler.com/bliki/BlueGreenDeployment.html
They don't really go into detail as to what limitations they hit by pushing code to servers instead of pulling. Does anyone have any ideas as to what those might be? I can't think of any bottlenecks that wouldn't apply in both directions, and pushing is much simpler in my experience, but I've also never been involved with deployments at this scale.
 They might buy in solutions for some business functions like accounting, HR and support, but they'll still have tons of homegrown stuff. Every tech company does.
The setup I currently use is custom bash scripts setting up EC2 instances. Each instance installs a copy of the git repo(s), and runs a script to pull updates from production/staging branches, compiles a new build, replaces the binaries & frontend assets, then restarts the service, and sends a slack message with list of changes just deployed.
It works good enough for a startup with 2 engineers. However, I'd like to know what could be better ? What could save my time from maintaining my own deployment system in AWS world, without investing days of resources to K8s?
Iteration 0: What you have now.
Iteration 1: A build server builds your artifact, and your EC2 instances download the artifact from the build server.
Iteration 2: The build server builds the artifact and builds a container and pushes it to ECR. Your EC2 instances now pull the image into Docker and start it.
Iteration 3: You use ECS for basic container orchestration. Your build server instructs your ECS instances to download the image and run them, with blue-green deployments linked to your load balancer.
Iteration 4: You set up K8s and your build server instructs it to deploy.
I went in a similar trajectory, and I'm at iteration 3 right now, on the verge of moving to K8s.
It's your call on how long the timespan is here, and commercial pressures will drive it. It could be 6 months, it could be 3 years.
Firstly, production servers are usually "hardened", and only have installed what they need to run, reducing the attack surface as much as possible.
Secondly, for proprietary code, I don't want it on production servers.
But most importantly, I want a single, consistent set of build artifacts that can be deployed across the server/container fleet.
You can do this with CI/CD tools, such as Azure DevOps (my personal favourite), Github Actions, CircleCI, Jenkins and Appveyor.
The way it works is you set up an automated build pipeline, so when you push new code, it's built once, centrally, and the build output is made available as "build artifacts". In another pipeline stage, you can then push out the artifacts to your servers using various means (rsync, FTP, build agent, whatever), or publish them somewhere (S3, Docker Registry, whatever) where your servers can pull them from. You can have more advanced workflows, but that's the basic version.
Anyone knows their reasoning behind not employing feature toggles? I would feel very slowed down if I didn't have the guarantee and confidence I could quickly rollback in the event of errors.
It's nice to know what Slack does to mitigate bugs in releases, but it would also be useful to know what kinds of bugs each step catches and what bugs still slip through.
This is a tricky problem. It's tempting to include only small (less valuable) accounts in the first group. But some bugs only occur with large accounts, so you need some of those in the first 10%.
Many bugs affect only a small portion of customers. There are many categories. A canary becomes more effective when it includes members from each category. Example: account type, number of users, client type (web/ios/android/macos/windows/linux), client version, web browser type and version, ipv4/ipv6, vpn, TLS MITM proxy, language, timezone, payment currency, country, tax region, mobile service provider, etc.
In regards to deployment monitoring, besides "error monitoring", I would also add "Health Monitoring" as valuable for early detection of deployment issues:
> In this line of monitoring we are interested in assuring that our application is performing as expected. First we define a set of system and business metrics that adequately represents the application behaviors. Then we start tracking these metrics, triggering an alert whenever one of them falls outside of its expected operational range. 
A related challenge where we've never really found a good solution is how to handle deploying updates atomically when both code and data model are changing. That is, we need to migrate both our application software and our database schema in some co-ordinated way.
In practice, this usually ends up being done in multiple stages, where during some intermediate part of the process we are actively maintaining both the old and new database structure and running both versions of relevant code, at some point in the process there will be a bulk conversion of existing DB data that was only in the old format to the new one, and then hopefully at the end we switch to reading only the new version, retire the old code, and if necessary remove the old DB contents that are no longer in use. Even then we probably still want to keep an implementation of our previous data API available that is reverse engineering data from the new format, just in case we have to wind back the application code due to some other problem.
I got tired just writing that, and it feels similarly dirty actually deploying it. How is everyone else handling this? Has anyone found a satisfactory way to migrate code and data forwards, and if necessary backwards, without timing or data loss issues? Controlled deployments of application code seem to be largely a solved problem with modern tools and a bit of common sense, but the database side of things doesn't seem to be nearly as clean, at least not with any of the strategies I've encountered so far.
[Edit: I see that while I was writing this, someone else has already raised a similar point elsewhere in the discussion and a few people have replied, but unfortunately only along the lines I mentioned here as well. This does not make me optimistic about finding a cleaner strategy, but further comments are still welcome.]
I think that one depends on what you are rolling back and whether you have your application code somewhat isolated from your underlying database via a well-defined API.
Assuming that you will at some point need to populate your new column for all your pre-existing records in some well-defined way, you can handle rolling back the application code as long as you have a version of the database API that still provides the interface the older application code requires. You might no longer be updating your new column with new data at that time, but the data you did get is still there, and when you later want to move your application code forward again you can populate the new column for any extra records that have been added to your database in the meantime just as you did on the initial migration.
Given the practicalities of a multi-step migration involving both application and database schema, you might already have the necessary extra code in your database API to support running old application code against the new database schema, and even to fill in any missing data for that extra column according to the same rules you used for migrating older data from before transition and ensure any new constraints are satisfied. So this way, you can wind back your application code but not damage your new database.
If for some reason the database schema itself needs to be rolled back, and you can't just fake it at the API level, things become a lot more difficult as you have potential data loss issues to contend with. Likewise if it's possible that the old application code would not maintain any new database records in a way that satisfies all required constraints and you can't handle that at the API. Fortunately, this doesn't seem to happen very often in practice.
Speaking as a heavy user of Kubernetes, evolving from an existing VM-based application to something like what Slack is doing seems like it might be more sensible than a "move everything to microservices and Kubernetes" modernization strategy.
- does the deploy commander create the hotfixes or the engineers who authored the commits?
- it seems that the deployment is fully automated, but engineers still have to be available in case of problems, does that impact productivity?
- "Once we are confident that core functionality is unchanged", is there a particular metric to assert that?
- how long does deployment take currently?
- switching directories doesn't seem like a fully atomic operation yet, isn't there a delay from loading the files and wouldn't that generate 502s from the service? Maybe it's better to create new instances with the new files and then change the router to use those (blue-green)?
PHP-FPM with opcaching doesn't need to access files once all the opcodes are cached (turn off file modification checks in production). When you move the directory, you will restart the service.
Unless a request hits a file that is rarely used and not cached, you should be not receive any errors moving the directories.
Flowdock thought of this long time back - http://blog.flowdock.com/2014/11/11/chatops-devops-with-hubo...
Github Hubot is of course a modern interpretation of it..but I wonder why chatops doesnt have the mindshare that gitops has.
Slack's deployment is human driven. It's a natural fit for a chatops style model.
When I last did ops we pushed the automation and alerting hard, so the idea of someone being formally assigned to a deployment is interesting. This sounds like they have a ton of manual or semi scripted steps. At some point, removing the dedicated deployment commander and relying on alerting is helpful, although preference of where that point is can be debated.
You do lead with automation, but the introduction of human subjectivity is a low-overhead way to still have flexibility.
There is no need for flexibility in a repetitive process, unless there are bugged edge cases
I wonder if this was ruled out for some reason or perhaps for a large company with people dedicated to deploying this doesn't matter. One example, as they are on AWS autoscaling groups with prebuilt AMI's could have been used to roll new machines instead of copying files to the server.
I think this kind of process can last a company well into the thousands of engineers.
> Instead of pushing the new build to our servers using a sync script, each server pulls the build concurrently when signaled by a Consul key change.
That's slightly horrific. Weirdware NIH deploy system, no containers, PHP.
I'd argue that the contemporary infatuation with mastery of complex toolchains as being the only possible solution to modern technical problems is far more horrific.
Smart businesses focus on simple, effective solutions and avoid hiring engineers who obsess with rewriting everything using the latest over-hyped technology.
Running an actual process on an actual server has been around since time immemorial, as has doing the "atomic" deploy thing (which I'm guessing is just updating a symlink from cold to hot).
The approach is refreshingly sane.
You would be surprised at how 99.9% of the companies work. Including a lot of departments inside Google, Amazon and Netflix.
Maybe you were being sarcastic and I felt for it since you stressed it a bit too much.
The complexity of the infra and deployments are always relative to the size of the company, and no two companies are alike there. Small or big, it's all bespoke. Even if a few pieces are shared as open source projects, there's a veritable iceberg of complexity in the form of inhouse knowledge and tooling in each of the companies, there is nothing even close to a standard deployment system in either green field startups or FAANGs today.
Easy to get that many deploys out the door if you have a managed process like this - fast iteration, lots of different feature bumps and tweaks, different locales, updating even 1/2 links or words in a hardcoded page...
On HN, a submission doesn't count as a dupe unless it has had significant attention. This is in the FAQ: https://news.ycombinator.com/newsfaq.html.
Plus they didn't get much traction anyway, so I wrongly assumed there wasn't interest.
Know for the future now!
Sometimes posts that deserve to be on the front page don't make it. Seems fine to repost periodically as long as you aren't spamming many times per day.
This isn't true at all, for the record.
How nice of them to volunteer 2% of their paid customer base as "canary" without them specifically opting in to it, or perhaps even being aware.
Or perhaps they do it exclusively with the free service tier, which is much more understandable.
Expect 3.6 seconds of outage per user outage per release.
What I’d like you to get behind is disabling Windows Update. THAT thing is a menace.
Nothing is gained by saying 'deploys' instead of 'deployments' but instead confusion can be introduced.
See also ' what is the ask' and 'minimum spend'.
The idea isn't that you release less-tested software because you have the canary as a safety net. The idea is that you put in place all of the other practices you would anyway to minimise the likelihood of bugs and mistakes, and then you add a canary rollout as one extra layer of protection to mitigate the damage of anything you missed.
I would look at it as 98% of the users getting an even more reliable experience than they would otherwise (per release; everyone benefits over time), rather than 2% being given a worse experience. The alternative is just that everyone is in the "canary" release and everyone has to immediately use the release you "don't fully trust".