Close. The scenario we want to minimize is faulty code on the main branch. As your team grows, as the number of commits go up, it becomes a game of chance. Sooner or later something will get through. The more new teammates you have, the more often that will happens.
This is an inescapable cost of growth. The cost of promoting people to management. The cost of starting new projects. Occasionally you can avoid it as a cost of turnover, but you will have turnover at some point.
What matters most is how long the code is "broken" (including false positives) before it is identified, mitigated, and fully corrected. The amount of work you can do to keep these number relatively stable in the face of change is profound.
If you insist on no errors on master ever you will kill throughput. You will create situations where the only failures are big, which is neck deep in the philosophy that CI rejects: that problems are to be avoided instead of embraced and conquered.
There are a large number of automated tools which will help you prevent merging code that could break master: https://github.com/chdsbd/kodiak#prior-art--alternatives. The basic approach is to make a new branch from master, apply one or more commits on top of that branch, run the tests, and if tests pass, merge those commits (with fast-forward) back onto master. This makes it very difficult to get broken commits on master, as they have to pass the tests before. It is possible if you have a flaky test suite, but in my experience it happens very rarely, and is usually very easy to fix if something creeps in. In my experience, they speed up throughput, not slow it down, especially when you account for the disruption that merging broken code to master can be.
https://graydon2.dreamwidth.org/1597.html has a good discussion of this:
The Not Rocket Science Rule Of Software Engineering:
automatically maintain a repository of code that always passes all the tests
In GitLab we made merge trains https://docs.gitlab.com/ee/ci/merge_request_pipelines/pipeli... to solve this problem automatically.
With merge trains the merge requests with a passing feature branch is placed in a queue and tests are run against the combination of that branch and all the branches before it merged in. Since tests will pass 95%+ of the time the feature branches passes this can speed up the amount of merges you can get into master by 10x or more.
Unless you solve this engineering problem with tooling. At Uber, the full-blown CI mobile test suite takes over 30 minutes to run on a development machine (linting, unit test, UI tests - most of this time being the long-running UI tests, specific to native mobile). So we only do incremental runs locally, and have a submit queue, which parallelises this work and merges only changes that don’t break, into master. And we have one repository that hundreds of engineers work on.
It’s not an easy problem and the solution is also rather complex, but it keeps master at green - with the trade-off of having needed to build and maintain this system. See it discussed on HN a while ago: https://news.ycombinator.com/item?id=19692820
Lets just say in my company it also takes 30m to run tests and 4h to run them on merge pipeline with FATs and CORE tests.. Its way too long and highly cripples productivity.
1. Dependencies in incompatible Merge Requests that need to be accounted for, see https://docs.gitlab.com/ee/user/project/merge_requests/merge... on how to do that.
2. Most merge requests can merge in previous changes changes, for that you can use merge trains as detailed in my other comment https://news.ycombinator.com/item?id=21679515
If those get too expensive to run or you cannot speed them up them you have to do what Chromium does: run them post commit then bisect and revert any changes that break the tests. If things are truly broken you close the tree for a bit while you get the break reverted or fixed.
Also the system that is landing changes tests the optimistically in parallel assuming they will all succeed, so it does land a change only 30 minutes for example.
There are exceptions. Sometimes there is a management problem: management has been told some things cannot be done in parallel because you couldn't mitigate the problem in architecture and they failed to apply project management practices to ensure the developers worked serially.
Sometimes there is a team problem: the 10 developers have been placed on the same team to work on the same thing, and despite all that they still failed to coordinate among themselves to ensure that the changes happened in order.
The whole process assumes that multiple changes in the queue don’t depend on each other, if they did, it should all be in the same changeset.
If you want to find problems you didn't think of formal proofs are the only think I have heard of. However formal proofs only work if you can think of the right constraints (I forget the correct term) which isn't easy.
Note that the two are not substitutes for each other. While there is overlap there are classes of errors that one along will not catch. For most projects though it is more cost effective to live with some bugs for a long time than to spend enough money on either of the above to find it ahead of time. Different projects have different needs (games vs medical devices...)
Nobody seems to talk about this and I don't know why. It would remove integration complexity and speed up testing. We do the same thing for CD and nobody seems to have a problem with it...
1) Two changes which don’t effect intersecting parts of the repo are landed separately. Similar to having infinite separate repos.
2) Only the tests that your code effects are run.
This is all possible because Bazel let’s you look at a commit and determine with certainty which test needs to run a and which targets are effected.
For me I prefer the pros and cons of multi-repo. However sometimes I wish I could do the large cross project refactoring that a mono-repo would make easy.
> If you insist on no errors on master ever you will kill throughout.
Not sure why you believe this. It hasn't been my experience; just the opposite, in fact. By using CI in conjunction with a process that prevents errors on master, everything goes more smoothly, because people don't get stalled by the broken master.
You should strive to do that but you shouldn't be surprised that despite all effort mistakes still happen from time to time.
Not saying mistakes can't happen, but the person I was replying to didn't seem to be aware of this tooling.
The healthy mentality is to realize mistakes will happen. This creates a healthier culture when things do break.
However, you should take every step to ensure it doesn't happen. You should act as though you want to prevent all faults from hitting you master branch.
I don't agree that CI is a team problem and CD is an engineering problem. If you are following infrastructure as code principles it is everyone's problem because if you don't add how your new feature should be deployed it will break the CI and CD pipelines and you won't be able to merge it.
As for CI/CD differences: how many commits can actually affect code and infrastructure? I think this is part of the engineering problem at the end of the day.
We recently moved from Jenkins to CircleCI, and while the PR experience has improved dramatically (no queueing, faster builds), the _release_ process is far worse.
The reason seems to be that CircleCI just treats CD as CI. In reality doing CD requires high correctness, care, and nuance.
For example... with CircleCI there's no way to ensure that you release your code in the correct order other than to manually wait to merge your code until the previous code has gone out. That's not _continuous_. This is a very basic requirement.
So perhaps they are not a CD service as they pitch themselves as? That means deploys are manually triggered then? Nope, there is no way to manually trigger a build.
I wish this was an isolated example, but I've yet to see a CI/CD service that is easy to build fast, correct, deployments with. Jenkins is correct but not fast or easy, Circle is fast but not correct, and most others I've used are none of these at all.
Automated builds are the smallest part of CI. Necessary, but drastically insufficient. If that's all you're doing you've missed the forest for the trees.
It is completely reasonable to utilise one without the other, not all projects are giant multi author efforts trying to wrangle commits.
For instance if you have lots of small projects being worked on independently in parallel with no more than one or two authors on a repo at a time, CI is not going to be worth the investment... but CD still has it's uses.
How am I doing it wrong?
Who has a 3-7m CI build here?
On our newer services there is a CircleCI pipeline which parallelises work and takes ~1-2 minutes on a branch, and maybe an extra minute at most on master - where it automatically deploys to production if the build is green.
If you make the choice to prioritise this from the start, it isn’t all that difficult.
You can test your software faster in CI using Docker compose. Something along these lines: https://fire.ci/blog/api-end-to-end-testing-with-docker/
It doesn't have to spin up any DB or things like that though.
Reading a bit closer, I see the author describes CI as a sanity check, "ensur[ing] the bare minimum" and doesn't consider deploying on every commit. Maybe 3-7m is more realistic then.
However, I'm slightly surprised by this definition of CI. According to Fowler , "Continuous Delivery is a software development discipline where you build software in such a way that the software can be released to production at any time. ... The key test is that a business sponsor could request that the current development version of the software can be deployed into production at a moment's notice." So having CI gates on the development version that are weaker than the release tests would not seem to be continuous delivery according to his definition.
We're currently releasing on every commit and our CI build (which implements continuous delivery) takes about 15m.
Which is a nonsense, since CI is the practice of merging to master frequently, in a state that can be released if need be.
We won't understand it unless we distinguish the practice from the supporting tools that help us do it safely:
in this case, the practice is frequent merge to shared trunk, and the supporting tools are as many automated checks before and after that merge as can be done quickly.
A "CI build" of a branch is a tool to help you do CI, but unless you merge that branch when it's green, you're not _doing_ CI.
Misunderstanding this and doing "CI on a branch" means that you are mistaking the tool for the practice, and not doing the practice: by delaying integration, you will be accomplishing the opposite of CI.
In my case it wasn’t a need to spin up infrastructure as much as it was just pulling a few container images and starting them, the longest of the CI builds were if you were say loading and indexing test data from a database (container) into ElasticSearch etc... but overall moving images around and starting containers to build and test some ruby / python was usually around 1-3 minutes or there abouts.
But I like to touch common headers - say to document logging macros, and printf-format annotations to catch bugs, or maybe to optimize their codegen - and those logging macros are in our PCHes for good reason - so that still means rebuilding everything frequently. Which ties up a lot of build time (30-60 minutes per config per platform, and I typically have a lot of configs and a lot of platforms!)
This is one of my pet peeves, people using the term CI referring to the tooling. For me this alone invalidates anything they have to say about the subject.