
Successfully Merging the Work of 1000 Developers - rom16384
https://engineering.shopify.com/blogs/engineering/successfully-merging-work-1000-developers
======
DilumAluthge
This is exactly the kind of workflow that Bors ([https://github.com/bors-
ng/bors-ng](https://github.com/bors-ng/bors-ng)) automates.

In addition to Bors, there are a number of apps and services that automate
this kind of workflow. Here is an incomplete list:
[https://forum.bors.tech/t/other-apps-that-implement-the-
nrsr...](https://forum.bors.tech/t/other-apps-that-implement-the-nrsrose/51)

Edited to add: Graydon Hoare (creator of Rust) called this the Not Rocket
Science Rule Of Software Engineering (NRSROSE): "automatically maintain a
repository of code that always passes all the tests" \-
[https://graydon2.dreamwidth.org/1597.html](https://graydon2.dreamwidth.org/1597.html)

Disclaimer: I have contributed code to Bors.

~~~
omeze
Bors when introduced at my workplace was summarily disabled after 2 separate
incidents where it was improperly applying commit messages (which we patched
internally) and improperly applying patches (at which point we disabled it and
explored other solutions). Combined with the merge latency overhead, I would
be extremely hesitant to advocate running it at scale.

~~~
chrisseaton
> Bors when introduced at my workplace was summarily disabled after 2 separate
> incidents where it was improperly applying commit messages

Seems a bit extreme to discard something after finding two bugs. Could you
have fixed them instead?

~~~
omeze
We did fix one bug, but after the 2nd incident decided that using a tool that
silently corrupts your repository is not worth the marginal benefit it
provides against alternative more naive solutions. Most companies run post-
merge checks so in the event of a bad merge, the offending commit can be
isolated and reverted, or have a cron running at a regular cadence to isolate
the issue to a set of commits. Having your repo reflect developer’s patches is
really table stakes for a tool like this anyway.

------
kiallmacinnes
The OpenStack project faced a similar problem a few years back, they produced
Zuul[1] to solve the problem. I can't compare it to what Shopify produced, but
Zuul is absolutely worth a look when it comes to solving large scale, high
throughput, must always be green CI.

The linked page explains the speculative execution aspect, used to ensure
every change is tested before merge, with the true state of master at the time
of merge despite that state being different than it was when the CI run
started ;)

[1]: [https://zuul-ci.org/docs/zuul/user/gating.html](https://zuul-
ci.org/docs/zuul/user/gating.html)

------
eejjjj82
This is how google manages these changes:

[https://cacm.acm.org/magazines/2016/7/204032-why-google-
stor...](https://cacm.acm.org/magazines/2016/7/204032-why-google-stores-
billions-of-lines-of-code-in-a-single-repository/fulltext)

this was written 3 years ago and scale has only grown. But at the time the
numbers quoted in the article are 25,000 authors / day w/ 16,000 commits and
another 24,000 commits / day from automated sources

------
chdsbd
A bit of shameless self promotion: I built a more basic merge bot for GitHub
that efficiently updates and merges PRs because we were wasting a ton of time
keeping branches updated at work.

[https://github.com/chdsbd/kodiak](https://github.com/chdsbd/kodiak)

~~~
Eiriksmal
We started using Kodiak after the Auto Rebase bot was discontinued a few weeks
ago. Other than confusing me with a normal merge when I thought it had been
configured to do squash-merges, it works great. Thanks for releasing it!

------
coldcode
I find it difficult to imagine why you would need 1000 developers in most
single projects. The cost of managing so many people and work streams would
seem to overwhelm whatever volume you could get out of them

~~~
irfansharif
The core shopify codebase is a monolithic one, hence the merging
infrastructure described in the post. It's not a "single project" in any
useful definition of the phrase.

~~~
baroffoos
Im interested how what looks to me like a fairly standard web store could have
such an enormous amount if developers working on it. Is there some kind of
insane complexity behind webstores I am not seeing here?

~~~
jacktli
It’s about providing a platform for enabling global commerce. Our engineering
blog has other posts that goes into technical depth about a lot of the
challenges we face, and our solutions. For context, here is some info about
Black Friday/Cyber Monday for us last year:
[https://engineering.shopify.com/blogs/engineering/preparing-...](https://engineering.shopify.com/blogs/engineering/preparing-
shopify-for-black-friday-cyber-monday)

~~~
baroffoos
I understand its a hugely popular service and there is a lot that goes in to
something so big but there are bigger websites that run on much much smaller
teams of developers. I just don't understand how there is even enough surface
area on the app that 1000 people could be working on it at the same time. Is
that number counting people working on the ops stuff or building other tooling
not part of the main app?

~~~
winrid
Think of it this way. 100 different projects. Each with different business
goals etc. Make some of the teams Platform teams for good measure.

------
username90
If I read this article right, if anyone breaks any part of the build it breaks
for everyone? Doesn't sound very scalable. Shouldn't the main goal be to break
up your continuous integration steps so that a person in one end of the
company can continue working even if a person in the other end broke his
build?

That way you can also add tags for flaky tests etc, to make your builds more
reliable.

Edit: I didn't understand before what people had against monorepos, but I
guess if you tie an continuous integration builds to each repo then having a
monorepo becomes a huge pain point. Are there any open source tools to fix
this?

~~~
jacktli
There is an explanation about how we handle this case when I talk about the
failure-tolerance threshold. I go deeper into this in my GitHub Universe talk
where I also talk about an alternative (but costlier) solution through running
parallel branches, but unfortunately that talk is not posted up yet.

------
dkoston
It would be useful in this article to hear about what content is acceptable in
a merge request. For example: can these all go straight to queue because they
use feature flags? Are commits a "single piece of work", etc.

Not to sound like a downer but this is really an article about fixing a broken
process because not running CI on branches before merging to master goes
against best practices. Would have loved to actually hear about their work
process as this whole article could be summed up as "not running CI on
branches before merging their commits to master is a great way to ruin
master".

~~~
latortuga
Well, the problem is that master is a bottleneck. Trying to build CI on every
branch before merging to master just won't work with the scale they are
dealing with. At 1000 developers, the rate of PRs coming in makes it
impossible to determine what current master will be when the PR is ready to
merge (i.e. when the branch has a green build). It's also wasteful to build
each branch against current master because what is "current" will not be when
the branch is ready to merge.

Perhaps this problem is what microservices are meant to solve. When you can't
coherently integrate code fast enough, attack the bottleneck (master) by
splitting it (multiple services).

~~~
username90
> Trying to build CI on every branch before merging to master just won't work
> with the scale they are dealing with.

Google does it with 50 times the developer count.

> At 1000 developers, the rate of PRs coming in makes it impossible to
> determine what current master will be when the PR is ready to merge (i.e.
> when the branch has a green build).

True, it is impossible to catch all errors like this, but you can catch almost
every error by building and testing it against current master and then merge
it with the master 20 minutes later when the build is done. I have seen maybe
one build breakage a year being introduced due to this in projects I've worked
on, so it isn't a big deal.

~~~
randomidiot666
You have no idea how Google solved it. Basically everyone with a Monorepo
(except Google) implements it as a cargo cult best practice. Mindlessly
copying Google without understanding how Google actually does it.

------
dantiberian
This sounds quite similar to [https://bors.tech](https://bors.tech). If the
authors are here, did you see this, and can you compare and contrast it with
what you built?
[https://graydon2.dreamwidth.org/1597.html](https://graydon2.dreamwidth.org/1597.html)
also has a good overview of the problem and the original bors.

~~~
jacktli
Yes we have seen this before! The main difference is that throughput is
extremely important for us, which we would not get worth Bors. Also,
compatibility of multiple simultaneously merging PRs is the case that we are
optimizing for, vs. compatibility with current master.

~~~
yoden
If you don't mind me asking, How long does a CI run take for you? How do you
manage running CI with so many merges?

Our CI takes ~8 hours of machine/VM time, which is about 35 minutes of wall
time with our current testing cluster (including non-distributed parts like
building). We skip certain long tests during the day, so that brings wall time
down to ~13 minutes. But we also test 2-3 branches with decent churn. So even
if we're only doing post-merge CI based on the current state of master, we're
still getting 5+ commits fairly often.

I want to get to a world where CI is run before and after each merge with
master, on every commit (or push/pull), but it seems like it would take so
much more resources and infrastructure than we currently have.

------
aledalgrande
Very surprised they hadn't locked merge to master until recently.

I do like the emergency exception. The problem then becomes having a solid
policy on when to use it.

Not really sure about the flakiness test (the 25% figure is totally random),
as you could have a test that is failing only at a certain hour of the day and
you would be furiously rerunning the test suite for nothing because it would
always fail, wasting resources. It would be much better to at least single out
the failing tests and rerun only those, dropping the ones that pass
successfully until max tries are reached or the queue is empty.

------
kl4m
Is it the problem that Gitlab merge trains aim to solve?

[https://docs.gitlab.com/ee/ci/merge_request_pipelines/pipeli...](https://docs.gitlab.com/ee/ci/merge_request_pipelines/pipelines_for_merged_results/merge_trains/)

------
jefurii
This seems like a complicated way of shoehorning the Linux kernel mailing list
workflow into GitHub.

------
dustym
Anyone know what build times are for the Shopify monolith?

~~~
dudzik
I'm working on the test-infra team at Shopify. We got the CI down in the last
couple of weeks to around 22m 50th percentile and 33m 95th percentile. At the
moment we get most of our speed up by parallizing our builds steps a lot. But
we hit a ceiling with that and are working on a project on selectively running
tests.

------
pojzon
Did I read it correctly that in first iteration they merged "NOT TESTED" code
into master ?

Who would ever think it's a good idea ?

~~~
wvanbergen
The branches that are being merged are tested, also in the first version.
However, different branches can conflict with each other, and break the master
builds. This was happening often enough for us to want to prevent this.

The simple way would be to rebase your branch (or merge in master). However,
with the amount of changes that are being merged, by the time the CI result
comes in for your rebased branch, the tip of master is already changed making
the CI result obsolete. So we went for the queue approach instead.

~~~
pojzon
Dont know if I respond to author, but either way the end solution was to merge
into semi master and run CI before merging to master..

So overall you came to conclusion that what you explained even if can fail is
pretty much the best way you can do it.. (rebase just custom..)

Overall DevOps principles are proven to simply work, you just have to follow
them..

