
Speeding up a Git monorepo - illuminated
https://dropbox.tech/application/speeding-up-a-git-monorepo-at-dropbox-with--200-lines-of-code
======
benreesman
Monorepo/multirepo (and monolith/microservice which often tags along) seems
like a false dichotomy.

There are costs to having big repositories (e.g. TFA and needing to do scaling
work), and there are costs to having lots of repositories involved in
producing a unified working result (dependency resolution is NP-complete in
many of its useful formulations). Big players have the muscle to optimize
Mercurial and Git, so they get to do super slick trunk-only monorepo
development at engineer-commit scale, but still often have auxiliary
repositories that take e.g. machine generated commits. Smaller players
probably aren’t hitting scaling limits on these tools. But every situation is
different and if one approaches it as an engineering problem you can usually
do something very workable.

Likewise with monolith/microservice: there is a happy medium where you
introduce a network boundary for an engineering reason (maybe one part of my
computation needs a lot of CPU but a different part needs a lot of RAM, so
they run on different SKUs/instance types). My big giant web app that dates to
the founding of the company? Probably don’t want to rewrite that so let’s spin
services out of it incrementally when I need to write something in C++ or use
a shitload of RAM or whatever. That’s bread and butter systems engineering.

But this “pick a side” mentality where it’s like one giant ball of PHP in one
giant Subversion repository or every team has their own little service in
their own little repo and I burn 40% of my cycles parsing JSON isn’t a set
cover: you’re allowed to choose a happy medium.

Just do things for valid technical reasons and don’t have Conway’s law go
apeshit on your architecture by shattering it into a zillion pieces. The human
factor stuff can be addressed with engineering rigor and consensus. “This is
too slow to be in Python now” is a good reason to make a service. “The iOS
team shares no code with the web team and bisects will be faster/easier” is a
good reason to make a repo.

“I want to have my own coding style and/or use some language no one else knows
and/or learn k8s and/or not deal with that team I don’t like” are not
engineering reasons to type ‘git init’ or make a network call.

~~~
afarrell
> The human factor stuff can be addressed with engineering rigor and
> consensus.

How do you add engineering rigor and consensus? Let's say as an Individual
Contributor.

~~~
cjfd
By working at a place where people listen to you and are not (too) crazy.

~~~
Cthulhu_
Just you? If it's a team effort, places like Dropbox have hundreds, if not
thousands of engineers, all with an Opinion.

~~~
cjfd
Sometimes I get they feeling that they should indeed just listen to me. When
one needs an argument to convince a colleague that one really should not be
using == to compare two floats because of rounding errors or when working in a
place currently suffering from blatently ignoring
[https://www.joelonsoftware.com/2000/04/06/things-you-
should-...](https://www.joelonsoftware.com/2000/04/06/things-you-should-never-
do-part-i/) and experiencing the predictable consequences afterwards I indeed
quite often get the feeling that in many places where I could work I actually
am the only adult in the room.

To be a bit more practical, though, yes one should listen to others as well
because in some cases they might actually have ideas that are good. Also,
morally, it kind of is the golden rule that if one expects to be listened to
one needs to listen oneself as well. In cases of large places with lots of
people I would say that there should be some form of code ownership and hence
people and/or smallish teams can decide what to do with the code that they
own. One of the most important things that contributes to code quality is not
too many changes of code stewardship.

------
klodolph
I’m still hoping some standard toolset emerges for dealing with large
monorepos sometime in the near future. At the moment it’s clear that a number
of companies are rolling their own solutions, which follow one of two
patterns:

\- Persistent process which watches workspace changes, or

\- Workspace in virtual filesystem.

The other common factor seems to be trunk-based development with all commits
rebased into a linear history. I’m not super hopeful that we’ll see an open-
source solution in this space for a while, though—any company with a code base
large enough to really need these solutions is also large enough to throw a
few engineers at VCS, especially given that they’d already have engineers
supporting VCS from the operations side of things.

~~~
ublaze
Author of the post here.

I think Git upstream is trying to simplify configuration. They have a config
option called `features.manyFiles` which enables most of the features we
enabled for our developers ([https://git-scm.com/docs/git-
config#Documentation/git-config...](https://git-scm.com/docs/git-
config#Documentation/git-config.txt-featuremanyFiles)).

We wanted to use this instead of deploying a wrapper, but it turns out that
some of Git's features like fsmonitor do not interact well with repositories
with submodules (there were Git crashes). And we have some developers that
work on repositories with submodules. So we needed something more flexible,
like enabling these features only on particular repositories.

~~~
dundarious
I’m quite confused as to what you actually shipped to your developers to
increase performance. This config option is a great place to start, but it
would be great if that were clearer, so that others could follow suit.

~~~
ublaze
We shipped a wrapper that tweaks git configs and logs metrics, and a custom
fsmonitor hook that was _slightly_ faster than the stock one. We also ensured
watchman was installed on developers laptops.

And we made a few changes to Git to fix bugs (for example, `git stash` wasn't
using fsmonitor data, so it was slow).

~~~
dundarious
It would be great to have a listing of those config tweaks, even if with
caveats attached, such as “causes issues with submodule”.

I don’t want to seem demanding, but it’s such a tantalizing article without
this info :)

~~~
ublaze
Sure.

core.fsmonitor is set to our custom fsmonitor (this causes issues with
submodules, at least on 2.24)

core.untrackedCache true

We use index version 4

And a slight hack: our wrapper sets GIT_FORCE_UNTRACKED_CACHE = 1. This forces
`git status` to write the untracked cache if it notices a difference. I was
too lazy to add a patch to configure that.

~~~
dundarious
Wonderful, thank you!

------
shoo
I reckon the discussion of the history that lead dropbox to switch to a
monorepo is more interesting than the git speedups.

The last couple of places I've worked at are large non-tech companies, both
orgs were internally using on-prem gitlab/github/bitbucket . These tools make
it much easier for teams (or individuals) to create as many new repos as they
want without coordinating with anyone else -- for better or worse.

I suspect what happens quite often these days is that people create many repos
without consciously thinking about if that's a good idea or not -- because it
is familiar and because there are relatively high quality tools/products to
let you make more repos.

The small part of the org I currently work in probably has O(200) employees
and O(200) git repos.

The last system I worked on in previous company had a single git repo
containing all parts of a line of business application (db, API server,
frontend, backend servers for batch jobs) but then there were about 40 other
git repos containing deployment scripts etc used to deploy just this one
system. It made it bloody hard to figure out exactly what version of what
script or library was actually used to deploy ( to be fair, a lot of this was
a consequence of using ansible modules which expects each module to be in its
own git repo, and having a couple of people hack together a lot of ansible
modules in a short amount of time without review)

------
jgavris
Great post!

I wrote a (fast?) fsmonitor hook in Rust...benchmarked against the reference
Perl implementation it's quite a bit faster. On a repo of 130k files, my
monitor is able to `git status` in 18 millseconds.

[https://github.com/jgavris/rs-git-fsmonitor](https://github.com/jgavris/rs-
git-fsmonitor)

~~~
ublaze
Which operating system are you using? That's impressive. More importantly, our
hook doesn't support the new query version so we might want to switch.

~~~
jgavris
I use macOS, but some folks have contributed a linux package / installer. And
yeah, I added the v2 of the hook recently which is even faster!

------
thinxer
For anyone interested in why monorepo works, I'd recommend the book Software
Engineering at Google: Lessons Learned from Programming Over Time. It has
detailed the reasons for the One Version Rule and Version Control over
Dependency Management.

~~~
secondcoming
Doesn't Google use Perforce though, which (last time I used it) forces a
monorepo approach? git doesn't have equivalents to branchspecs and
clientspecs.

~~~
thinxer
It is because Google wants a monorepo, then Google choose to use Perforce (and
later Piper). It is not that Google uses Perforce and thus are limited to a
monorepo.

The core value behind monorepo (and monorepo-like approaches) explained in the
book is that dependency management is harder than version control.

------
nickcw
It is interesting that the speed of lstat on macOS is the driver behind this
problem. According to the article it is 10x slower than Linux.

I wonder if anyone has tried attacking that end of the problem? Faster lstat
on macOS would benefit all applications not just git.

~~~
ublaze
[https://gregoryszorc.com/blog/2018/10/29/global-kernel-
locks...](https://gregoryszorc.com/blog/2018/10/29/global-kernel-locks-in-
apfs/) is an interesting write up about the problem.

------
silverlake
The fact that so many are using monorepos points to a weakness in revision
control and dependency management tools. This is a big gap where someone will
invent Git’s replacement. What features will it need to replace Git?

~~~
j88439h84
> The fact that so many are using monorepos points to a weakness in revision
> control and dependency management tools.

People seem to like monorepos, what problem do you have in mind?

~~~
danenania
Access control is a big one.

------
swiley
Is there a reason people prefer a monorepo to submodules?

I worked at one place that kept everything in a mercurial mono repo and it was
a real pain keeping branches in sync.

~~~
saxonww
I introduced a submodule at my last job to hold common build code used across
multiple repositories. I knew going in that some of our developers didn't
really understand git, and were not interested in rectifying that. I won't say
the submodule was a disaster, but I definitely paid for it with time spent
sitting with people and helping them fix messes. I would do it again but only
if I knew I could count on the team to make more than a token effort to
understand what a submodule was and how it worked. I would not even consider
adding multiple submodules unless I felt like everyone knew exactly how to use
them.

Another potential source of issues: the submodule remote URI is checked in as
part of the .gitmodules file. If your CI system uses a different URI than your
developers, you have to work around that. If you change where your source is
hosted and you want to check out an old version, you have to figure that out
too.

That said, I think the most common reason is the same reason some people
prefer monorepos in the first place: they perceive the monorepo as simpler,
and adding submodules is not simpler. They want to know about and manage 1
repository clone. They want to commit, push, review, and build out of 1
repository. They want to search for stuff in 1 directory tree. They'd also
like to do that with 1 tool, ideally 'git'. Nothing really offers that except
monorepos.

~~~
mkesper
You can use relative paths for submodule urls.

------
duncans
Title should be "Speeding up a Git monorepo at Dropbox with <200 lines of
code" \- looks like some sanitisation regex got over-zealous here.

~~~
illuminated
Yes, I have posted that exact title but haven't noticed the cut until now. I
hope the mods will be able to fix this.

~~~
fastball
Honestly the HN obsession with modifying titles is excessive. I understand not
wanting to have incendiary titles which encourage flamewars, but the rest of
the policies around title editing / sanitizing don't seem that useful and
actual hinder coherence in many cases (like this one)

------
kccqzy
It's a shame that Dropbox abandoned Mercurial for Git. With both Facebook and
Google contributing to better support for monorepos in Mercurial, Mercurial
seems like a better choice for big monorepos.

~~~
mehrdada
Huh? I don’t know about Facebook but Google for sure does not use Mercurial as
monorepo backend (nor does it use Git). There are Git and Mercurial-alike
clients to interface with the in-house backend which is a Perforce like thing.
Neither Git nor Mercurial would be fun to use at Google scale. Dropbox has a
much smaller monorepo hence they can clone the whole Git thing on developer
machines. I assume doing the same with Mercurial is impractical as no one has
patience for something that slow.

~~~
kccqzy
I should clarify my comment by saying monorepo-related Mercurial improvements
benefit not just those monorepos backed by Mercurial, but also where Mercurial
is "just" a front end to a different system. I mean just think about it, what
does a front end in this case really mean? When you run a command like `hg
log` how much of the original Mercurial code are you running? Do you design an
entirely different system that happens to share the same command-line syntax
as the original Mercurial, or do you emulate the .hg folder format and run the
original Mercurial code, or somewhere in between? Thinking about this problem
would shed more light on why Google's work on Mercurial benefits everyone else
with a big monorepo even though Google is "just" using Mercurial as a front
end.

And both Facebook and Google have contributed to Mercurial on this front,
though admittedly Facebook did more work than Google did. I know the tree
manifest feature ([https://www.mercurial-
scm.org/wiki/TreeManifestPlan](https://www.mercurial-
scm.org/wiki/TreeManifestPlan)) is done by Google and upstreamed, and it
benefits every repo with millions of files. (Just clone the hg repository and
search for commits with a google.com author email and see what kind of commits
they are.)

~~~
mehrdada
Sure, but still in Google/Facebook case there’s a centralized backend that
does the day to day operations that you invoke from your laptop. In Dropbox
case, it is just a local git repo that you push to server only when you want
to land a change; just like whatever one does with GitHub, but quite a big
one, so the use case is quite different from F/G, which is what I was getting
at. In principle maybe you could do it, that’d require quite a big investment
in effectively rebuilding mercurial though.

------
cryptonector
Did Microsoft's enhancements to Git, particularly the Bloom filter
optimizations to git log/blame make it into the mainline?

~~~
WorldMaker
It sounds like a lot of them have, but most still need to be configured. The
commit-graph [1] is the biggest internal part of those optimizations and I
believe core.commitGraph is still defaulted to off (and probably is more
overhead than necessary for small to medium repositories).

[1] [https://git-scm.com/docs/commit-graph](https://git-scm.com/docs/commit-
graph)

~~~
stolee
Thanks for the link to the documentation. That is updated with every major Git
version, and can be used to track what features are present. Also, the release
notes can be helpful.

In particular, the recent Git v2.27.0 release does include an implementation
of Bloom filters with speedups for `git log` and `git blame`. You need to
manually run the command to make it work. The version I prefer is this:

> git commit-graph write --changed-paths --reachable

After that first write (which writes filters for every reachable commit) you
can do a smaller write by adding `--split` to write incrementally [2]

[2] [https://devblogs.microsoft.com/devops/updates-to-the-git-
com...](https://devblogs.microsoft.com/devops/updates-to-the-git-commit-graph-
feature/)

By writing these filters, you will speed up most `git log` and `git blame`
calls. There is an improvement coming in the next version that includes
speedups for `git log -L`.

 __Caveat: __The biggest reason these improvements have not been widely
advertised is that the user experience has not been completely smoothed out.
In particular, you can only write the changed-path Bloom filters using the
command(s) above. If a commit-graph is written during GC (due to
`gc.writeCommitGraph` config setting) then the filters will disappear. Similar
for `fetch.writeCommitGraph`. We plan to have these resolved in time for
v2.28.0, along with more performance improvements.

(Full disclosure: I am a contributor to Git, Scalar, and VFS for Git, which
are referenced by the article.)

~~~
ublaze
Thanks for your work and your team's work on Git!

------
alkonaut
For huge repos building on build servers also becomes a problem. A build
machine can usually do a shallow clone, but it would be better to filter to
smaller pieces.

Even more importantly, when you have multiple builds using the same repo, CI
systems usually work by setting up a local copy of the repository per build,
so if you have a compile+unit test build, and another for slower integration
tests running the same code, then the machine might end up having two git
/objects directories somewhere, containing the same data. If you have a 100Gb
repository and maybe 100 different builds against it, this quickly becomes
unmanageable.

What I'd want is the ability to use a common objects directory for a machine,
where common objects would be deduplicated. I don't know if this is achievable
with GVFS or even by linking the directories?

------
matt-attack
Can someone explain how git can be slow for these folks while the Linux kernel
continues to use git? Do these performance issues plague Linus as well? Or is
the Linux kernel smaller than the code in TFA?

~~~
alkonaut
Kernel is tiny. Whole repo is like 1.5GB. That's nothing. We are just 20 devs,
not a massive company and I'm on 100k commit, 50Gb history repo.

~~~
saagarjha
That’s a massive repository…I’ve worked in code with many hundreds of
thousands of commits from many, many people and they’re a couple gigabytes at
most. Are you storing assets in your tree?

~~~
alkonaut
Yes a lot of it is non-text content (By file count not so much, but by size
probably 80% or more). Nothing unnecesary, but resourcees required to build
and test each revision (Not e.g. documentation). It's a document based and
graphical app (CAD) so it is not practical to store separate non-text assets
from the source tree. Each branch has different versions of various test
inputs/outputs, drawings, image resources etc. We use git LFS obviously,
otherwise this whole setup would be impossible. In all, it's actually a decent
experience. We couldn't migrate off Subversion until LFS was stable, but now
that it is, Git is actually a quite good VCS for the "Subversion use case"
(Large, central, heavy in binaries).

------
perryizgr8
Those times look too long for the number of files. Here's my time:

    
    
      $ time git status
      On branch master
      Your branch and 'origin/master' have diverged,
      and have 1 and 115 different commits each, respectively.
        (use "git pull" to merge the remote branch into yours)
      
      real    0m0.517s
      user    0m0.240s
      sys     0m0.472s
      $ git ls-files | wc -l
      130685
    

This is on Ubuntu Linux, no special additions done to git. I wonder why the
author's experience is so different.

------
saxonww
I started using this a few months ago to solve the very important problem of
my git-status-decorated bash prompt taking too long to display on macOS. I'm
very happy with the result, but there are a couple of situations where it
seems to get stuck and I have to go kill processes: after I've created a lot
of untracked files and then deleted them; and if I've moved back and forth
between revisions with thousands of combined file changes.

Still highly recommended for large (or just old) repos.

~~~
saagarjha
I learned very quickly to run my bash prompt in a timeout wrapper after
cloning WebKit once and having an absolutely miserable time doing anything
while inside the repository.

------
the8472
They mention stat syscalls as limiting factor in large repos. I wonder whether
the ceiling of "too large" could be raised by batching/streaming stat calls
via io-uring. It wouldn't help mac users, but at least on linux it could
improve the out of the box experience for large repos.

------
monocul4r
There is currently no efficient way in git to just clone a single
subdirectory. This is very inconvenient when dealing with very large monorepos
- even with depth 1 - you still have to grab the entire tree.

~~~
the8472
[https://github.blog/2020-01-13-highlights-from-
git-2-25/#par...](https://github.blog/2020-01-13-highlights-from-
git-2-25/#partial-clones-and-git-sparse-checkout)

------
revskill
I couldn't use monorepo if i have a service to deploy to Heroku though. Or do
you know if i can deploy a monorepo to Heroku ?

------
mkesper
Easy fix: Disallow use of MacOS. I really don't get why developers use an OS
where no upstream package management of developer tools exists and that offers
no way of running performant VMs.

~~~
Hackbraten
I do like macOS’s UX and the fact that OS updates have been literally painless
for the 16 years I’ve been using macOS as my daily driver. At the same time,
it being a Unix under the hood, it gives me POSIX compatibility all over the
place. Regarding developer tools, Homebrew has done a good enough job so far.
YMMV.

I see why people dislike macOS but banning it at the workplace would be over
the top.

