
Bring your monorepo down to size with sparse-checkout - Amorymeltzer
https://github.blog/2020-01-17-bring-your-monorepo-down-to-size-with-sparse-checkout/
======
divbzero
This reminds me of VFS for Git, Microsoft’s solution for scaling Git for the
Windows code base. [1] [2] [3]

[1]:
[https://github.com/microsoft/VFSForGit](https://github.com/microsoft/VFSForGit)

[2]:
[https://news.ycombinator.com/item?id=14411126](https://news.ycombinator.com/item?id=14411126)

[3]: [https://devblogs.microsoft.com/bharry/the-largest-git-
repo-o...](https://devblogs.microsoft.com/bharry/the-largest-git-repo-on-the-
planet/)

~~~
wikibob
I'm surprised that VFS for Git isn't yet available on GitHub. Surely they are
working on adding it? Anyone have an inside scoop?

~~~
kyrra
2 reasons: VFS is a fork of normal Git. There is no Linux client.

Also remember that there are many Git clients that work with normal Git repos.
Like libgit and others. I doubt you'll see wide spread support for it unless
MS can upstream it into the main Git implementation, and maybe some the the
primary libraries.

This is one nice argument for Mercurial, where there is only a single
implementation, so adding big new changes can be easier.

------
ComputerGuru
I have always done —depth=1 for projects I am not a core developer of, but ran
into an issue with it being seemingly impossible to do the same with
submodules.

Golang should have figured this out from day 1 before shipping with a release
system built around cloning a repo in its entirety, history and recursive
submodules, and all.

~~~
est31
depth=1 is an old feature but it only limits the clone in the history
dimension. You still have to store the entire state of the tree as of last
commit. This feature is about cloning parts of the tree.

~~~
ComputerGuru
My point is that even that basic functionality a) did not extend to all of
git's core functionality, and b) went unused by major players in the industry
self-procaimedly responsible for "optimizing" the internet.

------
peterwwillis
I wish somebody'd write a book on monorepos. I've run into only a handful of
their problems when trying to manage production pipelines using just a dozen
services, so I'm sure there's tons more (like the purpose behind this
command). Nobody mentions the massive investment in time, technical expertise,
compute resource, and money required to run large monorepos in production.

Also, would emulating this command with a repo of submodules not work?

~~~
gdxhyrd
What problems did you encounter with just a few services?

Monorepos should be straightforward unless you are managing the code of >1k
engineers.

~~~
gen220
We’ve run into some nontrivial but totally solvable issues at about 100-200
engineers.

IME, most consternation comes from people adopting a mono repo without
adopting a build/dependency graph tool (like Bazel, buck or pants).

An additional source of strain is from people abusing the repo (checking in
large binaries, third party dependencies, etc).

A third is when people try to do branch-based feature development, instead of
the “correct” practice of only deploying master (or weekly cuts of master).

I think even a simple list of these sort of “gotchas” would be valuable for
the aspirational mono repo company.

My impression is that a lot of teams hit these early and painful roadblocks,
and imagine that they’ll never go away (they do!!).

~~~
Hackbraten
Checking in third-party dependencies is not always abuse. It can be a useful
habit for certain kinds of reproducible builds. The Buck documentation even
endorses keeping your dependencies in your monorepo along with your own
sources.

~~~
gen220
I understand the reasoning, and agree that it’s not always abuse. At first
blush it’s a good idea, but I’d maintain that it’s one of the things that
balloons your repo size quite quickly. Plus, one have to draw a line somewhere
on what to include (a Python interpreter? A Go version? awk and grep?), and
third party vs in-house is a fairly robust one imo.

We host a private mirror for third party dependencies, so that “pip
install”/“go get” fail on our CI system if the dependency isn’t hosted by us.
This gives us reproducible builds, while allowing us to hold 3rd party
libraries to a higher standard of entry than source code. For certain
libraries we pin version numbers in our build system, but in general it allows
us to update dependencies transparently. It also keeps our source repo size
small, for developers, and allows for conflicting versions (example Kafka X.Y
and X.Z) without cluttering the repo with duplicates.

It’s definitely a smaller gotcha than the others I listed, maybe to the point
where it’s not a gotcha, but I stand by it :)

~~~
peterwwillis
If you can do that with 3rd party dependencies, can't you do that with _all_
the code?

This is what confuses me about monorepos. Their design requires an array of
confusing processes and complex software to make the process of merging,
testing, and releasing code manageable at scale (and "scale" can even be 6
developers working on 2 separate features each across 10 services, in one
repo).

But it turns out that you can also develop individual components, version
their releases, link their dependencies, and still have a usable system.
That's literally how all Linux distros have worked for decades, and how most
other language-specific packaging systems work. None of which requires a
monorepo.

So what I'd like to know is, of the 3 actual reasons I've heard companies
claim are why they need a monorepo, is it _impossible_ to do these things with
multirepo? If it is indeed "hard" to do, is it "so hard" that it justifies all
the complexity inherent to the monorepo? Or is it really just a meme? And are
these things even necessary at all, if other systems seem to get away without
it?

~~~
gen220
These are great questions!! :)

> Can you treat all code like 3rd party dependencies?

Yes, but there are trade-offs. Discoverability, enforcing hard deadlines on
global changes, style consistency, etc.

> Is it impossible to do these things with multi-repo?

No, but there are trade-offs to consider.

> If it's hard, is it "so hard" that it justifies the complexity?

Hitting the nail on the head; there are trade-offs :)

> Are these things necessary, if other systems get away without it?

There are many stable equilibria; open source ecosystem evolved one solution
and large companies evolved another, because they have been subject to very
different constraints. The organization of the open source projects is
extremely different from the organization of 100+ engineer companies, even if
the contributor headcounts are similar.

For me, the the semantic distinction between monorepos and multirepos is the
same as the distinction between internal and 3rd party dependencies. Does your
team want to treat other teams as a 3rd party dependency? The correct answer
depends on company culture, etc. It's a set of tradeoffs, including
transparency over privacy, consistency over freedom, collaboration over
compartmentalization.

With monorepos, you can gain a little privacy, freedom, and
compartmentalization by being clever, but get the rest for cheap; vice versa
for multirepos. It's trading one set of problems for another. I'd challenge
the base assumption that multirepos are "simpler", they're just more tolerant
of chaos, in a way that's very valuable for the open source community.

I hope we've not been talking past each other, I really like the ideas your
raising! :)

~~~
peterwwillis
I don't think we're talking past each other, and thank you for your responses.

> Does your team want to treat other teams as a 3rd party dependency?

From what I recall, 'true' microservices are supposed to operate totally
independent from each other, so one team's microservice really is a 3rd party
dependency of another team's (if one depends on the other). OTOH, monolithic
services would require much tighter integration between teams. But there's
also architecture like SOA that sort of sits in the middle.

To my mind, if the repo structure mimics the communication and workflow of the
people writing the code, it feels like the tradeoffs might fit better. But I'd
need to make a matrix of all the things (repos, architectures, SDLCs,
tradeoffs, etc) and see some white papers to actually know. If someone feels
like writing _that_ book, I'd read it!

------
alanfranz
Monorepo + sparse-checkout looks a bit like a distributed subversion!

~~~
gdxhyrd
Not really, because commits don't go across the entire SVN, which is what
makes monorepos so powerful.

~~~
krupan
What do you mean? When you commit to svn the whole repository goes up in
version number.

~~~
gdxhyrd
You are right, I was thinking of CVS.

In any case, with SVN you usually do not want to give write perms to everyone
in all the tree, so you end up with effectively partitioned spaces, or you
make several repos instead, or you put another layer on top. With Git, anyone
can easily develop global commits.

------
alexhutcheson
sparse-checkout, partial-clone, and shallow seem like decent building blocks
to make working with very large repos tractable in git. At the same time, the
features and their interaction are pretty complicated, so I believe we'll need
good "porcelain" abstractions over these building blocks to make the workflow
reasonable for average users.

~~~
allover
What is your bar for 'average users'?

If we're talking project-wide repos rather than entire-org repos, I'd wager
the vast majority of projects can use monorepos without special git tooling,
and will retain huge productivity benefits vs app/package-per-repo
organisation.

~~~
alexhutcheson
Honestly most orgs (with "most" weighted by org, not by headcount) could
handle entire-org repos without using any of these features. It's still worth
simplifying the workflow and training experience for projects and orgs that
grow beyond that, though.

------
mark_l_watson
Partial checkout efficiency improvements makes mono repos more compelling for
large projects and organizations.

As an individual, I have switched to a mono repo for all of my Common Lisp
code and with some adjustments to my Quicklisp configuration I am very happy
with my setup.

I am a programming language junkie, and I have it on my low priority todo list
to switch to a mono repo for Haskell, Racket, and Hy language (Lisp with a
Clojure syntax that sits on top of Python).

I worked as a contractor at Google in 2013 and I absolutely loved their mono
repo and web based development environment. I really miss that.

------
krupan
How does a sparse checkout not defeat the purpose of a monorepo? I thought
monorepos existed so it was easy to make changes that affect the whole
codebase and to test those changes. If you only checkout a portion of the
files, how are you going to test against the whole repo?

EDIT: my overall concern is that it looks like people are reinventing
clearcase. Please speak to an older developer who worked at an HP/IBM type
company in he late '90s/early 2000's before you do that. Please!

~~~
marcell
Continuous integration tools still check out and test the whole repo. Google
has used this approach for over a decade.

~~~
andrewg
This would be impractical for really large monorepos like the ones Google and
Microsoft have. They have virtual file system layers on top (MS open sourced
theirs) to prevent checking out the whole repo.

In fact, it’s not just useful for the CI/CD pipeline - any developers making
significant changes to base libraries or core infrastructure should be able to
use the VFS in combination with a system like Bazel to run all (or a
significant sample of) affected tests across the company.

------
gravypod
I can't wait until there's tooling that takes advantage of this. Tying sparse
checkout into Gradle or Bazel would make this a lot easier.

------
MayeulC
Interesting. I've been wanting something like that for submodules. Can the two
features be combined?

For instance, if you need a single file/directory from another project in your
repository.

~~~
krupan
I thought one of the big benefits of monorepos was that you didn't mess with
submodules anymore?

~~~
potatochup
It still might be needed for external dependencies. the code an organization
writes might be in one repo, but if you want to bring in some other library,
like libssl (assuming there is no better package manager for your language)
submodules are often used

------
tantalor
I have to tell it which directories I want? That seems like work the tool
could do. Also, the granularity should be at the file level, not directory.

~~~
stolee
The sparse-checkout patterns match at the file level, so you can always use
that (without “cone mode”) if you want. It becomes difficult to match an exact
file list as people add files to projects: you require every other user to
update their patterns to match the newly-added file.

~~~
WorldMaker
Just keep in mind as the article points out that without "cone mode" is
potentially a lot slower, and that's why cone mode exists.

