
Why Google Stores Billions of Lines of Code in a Single Repository (2016) - bwag
https://cacm.acm.org/magazines/2016/7/204032-why-google-stores-billions-of-lines-of-code-in-a-single-repository/fulltext
======
hobls
I feel terrible for anyone who sees this and thinks, “ah! I should move to a
monorepo!” I’ve seen it several times, and the thing they all seem to overlook
is that Google has THOUSANDS of hours of effort put into the tooling for their
monorepo. Slapping lots of projects into a single git repo without investing
in tooling will not be a pleasant experience.

~~~
adamrt
We moved to a monorepo about 2 years ago and it has been nothing but success
for us.

We have quite a few projects but only 4 major applications. Maybe it is that a
few of our projects intertwine a bit so making spanning changes in separate
repositories was a pain. Doing separate PRs, etc. Now changes are more atomic.
Our entire infrastructure can be brought up in development with a single
docker-compose file and all development apps are communicating with each
other. I don't think we've had any issues that I can recall.

We are a reasonably small team though, so maybe that is part of it.

~~~
Tloewald
My previous had a monorepo for the website and backend (but not the mobile
apps) which was insane to work with (as a coder I had a dedicated 128 core box
to work on, some engineers had more than one, less intense engineers shared
one) and a substantial amount of my time was spent just finding code. I guess
most engineers just end up working in some nook and so that searching code
constantly becomes less of an issue (it never did for me) but the code / debug
cycle was dreadful.

I should add that a huge amount was invested in tooling. We had an in-house
IDE with debug tools that could step through serverside code. We had a highly
optimized code search tool. We had modified a major version control system so
it could handle our codebase. (Indeed we picked our version control system
because we needed to fork it and the other major version control system was
less amenable to our PRs.)

My current job we have a micro service architecture and lots of small, focused
repos. Each repo is self-documented. Anyone can checkout and build anything.
We don’t need obscene dev servers. We have not hugely invested in tools or
workflow.

Client apps are unavoidably larger repos than the services apps.

Based on my personal experience, I think monorepos are nuts.

~~~
Too
How is this

    
    
       company
         /ProjectA
           /.git
         /ProjectB
           /.git
    

easier to browse than this?

    
    
       company
         /.git
         /ProjectA
         /ProjectB
    

You still need to find the project A repo if you don't use monorepos. And even
if you do use monorepos everything doesn't have to be one monoloithic build
hogging down your IDE, you can still have microservices with the code for each
hosted in the same repo. You seem to conflate monorepo with lots of other
things.

~~~
flukus
In the former each project is a self contained unit and if I'm working on
project B I can forget project A even exists, which is lovely caused I've got
enough to deal with on B as is. Each project can be branched individually, the
log of project B is not polluted with commits to project A, I can rebase and
not get a bunch of commits I don't care about.

The later forces me to be aware of the entire universe in that repo.

~~~
Too
In the later you can also forget project A exist. git log works in
subdirectories out of the box and if you didn't touch any files in project A a
rebase will be trivial without any merge conflicts. Branching is also free
(unless you are still in the SVN stoneage that requires a copy of each file)
so it doesn't matter if you branch the whole monorepo or just a single
project.

~~~
wcoenen
> unless you are still in the SVN stoneage that requires a copy of each file

Maybe you are thinking of CVS? In SVN, creating branches has always been cheap
in both space and time.

------
jgibson
Is it just me, or are a lot of people here conflating source control
management and dependency management? The two don't have to be combined. For
example, if you have Python Project X that depends on Python Project Y, you
can either have them A) in different scm repos, with a requirements.txt link
to a server that hosts the wheel artifact, B) have them in the same repo and
refer to each other from source, or C) have them in the same repository, but
still have Project X list its dependency of project Y in a requirements.txt
file at a particular version. With the last option, you get the benefit of
mono-repo tooling (easier search, versioning, etc) but you can control your
own dependencies if you want.

edit: I do have one question though, does googles internal tool handle
permissions on a granular basis?

~~~
justicezyx
Single repo is one design that coherently addresses source control management
and dependency management.

The key is to let the repo be a single comprehensive source of data for
building arbitrary artifacts.

~~~
adrianN
A single repo makes it a bit tricky to use some library in version A for
project X and version B for project Y.

~~~
bunderbunder
I think that's actually a good thing. Allowing different projects to use
different versions of a 3rd-party package may be convenient for developers in
the short term, but it creates bigger problems in the long term.

~~~
adrianN
It depends on the industry. In some places changing a dependency, no matter
how trivial the change, entails _a lot_ of work. Think for example about
embedded systems where deploying is a lot harder than pushing a Docker image
somewhere. It is often far cheaper to analyze whether the fixed bug can be
triggered to avoid upgrading unless necessary.

~~~
bunderbunder
In those situations, why not go ahead and keep the code up-to-date and
consistent, and simply not deploy when you don't need to?

~~~
adrianN
Because that costs money now that could be spent on something that actually
produces a profit.

------
senozhatsky
Well, it's not so uncommon. For instance, OpenBSD, NetBSD repos are sort of
monolithic. And, believe it or not, there are some advantages. For instance,
let's take a look at OpenBSD 5.5 [0] release notes:

> OpenBSD is year 2038 ready and will run well

> beyond Tue Jan 19 03:14:07 2038 UTC

OpenBSD 5.5 was released on May 1, 2014. While Linux is still "not quite there
yet" y2038-wise. y2038 is a very complex issue, while it may look simple -
time_t and clock_t should be 64-bit. This requires changes both on the kernel
-- new sys-calls interfaces [stat()], new structures layouts [struct stat],
new sizeof()-s, etc. -- and the user space sides. This, basically, means ABI
breakage: newer kernels will not be able to run older user space binaries. So
how did OpenBSD handle that? The reason why y2038 problem looked so simple to
OpenBSD was a "monolithic repository". It's a self-contained system, with the
kernel and user space built together out of a single repository. OpenBSD folks
changed both user space and kernel space in "one shot".

IOW, a monolithic repository makes some things easier:

a) make a dramatic change to A

b) rebuild the world

c) see what's broken, patch it

d) while there are regressions or build breakages, goto (b)

e) commit everything

[0] [http://www.openbsd.org/55.html?hn](http://www.openbsd.org/55.html?hn)

[UPDATE: fixed spelling errors... umm, some of them]

-ss

~~~
glandium
The reason why y2038 problem looked so simple to OpenBSD has little to do with
"monolithic repository" and everything to do with "happy to break kernel ABI
compatibility". You're saying as much yourself.

Monolithic repository might have been a tool that helped enforce it, but
that's not what made it happen. It's the decision that ABI could be broken
that did.

And that's also why it hasn't happened in Linux yet. Even if there was a
monorepo containing all the open source and free software in the world (or at
least, say, that you can find in common distros), the fact that there's a
contract to never break the ABI makes it simply hard to do.

~~~
senozhatsky
> Monolithic repository might have been a tool that helped enforce it, > but
> that's not what made it happen. It's the decision that ABI could > be broken
> that did.

Well, there are probably some subtle details which I'm missing, and may be you
are totally right.

The way it looks to me is as follows: They are "happy to break kernel ABI
compatibility" because the repository is monolithic - they break ABI, they
immediately fix user space apps.

E.g. NetBSD time_t 64-bit commit: [https://marc.info/?l=openbsd-
cvs&m=137637321205010&w=2](https://marc.info/?l=openbsd-
cvs&m=137637321205010&w=2)

They patched the kernel:

    
    
    	 sys/kern       : kern_clock.c kern_descrip.c kern_event.c
    	                 kern_exit.c kern_resource.c kern_subr.c 
    	                 kern_synch.c kern_time.c sys_generic.c 
    	                 syscalls.conf syscalls.master vfs_getcwd.c 
    	                 vfs_syscalls.c vfs_vops.c
    
    

and fixed broken user space at the same time:

...

    
    
    	 sys/msdosfs    : msdosfs_vnops.c
    	 sys/netinet6   : in6.c nd6.c
    	 sys/nfs        : nfs_serv.c nfs_subs.c nfs_vnops.c xdr_subs.h
    	 sys/ntfs       : ntfs_vnops.c
    	 sys/sys        : _time.h _types.h dirent.h event.h resource.h
    	                 shm.h siginfo.h stat.h sysctl.h time.h types.h 
    	                 vnode.h 
    	sys/ufs/ext2fs : ext2fs_lookup.c 
    	sys/ufs/ufs    : ufs_vnops.c 
    

...

There is no "transitional" stage, when the kernel is already patched, but no
user space apps are ready for those changes yet. It all happens at once.

-ss

~~~
flukus
> There is no "transitional" stage, when the kernel is already patched, but no
> user space apps are ready for those changes yet. It all happens at once.

What about third party apps? It's not a fully self contained system, there are
binaries out there running on openBSD that the openBSD devs have never heard
of, and they were broken by the change.

~~~
int_19h
BSDs simply don't guarantee ABI stability, so no third party app should ever
make a syscall directly. It all goes via libc. So, yes, from that perspective,
it is a fully self-contained system.

In practice, third-party apps sometimes think that they know better, and get
broken. Anything written in Go, for example:

[https://github.com/golang/go/issues/16272](https://github.com/golang/go/issues/16272)

~~~
trasz
Sure they do guarantee the ABI stability (within major release), but the main
thing here is that FreeBSD - like pretty much any other operating system, but
differently from Linux - maintains the stability at the libc level, not at
syscall level.

~~~
int_19h
I meant kernel ABI specifically, of course.

And it's a guarantee that is, essentially, useless for any purpose other than
the interaction between the base system and the kernel - i.e. not for third
party software.

------
ChrisCinelli
Managing dependencies and versions across repos is a pain. Refactoring across
repos is quite hard when your code spreads across repos considering the tree
of dependencies.

Unfortunately Git checkout all the code, including history, at once and it
does not scale to big codebases.

The approach that Facebook chose with Mercurial seems a good compromise (
[https://code.fb.com/core-data/scaling-mercurial-at-
facebook/](https://code.fb.com/core-data/scaling-mercurial-at-facebook/) )

~~~
antt
Git works very well when the code is _distributed_. Which funnily enough is in
the name. That we are using git as a centralized repository is a case of "Why
do I need a screwdriver when I have a hammer?".

~~~
shub
There's nothing about git that requires you to use it like the kernel does.
Centralized version control is just a special case of decentralized, if you're
using git. You still get the benefits of your repo being a peer of the master
repo, like local branches.

------
whack
Maybe I'm not cool enough to understand this, but I don't see the draw for
monorepos. Imagine if you're a tool owner, and you want to make a change that
presents significant improvements for 99.9% of people, but causes significant
problems for 0.1% of your users. In a versioned world, you can release your
change as a new version, and allow your users to self-select if/when/how they
want to migrate to the new version. But in a monorepo, you have to either
trample over the 0.1%, or let the 0.1% hold everyone else hostage.

Conversely, imagine if you're using some tools developed by a far off team
within the company. Every time the tooling team decides to make a change, it
will immediately and irrevocably propagate into your stack, whether you like
it or not.

If you were at a startup and had a production critical project, would you
hardcode specific versions for all your dependencies, and carefully test
everything before moving to newer versions? Or would you just set everything
to LATEST and hope that none of your dependencies decide to break you the next
day? Working with a monorepo is essentially like the latter.

~~~
anyfoo
I've worked both at Google (only as an intern, though) and at other very very
big companies with gargantuan code bases. At that scale, with software that is
constantly in flux, pretty much the last thing you want is having to keep
compatibility between several versions of a component. It's bad enough if you
have to do it for external reasons, but if the only reason is so that "others
in the company have a choice" then... no, just no.

You might think this ought to be trivial by having clear API contracts, but
that's a) not how things work in practice if all code is effectively owned by
the same, overarching entity and, more importantly, b) now you have an
enormous effort to transition between incompatible API revisions instead of
just being able to do lockstep changes, for no real gain.

Even if you manage to pull that off (again, for what benefit?), it will bite
you that 1.324.2564 behaves subtly different from 1.324.5234 even though the
intent was just to add a new option and they otherwise _ought_ to have no
extensional changes in behavior.

~~~
whack
It took me a while to figure out that you're disagreeing with me, because your
last paragraph is a perfect example of why monorepos are so dangerous.

Imagine a tooling team on a different continent that makes some changes this
afternoon. Like you said, their intent is just to add a new option, and it
ought to have no extensional changes in behavior, but it still ends up
behaving subtly different. The next morning, all your services end up broken
as a result.

In a versioned world, you can still freeze your dependency at 1.324.5234, and
migrate only when you want to, and when you're feeling confident about it.

In a monorepo world, you don't have a choice. You've been forcefully migrated
as soon as the tooling team decides to make the change on their end. They had
the best of intentions, but that doesn't always translate to a good outcome.

FWIW, I'm currently working at a large famous company that uses a monorepo.
Color me not-impressed. I do think that having a single repository for an
entire team/project is a good idea. Hundreds of different projects and teams
who've never seen one another? Not so much.

~~~
anyfoo
> In a versioned world, you can still freeze your dependency at 1.324.5234,
> and migrate only when you want to, and when you're feeling confident about
> it.

The correct course of action is to either reverse/fix the code change to the
library you depend on, or if your code is clearly using the library wrong and
can be easily fixed, to do that. Not to let the whole ecosystem slowly spiral
out of control.

Either way, the point is that it will force the issue to be resolved, quickly,
and the code base to move forward.

The tools/libraries you depend on are themselves dependent on other libraries
and tools. They may have done changes that are necessary to continue working,
which you are not picking up if you stay behind. They will do IPC and RPC and
always rely on their infrastructure being current.

>In a monorepo world, you don't have a choice. You've been forcefully migrated

Yes, and that's good, because:

> and migrate only when you want to, and when you're feeling confident about
> it.

... does not help in moving the code forward.

If your change will break others, you need to coordinate with those others so
that the transition happens gracefully, not let them live on what amounts to
unsupported (and slowly more incompatible) code.

~~~
whack
_> > In a monorepo world, you don't have a choice. You've been forcefully
migrated

> Yes, and that's good_

In your projects, have you configured your build system to always auto-pull
the latest version of every single dependency you have? If not, you're not
practicing what you've claimed above.

FWIW, java-maven used to allow specifying LATEST/RELEASE versions, so that the
latest version will always be auto-pulled on every build. They later removed
that option entirely, because they realized how dangerous that is.

[https://stackoverflow.com/questions/30571/how-do-i-tell-
mave...](https://stackoverflow.com/questions/30571/how-do-i-tell-maven-to-use-
the-latest-version-of-a-dependency)

------
makecheck
This is clearly detrimental to external projects such as Go packaging, since
_their own_ developers will never be looking at dependency problems in the
same way as outside groups.

Monorepo also bugs me because there will _always_ be some external package you
need, and invariably it’s almost impossible to integrate due to years of
colleagues making internal-only things assume everything imaginable about the
structure and behavior of the monorepo. There will be problems not handled,
etc. and it leads to a lot of NIH development because it’s almost easier in
the end.

Also, it just feels _risky_ from an engineering perspective: if your
repository or tools have _any_ upper limits, it seems like you will inevitably
find them with a humongous repo. And that will be Break The Company Day
because your _entire_ process is essentially set up for monorepo and no one
will have any idea how to work without it.

~~~
robaato
What about Android and 800-1,000 git repos?!

Have seen the pain trying to manage that across larger teams (e.g. thousands
of devs) - and no the "repo" tool is not sufficient.

~~~
nwlieb
I'm very curious, what pain did you see with the repo.py tool?

------
tzhenghao
Having worked at different companies adopting both monorepo and the multiple
repos approach, I find monorepo a better normalizer at scale in consolidating
all "software" that runs the company.

Just like what many commenters here have mentioned, the monorepo approach is a
forcing function on keeping compatibility issues at bay.

What you don't want is to end up in a situation where teams reinvent their own
wheels instead of building on top of existing code, and at scale, I think the
multiple repo approach tends to breed such codebase smell. [1] I'm sure 8000
repos is living hell for most organizations.

[1] -
[https://www.youtube.com/watch?v=kb-m2fasdDY](https://www.youtube.com/watch?v=kb-m2fasdDY)

~~~
shiift
I really liked that talk! Lots of relevant information and I can definitely
relate, working at a Amazon. Wouldn't say that we are hurt by all of the same
problems (we have solutions that work very well for some of them), but we
definitely are aware of them.

------
mlthoughts2018
One of my former managers had worked a long time at Google and was present for
the advent of Google’s in-house tooling developed around their monorepo.

His account was that it was basically accidental, at first resulting from
short term fire drills, and then creating a snowball effect where the momentum
of keeping things in the Perforce monorepo and building tooling around it just
happened to be the local optimum, and nobody was interested in slowing down or
assessing a better way.

He personally thought working with the monorepo was horrible, and in the
company where I worked with him, we had dozens of isolated project repos in
Git, and used packaging to deploy dependencies. His view, at least, was that
the development experience and reliability of this approach was vastly better
than Google’s approach, which practically _required_ hiring amazing candidates
just to have a hope of a smooth development experience for everyone else.

I laugh cynically to myself about this any time I ever hear anyone comment as
if Google’s monorepo or tooling are models of success. It was an accidental,
path-dependent kludge on top of Perforce, and there is really no reason to
believe it’s a good idea, certainly not the mere fact that Google uses this
approach.

~~~
gefh
Do you wonder whether he is a reliable narrator?

~~~
mlthoughts2018
I don’t, but it’s fair to ask. He was unequivocally the best senior manager
I’ve worked with. Extremely technically smart but skilled at letting people
under him work autonomously, good communicator, cared a lot about pushing best
practices past bureaucratic barriers.

His description of Google made it seem like it had the same dysfunction every
place has. And the monorepo was a totally mundane, garden variety eyesore kind
of in-house framework that you’ll find anywhere.

I think he recognized the usefulness of just working with it and picking
battles. He was just dumbfounded that any outsider would see the monorepo
project and think it possibly had any relevance for anyone else. It was just a
Google-history-specific frankenstein sort of thing that got wrangled with
tooling later. The supposed benefits are all just retrofitted on.

------
haglin
Google's handling of their source code makes me wanna work there.

I don't like distributed version control systems with hundreds of repositories
spread out. It makes management more complicated. I understand this is a
minority view, but that is my experience. It was easier to work in a single
Perforce repository than hundreds of Git or Mercurial repos.

~~~
djur
Distributed vs. centralized VCS has very little directly to do with many vs.
monolithic repos. After all, git was originally developed for a project with a
large monolithic repo. Distributed VCS and many small repos got popular around
the same time, but that's partly coincidental (microservice architectures
getting popular, npm community preferring extremely small libraries) and
partly because of GitHub making it very cheap in money/time to have many git
repos.

------
a-dub
It should be noted that the monolithic model is somewhat encouraged by the
client mapping system in Perforce, which was Google's first version control
system so it is unclear to me if this was deliberate or just a side effect of
the best VCS of the time.

I also still have doubts around the value of a monorepo, in the article they
claim it's valuable because you get:

Unified versioning, one source of truth;

Extensive code sharing and reuse;

Simplified dependency management;

Atomic changes;

Large-scale refactoring;

Collaboration across teams;

Flexible team boundaries and code ownership; and

Code visibility and clear tree structure providing implicit team namespacing.

With the exception of the niceness of atomic changes for large scale
refactoring, I don't really see how the rest are better supported by throwing
everything into one, rather than having a bunch of little repos and a little
custom tooling to keep them in sync.

~~~
malkia
Incrementally monolithic CL number is also useful. You can mark quite a lot of
things with it - not only binary releases, but other developments too
(configuration files, etc.). At the end your binary "version" comprises of
main base CL + cherrypicked individual CL's - rather than branch with these
fixes - I guess one can encode this too with git/hg - by using sha hashes -
but this becomes much bigger in terms of information, and human handling it.

I guess not very strong point, but using CL numbers (I'm working with perforce
mostly these days) makes things easier. And having one CL monothonically
increasing all over all source code you have even better - you can even
reference things easier - just type cl/123456 - and your browser can turn it
into a link. Among many other not so obious benefits...

~~~
lpghatguy
Most popular Git frontends (GitHub and GitLab too, I believe) let you link to
commits with just the first 5-6 characters of the hash. I don't think that's
much different to remember than a Perfore CL number.

~~~
malkia
To me the issue is when mentally trying to work with these numbers, P4 & G4's
numbers increment, so I can tell which one came before the other - I can't do
this with hashes. I'm sure I can get used to the other way, but this cannot
easily be ignored.

~~~
Conan_Kudo
Mercurial gives you those incrementing numbers through its revlog, and they
map to revisions, so you get that facility there.

------
techbio
Previous thread:

[https://news.ycombinator.com/item?id=11991479](https://news.ycombinator.com/item?id=11991479)

------
ridiculous_fish
> Google's monolithic software repository, which is used by 95% of its
> software developers worldwide, meets the definition of an ultra-large-scale4
> system, providing evidence the single-source repository model can be scaled
> successfully

This 95% number is the most surprising part of the article. That implies that
the sum of engineers working on Android + Chrome + ChromeOS + all the Google X
stuff + long tail of smaller non-google3 projects (Chromecast, etc) constitute
only 5% of their engineers. Is e.g. Android really that small?

~~~
dlubarov
They must have meant that 95% of Google engineers use the monorepo in some
capacity, even if the majority of their work is done in a different repo.

------
stevesimmons
My company has a 50m LOC Python codebase in a monorepo. It works really well,
given the rate of change of thousands of developers globally. That is only
possible because of the significant investment in devtools, testing and the
deployment infrastructure.

Here is "Python at Massive Scale", my talk about it at PyData London earlier
this year:

[https://youtu.be/ZYD9yyMh9Hk](https://youtu.be/ZYD9yyMh9Hk)

------
timkrueger
We work with an monorepo since Septemeber 2017. I wrote about the migration:

[https://timkrueger.me/a-maven-git-monorepo/](https://timkrueger.me/a-maven-
git-monorepo/)

Our developers like it, because they can use 'mkdir' to create a new
component, search threw the complete codebase with 'grep' and navigate with
'cd'.

------
jamesmiller5
I wish more developers knew of the wonderful "repo" tool[0] developed by the
Android devs which allows a monorepo _perspective_ of many git repositories.
Breakdown of the repo tool and example manifest files
[http://blog.udinic.com/2014/05/24/aosp-part-1-get-the-
code-u...](http://blog.udinic.com/2014/05/24/aosp-part-1-get-the-code-using-
the-manifest-and-repo/)

[0]
[https://source.android.com/setup/develop/repo](https://source.android.com/setup/develop/repo)

------
wrayjustin
> includes approximately one billion files

...

> including approximately two billion lines of code

_also_

> in nine million unique source files

I should insert a joke about how well the system would do if each source file
contained more than two lines of code.

But seriously, this summary could use some work.

~~~
rpcastagna
Binary files (arbitrary example: images used for golden screenshots in tests)
have no line counts and are likely skewing the numbers here -- in the way
you're (logically) looking to interpret them at least.

From a system design perspective, being able to handle a large number of files
regardless of type is an interesting challenge, as is being able to handle a
large number of highly indexed text files. All three of those statistics seem
potentially interesting for different audiences that might read this paper.

------
tsycho
It's not just devops that you need to pull off a large monorepo; the other big
thing is a strong testing culture. You have to be able to rely on unit tests
from across the code base being a sufficient indicator of whether your commit
is good. AND a presubmit process that can compute which parts of the monorepo
get affected by your diff, and run tests against them automatically before
committing your diff.

Google not only has the above but also has a strong pre-submission code review
process which catches large classes of bugs in advance.

------
malkia
Here is the video (with Rachel Potvin), predating the article by some months:
[https://www.youtube.com/watch?v=W71BTkUbdqE](https://www.youtube.com/watch?v=W71BTkUbdqE)

------
vbezhenar
I've used monorepo for few small related projects and it worked just fine for
me. Much easier to make related changes across several projects.

------
joe_fishfish
This is probably a stupid question, but I couldn't find an answer. Does this
mean Google keeps all of its different products in all their different
languages and environments in one repo? So like, Android lives in the same
repo as Gmail, which is the same repo as all the Waymo code and the Google
search engine code as well? That seems insane to me.

~~~
krackers
Android & chromium are kept outside the monorepo

------
paulddraper
Version controlled repositories are like business offices.

You can have your entire company in one location, or the entire company in
separate locations. The _most_ important thing is the logical rather than
physical organization: team structure, executive leadership, inter-org
dependencies, etc. You can achieve autonomy and good structure with or without
separate locations.

A single location reduces barriers, but at some point multiple locations can
solve physical and logistical challenges. General rule of thumb is to own and
operate office space in a few locations as possible, but at some point you
have to take drastic measures one way or another.

(Notice that Google had to invent their own proprietary version control system
just for their monorepo. And not even Google _actually_ uses a single repo as
the source of truth: e.g. Chromium and Android.)

------
paulie_a
Im sure properly organized it's okay, but from what I've seen it's mediocre at
best, especially with legacy/technical debt it's a huge mistake.

Start breaking that repo apart, because it probably isn't very/hopefully
depending on the debt that exists.

~~~
oxguy3
One of the big advantages of the monorepo is actually that it prevents
technical debt from accumulating. If a change somewhere else breaks your code,
you can't put off dealing with it -- you are forced to fix the issue
immediately.

~~~
erik_seaberg
Tech debt is a useful tool and we shouldn't have zero tolerance. I can see
wanting to deprecate old versions promptly, but I can't see _instantly_
deprecating _every_ old version with no workaround for mitigating emergencies.

------
the_arun
Seems like Google uses its own custom Source Control & tools -
[https://www.quora.com/What-version-control-system-does-
Googl...](https://www.quora.com/What-version-control-system-does-Google-use-
and-why).

------
carapace
[https://en.wikipedia.org/wiki/Conway%27s_law](https://en.wikipedia.org/wiki/Conway%27s_law)

> "organizations which design systems ... are constrained to produce designs
> which are copies of the communication structures of these organizations."

Interestingly, in light of the above adage, this massive repo is organized (if
that's the word for it) like a bazaar or flea market. (Rather than like a
phone book
[https://en.wikipedia.org/wiki/Yellow_pages](https://en.wikipedia.org/wiki/Yellow_pages)
)

------
alexeiz
> Trunk-based development. ... is beneficial in part because it avoids the
> painful merges that often occur when it is time to reconcile long-lived
> branches. Development on branches is unusual and not well supported at
> Google, though branches are typically used for releases.

This sounds like the SVN model to me where branches are cumbersome and
therefore they are very rare. After getting used to the Git branching model
where branches are free and merges are painless, it would be very hard to go
back to the old development model without branches.

------
jbergknoff
How does CI work with a monorepo? Do you always have to run all the tests and
build all the artifacts? Or are there nice ways to say "just build this part
of the repo"?

~~~
dekhn
You specify targets. Just like using bazel: bazel build //tensorflow/blah/....

I maintain a small part of the monorepo, and it's really nice to be able say
"Run every test that transitively depends on numpy with my uncommitted
changes", so you can know if your changes break anybody who uses numpy when
you update the version.

Personally I think it would be neat if there was an external "virtual
monorepo" that integrated as-close-to-head of all software projects (starting
at the root, that's things like absl and icu, with the tails being complex
projects like tensorflow), and constantly ran CI to update the base versions
of things. Every time I move to the open source world, I basically have to
recompile the world from scratch and it's a ton of work.

------
nicodjimenez
I have slight experience with both monorepos and smaller repos and I think
they can both work. The advantage of smaller repos is that it forces different
components to expose well designed API's. Bigger repos make sense for products
and embedded software, smaller repos make sense for platforms build up of
small services communicating on the internet.

~~~
djur
Smaller repos force different components to expose APIs, but I don't think it
forces or even encourages the APIs to be well designed. In some cases, having
work spread across multiple repos can impede iterative development, meaning
that you risk half-assed or, uh, two-and-a-half-assed implementations.

Also, when someone's asking for review for a change that encompasses, say, a
change to a service, a change to a client library for that service, and a
change to 2-3 other services that use that client library, I know that I
cringe a little when suggesting a change, knowing that to implement it is
going to require a commit on all of these different repos, waiting for CI to
run on each one, etc. I try to only use that impulse to counter the urge to
bikeshed, but the temptation is there.

------
jorblumesea
Is this really relevant for anyone except for "google scale" companies? For
most teams, managing 30-40 services backed by git repos isn't a huge task and
doesn't cause many problems.

Is there mature tooling that helps teams manage this, or is this proprietary
google magic tooling?

~~~
fastball
Most teams can probably get by with much fewer than 30-40 services. Unless you
have 30-40 groups within your team.

~~~
jorblumesea
Even if they had that, managing the contract between and in a small group
isn't super difficult.

------
testcross
I don't understand why gitlab/github/bitbucket don't provide better tools for
monorepo. This is a topic pretty trendy. But there is absolutely no tools
helping with control access, good ci, ...

~~~
malkia
What's missing in these is cross-reference, which is not possible without
somewhat established BUILD system (caps "pun-intened") - e.g. like
bazel/build, then a source code indexer, etc, etc.

This becomes very critical for doing reviews, since it allows you to "trace"
things without running them, apart from many other things. For example large
scale refactorings looking for usages of functions, and other examples like
it.

Why githab/gitlab/etc. can't do it? Well because hardly there could be one
encompassing BUILD system to generate correctly this index.

~~~
testcross
They can create a standard file format that has to be generated by build
system. github is in a pretty powerful position. They can create even a shitty
version of it and people will follow.

I've been thinking about a tool like this for a long time. A way to attach to
each commit not only the diff in the code, but also the list of places
affected by the changes (usages of functions that are modified for example).
Then during review we wouldn't have only a stupid diff. We would have a list
of place to check to be sure that the changes make sense in the context of the
project.

~~~
malkia
Even if they can, it's one thing indexing your own source files every night,
another indexing a much bigger amount + massive amounts of branches, clones,
etc. (I'm talking about github) - e.g. not practical - as there is no no clear
way to say which branch (from git) must be indexed (obviously not all) - e.g.
there is no encompassing "standard" saying so.

That by itself is another BIG PLUS for mono-repo (and "mono"-rules) - things
are done one (opinionated) way, trunk based development - but thus giving you
things that you won't be able to have normally.

Now indexing source file is not an easy and cheap task - it's basically a huge
MapReduce done over several hours (just guessing), so there must be a reason
for this to be done.

------
IloveHN84
The giant monorepo works only if you're using SVN, with Git it would be
tremendous

~~~
therealmarv
Unless you change git like Microsoft did:
[https://blogs.msdn.microsoft.com/bharry/2017/05/24/the-
large...](https://blogs.msdn.microsoft.com/bharry/2017/05/24/the-largest-git-
repo-on-the-planet/)

~~~
HNNewer
sure, but GitHub / GitLab / Bitbucket don't offer it

------
axaxs
Sorry, but as someone who has been in orgs that do both, mono repo is a
mistake. Constant needs to pull unrelated changes before pushing, pipelines
requiring to grab the whole repo for dependencies, etc. I understand the
arguments for mono repo, but never think it's nothing that outweighs the cons.

~~~
robaato
Well those are issues around having a git mono repo - where the repo is the
unit of change - you get it or you don't.

With mono repos such as SVN or Perforce you just work on whatever subset you
want.

------
prepend
I love these articles. Is there a wiki or collection of detailed descriptions
of large company tech practices that isn’t marketing blargh.

I read years ago about Google data ingest, locator process but neglected to
bookmark so now can’t find the reference.

~~~
mlinksva
Me too. I don't know of a collection, but others can be found at
[https://ai.google/research/pubs/](https://ai.google/research/pubs/)
[https://research.fb.com/publications/](https://research.fb.com/publications/)
[https://www.microsoft.com/en-
us/research/search/?q&content-t...](https://www.microsoft.com/en-
us/research/search/?q&content-type=publications) and similar (though only a
small fraction give hints about at scale practices, and those would be neat to
collect in one place).

Closely related to this post: just noticed a 2018 case study on Advantages and
Disadvantages of a Monolithic Repository
[https://ai.google/research/pubs/pub47040](https://ai.google/research/pubs/pub47040)

------
gervase
Should probably have a [2016] tag.

~~~
guessmyname
Indeed, but to be fair, the information in the article is based on several
research papers from 2011 [1].

And I am 100% sure the idea of having a monolithic project is several years
older than that.

I am grateful that the article is re-posted in multiple websites, because just
the other day I was in an interview and, while doing my coding challenge,
overheard the conversation of a young computer science graduate and another
interviewer. The interviewer asked him to explain what was a monolithic
repository and the benefits. This guy had no idea what the interviewer was
talking about and right there I realized that what many of us take for granted
terminology-wise in the IT world, will certainly be a foreign language to
young students who are just entering the work force.

[1]
[http://info.perforce.com/rs/perforce/images/GoogleWhitePaper...](http://info.perforce.com/rs/perforce/images/GoogleWhitePaper-
StillAllonOneServer-PerforceatScale.pdf)

------
emmelaich
(2016)

------
tflinton
A repo including configuration and data.

How about we stop considering google an engineering leader and just a search
leader?

------
tflinton
A repo with configuration, secrets and data?

Can we stop considering google an engineering leader and just a search
algorithm leader?

------
curtis
I think monorepos make a lot of sense when you're talking about millions of
lines of code. I'm not at all sure they make sense when you're talking about
billions.

~~~
gravypod
I don't think the number of linea matters. I think the interconnection of your
code matters. If you have 2 sets of services that are completely uncoupled the
having two monorepos for those two deployments make sense. If you can
guarantee atomic changes across all services that interconnect you have the
benefits monorepos give you.

~~~
mason55
Isn’t this only true if you’re doing full CI? Otherwise I could update my
service and you can update yours to work with mine but unless we coordinate
deployments you still have to worry about interface mismatches. I guess the
alternative is you can just never (for a loose definition of never) make
breaking changes to an interface. You can only enhance or create a new
version.

------
fizixer
I don't care about that. For me this is incomprehensible:

Why the eff does Google have billions of lines of code in their repo?

I hope they are not counting revisions (e.g., if a single 1 million project
has 100 revisions, that's 1 million, not 100 million).

I have heard that they do count generated code (so it's not all handwritten
code). In that case again, I have two things to say:

\- that's a bad metric. I could overnight generate a billion lines of code
with each line a printf of number_to_word of numbers from 1 to a billion. They
want to measure the size of the repo? They should tell us the gigabytes,
terabytes etc. But when it's lines of code, it's cheezy and childish to blow
up the measure by including lines of generated code.

\- But more importantly, I hope the generated code is 90% or more of that
repository. Because any less than that would mean that Google engineers have
handwritten 100 million or more lines of code through out the lifetime of the
company, in which case I have to ask: what bloated mess do you have on your
hands? I thought you guys were the top engineers of the world.

