
Why Google Stores Billions of Lines of Code in a Single Repository - signa11
http://cacm.acm.org/magazines/2016/7/204032-why-google-stores-billions-of-lines-of-code-in-a-single-repository/fulltext
======
rzimmerman
I've worked a bit with "the big monorepo" (though nothing like Google scale)
and my impression is that a lot of the benefits fall apart if you don't have
people working to maintain the build environment, automated tests, developer
environments. When it takes hours or more to run tests and produce a build it
can really slow the team down. The ability to force updates on dependencies in
a single big commit can be really worthwhile as long as you're willing to
spend the time to build the tools and do the maintenance.

~~~
wtbob
> a lot of the benefits fall apart if you don't have people working to
> maintain the build environment, automated tests, developer environments.

The thing is, a monorepo makes clear what multiple repos can obscure: if you
have a software system, then you need all those things anyway. Multiple repos
can hide the cost, but they don't eliminate it — in my experience, they
multiply it. E.g. if someone makes a breaking change in a monorepo, _he_
breaks the tests and _he_ fixes them before his changes can be merged in; but
with multiple repos someone can commit a breaking change and a dependent repo
won't know about about it until days, weeks or even months later; the person
responsible for fixing it has no idea what really changed or why; and he has a
broken build until he _does_ figure it out.

> The ability to force updates on dependencies in a single big commit can be
> really worthwhile as long as you're willing to spend the time to build the
> tools and do the maintenance.

It's like maintaining your car: you can spend $20 to replace your oil every
3,000-5,000 miles, or you can spend thousands every 10,000 miles. Up to you.

~~~
slavik81
On the other hand, some of the code you have to update may be for products
that have been mothballed. That might end up being a waste of time if the
project is never revived.

Or the code might be safety-critical and updating it is introducing
unnecessary changes (and hence, unnecessary danger).

Some tests may include hardware integration. For example, aircraft software
may need some number of flight hours after significant changes. That's
probably not going to be a part of the CI suite, and the changes will
introduce a greater manual testing burden.

~~~
tamana
That's why you delete dead code at head (keeping archive of history) , and you
have dev branches and release branches

------
hpaavola
With my current client we decided to go with multiple repositories and came to
regret that decision.

The product family we are developing contains website, two mobile apps (iOS,
Android), three PC applications (OS X, Windows and one legacy application) and
software for embedded devices.

Each product lives in it's own repository, most repositories use one shared
component as a submodule and many product share common platform, which is used
as a submodoule and products build on top of that. Test automation core sits
its own repo. I built and maintain that test automation core and it's a pain.

Each product repository has it's own functional tests that use the test
automation core to make the tests actually do something. So when ever I make
changes to the test automation core, I need to branch each product and use the
feature branch from my core in it. Then run all the tests in each repo and see
that I did not break backwards compatibility. If I do break it, then I need to
fix it through pull requests to possibly multiple teams.

I'm not the greatest git wizard in the world, so maybe someone else could
easily maintain good image of the whole mess in their head, but for me this is
a pain. And everyone else who maintains a shared component shares my pain.

Monolithic repo would not magically make all the pain in the world to
disappear, but it would be so much more easier to just have one repo. That way
I would need to only branch and PR once.

~~~
rakpol
Just to be clear, wouldn't the alternative with a monorepo still require that
you go back and forth with multiple teams if the commit is not backwards
compatible? It seems like the main complaint you have is that it's difficult
to wrangle a number of related pull requests, so perhaps switching to
something like gitcolony [1] for code reviews would help.

[1]: [https://www.gitcolony.com/features](https://www.gitcolony.com/features)

~~~
ori_b
> Just to be clear, wouldn't the alternative with a monorepo still require
> that you go back and forth with multiple teams if the commit is not
> backwards compatible?

No. You just send one diff that changes all the team's code and update
everything in lockstep, so at any point in your history, everything is
compatible.

Instead of you going back and forth with multiple teams, you're bringing them
together to comment on a change in one place. You synchronize on version
control history instead of needing to wrangle multiple teams across multiple
repositories, and you no longer need to deal with fallout for code change
compatibility. You just make the change. Everywhere. In one shot.

~~~
skybrian
You may have to get multiple teams to review your change before being allowed
to commit it. And you have to run all their tests. If there is a problem the
whole thing will typically get rolled back, which is a drag because then you
have to fix the issue, run tests again and get approvals again.

So, in practice, for large-scale changes that affect many teams, we still try
to break up large patches into multiple smaller steps, even working in a
monorepo.

A single commit is nice for small patches that only affect a handful of teams,
though.

------
OneMoreGoogler
During my 16 month tenure in Google, I worked on:

1\. Android, using shell and Make

2\. ChromeOS, using Portage

3\. Chrome browser, using Ninja

4\. google3 (aka "the monorepo") officially using blaze (but often there were
nested build systems - I remember one that used blaze to drive scons to build
a makefile...)

The diversity of the build systems significantly steepened the learning curve
when switching projects. During orientation, they told me "All code lives in
the monorepo, and every engineer has access to all code", but this turned out
to be not true at all. If anything it was the opposite: more build system
diversity at Google than at other places I worked.

~~~
flamedoge
> I remember one that used blaze to drive scons to build a makefile...

Good lord.. then again I have to use MSBuild. Why can't someone write one that
consumes something like JSON and is async?

~~~
tinco
I thought msbuild took json files too nowadays.

------
scaleout1
Please dont do it unless you are google and have built google scale tooling
(gforce/big table for code cache)

benefits of monorepo

* change code once, and everyone gets it at once

* all third party dependencies get upgrade at once for the whole company

cons (if you are not google)

* git checkout takes several minutes X number of devs in the company

* git pull takes several minutes X number of dev

* people running get fetch as a cron job, screwing up and getting into weird state

* even after stringent code reviews, bad code gets checked in and break multiple projects not just your team's project.

* your IDE (IntelliJ in my case) stops working because your project has million of files. Require creative tweaks to only include modules you are working on

* GUI based git tools like Tower/Git Source dont work as they cant handle such a large git repo

Google has solved all the issues i mentioned above so they are clearly an
exception to this but for rest of the companies that like to ape google, stay
away from monorepo

~~~
pmezard
> git checkout takes several minutes X number of devs in the company > git
> pull takes several minutes X number of dev > people running get fetch as a
> cron job, screwing up and getting into weird state

I feel like you are addressing teams of several hundreds of developers. Unless
they commit large binary files each day this is hardly an issue for smaller
several tens people teams.

> even after stringent code reviews, bad code gets checked in and break
> multiple projects not just your team's project.

Revert such code immediately once detected by the CI. Which is harder to do if
the changes and their adjustments are spread across dozen of repositories.
Also, please compare the easiness of setting up CI for a hundred of
repositories compared to a single one with tens thousand of files.

> your IDE (IntelliJ in my case) stops working because your project has
> million of files. Require creative tweaks to only include modules you are
> working on.

Monorepo does not mean you have to load everything in a single workspace. It
means everything gets committed at once. If your tools cannot handle so many
files or cannot be configured to work on subsets, blame your tools.

Yes, monorepo present some challenges but handling hundred of repositories is
no better. Having done both, I prefer the former.

And yes not everyone is Google, with thousand of developpers and billion lines
of code.

~~~
avita1
> Also, please compare the easiness of setting up CI for a hundred of
> repositories compared to a single one with tens thousand of files.

It's not so black and white though. There's plenty of difficulty in keeping a
mono-CI system running and being helpful.

* The CI service becomes a single point of failure for all developer productivity

* Running a full build on every commit while accounting for semantic merge issues (so serially) is non trivial.

Jane Street did a pretty good write up of how hard this can be:
[https://blogs.janestreet.com/making-never-break-the-build-
sc...](https://blogs.janestreet.com/making-never-break-the-build-scale/)

------
stdbrouw
> Since all code is versioned in the same repository, there is only ever one
> version of the truth, and no concern about independent versioning of
> dependencies.

This sounds like horror to me: it's essentially a forced update to the latest
version of all in-house dependencies.

Interesting article though. It feels like there's a broader lesson here about
not getting obsessed with what for some reason have become best practices, and
really taking the time to think independently about the pros and cons of
various approaches.

~~~
lpolovets
It's been a long, long time since I worked at Google, but IIRC the build
system was like Maven: you could depend on the latest version of something, or
a specific version, and only the dependencies you needed would get pulled from
the global repo when you checked out code. If your code depended on a specific
version of some piece of code, then you'd be immune to future updates to that
piece of code unless you updated your dependency config. (Hopefully I'm
remembering all of this correctly..)

~~~
thirtyseven
Nope, that's not quite how it worked. Except for a few critical libraries that
were pinned to specific versions, every build built against HEAD. However, a
presubmit hook would run tests on the transitive closure of code depending on
the changed code, and if anything broke, it was your responsibility as the
committer to fix it or roll it back.

This is really what kept things stable. If you maintained a a common library
and wanted to update it, you had to either keep things backwards compatible,
write an automated refactor across the codebase (which wasn't too hard because
of awesome tooling), or you would get dozens of angry emails when everything
broke. If you relied on a library, you had to be sure to write solid e2e tests
or your code might silently break and it would mostly be your fault.

I definitely miss the big G, so many resources were dedicated to engineer
productivity and tooling. It's what makes the monorepo work.

~~~
lpolovets
Thanks! I appreciate the correction. I miss a lot of the resources, too. I
also miss the spirit of dedicating serious attention to code quality. Things
like Testing Fix-it and Documentation Fix-it days were awesome.

~~~
squeaky-clean
> Things like Testing Fix-it and Documentation Fix-it days were awesome.

Can you elaborate on this? Are they just days where the team goes "Ok, no need
to stress about new features today, let's just catch on on test coverage /
documentation" ? If so, that sounds pretty wonderful.

~~~
lpolovets
It was company-wide. Every month or two, there would be a "documentation fix-
it day" or something similar. The theme would vary: sometimes it was docs,
sometimes unit tests, sometimes localization, etc. If memory serves correctly,
the idea was that unless you had a high priority/production bug, you'd spend
your day working on the theme of the fix-it.

------
wtbob
Anyone have a cache?

Without reading it, I've used monorepos and multiple repos, and far prefer the
former. What people don't get is that any system consisting of software in
multiple repos is really a monorepo with a really poor unit & integration test
story. The overhead of managing changes to inter-repo dependencies within the
same system is utterly insane — I'd say that it's somewhere between 2^ n &
n^2, where n is the number of repositories (the exact number depends on the
actual number of dependency relationships _between_ repos).

In fact, after these several years, I'm beginning to think that 'prefer' is
not the word: monorepos appear, more and more, to be Correct.

~~~
kuschku
Unless the software has nothing to do with each other.

A client software in Java, another client in C#, another one in Obj-C and a
server software in C++ will be hard to test together, will share no code, and
often be developed in independent development cycles, often by independent
teams.

Putting such things in monorepos would be stupid.

~~~
ryandrake
Totally disagree. Any serious development project that has an application that
needs to work across iOS, Android, Mac and Windows should be sharing as much
of its business logic, in C or C++ which is usable on all those platforms. The
UIs of course will be platform-specific, hopefully as thin as possible, but
the majority of your code should be totally re-usable.

------
DanielBMarkham
Couple of notes:

\- This is a technique, and it's a toolset, but most importantly it's a
_commitment_. Google could have split this up many times. In fact, this would
have been the "easy" thing to do. It did not. That's because this strategic
investment, as long as you keep working it, keeps becoming more and more
valuable the more you use it. Taking the easy way out helps today -- kills you
in the long run.

\- This type of work just isn't as important as regular development, it's
_more_ important than regular development, because it's the work that holds
everything else together.

\- In order for tests to run in any kind of reasonable amount of time, there
has to be an architecture. Your pipeline is the backbone you develop in order
for everybody else to work

\- You can't buy this in a box. Whatever you set up is a reflection of how you
evolve your thinking about how the work gets delivered. That's not a static
target, and agreement and understanding is _far_ more important than
implementation. I'm not saying don't use tools, but don't do the stupid thing
where you pay a lot for tools and consultants and get exactly Jack Squat. It
doesn't work like that.

~~~
qznc
The paper does not compare or evaluate this "commitment". It is a data point
that monorepo scales to 86TB provided you heavily invest in infrastructure to
support it. This is significant as many would have considered that
"impossible" as in "it is impossible to make git scale to 86TB".

We don't know if the monorepo approach is worth it. Google believes it is
(Facebook as well). Many others don't. The manyrepo approach also has
advantages.

~~~
DanielBMarkham
FTA, "...Over the years, as the investment required to continue scaling the
centralized repository grew, Google leadership occasionally considered whether
it would make sense to move from the monolithic model. Despite the effort
required, Google repeatedly chose to stick with the central repository due to
its advantages..."

The article goes into detail about both benefits and drawbacks.

 _The manyrepo approach also has advantages._

Care to elaborate on any of these? (aside from the obvious "works fast on my
machine")

------
deepsun
It works well when most of your code is written in-house. If you have a lot of
external dependencies -- not so good.

The problem is what version of the third-party dependency is various projects
in the BIG repo should depend on.

Article mentions that: "To prevent dependency conflicts, as outlined earlier,
it is important that only one version of an open source project be available
at any given time. Teams that use open source software are expected to
occasionally spend time upgrading their codebase to work with newer versions
of open source libraries when library upgrades are performed."

So if you have a lot of external dependencies -- you need a dedicated team to
synchronize them with all your internal projects.

~~~
willvarfar
The article doesn't say that google need _dedicated_ teams to synchronize with
external open source projects. The teams that _use_ each external dependency
are responsible for maintaining that as par the course.

~~~
cmrdporcupine
When submitting a third party package to the repo you become the third party
package maintainer. And it's your responsibility to make sure you don't break
the world when updating it. If people using it want a newer version (only one
version can live in the repo, by policy) it is generally their responsibility
to manage that transition. It's been a long time since I managed a third party
package, but that's my rough recollection.

The other thing is that third party packages are checked in as source.
Everything is built from source at Google.

~~~
iamchrisle
This is still how it's done today.

------
oblio
I have a question for Googlers: I keep hearing about refactoring all across
the repo. How does that work with dynamic call sites (reflection, REST), etc.?

I mean, there's no way to prove something is actually used in those cases
except for actually running the thing.

Do you just rely (and hope) on tests across all projects?

~~~
vjeux
There is a good talk from a Facebook engineer about this for JavaScript:
[https://www.youtube.com/watch?v=d0pOgY8__JM](https://www.youtube.com/watch?v=d0pOgY8__JM)

It usually goes like this:

\- You make your API backward compatible.

\- You use grep to try and find all the ways people are using it.

\- You write and run a codemod to convert 70-90% of the call sites
mechanically.

\- You manually fix the remaining ones. This is usually a good opportunity to
throw a hackathon-style event where a bunch of people sit in a room for few
hours and drive it down.

\- You remove support for the previous version.

~~~
ionforce
> You use grep to try and find all the ways people are using it.

The mere thought of this makes me think of the dark ages and a time in my
career that I'd rather forget. All hail statically typed languages. Or
languages that can easily be tooled against.

~~~
_asummers
If Facebook uses Flow internally, tooling could be written against that. Hell,
just removing the old function and seeing the compile errors would expose all
the call sites if their whole code base is Flow typed.

------
ACow_Adonis
As someone outside of google, I'm having a hard time seeing how this would
actually work. Not as in, it can't be done, as in, is there actually empirical
evidence that the supposed benefits (how do you know silos are lower than
otherwise?) are happening as claimed because of the monolithic code? Do silos
really come down, do big changes to underlying dependencies really get rolled
out, or do people hunker down into their own projects, try to cut out as many
dependencies as possible?

Perhaps the extra tools, automation, testing, etc helps to a large extent, I
can see that being reasonable, but I don't see how they solve all the problems
I have in mind.

Perhaps more so, if you've invested in all these automated tools, I am,
perhaps (certainly?) ignorantly, not entirely certain what those tools
inherently have to do with the choice of a monolithic code base? Couldn't many
of them work on a distributed code base if they're automated? I mean, we're
talking about "distributed" in the sense that its all still in the one org
here...I realise that in practice, this distinction between monolithic and
distributed is possibly getting a bit academic...

~~~
md_
I think your last paragraph sort of makes the point.

A lot of these things depend upon the ability to quickly and easily see how a
given piece of code is being used universally. Whether that means a single
repo or simply a single unified index of all code and the ability to
atomically commit to every repo is immaterial, but the latter sounds a lot
like the former.

------
acqq
I've found an interesting detail, does anybody know more about this?

"The team is also pursuing an experimental effort with Mercurial, an open
source DVCS similar to Git. The goal is to add scalability features to the
Mercurial client so it can efficiently support a codebase the size of
Google's. This would provide Google's developers with an alternative of using
popular DVCS-style workflows in conjunction with the central repository. This
effort is in collaboration with the open source Mercurial community, including
contributors from other companies that value the monolithic source model."

~~~
trymas
Haven't facebook forked Mercurial to use it for their own mega mono
repository?

~~~
xapata
As per this blog post.
[https://code.facebook.com/posts/218678814984400/scaling-
merc...](https://code.facebook.com/posts/218678814984400/scaling-mercurial-at-
facebook/)

------
valarauca1
Google gave a talk at the @scale conference in 2015 about this very topic. You
can watch it here (30 minutes):
[https://www.youtube.com/watch?v=W71BTkUbdqE](https://www.youtube.com/watch?v=W71BTkUbdqE)

~~~
Bytes
Thanks for linking the talk. It was much more descriptive than the article.

------
ComodoHacker
Interesting:

    
    
        The Google code-browsing tool CodeSearch supports 
        simple edits using CitC workspaces. 
        While browsing the repository, devel-
        opers can click on a button to enter 
        edit mode and make a simple change 
        (such as fixing a typo or improving 
        a comment). Then, without leaving 
        the code browser, they can send their 
        changes out to the appropriate review-
        ers with auto-commit enabled.
    

Do they still maintain CodeSearch for themselves? Was it so much burden to
maintain reduced version of it for the public?

~~~
SwellJoe
They open sourced a tool for code search:
[https://github.com/google/codesearch](https://github.com/google/codesearch)

I don't know if it's the same, as I just heard of it from a YAPC talk a few
days ago and haven't tried it, but it's called "Code Search" so it seems
likely.

~~~
ComodoHacker
It looks like local index and search tool like ack and ag. While above I'm
talking about the whole service.

~~~
DannyBee
"It looks like local index and search tool like ack and ag. While above I'm
talking about the whole service. "

The tool that was open sourced by russ uses the same technology/scheme that
the original codesearch was built on.

Given Russ wrote code search (the service) as an intern in a few months, one
would think you should be able to take the pieces and put the rest together.
:)

As for cost to maintain it for the rest of the world --

Look, if you have, say a team of 3-4 people, and your mandate is mainly one to
support internal developers, and there is plenty to do there, you just aren't
going to end up keeping external folks happy. This is likely true of almost
anything in the software world. Even if it works, people want to see it evolve
and continue to get better, no matter what "it" is.

If the next question is "why is internal so different from external that this
matters", i could start by pointing out that, for example, internally it
doesn't need to crawl svn, etc repositories to be useful. There are tons and
tons and tons of things you don't need internally but need externally. What it
takes for a feature to be "good enough" to be useful is also very different
when you are talking about the world vs 20000 people.

So it's not really even a question of "cost to maintain" in some sense.

------
arviewer
Does this mean that one Googler could checkout the complete system, and sell
it or put it online? How many people have access to the complete repository?
How big is one checkout?

~~~
OneMoreGoogler
Checkouts are apparently instantaneous. Behind the scenes, there is a FUSE
filesystem that faults in files by querying Google's cloud servers. So it does
not require a significant amount of space (but does require fast access to
Google's servers, which can be problematic).

Almost all engineers have access to the "complete system," which is really
Google's server-side software. Other repos like Android have historically been
more locked-down, but there's been some recent effort to open them up within
Google.

Presumably if you tried to copy all of the files, you'd first be rate-limited,
then get an access denied, and lastly a visit from corporate security. I
wouldn't want to try.

~~~
caustic
Is it a good idea to try to run the du command in the root directory?

~~~
arviewer
Probably even better to try sudo -i! ;-)

------
mac01021
Since acm can't handle the traffic:

[https://web.archive.org/web/20160625124824/http://cacm.acm.o...](https://web.archive.org/web/20160625124824/http://cacm.acm.org/magazines/2016/7/204032-why-
google-stores-billions-of-lines-of-code-in-a-single-repository/fulltext)

------
makecheck
There are advantages and disadvantages to consider. Clearly monolithic
repositories allow you to leverage commonality but it’s not free.

For example, single-repository environments may require you to check out
_everything_ in order to do _anything_. And frankly, you shouldn’t have to
copy The World most of the time. Disk space may be cheap but it is never
unlimited. It’s a weird feeling to run out of disk space and have to start
deleting enough of your own files to make room for stuff that you know you
don’t care about but you “need” anyway. You are also wasting time: Subversion,
for instance, could be _really slow_ updating massive trees like this.

There is also a tendency to become too comfortable with commonality, and even
over-engineer to the point where nothing can really be used _unless_ it looks
like everything else. This may cause Not-Invented-Here, when it feels like
almost as much work to integrate an external library as it would be to hack
what you need into existing code.

Ultimately, what matters most is that you have _some_ way to keep track of
which versions of X, Y and Z were used to build, and _any_ stable method for
doing that is _fine_ (e.g. a read-only shared directory structure broken down
by version that captures platform and compiler variations).

~~~
slededit
Companies that do this don't require you to check out the world, that would be
craziness. Perforce allows for partial mappings and is used by pretty much all
the big companies. Even Microsoft's internal source depot is just a fork of
perforce.

Dependencies are usually handled by a binary import/export system where you
pull modules from a shared "build the world" system from a known checkpoint.

~~~
wocram
Both git and hg supports shallow (low history) and sparse (selective files)
checkouts as well, facebook has a blog post somewhere about how they built it
into hg.

~~~
kmiroslav
It's still not the same as perforce's mappings. Even with shallow repos, git
will force you to download the entire world each time you pull or fetch.

~~~
jahewson
That's not really the entire world though, because the history before the
shallow clone will not be pulled or fetched. You'll get all the new stuff
though.

~~~
slededit
I don't think the scale of these places are fully appreciated. Your local
enlistment could be a few hundred gigabytes and that is only the head revision
of stuff you work with daily. Keeping more than just the current revision
around is a lot of data for even a beefy workstation.

------
quux
See also:
[https://www.youtube.com/watch?v=W71BTkUbdqE](https://www.youtube.com/watch?v=W71BTkUbdqE)

------
kev009
History as a guide, we will look back on this for the tire fire it is.
monorepo and projects like Chrome and Android to me look like a company that
is trying its best to hold itself together but bursting out at the seams the
same way Microsoft did in the '90s with Windows NT and other projects.
Googlers frequently use appeal to authority to paper over the fact that they
are basically the new Microsoft.

~~~
commentereleven
OK, interesting point of view.

Can you justify it with examples?

~~~
kev009
That was just my reaction to the article. I saw Google and ACM so thought it
would be something good, but in the end I fell victim to the authority click
bait. The article is written in a scholarly style, but it is neither academic
nor profound. It is the engineering equivalent of an intra-office memo
describing how a massive company is going to organize their sales regions and
Take Over The World. Imagine for a moment if IBM, Microsoft, or the Department
of Defense wrote this exact piece today... would ACM have published it? Would
it have made the front page of HN? I simply felt disappointed and expressed
that reaction.

For examples of Google non-superiority (remember, this is a
hacker/entrepreneurial forum, we should seek solidarity in outdoing behemoth
entities with agility), I simply encourage you to think for yourself and not
put any credence into lore, methodology, or tech simply because it comes out
of Google. I see a 1:1 comparison between Android and the non-WinNT kernel
Windows releases. Google put a festering pile of code out and allowed handset
makers and carriers to basically never patch any type of vulnerability. The
permissions model and app over reach are just barely now contained in Android
6... seven years after release. Chrome bundles tons of third party
libraries... it's another moving train wreck with enough bodies to somehow
deal with the naive vendoring, scope creep, and general upkeep but it's still
a nightmare to correctly build and package for an OS. By comparison I have
immense respect for the Servo developers who are making an interesting reach
with far less resources than Google.

------
zellyn
One thing to consider is that monorepo tooling is (outside of Google) still
pretty immature.

At Square, we have one main Java "monorepo", one Go "monorepo", and a bunch of
Ruby repos. The Java repo is the largest, by a huge factor (10x the Go repo,
for example).

The Java repo is already noticeably slow when doing normal git operations. And
I'm told we've seen nothing of the pain the folks at Twitter have had with
git: their monorepo is huge.

We periodically check to see how the state of Mercurial and Git is progressing
for monorepo-style development. Seems like Facebook has been doing interesting
stuff with Mercurial.

But I still miss the fantastic tooling Google has internally. It really is so
much better than anything else I've seen outside.

~~~
huherto
How big is your mono repo ?

We have ~90k lines of java code. I don't think it will be a problem unless it
grows ten times. We spend time grooming it. Removing old code, etc. I believe
it is the case for most companies. Unless you are google, facebook, square,
etc.

~~~
zellyn
Our Go monorepo has 1,325,332 lines of non-comment code, according to `cloc`,
although about half of that is generated protobuf code. I'm still waiting for
`cloc` to finish on the Java repo…

~~~
durin42
It's likely that Mercurial would scale for you today, using Facebook's stack.
Drop me an email if you'd like to chat sometime.

------
kngspook
Seems to be down... (I can't imagine the ACM is _that_ easy to take down..?)

~~~
dammitcoetzee
Remove the /fulltext and it loads instantly.

~~~
rzimmerman
Thanks! How on Earth did you figure that out?

~~~
dammitcoetzee
I have to fix a lot of broken URLS from the submissions sent in to
hackaday.com. I guess I don't even think about fiddling with them until they
show up anymore, haha:) What a weird skill.

------
hbsnmyj
For me, it seems that the power of this model is not that "we have a single
head", rather than, "we can enforce everyone use the newest library version
(expect for sometimes create a whole new interfaces)".

Let's suppose we have a tool that can

* automatically check out the HEAD of each dependent repo

* run a complete integration tests across all the repo before any push/check-in.

This will work fine even with a multi-repo model, won't it?

Also, As mentioned earlier by others, The reason google can do it is because
google can * maintain a powerful cloud-based FUSE file system to support the
fast checkout

* run automated build tests before any submits to ensure the correctness of any build

So they don't need to maintain multiple dependency version(for the most time)

------
qznc
How does Google deal with external dependencies? E.g. the Linux kernel. Do
they have a copy of Linux in the monorepo for custom patches? Is there a
submodule-like reference to another repository? Is there a tarball in the
monorepo, a set of patches, and a script to generate the Custom-Google-Linux?
What happens when a new (external) kernel version is integrated?

------
mk89
It's amazing to read this kind of articles, because they are the best argument
to all those very opinionated (or simply arrogant) people who claim that "one
repository is s __* " or similar, such as "this technology is better than this
one", "unit-tests yes/no", the list of religious wars could go on forever ...

Thanks a lot for sharing!

------
andrewguy9
This works because they have teams dedicated to reproducing the work everyone
else gets from their ecosystem's community.

For us non-googlers, would you trade away the benefits pip,gem,npm for single
source of truth?

~~~
timv
I can't see why I can't have both...

If my central package repo had genuinely reproducible builds, such that I
could download the source, run the build process and know that I had the same
output that the repo held, then I think that I would love to have a setup
where:

\- I could "add a dependency" by having an automated process that downloads
the source, commits it as its own module (or equivalent) in my source repo,
tags it with the versioned identifier of the package, and then builds it to
make sure it matches what the package repo holds.

\- I could make local modifications to the package if I needed to, and my
source control would track my deviation from the tagged base version

\- I could upgrade my package by re-running the first step, potentially
merging it with my local changes.

 _Hmm, I think I just described the ultimate end state for Go package
management..._

------
jjnoakes
The monorepo solves source-to-source compatibility issues, but it doesn't
solve the source-to-binary compatibility issues. For that you need a solid
ABI, possibly versioned, unless every code checkin redeploys every running
process everywhere.

Say version N of the code is compiled and running all over the place, and you
make a change to create N+1. Well, if you don't understand the ABI
implications of building and running your N+1 client against version N servers
(or any other combination of programs, libraries, clients, and servers), then
you'll be in a mess.

And if you do understand those ABI boundaries well and can version across
them, I'm not sure you need a monorepo much at all.

~~~
fishywang
protobuf and grpc (both open-sourced) solved that: just don't reuse a protobuf
field number and you are fine.

------
MichaelBurge
I find it really hard to find anything when it's split into a bunch of smaller
repositories. If you're going to do that, you should at least have one master
repository that adds the tiny ones as git submodules.

------
sytse
This is a very interesting article. I believe there is value to using
libraries about there is something to be said for monrepos. Interesting that
Google and Facebook are working together to extend Mercurial for large repo's.

The article gave me the following idea's to extend GitLab: Copy-on-write fuse
filesystem [https://gitlab.com/gitlab-org/gitlab-
ce/issues/19292](https://gitlab.com/gitlab-org/gitlab-ce/issues/19292)
Autocommit using auto-assigned approvers [https://gitlab.com/gitlab-
org/gitlab-ce/issues/19293](https://gitlab.com/gitlab-org/gitlab-
ce/issues/19293) CI suggesting edits [https://gitlab.com/gitlab-org/gitlab-
ce/issues/19294](https://gitlab.com/gitlab-org/gitlab-ce/issues/19294)
Coordinated atomic merges across multiple projects [https://gitlab.com/gitlab-
org/gitlab-ce/issues/19266](https://gitlab.com/gitlab-org/gitlab-
ce/issues/19266)

------
carrja99
At previous job I had Google to thank for the team's god awful decision to
choose perforce over git thanks to some silly whitepaper or article. They
acted like git was some fringe version control system that no one would use
professionally... just for fun toy projects.

~~~
packetslave
Perforce is a perfectly reasonable choice given certain technical and
organizational constraints. It is a bad choice for others. Git is a good
choice for many projects. It is a terrible choice for others, and was a much
worse choice in the past.

------
dabn
My experience working in large corporations and smaller companies with both
approaches tends to make me lean towards the multi repo approach.

Some details:

* Amadeus: huge corporation that provides IT services for the airline industry, handling the reservation process and the distribution of bookings. Their software is probably among the biggest C++ code bases out there. We were around 3000 developers, divided into divisions and groups. Historically they were running on mainframes, and they were forced to have everything under the same "repository". With the migration to Linux they realized that that approach was not working anymore with the scale of the company, and every team/product has now its own repository.

All libraries are versioned according to the common MAJOR.RELEASE.PATCH naming
and upgrades of patch level software are done transparently. However Major or
release upgrades have to be specifically targeted. What is more important for
them is how software communicates, which is through some versioned messages
API. There is also a team that handles all the libraries compatibility, and
package them into a common "middleware pack". When I left around 2012 we had
at least 100 common libraries, all versioned and ready to use.

 _Murex

financial software used in front/back office for banks. We had one huge
perforce repo, I can't even begin to tell you what pain was it. You could work
for a day on a project, and having to wait weeks to have a slot to merge it in
master. Once you had a slot to merge your fix in master, chances are that code
has changed meanwhile somewhere else and your fix can't be merged anymore.
That was leading to a lot of fixes done on a premerge branch, manually on the
perforce diff tool.

Also given the number of developers and the size of the repository, there was
always someone merging, so you had to request your slot far in advance. Maybe
the problem was that the software itself was not modular at all, but this
tends to be the case when you don't force separation of modules, and the
easiest way is to have separate repositories.

_ Small proprietary trading company We didn't have a huge code base, but there
were some legacy parts that we didn't touch often. We separated everything in
different repos, and packaged all our libraries in separate rpms. It worked
very well and it eased the rebuild of higher level projects. If before to
release some project would take ~1h, with separation of libraries it would
only take 5 minutes. It was working well because we didn't change often base
libraries that everyone was depending on.

------
plandis
I'd be curious to see how this works with third party libraries/etc...

For instance did everyone at Google migrate to Java 8 at the same time? That
seems like a huge amount of work in a mono repo.

------
doozler
Cached version:
[http://webcache.googleusercontent.com/search?q=cache:XDtepkB...](http://webcache.googleusercontent.com/search?q=cache:XDtepkBDv3MJ:cacm.acm.org/magazines/2016/7/204032-why-
google-stores-billions-of-lines-of-code-in-a-single-
repository/fulltext+&cd=1&hl=en&ct=clnk&gl=uk)

------
gravypod
A lot of the benefit that comes from this code storage method doesn't really
seem like the best solution.

The presenter in the video linked in this thread that this is very
advantageous due to cross dependencies. I don't think that this is the correct
way to handle a cross dependency.

I'd much rather handle it by abstracting your subset of the problem into
another repository. Have some features that two applications need to share?
That means you're creating a library in my mind. This is much better suited
for something like git as you can very simply use sub-modules to your
advantage.

Hell, you can even model this monolithic system within that abstracted system.
Create one giant repository called "libraries" or "modules" that just includes
all of the other sub-modules you need to refer to. You now have access to
absolutely everything you need within the google internal code base. You can
now also take advantage of having a master test system to put overarching
quality control on everything.

This can be done automatically. Pull the git repo, update all the sub-modules,
run you test platform.

I'd say that's a better way to handle it. Creating simple front end APIs for
all of the functionality you need to be shared.

------
oxplot
A great talk about this:
[https://www.youtube.com/watch?v=W71BTkUbdqE](https://www.youtube.com/watch?v=W71BTkUbdqE)

------
fizixer
Why does google have billions of lines of code?

This is more of a rhetorical question. As a tech minimalist, the preferrable
answer by a long shot would be that, internally, Google is keenly aware of the
severe bloat and technical debt of their codebase and have clear plans going
forward to drastically reduce the scale, by more than 1000x at least, without
sacrificing any of the features/bug-fixes/performance of any of the code.

~~~
Mahn
Realistically, I don't think you can have a minimalist, clean and focused code
base with thousands of engineers pounding away at the keyboard. You ought to
make some trade offs with that many employees.

~~~
fizixer
That sounds like reverse justification, i.e.,

Google's operations probably require X lines of clean code added everyday by
10% of the engineers (so ~5000 instead of ~50,000 engineers), but because of
the sheer number of engineers Google has, that are supposed to do something
continuously to show performance, Google has ended up with ~10 or ~100 X lines
of daily bloat accumulation.

Sounds like a textbook example of too-many-hands-spoiling-the-broth.

------
NicoJuicy
I suppose they don't do git pull :p, what's their source control management
system?

~~~
criddell
I believe it's called Piper.

~~~
nolepointer
Pied ... Piper?

~~~
bradfitz
Piper is Piper expanded recursively.

[https://99designs.com/t-shirt-design/contests/t-shirt-
design...](https://99designs.com/t-shirt-design/contests/t-shirt-design-
wanted-google-86935)

~~~
oaktowner
Oh, God, that is awesome.

------
justinlardinois
> The Google codebase includes approximately one billion files

> approximately two billion lines of code in nine million unique source files

So what are the other ~991 million files? I don't doubt that there's a lot of
binary files, but what else? Also what does "unique" source files mean?

~~~
corpus
It's the next sentence in the article. > The total number of files also
includes source files copied into release branches, files that are deleted at
the latest revision, configuration files, documentation, and supporting data
files; see the table here for a summary of Google's repository statistics from
January 2015.

~~~
justinlardinois
> see the table here for a summary of Google's repository statistics from
> January 2015.

Is it just me or is there not actually a table there?

------
amelius
I wonder if they also distribute binaries internally. Otherwise, setting up a
developer machine could take really long. Like installing Gentoo Linux :)

~~~
zhengyi13
With CitC meaning all the source is instantly, "locally" available and up to
date, and something like bazel.io meaning you always have a reliable and fast
way to rebuild something out of cached/distributed build outputs[1], you don't
_need_ to distribute a binary in the general case.

Whoever needs a tool can cd to the source of the tool, build it right then and
there, and run it.

[1] [http://google-engtools.blogspot.com/2011/10/build-in-
cloud-d...](http://google-engtools.blogspot.com/2011/10/build-in-cloud-
distributing-build.html)

------
twoy
The idea of working directory in cloud is dangerous, because it reveals that I
haven't yet started anything haha.

