
Why Google stores billions of lines of code in a single repository (2016) - fagnerbrack
https://dl.acm.org/citation.cfm?id=2854146
======
nickm12
I've worked with both a distributed repo model and a monorepo model and vastly
prefer the distributed approach (given the right tooling). The trade-offs are
complementary and no doubt with proper discipline you can try to maximize the
benefits, while minimizing the downside. But here's what I don't like about
working in a large monorepo:

1) Difficult to track changes to the code I'm interested in. Every day there
are hundreds of changes in the repo and almost all of them have nothing to do
with what I'm working on.

2) all sorts of operations take longer (pulling, grepping source, etc.) to
support code I couldn't care less about.

3) Frequently have to update the world at once. Unless the repo can store
multiple versions of the same module, then all the consumers have to be
updated at once, even if it's inconvenient. Sometimes migrations are better
done gradually.

4) Encourages sloppy dependency management. There are frequently unclear
boundaries between software layers.

I'm sure people will say "if you're having those problems, you're doing it
wrong" but the same thing could be said to people who find the distributed
model problematic.

~~~
geocar
Software A version 1 consumes format F1 and produces format G1 data, and
software B version 1 consumes format G1 and produces H1.

To upgrade format G2 we _must_ change both software A and B.

First, software B version 2 must accept both G1 and G2. To do this we may need
to build software A version 2 and try them in a sandbox environment to gain
confidence that ∀F1 we produce the correct G2. If F1 is complete, we may be
able to do this exhaustively, but if F1 is sufficiently diverse, monte carlo
simulation might be used.

Then, if there's a 1:1 relationship between A/B we can upgrade pairs.

If there's a N:M relationship, we need to upgrade all of the instances of
software version B1 to B2 (at least within a shard). If you're running in a
non-stop environment, this might have it's own challenges. Only then, can we
_begin_ the upgrade from A1 to A2.

Now:

Something, somewhere needs to record what and where we are in this journey. It
is relatively straightforward how to do this with a monorepo, but it is very
unclear how to do it with a distributed repository:

Almost everyone I know punts and uses some other golden record (like a
continuous integration server, or a ticketing system, or an admin/staging
system), and like it or not: that's your monorepo.

~~~
Joeri
You can also design software A to produce both G1 and G2 side by side, deploy
it, and then develop new software B against G2, submitting bug reports to
project A when there’s a problem detected in G2.

If you’re doing the multirepo strategy it’s best imho to make the projects
truly independent, as if they were developed by different companies. That way
every project only needs to think about its own dependencies and consumers,
and how to do migrations, without needing to have the big picture mapped out.

~~~
geocar
> You can also design software A to produce both G1 and G2 side by side

This can be impractical if G is a database table that is very large.

> it’s best imho to make the projects truly independent, as if they were
> developed by different companies.

One of our systems might cost £300k, so completely desynchronising them so
that code paths can build both G1 and G2 simultaneously (allowing B to develop
separately) means "simply" doubling the costs. That might put our team at a
disadvantage against someone who figures out another way.

~~~
Groxx
> This can be impractical if G is a database table that is very large.

If this is true, you _have no choice_ , and must run things side by side while
you convert to G2. Or shut everything down to make the migration atomic, which
is increasingly not an option.

~~~
geocar
> you have no choice

It depends on your database.

If you imagine a simple postgres or mysql server and an "alter table G..."
then you're right.

If (however) G1⊂G2 then a document store or a column-based database can
usually partition the table somehow.

------
matttproud
Large-scale refactorings are actually pleasant.

My team owns a framework and set of libraries that are widely used within the
Google monorepo. We confidently forward-update user code and prune deprecated
APIs with relative ease — with benefits of doing it staged or all-at-once
atomically.

It's imperfect, but maintenance in distributed repositories is infinitely
worse. Still, I remember the earlier days of the monorepo and keeping Perforce
client file maps; that was a pain!
[https://www.perforce.com/perforce/r15.1/manuals/dvcs/_specif...](https://www.perforce.com/perforce/r15.1/manuals/dvcs/_specify_mappings.html)

~~~
Roritharr
How many people work on the repository tooling at Google?

I'm asking because i wouldn't know how to setup a mono repository at my 50
people Startup even if we deemed this to be necessary.

~~~
rifung
> I'm asking because i wouldn't know how to setup a mono repository at my 50
> people Startup even if we deemed this to be necessary.

Sorry if this is a really dumb question. If you only have 50 people I'm
assuming your codebase isn't that big, so why can't you just make a repo, make
a folder for each of your existing repos, and put the code for those existing
repos into the new repo?

I imagine there's a way to do it so that your history remains intact as well.

~~~
aurelianito
Yes, there is. Move the entire content of each repo to a directory and then
force-merge them all in a single repo. I did this a few years ago with 4 small
mercurials repositories that belonged together.

------
chias
Google's mono-repo is interesting though, in that you can check out individual
directories without having to check out the entire repo. It's very different
from checking out a bajillion-line git repo.

~~~
nerfhammer
It's kind of interesting that nowadays people assume that version control
system == git.

For a huge, non-open codebase there are some pretty large downsides to a fully
distributed VCS in exchange for relatively few benefits.

~~~
malux85
Could you please elaborate? I've only used svn and git, and the largest
codebases I've worked on have only been about 150k lines of code.

What are the other ones and the main differences, really curious

~~~
vvanders
Perforce is really common in a few domains because it handles 1TB+ repo sizes
cleanly, has simple replication, locking of binary files and a good UI client
for non-programmers.

Was pretty much used exclusively back when I was in gamedev, not sure if
that's still the case.

------
haglin
I work in a large company and I have used a central repository for six years
and a distributed for six years. I think a central repository is better. The
benefits are:

1) Transparency. I can see what everybody else is doing and if somebody has an
interesting project I can find it quickly. You can also learn a lot from
looking at other peoples changes.

2) Faster. To check out the source code for the project I now work on takes an
hour in the distributed system, while it only took 5 minutes in the
centralized system.

3) Always backed up. All code that is checked into the central repository is
backed up. It has happened twice that employees have left and code was lost
because they only checked it in locally.

Many have only used CVS or SVN, which are horrible. I rather use Git or
Mercurial, but Perforce is really good.

~~~
JoshTriplett
> 1) Transparency. I can see what everybody else is doing and if somebody has
> an interesting project I can find it quickly. You can also learn a lot from
> looking at other peoples changes.

This doesn't require a _single_ central repository, just that all repositories
live in a common location.

> 2) Faster. To check out the source code for the project I now work on takes
> an hour in the distributed system, while it only took 5 minutes in the
> centralized system.

What distributed repository management system do you use, and what centralized
system did you use?

> 3) Always backed up. All code that is checked into the central repository is
> backed up. It has happened twice that employees have left and code was lost
> because they only checked it in locally.

As with point 1, this doesn't require a _single_ central repository, just that
all repositories live in a common location.

~~~
candiodari
I use git-svn to use a central repository. Let me list the advantages

1) Faster

There is no comparison. But let me count the ways

a) checking out stuff

It is faster than just downloading a directory using SVN.

b) just trying something out (ie. branch)

Creating a branch, making a few changes takes me seconds, and does not require
me to change paths like it does for the svn victims I work with. Throwing it
back out again takes seconds, and all operations are reversible for when I
fuck up (which is often).

c) merging

Git's merging. Oh my God. In half the cases I just have to check stuff over,
if that.

d) submitting

We use code review. Unlike most of the subversion folks I can easily have 5
co-dependant changes in flight (5 changes, each depending on the previous one)
without going insane, and I have gone up to 13, not counting experimental
branches. I observe around me that it takes a good developer to manage 2 with
subversion. 5 is considered insane, I bet if I showed them the 13 were in
flight at the same time they'd have me taken away as a danger to humanity.

2) always backed up

Subversion doesn't back up until you commit and people don't commit anywhere
near quickly enough ... The way people lose code around here 99.9% of the time
is by accidentally overwriting their in-flight code contributions (the
remaining 0.1% involves laptop upgrades and overenthusiastic developers. Even
then cp -rp will just copy my environment and just work, and yet the same is
absolutely not true for the subversion guys).

Now with Git, I commit every spelling fix I make, every semicolon I have
forgotten, on occassion separately, other times with "\--amend". And only then
make my share of stupid mistakes, after committing, something that's
technically not impossible on subversion but not practical, mostly because of
code review ("just commit it" on subversion takes ~5 minutes in the very fast
case (that requires a colleague dropping everything _that very second_ , AND
can't involve any actual code changes, as that trips a CI run that takes 3
minutes assuming zero contention), and 20-30 minutes is a more typical time
(measured from "hey, I'd like to commit this", to actually in the repository).
Committing on git takes me the time to type "<esc>! git commit % -m
'spellingfix'". The subversion commit time means that developers often go for
weeks without committing. Weeks, as in plural.

I get that a git commit isn't the same thing as a subversion commit. But it
does allow me to use the functionality of source control, and that's exactly
what I'm looking for in a source control system. Subversion commit doesn't
allow me to use source control without paying a large cost for it, that's what
I'm getting at.

So I have backups guarding against the 99.9% problem (and an auto-backup
script that does hourly incremental backups for the 0.1% case). The subversion
guys are probably better covered for the 0.1% problem. Good for them !

3) actual version control

Git's branches, rebase, merge, etc mean I can actually work on different
things within short time periods in the same codebase.

The fact that other developers are using subversion means I can have my own
git hooks that I use for various automated stuff. Some fixing code layout,
some warning me about style mistakes, bugs, ... (you'd be surprised how much
your reputation benefits from these). Some updating parts of the codebase when
I modify other parts, ... you have to be careful as these are part of the
reason subversion is so slow (esp. the insistence on CI, I hear a CI run at
big G, which is required before even code review can happen, takes upwards of
an hour on many projects with some taking 8-9 hours)

~~~
icebraining
The discussion wasn't really about SVN vs Git; you can have one or multiple
repositories with either system.

~~~
candiodari
You'd have the same problems with any other centralized versioning system the
way companies use it these days (ie. with CI, and code review).

~~~
joshuamorton
Not really. I work at google. I work on a leaf, so my CI takes < a minute. I
also can send out multiple chained changes, in a tree, to multiple reviewers,
and have them reviewed independently.

Certainly, CI takes a long time for certain changes, but those are changes
that affect everything. You'd have the same problem in a multi-repo approach
if you updated a repo that everything else depended on. At some point, you
have to run _all_ of the tests on that change.

~~~
candiodari
Cool. I've wondered about Google's CI a lot, but there are a lot of horror
stories online. Most people are complaining about it taking an hour for simple
changes (something called "tap", I wonder what that stands for).

Chained code review changes, I refuse to believe that in Google version
control (which is perforce according to Linus' git talk at Google) chained
changes are easy. Branching in perforce is literally worse than SVN, it's a
bit more like the old CVS model, and they've sort-of tried to get the SVN
copy-directory model forced into the design afterwards. Also the tool support
(merges ...) is bad compared to subversion and stone-age compared to Git's
tools.

The one reason I keep hearing for using perforce is that perforce allows the
administrator to "lock off" parts of the repository to certain users.

I've done branches and merges in Git, Subversion and CVS (and I've had someone
talk me through one in Perforce, but I don't really know). Google's
branch/merge experience is very likely to be somewhere between SVN and CVS,
and those can accurately be referred to as "disaster" and "crime against human
dignity". It's certainly not impossible, but it's very hard and you can't
expect me to believe (normal developer) people can reasonably do that in
Perforce.

Also: what would happen if you send out 20 chained commits, 10 of which are
spelling corrections, 5 of which are trivial, compile-fixing bugs (forgot
semicolon, "]" that should have been ")", etc ...), 2 of which are small
changes to single expressions and 3 of which introduce a new function and some
tests. Perforce, like subversion and cvs doesn't have any way of tracking
stuff unless you commit it and you can almost never commit without CI and code
review, so would you track changes like that, or would you just leave them in
your client untracked until you're ready for a code review ?

~~~
joshuamorton
>Cool. I've wondered about Google's CI a lot, but there are a lot of horror
stories online. Most people are complaining about it taking an hour for simple
changes (something called "tap", I wonder what that stands for).

Well, like I said, its possible to do modify things that have a lot of
dependencies, at which point you run a lot of tests, but that would be truish
anyway. Consider the hypothetical situation where you're changing you're
modifying the `malloc` implementation in your /company/core/malloc.c`.
Everything depends on this, because everything uses malloc. If you have a
monorepo, you make this change, and run (basically) every unit and integration
test, and it takes a while.

Alternatively, if `core` is its own repo, you run the core unittests, and then
later when you bump the version of `core` that everything else depends on, you
run those tests too, but now if there's a rarely encountered issue that only
certain tests exercise, you notice that immediately when you run all the
monorepo tests, and can be sure that the malloc change is the breakage. If you
don't do that, then you notice breakages when you update `core`, or maybe you
_don 't_ notice it, because its only one test failing per package, and it
could just be flakyness. So noticing it is harder, and identifying the issue
once you've decided there is one is harder, and now you need to rollback
instead of just not releasing.

>Chained code review changes, I refuse to believe that in Google version
control (which is perforce according to Linus' git talk at Google) chained
changes are easy. Branching in perforce is literally worse than SVN, it's a
bit more like the old CVS model, and they've sort-of tried to get the SVN
copy-directory model forced into the design afterwards. Also the tool support
(merges ...) is bad compared to subversion and stone-age compared to Git's
tools.

Google no longer uses perforce, we use Piper (note that this is a google
develped tool called Piper, not the Perforce frontend called Piper, yes this
is confusing, afaik, Google's Piper came first). Piper is inspired by
perforce, but is not at all the same thing. (See Citc in the article). The
exact workflow I use isn't piblic (yet), but suffice to say that while Piper
is perforce inspired, Perforce is not the only interface to Piper. This
article even mentions a git style frontend for Piper.

>Google's branch/merge experience is very likely to be somewhere between SVN
and CVS, and those can accurately be referred to as "disaster" and "crime
against human dignity". It's certainly not impossible, but it's very hard and
you can't expect me to believe (normal developer) people can reasonably do
that in Perforce.

Suffice to say you're totally mistaken here.

>Also: what would happen if you send out 20 chained commits, 10 of which are
spelling corrections, 5 of which are trivial, compile-fixing bugs (forgot
semicolon, "]" that should have been ")", etc ...), 2 of which are small
changes to single expressions and 3 of which introduce a new function and some
tests. Perforce, like subversion and cvs doesn't have any way of tracking
stuff unless you commit it and you can almost never commit without CI and code
review, so would you track changes like that, or would you just leave them in
your client untracked until you're ready for a code review ?

So, Piper doesn't have a concept of "untracked". Well it does, in the sense
that you have to stage files to a given change, but CitC snapshots every
change in a workspace. Essentially, since CitC provides a FUSE filesystem,
every write is tracked independently as a delta, and it's possible to return
to _any_ previous snapshot at any time. One way to think of this concept is
that every "CL" is vaguely analogous to a squashed pull request, and every
_save_ is vaguely analogous to an anonymous commit.

This means that in extreme cases, you can do something like "oh man I was
working on a feature 2 months ago, but stopped working on it and didn't really
need it, but now I do", and instead of starting from scratch, you can, with a
few incantations, jump to you're now deleted client and recover files at a
specific timestamp (for example: you could jump to the time that you ran a
successful build or test).

>Also: what would happen if you send out 20 chained commits, 10 of which are
spelling corrections, 5 of which are trivial, compile-fixing bugs (forgot
semicolon, "]" that should have been ")", etc ...), 2 of which are small
changes to single expressions and 3 of which introduce a new function and some
tests.

I'd logically group them so that each resulting commit-set was a successfully
building, and isolated, feature. Then, each of those would become its own CL
and be sent for independent review.

------
cletus
So as someone who previously worked for Google and now works for Facebook,
it's interesting to see the differences.

When people talk about Google's monolithic repo they're talking about Google3.
This excludes ChromeOS, Chrome and Android, which ard all Git repos that have
their own toolchains. Google3 here consists of several parts:

\- The source code itself, which is essentially Perforce. This includes code
in C++, Java, Python, Javascript, Objective-C, Go and a handful of other minor
languages.

\- SrcFS. This allows you to check out only part of the repo and depend on the
rest via read-only links to what you need from the rest.

\- Blaze. Much like Bazel. This is the system that defines how to build
various artifacts. All dependencies are explicit, meaning you can create a
true dependency graph for any piece of code. This is super-important because
of...

\- Forge. Caching of built artifacts. The hit-rate on this is very good and it
consumes a huge amount of resources given the number of artifacts produced.
Forge turns build times for some binaries from hours (even days) into minutes
or even seconds.

\- ObjFS. SrcFS is for source files. ObjFS is for built artifacts.

This all leads to what is usually a pretty good workflow like the ability to
check out directories if you want to modify them and just use the read only
version if you don't. You can still step through the read only code with a
debugger however.

Now Facebook I have less experience with (<6 months) but broadly there are
four repos: www, fbobjc, fbandroid and fbcode (C++, Java, Thrift services,
etc). At one point these were Git but for various reasons ended up being
migrated to Mercurial some years ago.

The FB case (IMHO) highlights just how useful it can be to have one repo.
Google uses protobufs for platform independence. FB uses GraphQL at a client
level and Thrift at the service level.

So one pain point is that, for example, you can modify a GraphQL endpoint in
one repo but its used by clients in others (ie mobile clients). There are lots
of warnings about making backward-incompatible changes, some of them
excessively pessimistic because deterministically showing something will break
some mobile build in another repo is hard.

Google3 has less of these problems because the code is in the same repo. On
top of that, Google has spent a vast amount of effort making it so the same
build and caching systems can handle C++ server code as well as Objective-C
iOS app code. Basically if you're working on Google3 you basically compile
very little to nothing locally.

Engineers on Android, Chrome and ChromeOS however compile a lot of things
locally and thus get far beefier workstations.

At FB the mobile build system doesn't seem to be as advanced in that there is
a far higher proportion of local building.

IIRC the Git people seemed to reject the idea of large code bases. Or, rather,
their solution was to use Git submodules. There was (and maybe is?) parts of
the Git codebase that didn't scale because they were O(n). Apologies if I'm
misspeaking here but I peripherally followed these discussions on HN and
elsewhere years ago as someone from the outside looking in so I'm no authority
on this.

The problem of course is that Git submodules don't give you the benefits of a
single repo and I've honestly not heard anyone say anything good about Git
submodules.

Just to stress, the above is just my personal experience and I hope it's taken
as intended: general observations rather than complaints and definitely not
arguing that one is objectively better than the other. There are simply
tradeoffs.

Also, there are definite issues with Google3, like the dependency graph
getting so large that even reading it in and figuring out what to build is a
significant performance cost and optimization issue.

~~~
epage
I have two main concerns when I see monorepos being used.

First, like in other areas, I see companies that want to "google scale" and
blindly copy the idea of monorepos but without the requisite tooling teams or
cloud computing background / infrastructure that makes this possible.

Second, I worry about the coupling between unrelated products. While I admit
part of this probably comes from my more libertarian world view but I have
seen something as basic as a server upgrade schedule that is tailored for one
product severely hurt the development of another product, to the point of
almost halting development for months. I can't imagine needing a new feature
or a big fix from a dependency but to be stuck because the whole company isn't
ready to upgrade.

I've read of at least one less serious case of this from google with JUnit

> In 2007, Google tried to upgrade their JUnit from 3.8.x to 4.x and struggled
> as there was a subtle backward incompatibility in a small percentage of
> their usages of it. The change-set became very large, and struggled to keep
> up with the rate developers were adding tests.

[https://trunkbaseddevelopment.com/monorepos/#third-party-
dep...](https://trunkbaseddevelopment.com/monorepos/#third-party-
dependencies-1)

~~~
bunderbunder
> I worry about the coupling between unrelated products.

I even worry about coupling among related products.

I could see monorepos working out well for a company that just does SaaS, and
is able to get away with nice things like maintaining a single running version
of the app, and continuous delivery.

Having mostly worked in companies that do shrinkwrap software or that allow
different teams or clients to manage their own upgrade schedule, though,
monorepo seems to me like a recipe for a codebase that is horribly resistant
to change. Not just in the "big bang upgrades like JUnit4 are awful" ways
described above, but also in a, "We never clean up old stuff, because most of
the time when we try it breaks a bunch of other teams' code and we just nope
out of that whole hassle, so barely-supported code sort of collects
continuously, like dead underbrush in a forest that's never allowed to burn,
until eventually it all explodes in a horrible conflagration," sort of way.

Seeing the list of things that Google keeps in a monorepo, vs things that
Google keeps in Git repos, it seems like they might be thinking similarly.
They've really only got a precious few products that typically run on non-
Google-owned hardware, and apparently the major ones live outside the
monorepo.

~~~
smallnamespace
The dynamics actually played out very differently at Google. Because it was a
monorepo with automated testing, if you didn't want other teams to break you
when they change the dependencies, then you had better have a robust test
suite.

Breaking changes would then lead to a discussion with your team, rather than
your fruitlessly trying to binary search to find the commit that broke you.

Over time, the culture at Google became that all teams need to write tests at
the unit, functional, and (usually) integration level.

------
sytse
I think that what is missing when using multiple git repos is the ability to
make a code change that spans multiple projects. We're open to adding that to
GitLab

~~~
jacobr
This will help if you _don 't_ want to run a monorepo, but many in the
industry consider a monorepo suitable for their organisation. The biggest
blocker with Gitlab in my opinion is not being able to only run a job if some
specific folder was modified ([https://gitlab.com/gitlab-org/gitlab-
ce/issues/19232](https://gitlab.com/gitlab-org/gitlab-ce/issues/19232)).

~~~
sytse
I agree that functionality would be great to have. We planning it in the next
3 months but we also very open to someone contributing it earlier.

------
siliconc0w
I think the monorepo is better but it's really hard to pull off without
Google-like tooling and engineering practices. Git + meticulous dependency
tracking and strict versioning conventions is probably the better move for
most companies.

~~~
jorblumesea
+1, I really dislike the "Let's do it because Google does it" mentality in
software engineering. Google operates at a scale and encounters problems that
the average company would probably never encounter.

------
borplk
How would they do CI/CD if it's all in one big repo?

I suppose you could do it if you had a very strict rule where absolutely
everything that could affect a "unit" was inside its own directory (but never
and nothing higher up than that "project root").

So you could check which sub-directory is affected within a commit and so on.

~~~
tehlike
Oh it works really well. One benefit is all tests affected by your target is
run as you presubmit. The benefit is everything is at head so things like
library bugs, security bugs are all handled naturally as part of the new
releases. This usually happens twice a week for most server binaries.

~~~
Xorlev
So long as the presubmits are configured correctly, or someone ignores
failures and force submits. :)

It's usually quite nice though, and changes that break lots of projects are
rolled back quite fast. Usually.

~~~
tehlike
As much as people hate, force submits are a fact of life.

------
pipu
If someone's interested, I made a summary of the paper a year back:

[https://www.extreg.com/blog/2017/02/googles-ultra-large-
scal...](https://www.extreg.com/blog/2017/02/googles-ultra-large-scale-
monolithic-source-code-repository/)

~~~
JediSWEng
It's a good write up.

To me, Piper is a monolithic version control system which is geared towards
good engineering practices.

As far as I know there are only two such systems in use today and the other
one is very dated and older than a lot of things out there.

When people say they have worked in a monolithic repo, then they typically
mean a 1 repo under one of the open source version control systems, but none
of these actually do or support what is needed when working with a monolithic
repo AND modern/good engineering practices.

For that specialised VCS is required and there is very few examples of that,
none of which is in open source software.

Git probably could be made to do this kind of stuff, but it would require some
extensions to the DAG as well as extend on it's already verbose command line
set. But I think it is doable.

The question is who can do it? Most are probably under some strict NDAs.

------
ridiculous_fish
What fraction of Google's code is in the monorepo these days? Android, Chrome,
and ChromeOS are not, and those are certainly large projects.

~~~
gsnedders
I believe anything not open source, or reasonably expected to become open
source.

~~~
tehlike
Plenty of opensource projects are in google3(that frequently gets merged into
external)

~~~
jingwen
Yep. Projects use Copybara to import/export OSS code.
[https://github.com/google/copybara](https://github.com/google/copybara)

------
luckydude
I scanned this, the usual mono/multi arguments. All made moot by BitKeeper's
nested approach:

[http://mcvoy.com/lm/bkdocs/nested.html](http://mcvoy.com/lm/bkdocs/nested.html)

in an open source system [http://bitkeeper.org](http://bitkeeper.org)

What nested brings to the table is the semantics of a mono repo with the
advantages of a multi repo. The whole thing walks in lockstep, if you have a 3
week old version of the kernel and you add in the testing component (subrepo)
then you get the 3 week old version of the testing component, all lined up
with the same heads in the same tip commit.

I get it that git won but at least steal the ideas.

Edit: BTW, bk has a bk fast-import that usually works (it doesn't like octopus
merges but other than that....)

------
wiradikusuma
The lazy me thinks maybe 1 kitchen sink repo is better.

Here's my problem:

I can be working on multiple projects at the same time. Each project has
multiple modules (core, api, www, admin, android, etc -- I use microservices
on Google App Engine). Sometimes, some modules have "feature" branches. Oh,
did I tell you I work on both Desktop and Laptop?

The problem is syncing. Before traveling, I need to make sure the
projects/modules I'll be working on the go all have latest commit from
Desktop.

My question:

Is there some "dashboard/overview" for all Git projects? So I can quickly
tell, "Ok, all projects are at latest commit, and oh I'm working on feature
branch for project X and Y."

~~~
erikb
There is a dashboard called github? Or what are you looking for?

I think if you want to make sure that all your changes are in, you need to do
what most programmers do and learn to finish a programming session with a
commit and push, just like you finish a sentence with a dot. Once you are used
to it, the chance of forgetting are really low.

~~~
wiradikusuma
GitHub.com doesn't tell you if you have uncommitted, and even unpushed commit.

(the desktop app does, but you need to check one by one for each project)

------
czardoz
Does anyone have "aha" moments to share about working in a single repository?

Stuff that just made you say, "Great, I wouldn't have been able to do that if
it was in separate repositories!".

~~~
BrandonY
Once, I noticed that a minor piece of how Google's Python system tests could
be very slightly cleaner and more consistent. It would be a tiny change, it
seemed very safe, and it was easily accomplishable with a short sed script,
but it'd also be a backward-incompatible change across the projects of
hundreds of teams and thousands of build targets. I was able to make that
change with only a few commits and without needing to bother most of those
teams.

These sorts of small, general, large scale cleanup commits are quite common at
Google, and they're encouraged. They help keep the codebase healthy. There are
special groups that review them so that all of the individual teams affected
don't have to bother, and there are tools to manage the additional testing and
approval requirements for such a change.

At my previous company, making such a change would have been a major
undertaking. I never would have considered a refactor of that scale without a
critical need. They had thousands of packages, each of which had its own
repository and an incredibly complex web of build and runtime dependencies. It
was a nightmare, and fiddling to find a working sets of versions of internal
dependencies took up way, way too much of my time each day.

------
soroso
I wonder what do the bots commit in the monorepo. It is something that has
been making me curious for a while.

~~~
greydata
Nothing too interesting. They're restricted so they can only update their own
source code.

~~~
Xorlev
Except for that _one_ time. ZRH SRE has never been the same.

------
pedro1976
The clear advantage of a mono repo over a mono-purpose repo is technical dept.
The goal is that you never build up technical debt and immediately patch all
references in an atomic change.

Example: Let's say you introduce a breaking change in lib A that is used in
libs B and C. First problem is visibility, that A does not necessarily see
that it is used in C and D. Second, you the build should break immediately and
not until someone build C/D.

~~~
afro88
Tech debt isn't just breaking changes though, and mono repo does nothing to
curb all the other types of tech debt: * Accumulation of FIXMEs * Partial
refactors cut short after change in business reqs * Quick hacks near release
time * "this could be done better if I had time" * Overdue re-architecture
after accumulated changes and additions * Orphaned code * Commented out tests
* etc etc

------
bookwormAT
Can someone further explain how the pre commit phase works? I don't get
how/why "pre commits" work without feature branches?

How are my changes shared with the reviewer, if I there is no feature branch?
Is my local code uploaded to that review tool mentioned in the article? And
then what happens if the reviewer requests changes?

I probably did get this completely wrong, so thanks in advance for pushing me
in the right direction.

~~~
tdeck
Perforce (the origin of Piper) has a concept of changelists. Some changelists
are submitted (committed) while others are not. So the review works by
uploading your changelist to Perforce, then pointing people to that
changelist. It's like an unnamed feature branch that can't easily be rebased
off of. Changelists do have a base, and you do a "g4 sync" to essentially
rebase off of master. Does that make sense?

~~~
mjw1007
When people talk about making a single global change to update all clients
when they change an interface, are they updating unsubmitted changelists too?

~~~
jsolson
No.

Unsubmitted changes at Google usually come in one of two flavors, short-lived
(abandoned or submitted within a few days) or perpetual. The latter flavor is
often for "I think we might want this". It's not uncommon for those to be
completely rewritten if they're actually needed. There's usually a preference
for submitting useful things (with tests!) and flag gating them to cut down on
bitrot.

I have seen exceptions -- I reported a bug in a fiddly bit of epoll-related
code and an engineer on my team had a multi-year-old fix -- he hadn't
submitted it because he wasn't confident he'd found an actual bug. The final
changelist number was more than double the original CL number (unsubmitted
changes get re-numbered to fit in sequence when they're submitted -- the
original number redirects to the final submitted version in our tooling).

------
shakencrew
previous discussion:
[https://news.ycombinator.com/item?id=11991479](https://news.ycombinator.com/item?id=11991479)

------
huherto
The model that I've liked the most...

\- Monorepo for all the services code in the enterprise. (Java )

\- UI code is kept in their own repositories.

\- Enterprise Services are exposed using well defined stable rest-like APIs.
(json, http, swagger, etc). We only exposed what is needed.

\- Within the services monorepo, services can call other services directly
using regular java calls.

\- Services in the monorepo are refactored all the time. This is the advantage
of using a strongly typed language like Java.

\- Several instances of the services run are a time, they scale horizontally.
Releases are done one instance at a time.

\- We had good unit and integration test suites on the services.

Now I am working for a different company. We have hundreds of services
deployed. No one knows what is running where. Or what the dependencies are.
Once something is released, everybody is afraid to make a change as no one
knows how that is affects other systems.

------
ssamuli
So, monorepo seems to be great, but why don’t we have open source tools to
deal with that?

------
oblio
How do they handle Android? It's one thing when you can just go "use the
latest" for web service and another when you have N branches. I'm not sure a
mono repo is as big of a benefit then.

~~~
GoatOfAplomb
Android Open Source Project is still in gerrit, where the code is stored as a
set of many git repositories. There's a "repo" tool that adds a helpful layer
of abstraction to make those git repositories (mostly) look like a single
repository.

------
randomsearch
Doesn’t this create a single point of failure for the development of the
entire software ecosystem at google? What happens if something goes
disastrously wrong with their repo? Does everyone twiddle their thumbs until
it gets fixed?

Also, isn’t this is the opposite of “separation of concerns”... code should be
divided into small units of functionality that don’t overlap. Minimise
interdependencies. Eliminate unnecessary work. Is this the most efficient and
future-proof approach?

~~~
boulos
Yes, when the system is down, it's really bad. Though it's an extremely
reliable system as so many folks depend on it (note there's also lots of
layers of tooling, so maybe you just need commits to work versus code review
versus your local edits, etc.). In my experience, the worst things are
actually when the build/test system is running slowly (thus blocking commits
until tests are complete) or the bog standard "ugh, someone decided to force
submit even though it said the tests were failing. Roll that back, please".

As for separation of concerns, there's actually a lot of scoping / visibility
stuff precisely to avoid letting people depend on things they shouldn't. I
think you now have to explicitly open up your visibility to let random other
projects depend on you. A monorepo doesn't _require_ that you share, it just
permits it.

~~~
randomsearch
Interesting. Every system is going to be down times, so I don't worry about
that too much. I'd be more worried about a design flaw or bug that doesn't get
exposed until you're painted into a corner and it's hard to escape.

Good point about the scoping. I suppose it really depends on how developers
use it - do they default to sharing or not?

------
nsm
There is a great talk by their C++ dev team about how a build everything from
source, mono repo really helps with C++ change enablement

[https://m.youtube.com/watch?v=tISy7EJQPzI&t=2589s](https://m.youtube.com/watch?v=tISy7EJQPzI&t=2589s)

Personally, I’d have to agree that at a minimum a mono repo per distinct
product role (server side in one mono repo, desktop software in one mono repo)
is immensely valuable.

------
mc32
I heard updating pager duty [on-call] commits used to be a mess --perhaps it
was apocryphal, but if not, why not use a different system for the pager duty
updates?

~~~
foota
Using the existing thing was probably easier than anything else.

------
tomerbd
"The two are not mutually exclusive. I utilize what I call ensemble
repositories that, for us, submodule the individual repositories" \-
[http://disq.us/p/1hj9nmu](http://disq.us/p/1hj9nmu)

------
symboltoproc
I'm a bit disappointed that they don't back their publication with any numbers
that compare the two approaches. I don't see the scientific value of what
would also be a good click-bait blog post.

------
fenollp
Here is yet another tool for managing multiple repos as one:
[https://supertanker.github.io/tsrc/](https://supertanker.github.io/tsrc/)

------
TobiasA
I'm working on a project where we have two projects: a desktop app and a web
backend. They each live in their own repo. However, this approach is proving
tedious for us now, as every new feature often means two branches (one in each
repo) which also means two separate pull requests when the feature is ready to
be merged. Has anyone encountered this kind of issue, and is the solution to
simply merge the two repos into one monorepo?

------
crucifiction
The real reason is why any large enterprise does something unusual: “it has
always been like that”.

------
known
So that developers can REUSE code.

------
guftagu
I develop an application that has a SPA frontend and an API backend. They can
love inside separate repos but I prefer to keep them together because if I
change the API signature, I'll also make the same changes in the frontend and
they will deploy together at once.

------
ericfrederich
Good Google Talk video about it too.

~~~
adyavanapalli
Can you post the link? Thanks!

~~~
bch
Could be this one:
[https://www.youtube.com/watch?v=W71BTkUbdqE](https://www.youtube.com/watch?v=W71BTkUbdqE)

------
ashwins227
Can anyone quickly summarize the article in a few pointers? I'm looking for
benefits/drawbacks and trade-offs.

~~~
cbcoutinho
Just open up the pdf and look on the front page.

> Key insights

> Google has shown the monolithic model of source code management can scale to
> a repository of one billion files, 35 million commits, and tens of thousands
> of developers.

> Benefits include unified versioning, extensive code sharing, simplified
> dependency management, atomic changes, large-scale refactoring,
> collaboration across teams, flexible code ownership, and code visibility.

> Drawbacks include having to create and scale tools for development and
> execution and maintain code health, as well as potential for codebase
> complexity (such as unnecessary dependencies)

------
diebir
At this point I normally assume that if Google does something a certain way,
or uses a particular proprietary technology, then it likely should NOT be
used.

I work for a place that tried to model itself culturally after Google and
Facebook and has had a lot of engineers moving back and forth. If Google is
anything like us, then it creates wrong incentives to invent in-house stuff.
See, in their expense scheme the salaries of engineers are not a huge deal.
It's cheap to have people do things. There is also an incentive for an
engineer to try and "leave a mark". There are also a bias to hiring new and
unexperienced people, who fail to learn the existing tech and replace it with
something different, simpler and less functional (by the time it matures, it
becomes just as complex, however).

I live in Java world, so I am seeing Google reinvent (poorly) every bit of
Java tech: DI (Dagger), build tools (Gradle), commons libraries (Guava) and
the list goes on and on. Well, apparently, they also reinvented internally
Git, Jenkins and the rest of the tools. The rest of us should probably NOT do
it this way.

~~~
cromwellian
Gradle was not a Google invention. Dagger was invented by Square. Google
invented Guice (Bob Lee now at Square) which was way better than Spring and
J2EE. Guava is far better designed than Apache Commons.

Dagger2 was invented by Google for good reason because it works purely as an
annotation processor without runtime bytecode classloader magic. That means it
is easier to debug, faster to startup on mobile, and can be crosscompiled with
j2objc and GWT/j2cl.

Much of the time in-house stuff is created to deal with scalability or
maintain ability problems.

------
hk__2
That’s a 2016 article and it’s called "Why Google stores billions of lines of
code in a single repository".

~~~
sctb
Thanks, updated.

------
throw7
I'd like to see the world checked into an rcs.

------
bjork
I guess you can’t blame the weaknesses of Google’s products (e.g. searching in
gmail) on the repository system.

------
2_listerine_pls
Facebooks doesn't, y tho?

~~~
dmnd
Doesn't it? [https://code.facebook.com/posts/218678814984400/scaling-
merc...](https://code.facebook.com/posts/218678814984400/scaling-mercurial-at-
facebook/)

------
srtjstjsj
"Why"? Because Google has billions of lines of code.

Does any company not store all their related code in a single repository?

~~~
eadmund
We don't, and it's hell. I may make a change to X, but some other developer
doesn't know it, and then one day when he makes a change to X and tries to
pull those changes into Y, now he has to change the bits of Y which are
affected by my changes as well as his.

I would _love_ to have a monorepo, although I'd advocate for branch-based
rather than trunk-based development. Yes, merges can be their own kind of
hell, but they impose that cost on the one making breaking changes, rather
than everyone else.

~~~
scarmig
That just means that no one makes big breaking changes even if they night be
necessary.

------
ngold
Because of course they do. Just like we all do. Everyone has a file they
backup just because they might need to reference it. Index.txt is my cross.

In terms of sheer code. I hope they have some Jedi ai to cross reference it.
If they don't they will. In fact we help everyday.

No point in burning the printing presses. We have to learn to get along.

Enjoy living, learning will come naturally as you pursue your interests. Those
interests will change, growing, cultivating other fields of knowledge,
grappling for your attention.

Culture you enjoy as only a human can, and cultivate it. Distribute and share
it. As humans we are denying our own renaissance with a ministry of silly
walks. Sometimes I think high school UN clubs are better at running the world.
Why not. It doesn't seem to matter anymore.

