
Why Google Stores Billions of Lines of Code in a Single Repository (2016) - tosh
https://cacm.acm.org/magazines/2016/7/204032-why-google-stores-billions-of-lines-of-code-in-a-single-repository/fulltext
======
cmrdporcupine
Seems like this gets re-posted every few months.

We don't actually have a single repository, although Google3/Piper really is
huge and dominant. Android, Chrome, Fuchsia, and myriads of other projects
don't exist in the monorepo (well, they are mirrored there) because the
tooling and libraries there are really geared towards server deployment, not
devices.

~~~
melling
tosh and others do a lot of reposts, Wikipedia links, etc

karma farming?

~~~
Tomte
Why does it matter to you? Does this submission hurt you?

~~~
melling
I’m simply pointing out why we get lots of repeat submissions.

Some stories are great and worth seeing every few years.

Your anger gave away the fact that you do the same thing:

[https://news.ycombinator.com/submitted?id=Tomte](https://news.ycombinator.com/submitted?id=Tomte)

The one downside is that it creates a lot of noise on HN. It forces new
stories off the first page quicker.

Also, stories about Jerry Lee Lewis’ marriage to a 13 year old isn’t HN, for
example.

~~~
Tomte
I'm not angry, actually. I just wonder why you're so invested in what I'm
doing. I'm actually proud of it. More people should un-earth cool stories and
submit them.

About Jerry Lee Lewis: I disagree. The sheer fact that moral compasses were so
wildly divergent between America and England, of all pairs of countries, so
recently, is quite on-topic, in my opinion.

Also, I think it's great that you found only one story you disagree with in
the thirty stories my submission page shows. Clearly you approve of my
selection.

> I’m simply pointing out why we get lots of repeat submissions.

That's explicitly okay, as the moderators point out all the time.

~~~
melling
You joined the conversation. I mentioned another user. I quickly looked at
your account once you became indignant. I scanned your last submissions page
and saw you’re karma farming too and picked out the most egregious submission.

I didn’t say it wasn’t ok. I’m sure I do it occasionally.

I’m waiting for the first person to write a karma farming AI. Take great
stories, determine a decay time, then repost.

“HN Karma Farming with Deep Learning (in Rust)”

~~~
hombre_fatal
You’re forgetting that people actually have to upvote submissions. Presumably
not everyone sees every submission on HN every time thus are candidates to
upvote it.

Sure, write that karma farming bot. Seems like a good service to those of us
who want to see good content we may have missed, generate new discussions on
classic content.

Complaining about “karma farming” has always seemed like a weird jealousy
thing to me. Where you presumably care so much about karma that you’re annoyed
over your arbitrary concept of karma-fairness. Who cares? If someone can get
karma by reposting classic content for HNers who missed it the other times,
more power to them. They reposted the content to an audience that wanted to
see it. You can’t always take it personally or lash out at others just because
you’re not part of that audience.

~~~
melling
I’m not complaining. I’m providing deeper insight into why people are seeing
repeated posts.

I did point out a downside, but have at it.

------
w0mbat
In the 90s, a startup called WebTV managed to negotiate and buy a perpetual
SOURCE CODE license for Perforce (P4). WebTV built their own good version
control system called "SourceDepot" on top of P4.

When Microsoft bought WebTV they got this license, which enabled them to
cheaply roll out P4/SourceDepot to massive teams for whom it was a big
improvement over what they had been using. With access to the source, they
could modify it to their needs and scale away.

Google tried to buy a source license early on, but Perforce had learned their
lesson from what happened with WebTV and wouldn't sell them one. That's how
Google ended up doing elaborate projects for years to prop up and eventually
replace P4.

Source : I worked at WebTV, Microsoft and Google.

------
tosh
> Most notably, the model allows Google to avoid the "diamond dependency"
> problem (see Figure 8) that occurs when A depends on B and C, both B and C
> depend on D, but B requires version D.1 and C requires version D.2. In most
> cases it is now impossible to build A. For the base library D, it can become
> very difficult to release a new version without causing breakage, since all
> its callers must be updated at the same time. Updating is difficult when the
> library callers are hosted in different repositories.

~~~
JMTQp8lwXL
This seems to me, to be a package manner implementation detail. The packager
manager I am most familiar with, npm, won't have an issue with the diamond
problem, since node modules can install nested modules, allowing B and C to
both depend on unique versions of D.

~~~
blablabla123
In times of security fixes sometimes there is only one secure version but the
whole ecosystem is built around the fact that older, flawed version can still
be included. Another drawback is performance, npm install is amazingly slow.
In fact package resolution is NP-complete:
[https://research.swtch.com/version-sat](https://research.swtch.com/version-
sat)

I generally agree but the problem is even worse :)

~~~
JMTQp8lwXL
Yarn is a bit more sophisticated and has a concept of "resolutions" which
would be great for the security use case of forcing a particular version. Even
if B and C need different major versions, say D@v1-with-security-fix and
D@v2-with-security fix, yarn supports that too, since resolutions can be
nested.

[https://yarnpkg.com/lang/en/docs/selective-version-
resolutio...](https://yarnpkg.com/lang/en/docs/selective-version-resolutions/)

------
twblalock
Don't try a monorepo unless you have a team of engineers who can work full
time on maintaining the build tooling, enforcing good practices, and training
the rest of the engineers on how to behave in a monorepo.

~~~
vanniv
Indeed, maintaining the greenness of the build in a giant monorepo is
extremely expensive.

It isn't clear to me that it is worth the investment, even for Google --
though it clearly world's "well enough"

~~~
ori_b
So is maintaining the greenness of a fragmented repo. The only difference is
when you find out about the breakages.

~~~
cipherzero
Yes. Thank you! From my experience multi-repo doesn’t make problems go away,
it just defers them and introduces others. I’m not against multi-repo or a
strong proponent of monorepos... each have their strengths and drawbacks and
there is a time and place for either. I struggle to deal with folks who firmly
lie in one camp or the other.

~~~
vanniv
Agreed!

Multi-repo makes integrations harder, because you find out about breakages
later -- which means that it might be harder to identify the cause and more
likely that someone depends on the behavior that broke you by the time you
notice it.

But, multi-repo makes local development faster and cheaper -- you're insulated
from the churn of everybody else's check-ins. You don't have to constantly
refactor everything every time some dependency makes a minor tweak.

You get more done and have a smoother development lifecycle -- but you're
going to keep falling behind your dependencies unless you invest in keeping up
-- and that part of the process is more unfun the less often you do it.

I currently live in a monorepo. I don't love it. But I've also not loved the
multi-repos I've lived in, so... _shrug_

------
jsnell
Previous discussion:

[https://news.ycombinator.com/item?id=11991479](https://news.ycombinator.com/item?id=11991479)
[https://news.ycombinator.com/item?id=15889148](https://news.ycombinator.com/item?id=15889148)
[https://news.ycombinator.com/item?id=17605371](https://news.ycombinator.com/item?id=17605371)

~~~
svat
Also:
[https://news.ycombinator.com/item?id=19303082](https://news.ycombinator.com/item?id=19303082)
\-- a survey of Google engineers on what they like and dislike about this.

------
jayd16
Monorepos are cool but I hate perforce with a passion. How can we use git?

A lot of work is going into making git scale better which is nice but there
will always be small open source repos. You could mirror them but it seems
like it would be nice to not have to do that.

Is there any research going into figuring out how to keep many repos in sync
with a single commit? Maybe you could use submodules across your
company...have a single master repo that essentially just keeps track of
working version sets. That might work but it sounds torturous. Is anything on
the horizon to solve this problem with git?

~~~
forwhomst
Why do you dislike perforce? I’ve used them all and git is the worst, in my
opinion, of the ones that aren’t dead and buried (like cvs). Perforce is my
fave.

Internal research at google suggested that ‘git5’ users were objectively less
productive than average, but there’s some friction around ‘git5’ so that
doesn’t necessarily implicate git itself.

~~~
rak00n
You have to be bluffing. I never heard of such internal research. How is even
productivity measured in a context like this?

~~~
joshuamorton
Google has an entire engineering productivity research team. They do research
on tooling and such all the time.

~~~
rak00n
I wasn’t doubting on existence of a research team like that. I was doubting on
the research mentioned. You cannot just say people who use a particular tool
(like git or git5) are less productive. That’s just absurd.

~~~
joshuamorton
I don't think the research is public, because it's all entirely about internal
tools, but yes the conclusion was that the got interface for perforce led to
less efficient development across basically any metric you could look at (LoC,
commit count, self-reported efficacy, etc.). Note that if you judge developers
based on commit count is a bad metric, but comparing LoC throughout of
developers who are evaluated on other metrics isn't problematic.

~~~
rak00n
I just searched in moma with variations of “git5 productivity” and no such
study showed up. Neither LoC nor commit count is a good metric to represent a
developer’s productivity.

In my personal opinion, using a git/hg like interface makes it lot easier to
work on a complicated CL because you can maintain internal local branches and
you can easily revert your incremental changes. That’s not at all possible in
perforce. I just can’t see how git/hg interface can make anyone less
productive.

~~~
joshuamorton
> Neither LoC nor commit count is a good metric to represent a developer’s
> productivity.

They're not a good metric to judge a developer on. But when you can do a large
controlled study (or even before/after with the same developer), without the
developer knowing they're being watched, it's a good metric.

------
dehrmann
Posting these because I don't see them mentioned so often in the mono/polyrepo
debate.

Choosing a monorepo? Be prepared for:

\- Investing in the build system

\- Investing in CI/CD

\- More painful upstream dependency management

\- A serious investment in architecting modules and thoughtful dependencies

\- Issues because of a bad deployment sequence

\- Slow version control

Polyrepo your thing?

\- Managing your own artifact repo (pypi, artifactory, etc)

\- Painful cross-project changes

\- Repeated effort around builds

------
tomrod
Naive question

Can't we call Github a monorepository, as it is a single area one can use to
access code?

Implementation details likely differ from whatever Google and her engineers
are cooking up, but at the end of the day I point to an archive and pull
artifacts. Certainly not every Googler clones the entire monorepo...

~~~
alexhutcheson
GitHub isn't a monorepo because you can't change multiple projects with a
single commit. In a monorepo you can do things like change a function
signature and change all of its callers in one commit, without breaking
anyone's build. If the implementation and its callers live in different
projects within GitHub, then you can't do this safely, and you need to do more
work to provide backward- and forward-compatibility in your changes.

~~~
juiyout
In your example (change a function signature and change all of its callers in
one commit), would it require code reviews and approvals from every affected
projects?

~~~
alexhutcheson
Yes, although if you can prove convincingly that it's a 100% safe change you
can often get approval from someone who has approval power for most of the
repository.

In practice, if you're changing 100+ individual callsites, then you would
probably make a backwards-compatible change, then use an automated system to
send out and manage a bunch of different commits to clean up call sites, then
clean up your old function signature once all the commits are submitted. If
your code isn't a really widely used library then you probably have fewer than
20 callsites, though, so it's nice to be able to do it in one commit.

------
Waterluvian
Is a monorepo more than just semantics?

At that scale I need tools to manage the monorepo. What about making tools to
manage many repos together?

~~~
tomasGiden
Ericsson has another philosophy where there connect multiple repos at ci/cd-
time. They have open sourced a tool for that called Eiffel [1]. There is also
a book [2] written by the author of Eiffel that is quite good. One of his
argument is that when, as an enterprise, you buy a company with a big mature
code base, you can’t just move it into another common repo with all custom
tooling (also very anti agile to force everyone into the same suit). A big
difference though might be that Ericsson deals with a lot of custom hardware
for telecom networks. So their ci tooling might be more complex than google’s.
Also, continuous deployment is not really an option for them. Then it is
better to just have each piece sent out events on what’s happening (builds,
test runs etc) and let event listeners in other parts of the ci/cd pipeline
work out what to do.

(I have worked for Ericsson previously for 7 years but that was before Eiffel)

[1] [https://eiffel-community.github.io/](https://eiffel-community.github.io/)

[2] [https://www.amazon.com/Continuous-Practices-Strategic-
Accele...](https://www.amazon.com/Continuous-Practices-Strategic-Accelerating-
Production-ebook/dp/B07HJJN6S9)

------
ehvatum
One gigantic Perforce repo that achieved such Brobdingagian scale, Google had
to roll their own ultra-giga Perforce that properly belongs in a hellish mondo
alternate dimension of largeness. “But it’s searchable”, Google says. Yes. By
the devil, it is.

~~~
gravypod
Is writing/improving tools to fit your needs considered bad? If Google needed
to store an exabyte of data into MySQL and they rewrote something with a MySQL
compatible APIb (PingCap) would this have been bad?

~~~
dahfizz
I think the sentiment is that would be solving the wrong problem.

Why do they need to store so much data in a single DB, and could they use a
different approach? Rewriting tools that already work well is a big of an
extreme solution, but Google also has the resources to do something like that.

~~~
gravypod
> Why do they need to store so much data in a single DB, and could they use a
> different approach? Rewriting tools that already work well is a big of an
> extreme solution, but Google also has the resources to do something like
> that.

I see the choices as: a) rewrite one core tool infrastructure component or b)
rewrite everything - currently and soon to be - built on that infrastructure.

It's also a question of level of abstraction. Why does my application care
about database sharing, replication, leaders, followers, etc. It should only
care about connecting & querying. The complications of running a system at
scale should be hidden as well as possible from the code you've written.

From a monorepo perspective: why do I care where my code is stored? As long as
I can control visibility, hide build complexity, and expose an API to other
pieces of code I have all the functionality I need. Why should the nitty
gritty of a package manager, VCSs, or other multi-repo concepts influence how
I write my code?

------
fizixer
If you're a type 0 company [1] and you have a billion lines of code, much less
multiple billions, you're most likely sitting on massive amounts of code
bloat.

[1]
[https://en.wikipedia.org/wiki/Kardashev_scale](https://en.wikipedia.org/wiki/Kardashev_scale)

------
discordance
How do they avoided leaks? \- surely having a giant mono repo poses an access
risk

~~~
atesti
I don't remember where I read or heard this, but if you try to download all
the source code at once, they will probably block your account

------
jsw
I've always been curious about the general directory structure they use for
this. Does anyone have insight?

~~~
username90
Top level directories by product area, then team/project based sub directories
most of which would be their own repositories if you didn't do a monorepo.
There are more exceptions to this than can fit in a post, but that is the
general structure.

------
Havoc
More automated commits than human?

------
Speakeasys
I can only imagine how long ‘git status’ takes to complete

~~~
yoz-y
The equivalent is really fast. Perforce and derivates have more in common with
SVN rather than Git, so if your work is filesystem based and per-folder.

------
aantix
Is it really “billions”?

~~~
CydeWeys
Don't underestimate how many of these changes are entirely automated. There's
a lot of versioned data and configuration stored in the monorepo, not just
human-authored code commits.

~~~
hinkley
The first “big” project I worked on was a monorepo, and it ended up somewhere
in the neighborhood of 300k loc/sloc, I can no longer recall which, but over a
third of that was generated code.

This was a datapoint on a theory I have that projects where all the code is
being actively stewarded (instead of abandoned code nobody understands) have
less than about 15k (human) lines per developer. If Google’s ratio of
generated code is near that, then with 40k developers they’re not far from
that ratio.

------
sabujp
we're going to make this scale even more

------
mlthoughts2018
“It was a Perforce accident that didn’t scale and later got retrofitted with
mythology about monorepo designs.”

