
Google Is 2B Lines of Code, All in One Place - sk2code
http://www.wired.com/2015/09/google-2-billion-lines-codeand-one-place/
======
antics
Just because people are talking about it: I work at MSFT, and the numbers
Wired quotes for the lines of code in Windows are not even close to being
correct. Not even in the same order of magnitude.

Their source claims that Windows XP has ~45 million lines of code. But that
was 14 years ago. The last time Windows was even in the same order of
magnitude as 50 million LOC was in the Windows Vista timeframe.

EDIT: And, remember: that's for _one_ product, not multiple products. So an
all-flavors build of Windows churns through a _lot_ of data to get something
working.

(Parenthetical: the Windows build system is correspondingly complex, too. I'll
save the story for another day, but to give you an idea of how intense it is,
in a typical _day_, the amount of data that gets sent over the network in the
Windows build system is a single-digit _multiple_ of the entire Netflix movie
catalog. Hats off to those engineers, Windows is really hard work.)

~~~
ohitsdom
> single-digit _multiple_ of the entire Netflix movie catalog

Strange unit of comparison, although I may start using it.

~~~
RyJones
Facebook gets a Flickr worth of photos every few days.

~~~
toomuchtodo
As someone who subscribes to ArchiveTeam's philosophy, it's going to be a dark
day when the time comes to scrape Facebook before it goes under with that much
data behind the scenes.

~~~
zeckalpha
It's already backed up to Blueray. They'll just hand it over.

~~~
ivank
Yeah, right. How and who exactly would they hand that over to, given that
there are privacy settings on the photos that people expect to have respected?

~~~
BtM909
Sarcasm son....

~~~
ivank
You'd be surprised at how many people assume these bigcos are open to doing
the right thing when they're shutting down.

~~~
dx211
Hey, I've still got my complimentary MySpace zip file kicking around
somewhere.

~~~
jsmeaton
I had a hard drive crash on me with all of my photos some years back and my
"backup" strategy failed. Dumping a myspace backup got me some of my most
precious photos back. Thanks MySpace!

------
dekhn
I'm a google software engineer and it's nice to see this public article about
our software control system. I think it has plusses and minuses, but one thing
I'll say is that when you're in the coding flow, working on a single code base
with thousands of engineers can be an intensely awesome experience.

Part of my job- although it's not listed as a responsibility- is updating a
few key scientific python packages. When I do this, I get immediate feedback
on which tests get broken and I fix those problems for other teams along side
my upgrades. This sort of continuous integration has completely changed how I
view modern software development and testing.

~~~
amelius
Could you tell something about the level of documentation? For instance, do
you have to write a paragraph of documentation for every function that you add
to the system? How about adding preconditions and postconditions, and other
invariants?

Also, is the code that you add to the repository always inspected by other
people? Is that done systematically?

~~~
lrem
It is mandatory that each code change is inspected for correctness, language
style and approved by a code owner (all 3 may be the same person, or require
multiple, depending on the situation).

~~~
amelius
But I guess the original programmer can never be the same person as those
other 3 roles? :)

~~~
mvgoogler
You always need to get at least _one_ other engineer to review your code.

OWNER and readability approvals may require additional reviewers but not
always.

------
sytse
So a monolithic codebase makes it easier to make an organization wide change.
Microservices make it easier to have people work and ship in independent
teams. The interesting thing is that your can have have microservices with a
monolithic codebase (as Google and Facebook are comprised of many services).
But you can also have a monolithic service with many codebases (like our
GitLab that uses 800+ gems that live in separate codebases). And of course you
can have a monolithic codebase with a monolithic service (a simple php app).
And you can have microservices with diverse codebases (like all the hipsters
are doing).

I'm wondering if microservices force you to coordinate via the codebase just
like using many codebases force you to coordinate via the monolithic service.
Does the coordination has to happen somewhere? I wonder if early adopters of
microservices in many codebases (SoundCloud) are experiencing coordination
problems trying to change services.

~~~
durin42
Google has _tons_ of services internally that talk via RPC. The monolithic
repo means that it's much easier to hunt down and find people that are (say)
using an outdated RPC method and help them fix their code.

(Just one example of how it's useful even when things are mostly services.)

~~~
Touche
This discourages you from ever making breaking changes to an API. On the face
of it that sounds good but sometimes you do have to make breaking changes. My
guess is that there are many duplicate projects in the Google code base for
when a breaking change is needed. This is a way to sidestep the problem.

~~~
tuckerman
I think sometimes it's actually the opposite. I was able to make a breaking
change to an API and update all of the callers to use the new one in a single
commit.

Tests would run on all the clients and, since in my workspace the server was
updated simultaneously, I could be more sure it would work.

~~~
Touche
That's fine for small breaking changes like an API being renamed, but some
times breaking changes require actually refactoring code which is
hard/impossible to do without intimate knowledge of a codebase.

Think python 2->3 or Angular 1->2\. These types of changes do happen, and I
bet they happen at Google. I don't think anyone is rewriting a downstream app
when they make these changes. Most likely they are doing something like
forking the library and renaming it, which is just another form of versioning.

~~~
packetslave
A talk and a paper on how we do large-scale refactoring in the C++ parts of
the codebase

[https://isocpp.org/blog/2015/05/cppcon-2014-large-scale-
refa...](https://isocpp.org/blog/2015/05/cppcon-2014-large-scale-refactoring-
google-hyrum-wright)

[http://research.google.com/pubs/pub41342.html](http://research.google.com/pubs/pub41342.html)

------
ChuckMcM
I will say that I saw and experienced many things that changed my definition
of 'large' at Google, but the most amazing was the source code control / code
review / build system that kept it all together.

The bad news was that it allowed people to say "I've just changed the API to
<x> to support the <y> initiative, code released after this commit will need
to be updated." and have that effect hundreds of projects, but at the same
time, the project teams could do the adaptation very quickly and adapt. With
the orb on their desk telling them at that their integration and unit tests
were passing.

I thought to myself, if there is ever a distributed world wide operating
system / environment, it is going to look something like that.

~~~
devit
The solution to the excessive API change problem is to force whoever changes
the API to fix all the consumers himself before the change is accepted.

The Linux kernel generally uses this policy for internal APIs for example.

~~~
stock_toaster

      > The solution to the excessive API change problem is to force
      > whoever changes the API to fix all the consumers himself
      > before the change is accepted.
    

This doesn't seem scalable. Let's consider the case of one api endpoint being
changed by one developer, to add a new param to a function call. Further
assume that this impacts hundreds of projects.

Does it really make sense to make one developer update those hundred projects?
Not only will it take forever for it to get finished (possibly never if there
are new consumers of this api coming online frequently), but the developer of
the core api may not have any experience in the impacted consumers of this
codebase. I think the end result of this policy would be nothing once written
_ever_ would get updated, and new apis would just be added all the time (api
explosion).

~~~
thrownaway2424
It maybe isn't scalable, but that's part of the benefit. If you want to make a
change to a widely-used API, it's going to be a lot of work, and it's not
going to be a lot of work for the users of the API, it's going to be a lot of
work for _you_ because _you_ are required to do it yourself. This prevents a
lot of API churn unless the benefit is clear and sufficiently large.

If it was any other way you'd rapidly reach a useless equilibrium where random
engineers were demanding that thousands of other engineers fulfill unfunded
mandates for what might turn out to be negligible benefits.

~~~
philwelch
That's one extreme. Another extreme is that you have the API versioning from
hell, where you can never get rid of technical debt because any and all API
changes will break someone, somewhere, who has no reason to migrate, so you're
left keeping ancient code on life support indefinitely.

------
sshumaker
Xoogler here. There were tons of benefits to Google's approach, but they were
only viable with crazy amounts of tooling (code search, our own version
control system, the aforementioned CitC, distributed builds that reused
intermediate build objects, our own BUILD language, specialized code review
tools, etc).

I'd say the major downside was that this approach basically required a 'work
only in HEAD' model, since the tooling around branches was pretty subpar (more
like the Perforce model, where branches are second-class citizens). You could
deploy from a branch but they were basically just cut from HEAD immediately
prior to a release.

This approach works pretty well for backend services that can be pushed
frequently and often, but is a bit of a mismatch for mobile apps, where you
want to have more carefully controlled, manually tested releases given the
turnaround time if you screw something up (especially since UI is really
inefficient to write useful automated tests around). It's also hard to
collaborate on long-term features within a shipping codebase, which hurts
exploration and prototyping.

~~~
nulltype
Could you elaborate how the single repo model causes that thing you said in
the last sentence?

------
ksk
Its interesting that they compare LoC with Windows. I suppose that this
article wants us to be amazed at those numbers. However, my experience with
Google's products indicates a gradual decline in performance and a
simultaneous gradual increase in memory bloat (Maps, Gmail, Chrome, Android).
Which ironically, FWIW, hasn't been the case with Windows. I have noticed zero
difference in performance going from Windows 7 to 8 to 10.

~~~
branchless
I'd have to disagree with this. First the baseline: windows is _very_ slow.
Second I found later versions slower. Third (and most maddening) every version
of windows I've ever used has gotten slower over time (including not
installing new s/w and defragmenting).

~~~
sz4kerto
Windows is slow? Compared to what? In what task? Running a game? Boot time?
Opening Firefox?

I have problems with Windows, but it's the fastest desktop os I think, mostly
because it's graphics stack is way the best of all. Running a number crunching
C code is exactly the same on Windows or Linux. (See all the benchmarks on the
Internet.)

~~~
buffoon
It's really not that fast. The filesystem is a total dog (MFT contention) to
the point that manipulating lots of small file is up to two orders of
magnitude slower than ext4. This is made bearable thanks to SSDs being on the
market. Also the amount of friction getting stuff built and running and
maintaining it is detrimental to general productivity meaning you piss
execution time out of the window regularly just fixing stuff.

Note: windows programmer for 19 years now. Only because of the cash.

~~~
ksk
I can't say I'm surprised to see people eager to point out how Windows sucks.
And sure, maybe it does. However, the fundamental point you're missing is that
I don't think that Windows was ever positioned as this OS that was designed
for every single type of workload out there (not withstanding marketing
noise). Windows is a very general purpose OS meant for general purpose
'mainstream' things. Things that hundreds of millions of people might want to
do. Specialty workloads are simply not what Microsoft is ever going to invest
any significant amount of time in - unless they see some money there. In that
sense, Windows would probably be a far better OS if users could modify it to
suit their needs, but thems the breaks. Linux seems to fill that void for
some.

The disadvantage of NTFS which you point out, isn't because of a fuckup. It's
not designed for your use case. You might even find Microsoft telling you that
themselves here :- [https://technet.microsoft.com/en-
us/library/Cc938932.aspx](https://technet.microsoft.com/en-
us/library/Cc938932.aspx)

As to your point about productivity, I can't comment without knowing
specifics. As a primarily C++ programmer, I haven't run into any Windows
showstoppers that prevented me from shipping. I have run into showstoppers
with their dev tools, but I see them as separate from the OS.

~~~
archimedespi
At least you don't ever really _need_ to defragment ext4, unlike NTFS.

~~~
thetruthseeker1
Can you solve all the problems in ext4 that NTFS claims to solve? No you cant.
I am not saying either of the systems is perfect nor either of them is
horrible. They are perfect for the use cases they are designed for. If
somebody had a file system structure that was unusual(say lots of small files)
to NTFS, I think it was his mistake in treating it as a black box.

~~~
buffoon
The problem is that a large number of small files is a very common use case.
Even Windows itself consists of lots of small files i.e. the source code and
WinSxS.

It should handle general scenarios consistently. We've had a few minor
versions of NTFS and now ReFS. ReFS should solve this but it doesn't as it's a
copy and paste of the NTFS code initially rather than a complete reengineering
effort.

------
lighthawk
"The two internet giants (Google and Facebook) are working on an open source
version control system that anyone can use to juggle code on a massive scale.
It’s based on an existing system called Mercurial. “We’re attempting to see if
we can scale Mercurial to the size of the Google repository,” Potvin says,
indicating that Google is working hand-in-hand with programming guru Bryan
O’Sullivan and others who help oversee coding work at Facebook."

Why Mercurial instead of Git?

~~~
urda
Because Google and Facebook are using Mercurial over Git internally.

 _Edit:_ And for those that are just shocked that git isn't the answer.

Facebook: [https://code.facebook.com/posts/218678814984400/scaling-
merc...](https://code.facebook.com/posts/218678814984400/scaling-mercurial-at-
facebook/)

Google: [http://www.primordia.com/blog/2010/01/23/why-google-uses-
mer...](http://www.primordia.com/blog/2010/01/23/why-google-uses-mercurial-
over-git/)

~~~
Lewisham
Well, Piper conforms to the Perforce API-ish, and Android and Chrome are both
on Git.

Mercurial was pushed internally as being the "better" (for some dimension of
better) between it and Git back in 2010, but I think even the most hardline
Mercurial fans have realized that in order to meet developers in the middle in
2015, we need to use Git for our open-source releases. We have a large
investment in Gerrit [1] and Github [2] now.

So the Mercurial comment is probably entirely based on scaling and replacement
for the Piper Perforce API, rather than anything externally facing.

[1] [https://www.gerritcodereview.com/](https://www.gerritcodereview.com/) [2]
[https://github.com/google](https://github.com/google)

~~~
cap_theorem
Though both Android and Chromium still build in a way more similar to that for
a monolithic repo. They use repo and depot_tools, respectively, as
abstractions on top of Git in order to clone and manage all their Git
repositories together as if they were a single large repository.

------
k33n
Comparing "Google" to Windows isn't really a fair comparison. I'm sure all of
the code that represents products that Microsoft has in the wild far exceeds
2B lines.

~~~
DannyBee
Note that this is just the monolithic repository. Google also has other non-
piper repositories containing hundreds of millions of lines too :P

For example, android and chrome are git based.

Note also that when codesearch used to crawl and index the world's code, it
was not actually that large. It used to download and index tarballs, svn and
cvs repositories, etc.

All told, the amount of code in the world that it could find on the internet a
few years ago was < 10b lines, after deduplication/etc.

So while you may be right or wrong, i don't think it's as obvious you are
right as you do.

~~~
ocdtrekkie
I'm still trying to figure out why having everything dumped in one big pile is
something worth bragging about. I'd far rather have code sorted well into
proper repositories.

~~~
BooneJS
So much is shared, though, right? Which is why Android is sorted into proper
repositories but still has the 'repo' front-end wrapper to make sure you're
getting the right versions of everything you need.

If I wanted to change something fundamental, like I found a 10% speedup in
Protobuf wire decode by changing the message slightly, there are likely very
many services that all need it.

Everyone at Google operates on HEAD. You're not allowed to break HEAD, and
pre-submit/post-submit bots ensure you don't and will block your submit.

~~~
ocdtrekkie
From my perspective, at least, this design seems to explain why Google
websites are so frequently broken in the ways different services integrate.
Because Googlers edit shared resources that affect products they don't
personally work on, and they just trust automated tests, which almost
certainly miss a lot of the edge cases I encounter.

I admit that I'm not an expert at large software development, but this seems
to nearly fully explain Google's declining code quality.

~~~
DannyBee
"I have literally no idea what i am talking about, but here's something that i
believe fully explains every problem i have ever encountered" :)

------
yongjik
One humorous side-effect of having all that code viewable (and searchable!) by
everyone was that the codebase will contain whatever typo, error, or mistake
you can think of (and convert into a regular expression).

I remember seeing an internal page with dozens of links for humorous searches
like "interger", "funciton", or "([A-Z][a-z]+){7,} lang:java"...

~~~
wetmore
> "([A-Z][a-z]+){7,} lang:java

Yeah this one was my favorite of the code search examples, there are some
really good ones in there.

~~~
cag_ii
Can you explain this? It looks to me like a regexp that searches Java source
for words 7+ characters that start with a capital letter?

~~~
yongjik
It searches for CamelCase identifiers that are made of seven or more "terms",
where each term is a capital letter followed by one or more lowercase letters.

E.g., ProjectPotatoLoginPageBuilderFactoryObserver.

(Disclaimer: I just made it up. Not an actual Google project name.)

"lang: java" is not a part of regexp; just a Google code search extension that
searches for Java.

------
low_battery
Direct link to talk (The Motivation for a Monolithic Codebase ):

[https://www.youtube.com/watch?v=W71BTkUbdqE](https://www.youtube.com/watch?v=W71BTkUbdqE)

~~~
Walkman
This is crazy :D I have never heard tools and workflows like this.

------
kazinator
I am unable to believe that Google has 2B lines of original code written from
scratch at Google.

Maybe they are counting everything they use. Somewhere among those 2B lines is
all the source code for Emacs, Bash, the Linux kernel, every single third-
party lib used for any purpose, whether patched with Google modifications or
not, every utility, and so on.

Maybe this is a "Google Search two billion" rather than a conventional,
arithmetic two billion. You know, like when the Google engine tells you "there
about 10,500,000 results (0.135 seconds)", but when you go through the entire
list, it's confirmed to be just a few hundred.

~~~
roxmon
Google has been around for 17 years and employees roughly 10,000+ software
developers. I think it's reasonable to assume that the 2B LOC metric is
accurate...

~~~
hk__2
Windows has been around for 35 years and Microsoft had 61,000+ employees (ok,
that’s not only software developers and they don’t work only on Windows) in
2005; and it’s only ~50M LOC. I don’t think the number of years + developpers
really show something; you don’t write new code everyday.

~~~
scott_s
You pointed it out yourself, but I think you underestimated its importance:
Microsoft works on many other things. Office, XBox, Windows Phone, Exchange,
SQL Server, .Net, etc. I suspect Microsoft's total line count is similar to
Google's. The difference, however, is that it's not _one_ codebase.

------
hellbanner
"LGTM is google speak for Looks good to me" \- actually common outside of
Google.

~~~
malkia
SGTM

------
a3n
In the spirit of "You didn't build that," I wonder how many lines of code
comprise the binaries that Google binaries run on? Windows, Linux, network
stacks, Mercurial, etc, etc.

I also wonder if there's a circular relationship anywhere in there.

~~~
Splines
It's turtles all the way down, and also includes all the hardware and people.

------
sytse
The CitC filesystem is very interesting. This is local changes overlaid on top
of the full Piper repository. Commits are similar to snapshots of the
filesystem. Sounds similar to
[https://github.com/presslabs/gitfs](https://github.com/presslabs/gitfs)

------
makecheck
I really wish there was a tendency to track all change/activity and not just
total size; maybe like the graphs on GitHub. _Removing things is key for
maintenance_ and frankly if they haven't removed a few million lines in the
process of adding millions more, they have a problem.

Having a massive code base isn't a badge of honor. Unfortunately in many
organizations, people are so sidetracked on the next thing that they almost
never receive license to trim some fat from the repository (and this applies
to all things: code, tests, documentation and more).

It also means almost nothing as a measurement. Even if you believe for a
moment that a "line" is reasonably accurate (and it's tricky to come up with
other measures), we have no way of knowing if they're measuring lots of
copy/pasted duplicate code, massive comments, poorly-designed algorithms or
other bloat.

~~~
nhaehnle
The article claims 2 billion lines of code across 25000 engineers, which boils
down to 80k lines of code per engineer. I'm not sure what to think about that.

It seems to be in a reasonable order of magnitude for C++/Java-type languages
compared to projects that I have seen, but it does imply a significant chunk
of code that is not actively being worked on for a long time (which is not
necessarily a bad thing - don't change a running system and all that).

------
brozak
The comparison of Windows to all of Google's services is pointless and
misleading.

It's like comparing the weight of a monster truck and the total weight of all
the cars at a dealership...

------
temuze
Assuming these numbers are right...

(15 million lines of code changed a week) / (25,000 engineers) = 600 LOC per
engineer per week

Is ~120 LOC per engineer per workday normal at other companies?

~~~
_delirium
Elsewhere in this thread it's mentioned that Google makes use of large-scale,
automated refactoring tools:
[http://research.google.com/pubs/pub41342.html](http://research.google.com/pubs/pub41342.html)

Would be interesting to know what percentage of the total LoC touched are
typically from that kind of automated refactor. Depending on the codebase, you
can touch a _ton_ of lines of code in a very small amount of time with those
tools.

------
melling
I imagine that there's a lot of Java and C++. I do like Go but it makes you
wonder if a more expressive language that requires a fraction of the code
would be helpful. Maybe Steve Yegge will see Lisp at Google after all.

~~~
astrange
He claims to have stopped using it (#5):

[https://sites.google.com/site/steveyegge2/ten-
predictions](https://sites.google.com/site/steveyegge2/ten-predictions)

------
jakub_g
Some questions that immediately come to my mind:

\- What is the disk size of a shallow clone of a repo (without history)?

\- Can each developer actually clone the whole thing, or you do partial
checkout?

\- Does the VCS support a checkout of a subfolder (AFAIK mercurial, same as
git, does not support it)?

\- How long does it take to clone the repo / update the repo in the morning?

Since people are talking about huge across-repo refactorings, I guess it must
be possible to clone the whole thing.

Facebook faces similar issues as Google with scaling so they wrote some
mercurial extensions, e.g. for cloning only metadata instead of whole contents
of each commit [1]. Would be interesting to know what Google exactly modified
in hg.

[1] [https://code.facebook.com/posts/218678814984400/scaling-
merc...](https://code.facebook.com/posts/218678814984400/scaling-mercurial-at-
facebook/)

~~~
bruckie
Most of these questions are answered in the talk. The tl;dr is that you don't
clone or check out anything at all: instead, you use CitC to create a
workspace, and the entire repository is magically available to you to view or
edit.

This model precludes offline work, of course. But that's not much of a problem
in practice.

~~~
jakub_g
I did not follow the links in wired article, and didn't realize there was a
link to a youtube talk. Thanks for tl;dr, need to watch the video!

------
therealmarv
What? This is surpassing the mouse genom complexity. See this charts for
comparison:
[http://www.informationisbeautiful.net/visualizations/million...](http://www.informationisbeautiful.net/visualizations/million-
lines-of-code/)

------
Strikingwolf
Really interesting article. Sounds like a great solution to the problem in git
of submodules. Definitely worth looking at. Thanks for posting OP.

IMO this system would best be suited for large companies, but I could see the
VCS that they are developing being used by anyone if it gets a github-esque
website.

------
ilurkedhere
Yeah, but it's only like ~200 lines rewritten in Lisp.

~~~
juhq
A serious question about Lisp and Google, is Lisp used within Google, and if
so, in what projects and why?

------
Apocryphon
Looks like someone's going to have to update this:
[http://www.informationisbeautiful.net/visualizations/million...](http://www.informationisbeautiful.net/visualizations/million-
lines-of-code/)

------
buro9
This hurts just thinking about what the build, test and deploy systems must
look like.

~~~
jsolson
Well, for build take a look at bazel, although attach it to a cluster of
machines that can all read from Piper.

------
michaelwww
For those interested, the source analyzer Steve Yegge was working on called
GROK has been renamed Kythe. I don't know how useful it turned out to be for
those 2B LOC. [http://www.kythe.io/docs/kythe-
overview.html](http://www.kythe.io/docs/kythe-overview.html)

Steve Yegge, from Google, talks about the GROK Project - Large-Scale, Cross-
Language source analysis. [2012]
[https://www.youtube.com/watch?v=KTJs-0EInW8](https://www.youtube.com/watch?v=KTJs-0EInW8)

------
Locke1689
What I'd like to know and no one seems to mention:

What's the experience like for teams _not_ running a Google service and
instead interacting with external users and contributors, e.g. the Go compiler
or Chrome.

~~~
bruckie
Many larger external projects are hosted in other repositories (Chrome and
Android are well-known examples).

Smaller stuff (like, say, tcmalloc or protocol buffers) is usually hosted in
Piper and then mirrored (sometimes bidirectionally) to an external repository
(usually GitHub these days).

~~~
Locke1689
Thanks, but I guess I was asking more about how this affects the other
development characteristics described. You still have to deal with the massive
repository and infrastructure, but if you're Go, for example, and you want to
change an API 1) you can't see the consumers because many or most won't be
Google-internal, and 2) even if you could see them, you can't change them.
Even the build/test/deploy systems are somewhat compromised because you can't
rely on all builders of your components being Google employees and having
access to those resources.

So in these scenarios, what does Google's infrastructure buy you, if anything?
And if it doesn't buy you anything, how does that influence Google culture?
Are teams less willing to do real open development due to infrastructure
blockage?

~~~
skybrian
Working with multiple source control systems, multiple issue trackers, and
multiple build systems has its challenges.

It's true that you don't know about all callers if you're working on open
source software. There's no magic there; you need to think about backward
compatibility. (On the other hand, if it's a library, your open source users
can usually choose to delay upgrading until they're ready, so you can
deprecate things.)

The main advantage for an open source project is that, though you don't know
about all callers, you still have a pretty large (though biased) sample of
them. If you want to know how people typically use your API's, it's pretty
useful. Running all the internal tests (not just your own, but other people's
apps and libraries) will find bugs that you wouldn't find otherwise.

There were changes I wouldn't have been confident making to GWT without those
tests, and bugs that open source users never saw in stable releases because of
them. On the other hand, there were also changes I didn't make at all because
I couldn't figure out how to safely upgrade Google, or it didn't seem worth
it.

------
727374
Really? This article sounds very over simplified, but I haven't worked at
google so I wouldn't know. I'm assuming if you want to change some much
depended on library, there's a way to up the version number so you don't hose
all your downstream users. That's the way it worked at Amazon at least. Also,
I wonder why the people in the story think Google's codebase is larger than
that of other tech giants, not that it really matters.

~~~
jsolson
Google mostly works at HEAD. Very little is versioned, and branches are almost
unheard of.

In general you change the much depended on library _and all of its consumers_
(probably over time in multiple changes, but you _can_ do it in one go if it
really needs to be a single giant change).

------
sandGorgon
What are the best practices to follow in a single-repo-multiple-projecrs
world? Some people recommend git submodule, others recommend subtree.

How do you guys manage alerts and messages - does every developer get a commit
notification,or is there a way to filter out messages based upon submodule.

How does branching and merging work?

I'm wondering what processes are used by non-Google/FB teams to help them be
more productive in a monolithic repo world.

~~~
cmrdporcupine
Generally branching isn't really a thing at Google. Work is done at the code
review level per change list ("CL"). Most changes happen through incremental
submission of reviewed CLs, not by merging in feature branches. Every CL must
run the gauntlet of code review, as well as can not usually be submitted
without passing tests. There are rare cases where branching is used, but not
commonly.

As for notifications, the CL has a list of reviewers and subscribers. If you
want to see code changing, you watch those CLs. Most projects have a list
where all submitted CLs go.

~~~
sandGorgon
Can you explain this a little more - what is a CL vs a changeset...and what do
you mean by watching changelists. It sounds like you're subscribing to
specific commits...but I'm talking about more at a project/directory level
within the monolithic repo.

~~~
devinj
Changelist is Perforce-speak for changeset. Because each CL gets an individual
review before applying to the codebase, there is no merge process -- there is
no branching/merging.

There is a solution for project/directory-level CC / review requirements. I
didn't see it discussed in the talk, though.

------
nemesisrobot
The comparison bewteen the total LOC across _all_ of Google's products against
just one of Microsoft's is a bit unfair.

------
h1fra
The comparison with windows really is just here to provide a something to
compare for casual reader, it's not really that good. An OS is a huge project.
But google has hundred of different project, apis, library, framework... Even
unix with an "unlimited" source of developpers does not reach that point.

------
dchichkov
I remember somebody wise had said once: "Every line of code is a constraint
working against you."

------
jfkw
How do the monolithic repository companies handle dependencies on external
source code?

Are libraries and large projects e.g. RDBMS generally vendored/forked into the
monolithic repositories, regardless of whether the initial intent is to make
significant changes?

~~~
jpollock
There's typically a subdirectory called third_party, with subdirectories for
each vendor, product and version. If the team is smart, they will also enact a
rule saying "only one version". If you're really, really smart, local changes
are kept as a set of patches, keeping them separate from the imported tar
file.

So, for source deliveries:

    
    
      third_party/apache/httpd/2.4/release.tgz
                                  /patch.tgz
                                  /Makefile (or other config)
      third_party/apache/httpd/2.2/release.tgz
                                 /patch.tgz                             
      ...

~~~
cpeterso
For example, here is Chromium's third_party directory:

[https://chromium.googlesource.com/chromium/src.git/+/master/...](https://chromium.googlesource.com/chromium/src.git/+/master/third_party/)

------
breatheoften
Are the source of piper and the build tools also in the mono repo and also
developed/deployed off the head branch? Seems like a random engineer could
royally fubar things if they broke a service which the build system depends on
...

~~~
thrownaway2424
You said "developed/deployed" as if it were the same thing. Even if you
somehow checked in the giant flaw, bypassing all code review and automated
testing, it's not like that would suddenly appear in production. Google isn't
some PHP hack where you just copy a tarball to The Server. Binaries of even
slightly important systems typically go through many stages of deployment,
first into unimportant test systems, then usually very, very slowly into
production with lots of instrumentation and of course, quick and easy methods
of rolling back to the previous release.

~~~
breatheoften
I see - it was something of a half baked thought but in my defense I wasn't
trying to suggest that I thought the head was automatically deployed to
production ... Deployed to testing round 1 ... N is still a "deployment" isn't
it ...? The shared boilerplate for how that magic works in a scaleable way for
so many different projects must be quite complex and itself hard to test ...

------
dblotsky
Even if the numbers are off, the assumption that 40M lines of code take less
effort to write than 2B lines of code commits the fallacy that effort is
proportional to number of lines of code. Come on, Wired, you can do better.

------
amelius
Is this article saying that all developer employees have access to the "holy"
search algorithm internals? I can hardly believe that to be true, given the
fact that SEO is a complete industry.

~~~
enf
Once upon a time it was all in one repository. Shortly after I started there
in late 2005, the "HIP" source code (high-value intellectual property, I think
it stood for) was moved to its own source tree, with only precompiled binaries
available to the rest of the company.

Looks like there is a Quora question that mentions this too:
[https://www.quora.com/How-many-Google-employees-can-read-
acc...](https://www.quora.com/How-many-Google-employees-can-read-access-all-
of-the-source-code-for-Googles-search-engine)

------
known
How frequently Google does
[https://en.wikipedia.org/wiki/Code_refactoring](https://en.wikipedia.org/wiki/Code_refactoring)

------
rbinv
Those are mind-boggling numbers.

Although I kind of doubt that "almost every" engineer has access to the entire
repo, especially when it comes to the search ranking stuff.

~~~
Lewisham
FWIW, apart from the previously mentioned sensitive stuff, we give engineering
interns the same level of access we give full-time engineers. We keep things
open because it makes things faster; we have an excellent code search tool
that's great for navigating through the Piper repo (e.g. finding subclasses,
finding uses of an API) which really speeds up dev time.

When we're not talking about the sensitive stuff, there's not much magic to
what many engineers write every day, it's the same "glue technology X to
technology Y" stuff you see everywhere, so I don't think there's any value to
hiding that in the name of secrecy.

~~~
maximilianburke
How are changes that affect sensitive code handled? Are the owners of that
code on the hook for making any API updates that the person pushing the change
can't make?

~~~
Lewisham
Having never worked on the secret sauce, I honestly don't know. There is a
small team of people who tend to do many of the global refactors, I might
expect that they are given special permission.

------
dblock
A giant repo works for Google, and works for Facebook, and Microsoft, but it's
bad for the development community at large.

If you start centralizing your development you’re killing any type of
collaboration with the outside world and discouraging such collaboration
between your own teams.

[http://code.dblock.org/2014/04/28/why-one-giant-source-
contr...](http://code.dblock.org/2014/04/28/why-one-giant-source-control-
repository-is-bad-for-you-and-facebook.html)

------
wedesoft
With 2 billion lines of code I would consider the problem of developers
stepping on each other's toes essentially solved.

------
rbanffy
What I find most distressing is that their Python code indents with two
spaces... This is so wrong, Google.

------
izzydata
If they were to recompile all of it on a standard desktop PC how long would it
take? A week?

------
sa2015
I wonder how close the "piper" system is to the code.google.com project.

~~~
DannyBee
I worked on code.google.com, i can tell you the are 100% unrelated.

piper grew out of a need to scale the source control system the initial
internal repositories were using

code.google.com was a completely separate thing supporting completely
different version control models, and a very different scale (very large
number of small repositories, vs very small number of very large repositories)

------
MrBra
Am I the only one who initially read 28 instead of 2B ? :)

~~~
MrBra
downvoter: laughter is good for your health.

------
wellsjohnston
What is a "line of code"? out of the 2b lines of code google has, how much of
it was auto-generated? how many of those lines are config files? This is a
very silly article that has little to no value.

------
therealmarv
So they do not suffer on git submodules I guess

------
wgpshashank
Cool , How much front and back end each ?

------
rosege
How many lines is duckduckgo? :-)

~~~
creshal
Can't be that many, given they outsource the actual search engine to third
parties.

------
nootropicdesign
OMG it's all in one file? OMG OMG it's all on ONE LINE????!!!

------
Sven7
Now I know why my google plus page takes half a day to load.

------
aikah
lol git clone
[http://urlto.google.codebase.git](http://urlto.google.codebase.git) ...

I wonder how much time it takes to clone the repo, provided they use git.

~~~
robertk
It's 80TB. You don't clone, just ask for views.

------
kuschku
This explains quite some things.

Still, this is not a very forward-thinking solution. Building and combining
microservices – effectively UNIX philosophy applied to the web – is the most
effective way to make progress.

EDIT: Seems like I misunderstood the article – from the way I read it, it
sounded like Google has a monolithic codebase, with heavily dependent
products, deployed monolithically. As zaphar mentioned, it turns out this is
just bad phrasing in the article and me misunderstanding that phrasing.

I take everything back I said and claim the opposite.

~~~
thomashabets2
That's why Google is so unsuccessful at scaling technical solutions, unlike
you they're not forward-thinking.

~~~
kuschku
No, it’s not that they are unsuccessful, it’s that they are unable to maintain
it properly. Already today they have tons of open security issues.

Or think about April 1st, when they set a Access-Control-Location: * header on
google.com because someone wrote the com.google easteregg.

Read the post from the SoundCloud dude from yesterday to find out how to do
software management properly (hint: modularization is everything)

~~~
captn3m0
Do you have a source for the easter egg security issue? Haven't read about it
anywhere, and can't seem to find anything either.

~~~
kuschku
Here is one: [http://arstechnica.com/security/2015/04/no-joke-googles-
apri...](http://arstechnica.com/security/2015/04/no-joke-googles-april-fools-
prank-inadvertently-broke-sites-security/)

The issue was that they wanted to load the page – with the user logged in, etc
– on com.google. For this they implemented an explicit URL parameter that
would allow this.

~~~
magicalist
> _Attackers could have seized on the omission of the X-Frame-Options header
> to change a user 's search settings, including turning off SafeSearch
> filters_

The horror!

