I've worked a bit with "the big monorepo" (though nothing like Google scale) and my impression is that a lot of the benefits fall apart if you don't have people working to maintain the build environment, automated tests, developer environments. When it takes hours or more to run tests and produce a build it can really slow the team down. The ability to force updates on dependencies in a single big commit can be really worthwhile as long as you're willing to spend the time to build the tools and do the maintenance.
> a lot of the benefits fall apart if you don't have people working to maintain the build environment, automated tests, developer environments.
The thing is, a monorepo makes clear what multiple repos can obscure: if you have a software system, then you need all those things anyway. Multiple repos can hide the cost, but they don't eliminate it — in my experience, they multiply it. E.g. if someone makes a breaking change in a monorepo, he breaks the tests and he fixes them before his changes can be merged in; but with multiple repos someone can commit a breaking change and a dependent repo won't know about about it until days, weeks or even months later; the person responsible for fixing it has no idea what really changed or why; and he has a broken build until he does figure it out.
> The ability to force updates on dependencies in a single big commit can be really worthwhile as long as you're willing to spend the time to build the tools and do the maintenance.
It's like maintaining your car: you can spend $20 to replace your oil every 3,000-5,000 miles, or you can spend thousands every 10,000 miles. Up to you.
On the other hand, some of the code you have to update may be for products that have been mothballed. That might end up being a waste of time if the project is never revived.
Or the code might be safety-critical and updating it is introducing unnecessary changes (and hence, unnecessary danger).
Some tests may include hardware integration. For example, aircraft software may need some number of flight hours after significant changes. That's probably not going to be a part of the CI suite, and the changes will introduce a greater manual testing burden.
> but with multiple repos someone can commit a breaking change and a dependent repo won't know about about it until days, weeks or even months later; the person responsible for fixing it has no idea what really changed or why; and he has a broken build until he does figure it out.
Isn't that the whole point of versioning? If there is a breaking change only the repo's using the latest version will see it. At which point they can submit a bug report before any of that code enters production.
I still agree that avoiding all of that is preferred.
That would be great if people always used versioning correctly. I deal frequently with breaks caused by simple things like renaming, removing, or re-ordering protobuf fields in one repo, and not updating code in another. At least in a monorepo you can catch those sorts of errors.
> a lot of the benefits fall apart if you don't have people working to maintain the build environment, automated tests, developer environments.
This is absolutely true. Without a very strong culture around testing and deep, continuous investment in infrastructure and tools, a monolithic repo would quickly devolve to being a giant midden heap of code.
How is that different on multi-repos ? The total amount of code is the same. You also need to keep build environments(not one but several), dependency management is much harder,etc.
I would say that any large project needs a "very strong culture around testing and deep, continuous investment in infrastructure and tools"
In my experience with small teams, maybe one person will take testing & automation seriously. If they've got a couple small codebases to worry about, it's possible for them to manage. One giant codebase would be too much work.
Obviously not the way it should be, etc, but reality bites.
So you've got a few small repos with good build infra,and many small repos that don't. Your overall system is still a sibling ship, and you have no economy of scale force multiplying your few people who take infra seriously
There are dedicated and volunteer teams that work on the internal tools and a strong testing and accountability culture. Everything is built from source.
In Google it is somewhat connected, actually. Sometimes engineers have to run ALL tests over the repo. And it takes hours (better be scheduled overnight!).
Although they might have fixed that already.
I still cannot wrap my head around this. What merit does forcing internal code to update a dependancy have? Having dependencies packaged like NuGet, and using an internal repository, you can always remove a particular version of a dependency if you must, and a build will fail if it cannot automatically use the latest minor version.
It means that I, as a dependency author, can make a breaking change and push not only the updates to my library, but also the updates to all my dependents in a single commit. Using a dependency packaging system like NuGet is a workaround for me not knowing who all my clients are. In a monorepo, I can have perfect knowledge.
push not only the updates to my library, but also the
updates to all my dependents in a single commit.
And what's more, you can run all their automated tests and address any problems your changes introduce. Which presumably means regular forced updates aren't the recipe for instability they would otherwise be.
Should you know who all your clients are? That seems to be a workaround for not having an adequate testing framework or a contract stating what your code will do. Do open source frameworks know everyone's code that is using it?
Plus you do have perfect knowledge if your CI tools builds everything on your commit. Which it will do anyway if you have a monorepo.
> That seems to be a workaround for not having an adequate testing framework or a contract
Sometimes you get that contract wrong, and you don't realize it until you've shipped it. If you want to evolve an API by introducing a breaking change, you can do it at Google. You just have to migrate the clients with the API change. It means your APIs can both improve and stay lean, without needing to carry around legacy cruft, which is great.
That's a good point about not carrying cruft forward. I've often seen on projects where there's crazy methods like sendEmailOn3rdBirthdayAfterFullMoon as someone somewhere needed it for a single project but who knows if it is still in use.
You will not have perfect knowledge. Even at Google they don't truely have a mono repo, as there is several other projects that live outside the monorepo. A build system will also detect changes and run test, produce a build, and notify maintainers, when you update a dependency. Monorepo is not needed for this.
You're not the first to make this claim, but it seems dubious.
It means the library authors must be knowledge expert on all projects using the library. How can you make changes to all the client code without knowing the code base well? Make change willy-nilly on codebase you are not familiar with on the basis that it uses your library?
That sounds like a recipe for disaster.
Or you need to get consensus and code review for each and everyone of those projects and have all tests pass for all projects. This look like slow-as-molasse progress. Submitting any changes would always be an extremely big endeavour.
The alternative is to have submit not affect other projects. You can do this in a monorepo by mandating that other projects not sync to head but to a given release tag of your library and only move to a more recent tag on their own schedule, after they've tested the upgrade. IOW, pretty much how a packaged release would work. Pretty much how one would have sync'ing to separate repos.
I fail to see what are the benefits or monorepo. The downside is either anyone can touch anything without being and expert and having an extremely long and complex submission process. (And woe to you if someone submitted something in the meantime... you have to redo your testing and fixing again?!?)
The world at large has lived with separate repos and explicit releases for ages. That's how third-party library are handled in every single software team I've ever seen.
The facts on the ground suggest that Google does not have trouble delivering high-quality software, so your arguments aren't very convincing. To me the biggest issue is that their customer-facing software is often not what people want, but that isn't really a development process issue.
"You're not the first to make this claim, but it seems dubious."
"It means the library authors must be knowledge expert on all projects using the library."
They must know enough to design good APIs or else you will fail anyway. If your apis require expert knowledge of the system to use correctly, you've probably done it wrong.
" How can you make changes to all the client code without knowing the code base well?"
Good API and good refactoring tools
" Make change willy-nilly on codebase you are not familiar with on the basis that it uses your library?"
With good tests and testing tools, i fail to see the issue.
"Or you need to get consensus and code review for each and everyone of those projects and have all tests pass for all projects. This look like slow-as-molasse progress. Submitting any changes would always be an extremely big endeavour.
"
Nope, there are tools to manage the changelist process for large refactoring tools. My experience says velocity is not an issue in terms of things moving too slow, pretty much ever.
"The alternative is to have submit not affect other projects. You can do this in a monorepo by mandating that other projects not sync to head but to a given release tag of your library and only move to a more recent tag on their own schedule, after they've tested the upgrade. IOW, pretty much how a packaged release would work. Pretty much how one would have sync'ing to separate repos.
"
Try this sometime, and see what the lag time to upgrade in a large enough set of teams and environment is. It's "forever until someone breaks you".
"I fail to see what are the benefits or monorepo."
I think they've been covered quite well.
You just think they don't happen in practice, despite people with experience saying they do.
That's okay. you don't have to believe them, but it's not clear what anyone could do to convince you:)
" The downside is either anyone can touch anything without being and expert and having an extremely long and complex submission process."
The first is an explicit feature. If everyone needs to be an expert to touch most of your codebase, that sounds like a bad codebase to me. Your latter is simply not correct. The rate of change is many times faster than any separate repo process i've seen.
" (And woe to you if someone submitted something in the meantime... you have to redo your testing and fixing again?!?)
"
If it's broken, it's broken, whether it's packaged or not. The only thing you are talking about is whether you can tell it's broken before you get to the next packaged release.
"The world at large has lived with separate repos and explicit releases for ages. That's how third-party library are handled in every single software team I've ever seen.
"
Yes, and honestly, it's been a complete and total disaster everywhere i've seen it, with tons of teams with forked, old, buggy versions, the inability to reuse code (or bugs) because of subtle third party dependency incompatibilities, etc.
I'm not sure why you claim it's a panacea.
Neither is, but not for the reasons you've given.
To know if it is breaking you do not need the mono repo, but access to the build system, mono repo or not, access to the source alone if not enough. It will detect an update to a dependency and start a new build, run all test, and produce an artifact if it's successful.
The latrge federated something or other doesn't exist because Google chose to build monorepo tools instead, and approximately no one else's code base is really big enough to matter. Why didn't/couldn't Google build the federated something or other?
It's an automated build system, it already exists. I think it's more likely that Googlers decided that they had the discipline for a mono repo - time will tell if that hold true.
Not theoretically, with [OpenGrok](https://opengrok.github.io/OpenGrok/) you can register all of your git repositories and then search for the dependency declaration of your library across the entire company's source code, and then with that list you can use a simple bash script to autogenerate the git diffs to switch to the appropriate security patch for whichever minor version of the library they're using.
You can also use this exact same approach to create tasks on teams to upgrade their library if you've EOLed the version they're using. You can be as narrow (requiring everyone on the latest version) or as broad (having a band of versions allowed, in case the change is significant enough to require updates to how it is used, or if there's a major component of the system using an older version for performance-regression reasons you're still addressing on the improved library) as you want.
This actually simplifies your build system, as you can reuse the same standard open source tools that come along with your language of choice, which often assume they're running at the root of the project and can do what they want to the surrounding directory structure, and build-n-test cycles take mere minutes instead of hours, being isolated to only the component you're working with (and in the case of a library update across project, parallelized due to this isolation across build servers, so still much faster than the monorepo approach).
This also makes the codebase healthier, in general. Libraries and services need to be designed to actually have a clear purpose and interface; they can't cheat by grabbing source files from each other while breaking the expected interface, so refactoring their internals has a near zero probability of causing downstream issues. Services can be rewritten by standing up a parallel service, shadowing traffic, and validating equal responses between them. You can also fork and repurpose service code for another project and know exactly what code is required by it immediately.
These things are possible in monorepos, but they require a very high degree of diligence, and even moreso in a mixed-language environment.
In general, breaking everything into a separate repository is madness (see the modern Node.js ecosystem), so new services should start out as "mini-monorepos" where everything new they need that isn't already an available library should be contained within the service. Once a few services seem to be doing the same sort of thing, then you look at the commonalities between their implementations, learn from past mistakes (that we all make), and make the library to share between them, and allow future services to not have to reimplement that.
(This also has the advantage of making it easier to open source said library, since it would contain zero references to proprietary services, and requires no cleaning of the git history, and to get services to use your library, you already need to document it clearly.)
Frankly, the world Google is in with their monorepo is an amazing feat of engineering built upon the initial tools they had available ~15 years ago, but it really isn't an ideal situation, and there are better, more flexible ways of doing things now. I'd be sad if our industry hadn't improved this over such a period of time.
This actually touches upon a subject I hold dear. For .net development, I use SimpleInjector, not because it is the best and most flexible DI container, but because it is deliberately designed to make it impossible to do things that are wrong following best practice.
Sometimes it is annoying that it lacks the flexibility, but it always means we just have to do the refactoring now, rather than let the code rot. A Lead Dev should always keep this in mind, choose tools that make the right way the easiest and possibly only way forward. A mono repo is an example of a tool that gives superb knowledge and flexibility, but it requires some serious discipline to avoid a rapidly growing complexity.
Agreed. It's a huge dependency of monorepo that is often not spelled out directly. Probably the reason is that the people who successfully implemented monorepos already live so deeply in that continuous integration world, maybe even continuous deployment, so that they don't even know anymore that a huge part of the world has not moved there yet (despite trying for a decade).
With my current client we decided to go with multiple repositories and came to regret that decision.
The product family we are developing contains website, two mobile apps (iOS, Android), three PC applications (OS X, Windows and one legacy application) and software for embedded devices.
Each product lives in it's own repository, most repositories use one shared component as a submodule and many product share common platform, which is used as a submodoule and products build on top of that. Test automation core sits its own repo. I built and maintain that test automation core and it's a pain.
Each product repository has it's own functional tests that use the test automation core to make the tests actually do something. So when ever I make changes to the test automation core, I need to branch each product and use the feature branch from my core in it. Then run all the tests in each repo and see that I did not break backwards compatibility. If I do break it, then I need to fix it through pull requests to possibly multiple teams.
I'm not the greatest git wizard in the world, so maybe someone else could easily maintain good image of the whole mess in their head, but for me this is a pain. And everyone else who maintains a shared component shares my pain.
Monolithic repo would not magically make all the pain in the world to disappear, but it would be so much more easier to just have one repo. That way I would need to only branch and PR once.
For repos to be independent, they need to have a defined interface, be able to test themselves against that interface as well as the expected behavior, and specify interdependencies with versioning (semver helps reduce effort on this point).
You can run integration tests in the dependent project across the interface of the dependency project, but tests in the dependency project should find 99% of breaking changes in the dependency.
So you can do whatever you want in any project, then test those changes. If your tests point out you've broken your contract with other projects, you decide to either solve that issue before releasing the change or cut a new major revision. Behavior that isn't covered by tests should basically be considered undefined, teams of dependency projects should contribute tests for things they want to rely on.
Either way the other project doesn't break -- it either updates to the new compatible version or stays behind on the last compatible release until it's been updated for the incompatible change. You should be confident having a robot automatically test projects with updated versions of dependencies and committing the change.
Of course, it's not always possible or desirable to have this degree of engineering around projects, but then the reality is that these aren't independent codebases, they are a single interdependent project with a single history and should be versioned/tested/released as such.
I understand your logic in theory, but not in practice. I maintain some 20 or so large applications, that all use a shared data access layer. That dal is safely cordoned off behind an interface layer, but I still have the same problem as the parent poster. If you make enough changes to any one of the subscribing applications, you will need to make an update to the DAL. If you update the DAL, you will need to make updates in the 19 other applications, and run the tests for all of them, if only to ensure they compile.
I could break the DAL further apart into 20 different pieces, one per application, but there is so much shared data access functionality between the applications, that it doesn't make sense.
There's no "logic in theory" and "logic in practice" there's just logic and people failing to apply it in practice.
If you can't come up with a small, stable interface between the the 20 applications and the shared component, then you don't have 20 applications + a shared component, you have one big app that is painfully maintained as separate repos across a large cross-sectional arbitrary boundary.
If the history of the project is so interdependent as to be singular (everything is changed in lockstep across repos, effectively a single history) why not have a monorepo? If versioning them under the pretense that they are independent modules costs you so much effort, why do you labor to do so?
Just to be clear, wouldn't the alternative with a monorepo still require that you go back and forth with multiple teams if the commit is not backwards compatible? It seems like the main complaint you have is that it's difficult to wrangle a number of related pull requests, so perhaps switching to something like gitcolony [1] for code reviews would help.
> Just to be clear, wouldn't the alternative with a monorepo still require that you go back and forth with multiple teams if the commit is not backwards compatible?
No. You just send one diff that changes all the team's code and update everything in lockstep, so at any point in your history, everything is compatible.
Instead of you going back and forth with multiple teams, you're bringing them together to comment on a change in one place. You synchronize on version control history instead of needing to wrangle multiple teams across multiple repositories, and you no longer need to deal with fallout for code change compatibility. You just make the change. Everywhere. In one shot.
You may have to get multiple teams to review your change before being allowed to commit it. And you have to run all their tests. If there is a problem the whole thing will typically get rolled back, which is a drag because then you have to fix the issue, run tests again and get approvals again.
So, in practice, for large-scale changes that affect many teams, we still try to break up large patches into multiple smaller steps, even working in a monorepo.
A single commit is nice for small patches that only affect a handful of teams, though.
Never heard of gitcolony before. Looks interesting, thanks! That would probably solve one issue. Then there is another one.
I try my best to keep head of master branch in such a state that it can always be taken into use in all projects. Just last week one branch of one embedded device had slight API change. Nothing big, but backwards incompatible change.
I branched test automation core and made it work with the new API. All tests looked green and things looked nice. We agreed with the embedded device owner that we'll merge that change upwards soon. Soon like in one hour or so. I rushed and merged the test automation core changes to master.
At the same time I was working with another API change with one PC app. That branch was also looking good and those changes were merged upwards in both the test automation core and in the PC app.
Now my test automation core master head was compatible with everything else but one embedded device, the one with the small API change that looked good in all tests. For some reason business did not want that change to go live yet with the device, so now I had changes in my test automation core that made it incompatible with the released embedded device.
Yes, it was my mistake to rush with the merge. But because getting those changes upwards was two merges, one in the product itself and in the test automation core, it was possible to get those out of sync. If we had used monolithic repository, it would have been just one merge and such thing would not have been possible.
Sure, not a huge thing but still an annoyance I could live without.
Tests belong in the project they're testing. Test tooling belongs elsewhere. It's not clear what your setup is, but it sounds like the tests themselves live in another repo. That's bad. It's not bad to have a separate repo for your tooling/test runners. As you've just seen, new tests need to go out simultaneously with new code.
Yes, tests are with the code as are test resources that are product specific. In addition to those there are a lot of resources that are shared across all or most products that sit in the test automation core.
4. google3 (aka "the monorepo") officially using blaze (but often there were nested build systems - I remember one that used blaze to drive scons to build a makefile...)
The diversity of the build systems significantly steepened the learning curve when switching projects. During orientation, they told me "All code lives in the monorepo, and every engineer has access to all code", but this turned out to be not true at all. If anything it was the opposite: more build system diversity at Google than at other places I worked.
Can you quantify that? Honest question - Android does not seem like a small project to me, but perhaps it really is dwarfed by Google's web services.
What you wrote reminded me of a feeling I had: that these projects are these weird exceptions, only nominally part of Google. They have their own policies (style guide, code review, etc), have their own hardware, etc. The Android building even had their own desks (non-adjustable!)
In terms of percentage of engineers, by far the majority are working in Google3 ("monorepo"). Android and Chrome are the big exceptions, because they are open source projects, but most of us don't work on them.
You can quantify it. Look at the numbers in the OP article, and thinking about how small Android must be: (1) all of Android code fits on a single smartphone (2) Android apps all have a corresponding server side.
You pretty much worked on that tiny 5% of projects that live outside of the monorepo. Good... job I guess? For 95% of engineers that will never happen.
That's not really true. More than 5% of engineers work on Android and Chrome, neither of which are developed in the core repo. There are a wide variety of other teams that choose to not work in the core repo for a wide variety of reasons.
This is the key comment in this thread. The article simply does not tell the full story In fact, Google even released a nasty little tool called "git repo" that coordinates simultaneous accesses to multiple git repos.
Sure, but even "only 80% of Google" in a monorepo is still bigger than the combined size of most everyone else's constellation of small repos. And the the separation of repos is based on deployment environment: Server, desktop, and smartphone, where code has almost no cross dependencies because the environments of software never communicate via internal channels.
The article is a fair assessment for "traditional" projects at Google (server-side? hosted? native? not sure the best term. Think "gmail" or "search" or "brain").
As mentioned elsewhere, Android and Chrome are huge exceptions in that they're a) open source and b) pretty much standalone products not closely tied to the rest of the Google ecosystem.
Please dont do it unless you are google and have built google scale tooling (gforce/big table for code cache)
benefits of monorepo
* change code once, and everyone gets it at once
* all third party dependencies get upgrade at once for the whole company
cons (if you are not google)
* git checkout takes several minutes X number of devs in the company
* git pull takes several minutes X number of dev
* people running get fetch as a cron job, screwing up and getting into weird state
* even after stringent code reviews, bad code gets checked in and break multiple projects not just your team's project.
* your IDE (IntelliJ in my case) stops working because your project has million of files. Require creative tweaks to only include modules you are working on
* GUI based git tools like Tower/Git Source dont work as they cant handle such a large git repo
Google has solved all the issues i mentioned above so they are clearly an exception to this but for rest of the companies that like to ape google, stay away from monorepo
> git checkout takes several minutes X number of devs in the company
> git pull takes several minutes X number of dev
> people running get fetch as a cron job, screwing up and getting into weird state
I feel like you are addressing teams of several hundreds of developers. Unless they commit large binary files each day this is hardly an issue for smaller several tens people teams.
> even after stringent code reviews, bad code gets checked in and break multiple projects not just your team's project.
Revert such code immediately once detected by the CI. Which is harder to do if the changes and their adjustments are spread across dozen of repositories. Also, please compare the easiness of setting up CI for a hundred of repositories compared to a single one with tens thousand of files.
> your IDE (IntelliJ in my case) stops working because your project has million of files. Require creative tweaks to only include modules you are working on.
Monorepo does not mean you have to load everything in a single workspace. It means everything gets committed at once. If your tools cannot handle so many files or cannot be configured to work on subsets, blame your tools.
Yes, monorepo present some challenges but handling hundred of repositories is no better. Having done both, I prefer the former.
And yes not everyone is Google, with thousand of developpers and billion lines of code.
Perforce works well for a monorepo at certain scale. It also falls over with too much stuff, and when it falls (due to capacity or otherwise) over your entire company is dead in the water until it's resolved.
> your IDE (IntelliJ in my case) stops working because your project has million of files.
This is not really an issue, because it depends on your local project files, not on the size of the repository. You only have problems with the IDE in a large project such as Chrome. Most projects will have moderate sizes that can still be handled by Eclipse.
To add based on the experience of running ClearCase in a 3 place multi-site setup in the past:
* Mapping the repository into the filesystem requires drivers which can add a whole dimension of fun if one has to support a larger team over multiple platforms.
* Doing replication is hard if you want uptimes of 99.9x%
Yes, done right it can work great. But usability does not degrade gracefully when things break.
The major build tools are insanely powerful, pragmatic, and maintained by dedicated teams. Volunteers also add features as needed.
CitC clients takes seconds, many IDEs work (some better than others with dedicated/volunteers to support), internal GUI based versioning, code reviewing, testing, documenting, bug tracking tools also work at scale.
> Since all code is versioned in the same repository, there is only ever one version of the truth, and no concern about independent versioning of dependencies.
This sounds like horror to me: it's essentially a forced update to the latest version of all in-house dependencies.
Interesting article though. It feels like there's a broader lesson here about not getting obsessed with what for some reason have become best practices, and really taking the time to think independently about the pros and cons of various approaches.
It's a "forced" update, but the downstream dependencies have veto power: if their tests fail, the change gets rolled back.
Of course in practice, it depends on how important the app is. If you break gmail it's definitely getting rolled back. If it's just a one-off abandoned tool somewhere doing something they're not supposed to do anyway, or some tests that flake all the time, maybe it will be ignored.
If you are upstream of a lot of important code, you might not be able to move as quickly as you like and that's because you have a lot of responsibility.
Migrations need to be planned carefully. Unlike in open source, you cannot make an incompatible change, bump the major version number, and call it a day. Migrating the downstream dependencies is your job. Version numbers are mostly meaningless and compatibility is judged by actually compiling code and running tests.
You soon learn that if you can fix a bug without changing a public API it's a whole lot easier. But if you need to change an API, there's a way to do it.
It's been a long, long time since I worked at Google, but IIRC the build system was like Maven: you could depend on the latest version of something, or a specific version, and only the dependencies you needed would get pulled from the global repo when you checked out code. If your code depended on a specific version of some piece of code, then you'd be immune to future updates to that piece of code unless you updated your dependency config. (Hopefully I'm remembering all of this correctly..)
Nope, that's not quite how it worked. Except for a few critical libraries that were pinned to specific versions, every build built against HEAD. However, a presubmit hook would run tests on the transitive closure of code depending on the changed code, and if anything broke, it was your responsibility as the committer to fix it or roll it back.
This is really what kept things stable. If you maintained a a common library and wanted to update it, you had to either keep things backwards compatible, write an automated refactor across the codebase (which wasn't too hard because of awesome tooling), or you would get dozens of angry emails when everything broke. If you relied on a library, you had to be sure to write solid e2e tests or your code might silently break and it would mostly be your fault.
I definitely miss the big G, so many resources were dedicated to engineer productivity and tooling. It's what makes the monorepo work.
Thanks! I appreciate the correction. I miss a lot of the resources, too. I also miss the spirit of dedicating serious attention to code quality. Things like Testing Fix-it and Documentation Fix-it days were awesome.
> Things like Testing Fix-it and Documentation Fix-it days were awesome.
Can you elaborate on this? Are they just days where the team goes "Ok, no need to stress about new features today, let's just catch on on test coverage / documentation" ? If so, that sounds pretty wonderful.
It was company-wide. Every month or two, there would be a "documentation fix-it day" or something similar. The theme would vary: sometimes it was docs, sometimes unit tests, sometimes localization, etc. If memory serves correctly, the idea was that unless you had a high priority/production bug, you'd spend your day working on the theme of the fix-it.
Just curios how does this automatic refactoring across entire code base work? Do they build AST's and then refactor? Or it at more surface level where you are replacing text intelligently?
Yes, working in a product area that's directly responsible for generating $N*10^9 of revenue per year is just terrible. I'm sure they don't do ANYTHING interesting, technology-wise.
I'm being facetious and exaggerating a bit, sure. But for every engineer doing cool machine learning stuff, there are 10 mucking about with AdWords and AdSense front-ends -- on customer-facing stuff that they probably don't use themselves. It's energising to work on a product you use, it's draining to work on one you don't.
(And I guess I'm putting words into the mouth of the earlier poster, sorry about that. This is from my own experience, not theirs!)
Pretty much everything builds against head, and very little is in versioned archives.
There's a ton of fantastic artifact caching (that gets invalidated whenever source changes; which is frequently), distributing building, and some really nifty source code juggling
Mmm, two things of note:
1) As has been mentioned, I'm fairly sure you can fix on a given dependency.
2) Changes to upstream libraries trigger tests on consumers of that library, so breaking changes tend not to happen.
>Changes to upstream libraries trigger tests on consumers of that library, so breaking changes tend not to happen.
This doesn't prevent build breaking changes, it just makes the team using the library aware what is broken and they can have fix ready when they want to upgrade their dependencies.
Afaik you cannot submit code that breaks any test in the repo. I mean, there are ways to force it, but unless it's a special case you really don't want to.
I think the way you should think of it is that there is only one "canonical" version - otherwise the "canonical" version is fragmented, and the fragments can contradict eachother.
And god damn, I hate Mark Shuttleworth for using that name for his company.
Without reading it, I've used monorepos and multiple repos, and far prefer the former. What people don't get is that any system consisting of software in multiple repos is really a monorepo with a really poor unit & integration test story. The overhead of managing changes to inter-repo dependencies within the same system is utterly insane — I'd say that it's somewhere between 2^ n & n^2, where n is the number of repositories (the exact number depends on the actual number of dependency relationships between repos).
In fact, after these several years, I'm beginning to think that 'prefer' is not the word: monorepos appear, more and more, to be Correct.
I have also worked with mono/multi repos. My conclusion is the same. A mono repo is much easier to manage. It is a shared repository but every team manages their own runtime instances. Each team releases at their own pace. The repository is always consistent, no need to manage inter-repo dependencies. You really get the best of both worlds.
Coincidentally, I have been reading "Building Microservices" from O'reilly and I still don't see the benefits.
Unless the software has nothing to do with each other.
A client software in Java, another client in C#, another one in Obj-C and a server software in C++ will be hard to test together, will share no code, and often be developed in independent development cycles, often by independent teams.
Totally disagree. Any serious development project that has an application that needs to work across iOS, Android, Mac and Windows should be sharing as much of its business logic, in C or C++ which is usable on all those platforms. The UIs of course will be platform-specific, hopefully as thin as possible, but the majority of your code should be totally re-usable.
Disagree. Typically, in this scenario, you're using some kind of protocol to communicate. In our case, it's protobufs: they are in fact one of the prime sources of breakages between repos, and one of the main reasons I wish we had a monorepo. Same would be true of json schemas, etc.
Oh but it wouldn't. While those clients might not share code, each of them does share an interface with a common dependency: the server. Having the server code and a given client code in the same repo is what you want, which means you want the server code and all the clients in the same repo.
Disagree. If they're all part of the same system and/pr project, they ought to go in the same repo. Otherwise, you've not got one system/project at all - you've got several, and you shouldn't try to pretend otherwise.
You can of course just press on regardless anyway, but that doesn't mean it's a great idea!
- This is a technique, and it's a toolset, but most importantly it's a commitment. Google could have split this up many times. In fact, this would have been the "easy" thing to do. It did not. That's because this strategic investment, as long as you keep working it, keeps becoming more and more valuable the more you use it. Taking the easy way out helps today -- kills you in the long run.
- This type of work just isn't as important as regular development, it's more important than regular development, because it's the work that holds everything else together.
- In order for tests to run in any kind of reasonable amount of time, there has to be an architecture. Your pipeline is the backbone you develop in order for everybody else to work
- You can't buy this in a box. Whatever you set up is a reflection of how you evolve your thinking about how the work gets delivered. That's not a static target, and agreement and understanding is far more important than implementation. I'm not saying don't use tools, but don't do the stupid thing where you pay a lot for tools and consultants and get exactly Jack Squat. It doesn't work like that.
The paper does not compare or evaluate this "commitment". It is a data point that monorepo scales to 86TB provided you heavily invest in infrastructure to support it. This is significant as many would have considered that "impossible" as in "it is impossible to make git scale to 86TB".
We don't know if the monorepo approach is worth it. Google believes it is (Facebook as well). Many others don't. The manyrepo approach also has advantages.
FTA, "...Over the years, as the investment required to continue scaling the centralized repository grew, Google leadership occasionally considered whether it would make sense to move from the monolithic model. Despite the effort required, Google repeatedly chose to stick with the central repository due to its advantages..."
The article goes into detail about both benefits and drawbacks.
The manyrepo approach also has advantages.
Care to elaborate on any of these? (aside from the obvious "works fast on my machine")
It works well when most of your code is written in-house. If you have a lot of external dependencies -- not so good.
The problem is what version of the third-party dependency is various projects in the BIG repo should depend on.
Article mentions that: "To prevent dependency conflicts, as outlined earlier, it is important that only one version of an open source project be available at any given time. Teams that use open source software are expected to occasionally spend time upgrading their codebase to work with newer versions of open source libraries when library upgrades are performed."
So if you have a lot of external dependencies -- you need a dedicated team to synchronize them with all your internal projects.
The article doesn't say that google need dedicated teams to synchronize with external open source projects. The teams that use each external dependency are responsible for maintaining that as par the course.
It can still be painful to use open source projects internally because of the monorepo's structure. There was/is a rule to have a maximum of one version of any given OSS package at a time (more or less).
In general, the rule was good, as it kept most teams up to date with the latest security patches, etc. It did, however, incur a fair bit of pain, as you frequently had to bump a myriad of dependents whenever bumping some package.
And npm packages were/are a real nightmare to manage due to that rule. Particularly when you have some popular npm packages that end up (transitively) depending on 3+ versions of a particular package.
When submitting a third party package to the repo you become the third party package maintainer. And it's your responsibility to make sure you don't break the world when updating it. If people using it want a newer version (only one version can live in the repo, by policy) it is generally their responsibility to manage that transition. It's been a long time since I managed a third party package, but that's my rough recollection.
The other thing is that third party packages are checked in as source. Everything is built from source at Google.
I have a question for Googlers: I keep hearing about refactoring all across the repo. How does that work with dynamic call sites (reflection, REST), etc.?
I mean, there's no way to prove something is actually used in those cases except for actually running the thing.
Do you just rely (and hope) on tests across all projects?
The automated refactoring tools are largely for C++/Java/Go. Reflection etc. are fairly rarely used in those languages at Google. They present a number of problems at Google scale, both in terms of run-time overhead and in terms of code obfuscation for new engineers reading the code. When they are used, it's often internal to some other library/minilanguage (eg. Guice or Protobufs) with well-defined semantics, and then the tools just special-case that code and rely upon tools exposed by it to track call sites.
- You use grep to try and find all the ways people are using it.
- You write and run a codemod to convert 70-90% of the call sites mechanically.
- You manually fix the remaining ones. This is usually a good opportunity to throw a hackathon-style event where a bunch of people sit in a room for few hours and drive it down.
> You use grep to try and find all the ways people are using it.
The mere thought of this makes me think of the dark ages and a time in my career that I'd rather forget. All hail statically typed languages. Or languages that can easily be tooled against.
If Facebook uses Flow internally, tooling could be written against that. Hell, just removing the old function and seeing the compile errors would expose all the call sites if their whole code base is Flow typed.
As somebody who has done this a few times: you run all the tests of all the projects on your change, and you don't feel bad if somebody has code that breaks and doesn't have a test that caught it.
The Google build tool, blaze (which is what their open source Bazel is based on), does analysis of what could break from code changes, so it only needs to run tests related to that.
As someone outside of google, I'm having a hard time seeing how this would actually work. Not as in, it can't be done, as in, is there actually empirical evidence that the supposed benefits (how do you know silos are lower than otherwise?) are happening as claimed because of the monolithic code? Do silos really come down, do big changes to underlying dependencies really get rolled out, or do people hunker down into their own projects, try to cut out as many dependencies as possible?
Perhaps the extra tools, automation, testing, etc helps to a large extent, I can see that being reasonable, but I don't see how they solve all the problems I have in mind.
Perhaps more so, if you've invested in all these automated tools, I am, perhaps (certainly?) ignorantly, not entirely certain what those tools inherently have to do with the choice of a monolithic code base? Couldn't many of them work on a distributed code base if they're automated? I mean, we're talking about "distributed" in the sense that its all still in the one org here...I realise that in practice, this distinction between monolithic and distributed is possibly getting a bit academic...
I think your last paragraph sort of makes the point.
A lot of these things depend upon the ability to quickly and easily see how a given piece of code is being used universally. Whether that means a single repo or simply a single unified index of all code and the ability to atomically commit to every repo is immaterial, but the latter sounds a lot like the former.
I've found an interesting detail, does anybody know more about this?
"The team is also pursuing an experimental effort with Mercurial, an open source DVCS similar to Git. The goal is to add scalability features to the Mercurial client so it can efficiently support a codebase the size of Google's. This would provide Google's developers with an alternative of using popular DVCS-style workflows in conjunction with the central repository. This effort is in collaboration with the open source Mercurial community, including contributors from other companies that value the monolithic source model."
Are you under the illusion that a high-level language cannot "scale"?
Python is a high-level language, a bytecode language, a standard library, and a variety of interpreters and compilers. And a large community of shared libraries, forums, collaborators, etc.
Java is a high-level language, a bytecode language, a standard library, and a (smaller) variety of interpreters and compilers. Given their similarities, why would you expect that one language can, but the other cannot create a "scalable" software project?
The Google code-browsing tool CodeSearch supports
simple edits using CitC workspaces.
While browsing the repository, devel-
opers can click on a button to enter
edit mode and make a simple change
(such as fixing a typo or improving
a comment). Then, without leaving
the code browser, they can send their
changes out to the appropriate review-
ers with auto-commit enabled.
Do they still maintain CodeSearch for themselves? Was it so much burden to maintain reduced version of it for the public?
I don't know if it's the same, as I just heard of it from a YAPC talk a few days ago and haven't tried it, but it's called "Code Search" so it seems likely.
"It looks like local index and search tool like ack and ag. While above I'm talking about the whole service.
"
The tool that was open sourced by russ uses the same technology/scheme that the original codesearch was built on.
Given Russ wrote code search (the service) as an intern in a few months, one would think you should be able to take the pieces and put the rest together.
:)
As for cost to maintain it for the rest of the world --
Look, if you have, say a team of 3-4 people, and your mandate is mainly one to support internal developers, and there is plenty to do there, you just aren't going to end up keeping external folks happy.
This is likely true of almost anything in the software world.
Even if it works, people want to see it evolve and continue to get better, no matter what "it" is.
If the next question is "why is internal so different from external that this matters", i could start by pointing out that, for example, internally it doesn't need to crawl svn, etc repositories to be useful. There are tons and tons and tons of things you don't need internally but need externally. What it takes for a feature to be "good enough" to be useful is also very different when you are talking about the world vs 20000 people.
So it's not really even a question of "cost to maintain" in some sense.
Steve Yegge has spoken about the search tool his team built to solve the "has someone else written this?" and "refactor this" for all the languages and systems across Google.
It is really sad that Google has internal development tools that run circles around Visual Studio, but decide not to share them with the outside world. I don't mean as open-source software, I personally would pay money for a Google Studio and I know my company would also pay for licenses if these were offered. It's especially painful on Linux where the best dev tools are 30 or so years old.
Does this mean that one Googler could checkout the complete system, and sell it or put it online? How many people have access to the complete repository? How big is one checkout?
Checkouts are apparently instantaneous. Behind the scenes, there is a FUSE filesystem that faults in files by querying Google's cloud servers. So it does not require a significant amount of space (but does require fast access to Google's servers, which can be problematic).
Almost all engineers have access to the "complete system," which is really Google's server-side software. Other repos like Android have historically been more locked-down, but there's been some recent effort to open them up within Google.
Presumably if you tried to copy all of the files, you'd first be rate-limited, then get an access denied, and lastly a visit from corporate security. I wouldn't want to try.
> Presumably if you tried to copy all of the files, you'd first be rate-limited, then get an access denied, and lastly a visit from corporate security. I wouldn't want to try.
And there's no way it'd fit on a single hard drive.
I worked at a company where our engineering team was working with an engineering team at Google to help Google move one of their systems from one vendor to us.
We wanted to send them a script to help with the conversion process and they flipped out and told them not to send them any code until we become authorized in some special capacity.
They then decided to tell us how a week or so before a coworker sent himself some code so he could work on something from home (according to him at least), and within 20 minutes of emailing himself that code security was already escorting him out the door.
That's 20 minutes for automated filters to catch it, security to review it, HR to process the termination paperwork, and security to go to his desk to escort him out.
There are Googlers who could checkout the complete system, and I guess sell it online. There are pieces that aren't available to all Googlers, but the _large_ majority of the code is there, and browsable by everyone.
Generally, you don't checkout all code you're using, just what you modify. You work in repos that exist virtually, because checkouts would be huge.
There are advantages and disadvantages to consider. Clearly monolithic repositories allow you to leverage commonality but it’s not free.
For example, single-repository environments may require you to check out everything in order to do anything. And frankly, you shouldn’t have to copy The World most of the time. Disk space may be cheap but it is never unlimited. It’s a weird feeling to run out of disk space and have to start deleting enough of your own files to make room for stuff that you know you don’t care about but you “need” anyway. You are also wasting time: Subversion, for instance, could be really slow updating massive trees like this.
There is also a tendency to become too comfortable with commonality, and even over-engineer to the point where nothing can really be used unless it looks like everything else. This may cause Not-Invented-Here, when it feels like almost as much work to integrate an external library as it would be to hack what you need into existing code.
Ultimately, what matters most is that you have some way to keep track of which versions of X, Y and Z were used to build, and any stable method for doing that is fine (e.g. a read-only shared directory structure broken down by version that captures platform and compiler variations).
Companies that do this don't require you to check out the world, that would be craziness. Perforce allows for partial mappings and is used by pretty much all the big companies. Even Microsoft's internal source depot is just a fork of perforce.
Dependencies are usually handled by a binary import/export system where you pull modules from a shared "build the world" system from a known checkpoint.
Both git and hg supports shallow (low history) and sparse (selective files) checkouts as well, facebook has a blog post somewhere about how they built it into hg.
That's not really the entire world though, because the history before the shallow clone will not be pulled or fetched. You'll get all the new stuff though.
I don't think the scale of these places are fully appreciated. Your local enlistment could be a few hundred gigabytes and that is only the head revision of stuff you work with daily. Keeping more than just the current revision around is a lot of data for even a beefy workstation.
History as a guide, we will look back on this for the tire fire it is. monorepo and projects like Chrome and Android to me look like a company that is trying its best to hold itself together but bursting out at the seams the same way Microsoft did in the '90s with Windows NT and other projects. Googlers frequently use appeal to authority to paper over the fact that they are basically the new Microsoft.
That was just my reaction to the article. I saw Google and ACM so thought it would be something good, but in the end I fell victim to the authority click bait. The article is written in a scholarly style, but it is neither academic nor profound. It is the engineering equivalent of an intra-office memo describing how a massive company is going to organize their sales regions and Take Over The World. Imagine for a moment if IBM, Microsoft, or the Department of Defense wrote this exact piece today... would ACM have published it? Would it have made the front page of HN? I simply felt disappointed and expressed that reaction.
For examples of Google non-superiority (remember, this is a hacker/entrepreneurial forum, we should seek solidarity in outdoing behemoth entities with agility), I simply encourage you to think for yourself and not put any credence into lore, methodology, or tech simply because it comes out of Google. I see a 1:1 comparison between Android and the non-WinNT kernel Windows releases. Google put a festering pile of code out and allowed handset makers and carriers to basically never patch any type of vulnerability. The permissions model and app over reach are just barely now contained in Android 6... seven years after release. Chrome bundles tons of third party libraries... it's another moving train wreck with enough bodies to somehow deal with the naive vendoring, scope creep, and general upkeep but it's still a nightmare to correctly build and package for an OS. By comparison I have immense respect for the Servo developers who are making an interesting reach with far less resources than Google.
One thing to consider is that monorepo tooling is (outside of Google) still pretty immature.
At Square, we have one main Java "monorepo", one Go "monorepo", and a bunch of Ruby repos. The Java repo is the largest, by a huge factor (10x the Go repo, for example).
The Java repo is already noticeably slow when doing normal git operations. And I'm told we've seen nothing of the pain the folks at Twitter have had with git: their monorepo is huge.
We periodically check to see how the state of Mercurial and Git is progressing for monorepo-style development. Seems like Facebook has been doing interesting stuff with Mercurial.
But I still miss the fantastic tooling Google has internally. It really is so much better than anything else I've seen outside.
We have ~90k lines of java code. I don't think it will be a problem unless it grows ten times. We spend time grooming it. Removing old code, etc. I believe it is the case for most companies. Unless you are google, facebook, square, etc.
Mozilla may be a good company to look at that is open source and has a large codebase. They use a monorepo for Firefox (in mercurial). mozilla-central is around 154,000 files.
Won't the performance of version control software largely depend on the size of project history and not necessarily LoC? I work on a project where we just surpassed 100k commits in Mercurial, which includes largefile assets. Cloning is a lengthy process - other operations are also slow such as push/pull/stat and occasionally rebase.
Our Go monorepo has 1,325,332 lines of non-comment code, according to `cloc`, although about half of that is generated protobuf code. I'm still waiting for `cloc` to finish on the Java repo…
I have to fix a lot of broken URLS from the submissions sent in to hackaday.com. I guess I don't even think about fiddling with them until they show up anymore, haha:) What a weird skill.
For me, it seems that the power of this model is not that "we have a single head", rather than, "we can enforce everyone use the newest library version (expect for sometimes create a whole new interfaces)".
Let's suppose we have a tool that can
* automatically check out the HEAD of each dependent repo
* run a complete integration tests across all the repo before any push/check-in.
This will work fine even with a multi-repo model, won't it?
Also, As mentioned earlier by others, The reason google can do it is because google can
* maintain a powerful cloud-based FUSE file system to support the fast checkout
* run automated build tests before any submits to ensure the correctness of any build
So they don't need to maintain multiple dependency version(for the most time)
How does Google deal with external dependencies? E.g. the Linux kernel. Do they have a copy of Linux in the monorepo for custom patches? Is there a submodule-like reference to another repository? Is there a tarball in the monorepo, a set of patches, and a script to generate the Custom-Google-Linux? What happens when a new (external) kernel version is integrated?
It's amazing to read this kind of articles, because they are the best argument to all those very opinionated (or simply arrogant) people who claim that "one repository is s*" or similar, such as "this technology is better than this one", "unit-tests yes/no", the list of religious wars could go on forever ...
If my central package repo had genuinely reproducible builds, such that I could download the source, run the build process and know that I had the same output that the repo held, then I think that I would love to have a setup where:
- I could "add a dependency" by having an automated process that downloads the source, commits it as its own module (or equivalent) in my source repo, tags it with the versioned identifier of the package, and then builds it to make sure it matches what the package repo holds.
- I could make local modifications to the package if I needed to, and my source control would track my deviation from the tagged base version
- I could upgrade my package by re-running the first step, potentially merging it with my local changes.
Hmm, I think I just described the ultimate end state for Go package management...
It works primarily because code health is a number one goal here, everything goes through rigorous mandatory code review, nothing submits into the repository without running all affected test targets, and the goal is that everyone can pretty much "always" (for various definitions) build off head.
And also that there be no walls between teams and projects and code sharing is a universal value across the company. At least outside of Chrome/Android, which are in their own worlds.
This works because the interests of developers of the dependency and the dependent part are aligned, or can be aligned/decided in the case of conflicting interests.
For the rest of us, we need the practical ability to simply make changes that might break dependencies, because of that the ability to fixate to specific old versions, and the ability to create alternate/forked incompatible versions - simply because it allows people to go/develop in the direction they want without being tied down by others who may want a different direction or even no direction at all, because it's not maintained anymore.
The main reasons (as usually is the case) are social/political, not technical.
The monorepo solves source-to-source compatibility issues, but it doesn't solve the source-to-binary compatibility issues. For that you need a solid ABI, possibly versioned, unless every code checkin redeploys every running process everywhere.
Say version N of the code is compiled and running all over the place, and you make a change to create N+1. Well, if you don't understand the ABI implications of building and running your N+1 client against version N servers (or any other combination of programs, libraries, clients, and servers), then you'll be in a mess.
And if you do understand those ABI boundaries well and can version across them, I'm not sure you need a monorepo much at all.
I find it really hard to find anything when it's split into a bunch of smaller repositories. If you're going to do that, you should at least have one master repository that adds the tiny ones as git submodules.
This is a very interesting article. I believe there is value to using libraries about there is something to be said for monrepos. Interesting that Google and Facebook are working together to extend Mercurial for large repo's.
At previous job I had Google to thank for the team's god awful decision to choose perforce over git thanks to some silly whitepaper or article. They acted like git was some fringe version control system that no one would use professionally... just for fun toy projects.
Perforce is a perfectly reasonable choice given certain technical and organizational constraints. It is a bad choice for others. Git is a good choice for many projects. It is a terrible choice for others, and was a much worse choice in the past.
My experience working in large corporations and smaller companies with both approaches tends to make me lean towards the multi repo approach.
Some details:
* Amadeus:
huge corporation that provides IT services for the airline industry, handling the reservation process and the distribution of bookings.
Their software is probably among the biggest C++ code bases out there.
We were around 3000 developers, divided into divisions and groups.
Historically they were running on mainframes, and they were forced to have everything under the same "repository".
With the migration to Linux they realized that that approach was not working anymore with the scale of the company, and every team/product has now its own repository.
All libraries are versioned according to the common MAJOR.RELEASE.PATCH naming and upgrades of patch level software are done transparently. However Major or release upgrades have to be specifically targeted.
What is more important for them is how software communicates, which is through some versioned messages API.
There is also a team that handles all the libraries compatibility, and package them into a common "middleware pack". When I left around 2012 we had at least 100 common libraries, all versioned and ready to use.
Murex
financial software used in front/back office for banks.
We had one huge perforce repo, I can't even begin to tell you what pain was it.
You could work for a day on a project, and having to wait weeks to have a slot to merge it in master.
Once you had a slot to merge your fix in master, chances are that code has changed meanwhile somewhere else and your fix can't be merged anymore. That was leading to a lot of fixes done on a premerge branch, manually on the perforce diff tool.
Also given the number of developers and the size of the repository, there was always someone merging, so you had to request your slot far in advance.
Maybe the problem was that the software itself was not modular at all, but this tends to be the case when you don't force separation of modules, and the easiest way is to have separate repositories.
Small proprietary trading company
We didn't have a huge code base, but there were some legacy parts that we didn't touch often.
We separated everything in different repos, and packaged all our libraries in separate rpms.
It worked very well and it eased the rebuild of higher level projects. If before to release some project would take ~1h, with separation of libraries it would only take 5 minutes. It was working well because we didn't change often base libraries that everyone was depending on.
A lot of the benefit that comes from this code storage method doesn't really seem like the best solution.
The presenter in the video linked in this thread that this is very advantageous due to cross dependencies. I don't think that this is the correct way to handle a cross dependency.
I'd much rather handle it by abstracting your subset of the problem into another repository. Have some features that two applications need to share? That means you're creating a library in my mind. This is much better suited for something like git as you can very simply use sub-modules to your advantage.
Hell, you can even model this monolithic system within that abstracted system. Create one giant repository called "libraries" or "modules" that just includes all of the other sub-modules you need to refer to. You now have access to absolutely everything you need within the google internal code base. You can now also take advantage of having a master test system to put overarching quality control on everything.
This can be done automatically. Pull the git repo, update all the sub-modules, run you test platform.
I'd say that's a better way to handle it. Creating simple front end APIs for all of the functionality you need to be shared.
This is more of a rhetorical question. As a tech minimalist, the preferrable answer by a long shot would be that, internally, Google is keenly aware of the severe bloat and technical debt of their codebase and have clear plans going forward to drastically reduce the scale, by more than 1000x at least, without sacrificing any of the features/bug-fixes/performance of any of the code.
Right. Google is going to reduce the code size to a million lines of code. For Search, YouTube, Maps, Docs, GMail, Hangouts and all the infrastructure projects that go with it. (And probably a ton of things I forgot to mention)
Compare, to, say, Apache alone: 1.8M lines of code. Or Riak: 250,000 lines of code.
Just because it sounds like a lot to you doesn't mean it actually is quite as much as you think.
One of the reasons: at that scale the cost of identifying and refactoring reusable code into shared modules just isn't worth it. Your project may have something in common with my project, but I wouldn't want to refactor my code into something that your project can use because I don't want to maintain that dependency. I expect the reverse to be true, so you often copy code and build very purpose-driven code for your project.
It's not like anyone could be well versed in more than a tiny percentage of the entire codebase, so the gains from that kind of reuse just aren't there.
Realistically, I don't think you can have a minimalist, clean and focused code base with thousands of engineers pounding away at the keyboard. You ought to make some trade offs with that many employees.
Google's operations probably require X lines of clean code added everyday by 10% of the engineers (so ~5000 instead of ~50,000 engineers), but because of the sheer number of engineers Google has, that are supposed to do something continuously to show performance, Google has ended up with ~10 or ~100 X lines of daily bloat accumulation.
Sounds like a textbook example of too-many-hands-spoiling-the-broth.
It's the next sentence in the article.
> The total number of files also includes source files copied into release branches, files that are deleted at the latest revision, configuration files, documentation, and supporting data files; see the table here for a summary of Google's repository statistics from January 2015.
I wonder if they also distribute binaries internally. Otherwise, setting up a developer machine could take really long. Like installing Gentoo Linux :)
With CitC meaning all the source is instantly, "locally" available and up to date, and something like bazel.io meaning you always have a reliable and fast way to rebuild something out of cached/distributed build outputs[1], you don't need to distribute a binary in the general case.
Whoever needs a tool can cd to the source of the tool, build it right then and there, and run it.