Hacker News new | past | comments | ask | show | jobs | submit login
Why Google Stores Billions of Lines of Code in a Single Repository (2016) (acm.org)
478 points by bwag on July 24, 2018 | hide | past | favorite | 281 comments

I feel terrible for anyone who sees this and thinks, “ah! I should move to a monorepo!” I’ve seen it several times, and the thing they all seem to overlook is that Google has THOUSANDS of hours of effort put into the tooling for their monorepo. Slapping lots of projects into a single git repo without investing in tooling will not be a pleasant experience.

Same line of thinking, just different conclusions.

I feel terrible for anyone trying to run a company with open-source style independent repos. On a popular github project, you have MANY potential contributors that will tell you if a PR, or a release candidate break API compatibility, etc. There are thousands of hours in open source dedicated to fixing integration issues due to the (unavoidable) poly-repo situation.

Monorepos in companies are relatively simple. You need to dedicate some effort in your CI and CD infrastructure, but you'll win magnitudes by avoiding integration issues. Enough tooling is out there already to make it easy on you.

Monorepos' biggest problem in an org is the funding, as integration topics are often deprioritized by management, and "we spend 10k per year on monorepo engineering" for some reason is a tough sell for orgs, who seem to prefer to "spend 5k for each of the 5 teams so that they maintain their own CD ways and struggle integrating which incurrs another 20k that just is not explicitly labeled as such".

Developer team dynamics also play a role. I have observed the pattern now multiple times (N=3):

* Developers have a monolithic repo, that has accumulated a few odd corners over time. * The feeling builds up that this monolithic repo needs to be modularized. * It is split up into libraries (or microservices), this is kind of painful, but feels liberating at first (now finally John does not break my builds anymore) * Folks realize: John doesn't break my builds anymore, but now I need to wait for integration on the test system to learn if he broke my code, and sometimes I only learn it in production. * people start posting blog posts on monorepos

That pattern takes 2-3 years to play out, but I have seen it on every job I worked.

We have shared components between some of our projects, I'm not sure how monorepo will fit in here. For integration there are a lot of available build tools and repo management apps. For us it solves the problem of having no dependency between versions of the same library used between multiple products.

> It is split up into libraries (or microservices)

It's a frequent problem to conflate organization/modularization with lifecycle/version management.

You can have a well-organized codebase just as easily in a monorepo.

That's a separate question from management the lifecycle of the code. (What is release and when? What tests are run? What process approves a change?)

working as dev with academic teams, I usually use many repos for "damage control" as git-ignorant scientists will dump irrelevant files into a repo.

with that in mind, is monorepo is a universally good approach or is more dependent on good behavior of team members than polyrepo?

I don't think that there is the one size fits all solution especially if you can't expect basic knowledge about git

A monorepo requires a good Continuous Integration infrastructure if it is supposed to work. Unless those small repos are will be unit tested, you will not benefit from a monorepo.

Suppose for your projects you have a utility library `lib_a`, in a polyrepo situation, your projects will use it in probably different versions, which means you have coordination effort necessary to get everyone on the latest release. The monorepo would enable the developers of `lib_a` to get feedback from the downstream test suites directly on whether the changes they perform are breaking user code, so they can up front introduce their changes less intrusive. They can however also roll out security-relevant changes much more easily. The monorepo will make the projects more homogeneous, which facilitates integration and operations (there are exceptions of course).

Is there frequent code reuse? If so, monorepos are really nice. If not, separate repos make more sense.

I'm honestly getting a little tired of the repetition this cycle causes. Monorepos are Wrong, "too many" repos are also Wrong, and everyone needs to realize that strawmaning the opposing side is really what wastes our time, not broken builds or waiting on dependencies to build.

Monorepo people ignore the learned lessons of those who came before us, and are trying to drag their teams back into a simpler time that, while nice, does not exist anymore. If you use any dependencies at all, you don't live in a monorepo world, and lying to yourself and your coworkers will only leave you confused and angry that your expectations are constantly not being met.

The solution isn't to split every single component into its own repo, but pretending like that's what anyone rational is proposing is not working with the best form of the argument. It's not always completely clear how to split up a growing codebase, but to claim that it's not usually worth splitting up is Wrong.

Both monorepo or "micro repo" end up falling apart at scale without some devops work involved. Either will work if you only have a few dozen projects. Neither will work once you hit 10s of millions of lines of code.

But people seem to forget that it wasn't that long ago that git didn't exist, making multiple repos was a pain in the butt. Managing multiple repos locally was hell. Monorepos were the norm.

Then as the state of version control ramped up, and making repos became easy, and having so much code in one repo had performance issues (overnight CVS/SourceSafe/SVN pull on your first day at work anyone? Branches that take hours to create?), people started making repos per project. The micro-service fad made that a no-brainer.

Now, for companies like Facebook and Google, or really any company that wrote code before the modern days and has a non-trivial amount of it, switching was not exactly a simple matter. So they just poured their energy into making the monorepo work. They're not the only ones to do it either (though not everyone has to do it at Google, Facebook or Microsoft scale, obviously, so its a bit easier for most). And so it works. And then people forget how to make distributed repos work and claim things like "omg I have to make 1 PR per repo when making breaking changes!", as if it was a big deal or it wasn't a solved problem.

People also seem to forget that "Monorepo" or (many) "Microrepos" is not a binary choice.

You can have both tiny repositories which do a single thing and large repositories that consist of many projects. It's totally cool to have both, assuming your team can be trusted to make the appropriate choices as they create new projects.

> And then people forget how to make distributed repos work and claim things like "omg I have to make 1 PR per repo when making breaking changes!", as if it was a big deal or it wasn't a solved problem.

Is this a solved problem? I typically do make one PR per repo to resolve breaking changes, though it's certainly not a big deal. Still, if there's an easier way, I'd love to hear about it!

> Is this a solved problem

I don't mean that it's magical, just that it's not particularly sorcery. Instead of making a breaking change, add new method, deprecate old method. Update projects, then get rid of old deprecated method. Because they're distinct you can do this one by one so some project can reap the benefits without having to wait until all the problems are solved.

Some people in this thread act like its freagin impossible. Avoiding breaking changes in APIs or proper deprecation strategies is an art everyone developing software should know: sooner or later they'll have to contribute to an open source project or have to make a more complicated breaking change or SOMETHING and will have to deal with it. Even if they use a monorepo. And when it happens you don't want it to be the first time anyone deals with it.

We have about 400 repos in a team of about 20 developers. We do have extensive tooling to help coordinate all of these, but configuration management is still by far the biggest engineering challenge that we face.

I don't recall having such issues when I was working with Subversion and Perforce.

On the other hand, not everything was rosy in the 'good old days': MS Source Safe was (by far) the worst VCS experience that I have ever had.

Sounds excessive and beyond the norms of what most people would encounter in not having a mono repo i.e. the opposite extreme.

Yeah. It is a bit excessive, and it is a PITA to manage.

You can also just have tooling to find every reference of a function and then refactoring all at once, sending it out in a single pr, but that's a bit more advanced

Microsoft went from multiple smaller repos for windows to one large one: https://blogs.msdn.microsoft.com/bharry/2017/05/24/the-large...

Absolutely! At some point you must invest in your tools. (Early, in my opinion.) I think the clarification I’d offer is that in the age of GitHub the “standard” model is multiple repos, so you’re actually giving up some tooling if you just shove everything in a single repo.

(I’m also not sure I’d generally categorize tools work as as “dev ops,” though I can certainly see how they end up intertwined.)

I've seen hardly any tools to manage dependencies across multiple repos. Modifying multiple repos at the same time isn't an issue I see many resources devoted to, and managing those cross repo versions is almost never done well. In comparison, both buck and bazel offer pretty mature monorepo management tooling. On the VCS front, you can take native git/HG a long way.

>Both monorepo or "micro repo" end up falling apart at scale without some devops work involved

Wouldn’t any project fall apart without devops work?

yeah, that's my point.

We moved to a monorepo about 2 years ago and it has been nothing but success for us.

We have quite a few projects but only 4 major applications. Maybe it is that a few of our projects intertwine a bit so making spanning changes in separate repositories was a pain. Doing separate PRs, etc. Now changes are more atomic. Our entire infrastructure can be brought up in development with a single docker-compose file and all development apps are communicating with each other. I don't think we've had any issues that I can recall.

We are a reasonably small team though, so maybe that is part of it.

My previous had a monorepo for the website and backend (but not the mobile apps) which was insane to work with (as a coder I had a dedicated 128 core box to work on, some engineers had more than one, less intense engineers shared one) and a substantial amount of my time was spent just finding code. I guess most engineers just end up working in some nook and so that searching code constantly becomes less of an issue (it never did for me) but the code / debug cycle was dreadful.

I should add that a huge amount was invested in tooling. We had an in-house IDE with debug tools that could step through serverside code. We had a highly optimized code search tool. We had modified a major version control system so it could handle our codebase. (Indeed we picked our version control system because we needed to fork it and the other major version control system was less amenable to our PRs.)

My current job we have a micro service architecture and lots of small, focused repos. Each repo is self-documented. Anyone can checkout and build anything. We don’t need obscene dev servers. We have not hugely invested in tools or workflow.

Client apps are unavoidably larger repos than the services apps.

Based on my personal experience, I think monorepos are nuts.

You seem to have conflated a bunch of different things and confused them with a monolithic repository. There's no reason why a monorepo requires you to have a gigantic development box ... you identify and compile only the transitive dependencies of your target, not every line of code in the repo.

I recently saw a question on quora asking whether the free food at Google boosted productivity. The reply that seemed strange at thwt time was being able to focus on their job and not having to do a bunch of stuff is what boosted productivity at Google.

I get some perspective from the comments above. There is seemingly an army of engineers at Google that keeps the monorepo functioning. I was at a meeting about bazel and angular. I thought I'd ask how they do things at Google. To my surprise, the presenter said he is not at liberty to discuss how things work at Google. I guess it wasn't so surprising in hindsight. I mean what would I do with that information, right? It would be way too overkill for my tiny crud application.

How is this

easier to browse than this?

You still need to find the project A repo if you don't use monorepos. And even if you do use monorepos everything doesn't have to be one monoloithic build hogging down your IDE, you can still have microservices with the code for each hosted in the same repo. You seem to conflate monorepo with lots of other things.

In the former each project is a self contained unit and if I'm working on project B I can forget project A even exists, which is lovely caused I've got enough to deal with on B as is. Each project can be branched individually, the log of project B is not polluted with commits to project A, I can rebase and not get a bunch of commits I don't care about.

The later forces me to be aware of the entire universe in that repo.

In the later you can also forget project A exist. git log works in subdirectories out of the box and if you didn't touch any files in project A a rebase will be trivial without any merge conflicts. Branching is also free (unless you are still in the SVN stoneage that requires a copy of each file) so it doesn't matter if you branch the whole monorepo or just a single project.

> unless you are still in the SVN stoneage that requires a copy of each file

Maybe you are thinking of CVS? In SVN, creating branches has always been cheap in both space and time.

We had this same experience with a large golang project, consisting of about 8 individual services. Switching to a monorepo made it incredibly easy to make changes to common code and inter-service communication. Huge breath of fresh air.

A single team is really helpful. Where I’ve seen it get particularly unhelpful is with multiple teams. I’m also not opposed to the concept, I just think it requires work to do correctly.

how do you create branches in mono repo?

for example I want to use branch rev5 from project A and rev3 from project B

how I do that in a mono repo, I could not do it in HG, but sure about GIT

Depends on the system. When I used to manage SVN we would branch independent projects and then releases would be a snapshot of each into the server section of the repo. Those were then pulled down to their respective machines.

In SVN a branch is simply a convention. You copy (almost zero cost) things around into your branches directory

Branch rev5 of project A's folder into a folder in your project's folder, branch rev3 of project B's folder into a folder in your project's folder. Get your project to refer to its own copy, rather than the shared copies you branch from. (This isn't quite like git submodules, but that's probably the closest thing git has.)

You do the usual 3-way merge thing to push your changes upstream, or pull upstream changes into your copy. As with git, the VCS tracks which revision of upstream your copy is up to date with, which is how it determines the base for the 3-way merge.

Can’t you create a branch and merge the two branches you are interested in into that?

my understanding, if you branch, you branch the entire repo, (not sure about some special case extensions ) if you have two projects stored in a single repo, you are forced to use whatever rev at for each project at a point of branch rev 5543 for example

In Perforce, which is more or less what Google is using, you can branch any directory within the repo. (You would never branch the whole repo; that makes no sense.) So if you wanted to construct a directory with one version of one subdirectory, and a different version of another subdirectory, that's quite straightforward.

Google branches google for releases. Otherwise, you would end up with version skew across dependencies, which would be a mess.

Yep, and if you want to keep folders you didn't branch up to date with master you have to continuously rebase.

Not having it this way would be equivalent to having subrepo that refers to HEAD instead of a specific commit which is normally considered an big anti-pattern.

The best remedy is to not do branching like this in the first place, just try to stay on trunk all the time.

Just curious how small is small ? How many kloc ? How many people ?

I sense that Google invests much more in it's infrastructure then most companies make in revenue.

I've worked with monorepos, and I'd be loathe to recommend it as well; the combination of culture shift and tooling it takes to keep a monorepo system running makes most CD processes you see today look like child's play.

There is a lot of very good free software that supports most of the open source approach to CD these days; but very, very little freely available monorepo tooling. Just check out https://github.com/korfuri/awesome-monorepo - it's a quick read. I haven't found many other notably superior compilations. Compared with available OSS workflows and tooling, it's rather sparse, filled with bespoke approaches everywhere.

Agreed about the lack of monorepo tooling. There's just not that much out there. A couple of other links I didn't see in the awesome-monorepo:

- https://github.com/facebookexperimental/mononoke - I hear this is a real thing and not a science fair project

- https://github.com/bors-ng/bors-ng - Needed in a monorepo to handle high arrival rate of commits / merges

What problems specifically did you see? Was this because the repo was too large?

I understand at google scale you'd need lots of tooling but why at a smaller scslr of merging a dozen small repos?

The biggest problems are always cultural. Most monorepo workflows really reinforce constant integration, and once you have separate teams with separate managers, I've always witnessed constant conflict that ended up trying to establish spheres of control. It's bizarre - but it's something I've seen at pretty much every place I've worked at.

With all that integration, your single CI toolchain is front and center since everyone's success or failure is tied to it. While projects like bazel exist, how many developers do you know work with bazel every day? I no nobody who does. And most want documented IDE support and ease of use, not some optimal CI workflow. I've found gradle to be OK, but even that kind of pushes everyone toward using Jetbrains tooling. In the end, almost real monorepos have significant custom CI tooling that wires together different toolchains, and, they may have to maintain custom tooling for use in developer machines. And that custom tooling can get expensive to maintain as the project scales up.


I call this "Google Imposter Syndrome". Because Google (insert Facebook, Apple, Amazon, etc) has success with Monorepos (insert gRPC, Go, Kubernetes, React/Native, etc), it must be a great idea, we should do it. You see this everywhere. Also known as an Appeal to Authority.

My personal opinion: very few companies will hit a point where sheer volume of code or code changes makes a monorepo unwieldy. Code volume is a Google-problem. But every company will have problems with Github/Gitlab/whatever tooling with multiple repos; coordinating merges/deploys across multiple projects, managing issues, context switches between them, etc. And every company will also have problems with CI/CD in a monorepo.

Point being... there are problems with both, and there are benefits to both. I don't think one is right or wrong. I personally feel that solving the problems inherent to monorepos, at average scale, is easier than solving the problems inherent to distributed repos. The monorepo problems are generally internal technical, whereas the distributed repo problems are generally people-related and tooling outside of your control.

Someone at some point said "Google may not be successful for the interview practises they use; they're big enough that they could very well be successful despite the interview practises they use."

It stuck with me, and is applicable to so many things. Including, maybe, this?

Another question is just the sheer scale of the FAANG companies, making things work at that scale is likely to be counterintuitive sometimes.

I just looked it up, Facebook has 2.2 billion users monthly. That's almost a third of the entire planet.

Shit that makes sense for them won't make sense for 99% of everyone else.

I've seen multiple companies struggling with maintaining interdependencies between multiple repos. It often results in an expensive custom solution. As a general guideline I'd say "when in doubt, put the code in a single repo"

Most people/companies aren't Google, but as you say they assume that if one or more of the tech gigants (or other very public tech companies) are doing something, then it must be good.

Often the focus is extremely weird. When people noticed that WhatsApp only employed something like 45 engineers, then most assumed that it was because they used Erlang and FreeBSD. The thought that maybe their success was do to hiring the very best engineers and paying accordingly is less attractive.

Monorepos is just a another item to the heap of things that may be a good idea, but it depends.

also known as "cargo culting"

There's also the fact that monorepos have issues when you don't have one organization responsible for all the code. The Linux kernel and NetHack don't live in the same repository for good reason.

I dunno, the BSD distribution included a wide gamut of games along with the kernel source in the same tree. In fact, NetHack is derived from Hack which itself is derived from Rogue, which was distributed within BSD. And BSD represented a cross-organization responsibility (see the history of AT&T and BSD).

Wasn't it because it was the same group of people who worked on both? And when it ceased to make sense, the games were split off - the only remaining ones in FreeBSD are things like banner(6) or pom(6).

It does seem like the Linux model scales better than the BSD model.

Fine, replace NetHack with Quake 3. :)

Slapping a whole bunch of projects into multiple repos with dependencies isn't a pleasant experience either. What is the solution then? I certainly don't want to host my own npm/composer/maven/clojars repos or even use those dependency managers to manage my own code which constantly changes and relies on multiple libraries both on the backend and frontend. I've tried this and, at least with a small team of two, it's not a pleasant experience at all. So how can I solve this problem? Cause the monorepo is very enticing after dealing with multiple repos and multiple dependencies pulled through dependency managers that clearly do not do well with dependencies that are constantly in flux.

Submodules are cutting edge and have cutting edges. The user experience on some corner cases can be painful. Example: if you happen to have unrelated conflicts when you rebase some patch across a submodule update, you're most likely going to end up committing a reversal of the submodule update.

I've seen this a few times in the .NET world, mainly as a carry-over from Subversion when we had moved to Mercurial and git.

Some mad genius in a company will write a fuck-ton of helper classes and utilities that take the heavy lifting out of everything remotely hard, to the point where you almost never need to touch a third-party API for a CMS, email send service, or cloud-hosting provider. Instead of supplying these as private NuGet packages to be installed into an application, they sit in solutions in their entirety, in case they are needed. That application then goes to a new developer team, and they have zero idea why there are millions of lines of code and dozens of projects for a basic website that doesn't really seem to do anything.

It's a nice idea, but it has resulted in some very tightly coupled applications. I remember one time where a new developer changed some code in one of the utilities that handled multi-language support, and for some reason our logs reported that the emails were broke.

Are you suggesting that there’s a solution to managing large amounts of code that doesn’t involve large amounts of tooling?

There’s an important distinction between lots of code and lots of projects. I agree; if you have a ton of code you’d better invest in tooling. But if you just have several normally sized projects, a monorepo can make your life much more difficult than simply using several repos.

Sure. You could also replace the term “monorepo” with “separate repos” and your statement would be just as valid. Either way you go has pros and cons.

Agreed. I think the thing is that GitHub basically supports the separate repo approach fairly well out of the box. Using only a single repo requires more thought around your strategy, especially if you have multiple teams.

heh, thousands, its probably at least an OOM greater, if not two.

Hah, you know I started with millions and then did some fuzzy math and started debating team sizes (since I know google often doesn’t have giant teams), ended up somewhere in the hundreds of thousands, and rounded to thousands. But I made it all caps so you know it’s the SERIOUS kind of thousands. :P

My brain sees OOM and thinks "out of memory", which might be applicable too.

Most folks who consider a monorepo don't have billions of lines of code, and often not even millions.

Linux kernel is a monorepo.

Linux kernel is one functional piece of work though.

Imagine if we combined KDE, Gnome, Linux Kernel, ZFS etc all in the one monorepo.

And gnucash, libreoffice, a couple copies of android, three other things that forked the linux kernel, and then all of apache to boot.

And we'll call it something crazy, like a Linux distribution!

A linux distro is a bunch of package metadata and binaries, they don't push the code of every package into one repository.

No if it's Google that's much too reasonable of a name, you have to prefix and/or suffix a bunch of stuff on there.

Where there are obvious and pronounced functional boundaries, often backed by administrative boundaries, separate repos makes total sense.

Otherwise, it's an optimization; see "premature optimization" for cautions.

But what are the alternatives to the monorepo in git?

All the ways of splitting code up and deploying multiple git repos for one project seem terrible.

Fun fact. I asked Facebook why they built their monorepo on Mercurial instead of Git. They said there were scaling issues in Git that made it unusable for large repos and the Git maintainers would not work with them to fix these issues. However, they were able to work with Mercurial to make it capable of holding their entire company in one repo.

Someone from FB did a really cringe-inducing presentation a few years ago about how "X can't handle our scale" (I think the predicate was iOS, but they went into IDEs and SCM systems). They had to pull the video and slides because it was so bad.

XCode also had issues working with large repos. Perhaps they were talking about that?

Responding late, so you might not see this. The thing that was so ridiculous to me was Facebook pretending that their app is somehow orders of magnitude more complicated than everyone else's.

They're whole schtick was "we're Facebook and we have unique scaling needs that no one else does," which makes absolutely no sense from the perspective of one user's content being rendered on their phone.

Plus the presenter was pretty smug, like all of this was good, when he wasn't convincing anyone that it was even necessary.

Found the slides: https://www.columbia.edu/~ng2573/zuggybuggy_is_2scale4ios.pd...

IIRC their iPhone app had about 20000 classes for some insane reason, and the system didn't handle that well.


Facebook patched Android Dalvik to increase the "max methods per app" limit.

Don't know if there's a similar iOs story.

> multiple git repos for one project

If it's one project, it's not a monorepo. It's a repo.

We wanted to have a "common" subsystem that was common across projects. Being able to add and work on the common area and new projects at the same time was important. Pushing the common area back and being able to deploy to the older projects and test was important.

This seems difficult in git.

There are "submodules" and "subtrees" but none seemed particularly great and as far as I could tell each came with a bunch of caveats.

I'll admit my Git skills aren't great, but I've used a variety of source control and tried to suss out the best way to deal with a small team.

We ended up using "git subrepo" which is an add on thing I don't love, but it works.

part of the motivation is "common" and "project 2" are to be open sourced, but "project 1" which also uses "common" isn't.

> what are the alternatives to the monorepo in git?

A monorepo in Perforce!

I don't understand why people are against this. You can have per repo branches/tags, the history is clean and relevant, it's easy to triage breakages, easy for different apps to have different versions of code etc. Plus for CI/CD it's trivial to just have one Jenkins jobs per repo as well and simple Git commit triggers.

The entire programming world revolves around libraries and yet when it comes to our own code we are afraid of them ? Strange.

> All the ways of splitting code up and deploying multiple git repos for one project seem terrible.

Of course they are, git isn't the tool for this. You don't want multiple repos for a single project, you want one per project (this is not a monorepo). If there are things like code common to multiple projects then they are their own project with their own repo and release schedule, releases go into some sort of package manager (even if it's just the file system) and the projects depending on that common code update as they go.

See, this is where your argument broke down for me. Once you’ve decided there is some library of common code, and assuming you factor out that code into another repo, you’ve just lost your ability to easily make breaking changes to the common code, which is something trivially easy to do in a monorepo. Why would you want that? It seems to me that if you have multiple projects sharing a base of common code then a monorepo is clearly superior.

> you’ve just lost your ability to easily make breaking changes to the common code

It should be hard to make breaking changes in common code. Even 'trivial' breaking changes seem to have a way of breaking things even when they shouldn't. If you need to make a breaking change to common code, the proper way to do it is add the new functionality separately, deprecate the old functionality (i.e. with javadoc so it gets called out explicitly by the IDE), and incorporate it 1-by-1 into consumers until none are using the deprecated version anymore.

And you can do that in a monorepo. But realistically, there are plenty of trivial breaking changes (renaming Foo to FooX) that don't warrant that effort, and so usually don't get done outside of monorepos.

You should apply care when making breaking changes. Having it be hard is a separate issue - I'd say distractions from multi repo tooling would introduce more risks overall. Having a unified CI system in a monorepo is really nice.

> you’ve just lost your ability to easily make breaking changes to the common code

I haven't lost anything, I've gained the ability to make breaking changes because I don't have to update everything that breaks all at once. I don't have to do it at all because that's the job of the team responsible.

With a monorepo what happens when their are 17 projects using the common code and I'm not familiar with 16 of them? Do I have to dive into the code of all 16 and fix them?

What you're proposing goes a step beyond multiple repos and into package versioning.

That is one viable workflow: Make a change to the common code and publish it as a new package version while allowing all existing code to continue to use the old package. Then, migrate other projects to the newer version of the dependency one by one.

Allowing multiple versions of the same code to exist in production at once adds complexity. It's a trade-off.

Also, if you're doing this with code that is ultimately webpacked to run in a web browser and you don't pay attention to the full tree of dependencies you're working with, there's a chance you end up loading two versions of the same library into a single web page, increasing the page weight and possibly causing incompatibilities in event handling.

Google prefers to simply have one master version of the entire company at a time.

I've spent a lot of time wondering which solution is the best and I'm still not sure.

> Also, if you're doing this with code that is ultimately webpacked to run in a web browser and you don't pay attention to the full tree of dependencies you're working with, there's a chance you end up loading two versions of the same library into a single web page, increasing the page weight and possibly causing incompatibilities in event handling.

You probably should have a way to visualize bundle size increases in PRs easily, so that this becomes obvious. Alternatively, some package managers like Yarn let you flatten the dependency tree, forcing you to pick one version of everything. Even with a monorepo, since you'll likely be using 3rd party dependencies, it's always an interesting exercise because of how hard NPM makes this: getting to a point where you only have 1 version of every 3rd party package can be very, very hard as some combinations of libs are mutually exclusive.

It almost certainly depends on the company. Consider General Electic's microwave oven firmware and hydroelectric control station software. Both might actually share some code. Maybe they both use GEEventLoop or GEHardwareButton or something. But there's no reason to be concerned about having different versions in production at once.

I don't think there's a universal answer to your question.

At Google yes, you would be expected to fix everybody. This will also give you exposure to client use cases and form a good basis for arguing your change is safe, necessary, and correct.

The idea that clients can run on the old library forever is a nightmare, especially for security-relevant changes. When I see a binary built yesterday I want it to contain yesterday’s code, not a copy of libpng 0.1 from 1996.

Someone will need to update those 16 unfamiliar projects, whether it's you or those projects' owners. From what I hear, Google's process is that the developer making the breaking change has to either update the other projects, or coordinate with the project owners, before they can check in the breaking change. It helps that Google has standard build and test systems for all their projects in the one monorepo.

Yes, that's what you do. And you commit the change when all the tests pass.

And that's insane, people don't scale like that. It's harder enough keeping your head around one large project let alone every project a company has that you might have to jump into at any point.

You're arguing for the nonexistence of something that obviously exists. There are tens of thousands of engineers at Google working in this manner on one of the largest codebases ever assembled.

What exactly makes automated refactoring difficult?

You can soft-deprecate the old code path, and communicate that warnings will turn into errors at some future date.

You can send your pull request to the affected team leads, and request that they approve it, once they make changes on their end.

I mean, the alternative is that you have 17 different projects, each using one of five different versions of the common code. Heaven forbid one of them makes an incorrect assumption about another. Getting 17 different teams to dance together around a breaking change is always going to be hard.

If the common code is a versioned package, then each of the 17 different projects could update their code to handle breaking changes in the common package independently and update the version dependency after thorough testing.

You can have versioned packages inside a mono-repository, too, though. /common_libs/foo_lib_v1.13/, /common_libs/foo_lib_v1.14/, etc.

By that point your creating micro-repositories in your mono-repository and getting the worst of both worlds.

No, you're getting the best of both worlds, because it's incredibly clear to the infrastructure maintainers what version everyone's on, whether or not the old version can be safely deprecated, who is responsible for deprecating it, etc.

I don't have a horse in this race but if you have, say, a security issue and that needs to propagate downstream where does your responsibility end in that situation? Do you try to track down the dependencies and open issues in their trackers? Or maybe a more common problem is a change to a library that consumes an API that's changing so updating the library has a drop dead date.

> I don't have a horse in this race but if you have, say, a security issue and that needs to propagate downstream where does your responsibility end in that situation?

This is an issue that needs to be managed, from the systems I've seen it tends to be managed poorly, that's in both monoish repos and multi-repo setups as well as everyone using third party packages. I don't think committing everything to trunk is a good way to resolve it though, they only upside to this approach is that it might force you to resolve it.

What I have to deal with much more frequently is the opposite problem, we have an urgent update that will break several things but has to be deployed for one dependent binary ASAP and fixing the rest of the universe first is not an option.

Worst case it might create some security issues, something that should be a breaking change getting kludged into a new breaking change but still being broken.

As a data point, we put the dependency graphs in a database at build time. When we have an emergency and need to push a library update to thousands of repos, we make the change, then trigger builds for all the dependents. We don't auto deploy (too risky), but we use the data we have to start nagging the owner of all the repositories to tell them they have to deploy asap. Since all projects are very small, they build and deploy very, very quickly (a few minutes at most for the big ones).

If you are making a breaking change to a public interface you maintain, wouldn't you want to know how that interface is being used first, before justifying such a major breaking change? Not just change it for the sake of your own libraries internal convenience and hope that users of the library adopt. Since you know how the api is supposed to be used and all its best practices, understanding how to change the parts of the 16 projects using it should be quite easy, you shouldn't have to dig into the domain knowledge for all those projects.

What I can advise against is repo partitioning prematurely. I have been on multiple teams that have thought "Oh this will be a common library for all our projects" or "this is a sample project" or "this is the android version and this is the iOS version" and split projects up into different repos, only to wind up with crazy dependencies between repos which have fallen out of sync or require another repo to be on a specific branch/hash to work correctly, causing all kinds of chaos. Split your repos by dependencies, and once your system architecture is kind of fleshed out. Just use branches on the same repo until then.

Spot on! I've seen org wide mono repos at Microsoft and they had their custom tooling and build systems built on top of SourceDepot.

Which is just rebadged Perforce :)

I kinda disagree, we’re a dev team of 30, 3.5 years in, 150k lines of code and we’ve always had a monorepo. We had to switch from maven to bazel after about 2 years because test times got out of control; bazel has been about 50% more annoying than maven but the incremental builds work perfectly.

Interesting. Do you have wrote something about that migration?

No, though that would be a good blog post. We tried to make multi-module maven work for a while, eventually gave up and wrote some scripts that would convert maven to bazel, using many assumptions that applied only to our particular case. We did the cutover in one day but kept maven around for a couple weeks in case we decided to bail on bazel. It worked out; we even found CircleCI works great. I would say the weak link in the bazel ecosystem is the IntelliJ plugin, which is very functional but also very slow.

Google used Perforce for a very long time before they built their own version control system.

Tooling is required for coordinating configuration management on multiple repositories too.

Also, why isn't such tooling available as open source? I'm trying to do my bit, but we could do with more effort being put into this, somehow.

Really probably closer to millions of hours.

Dumping all code in a single repo, even for a 30 man development shop was really tough. Doing so for a company of few thousands must be truly crazy.

I advice Google to replace the person in their internal IT who came up with that idea.

> and the thing they all seem to overlook is that Google has THOUSANDS of hours of effort put into the tooling for their monorepo

That's a huge understatement. They haven't just slapped a few scripts on top of git/svn, they've created their own proprietary scm to manage all of this. They've thrown more at this beast than most companies will throw at their actual product.

I'm also not convinced they haven't reinvented individual repositories inside this monorepo, it sounds like you can create "branches" of just your code and share them with other people with committing to the trunk, this is essentially an individual repository that will be auto deployed when you merge to master.

Your last paragraph doesn't sound like anything at Google. Most engineers will never use branches at all, and even fewer will use branches that merge into trunk (instead of away from it).

There is a set of code changes locally and those changes are bundled off to the test server to run the full test suite? That's a branch.

Now let's say I break a project sharing this code and because I'm not an expert in all 2 billion LoC and 3000 projects google is running I need to enlist some help in fixing what I broke. Presumably there is a way for the developers on that downstream project to pull in my change set? That's a shared branch.

Now assuming I can get all of these planets aligned correctly I'm going to need to take this set of changes and put it into the master version aren't I? That's merging my branch into trunk.

Your mental model of how this works within Google is completely foreign to me. I think you've made an unfounded assumption somewhere.

Yeah that second thing doesn't exist. That first thing doesn't really exist the way you conceptualize either, I don't think.

Can anyone articulate how it does work and where I'm going wrong then? The conversations feeling pretty one sided here.

You said the first thing doesn't exist? Do you not have local changes or are these changes not shared with the test/build server? Having a set of patches, code changes, whatever sounds like a branch to me, are you being too literal with the word branch?

For the second part what doesn't exist? Do you not make changes that breaks other peoples code? Do you not get them to help fix it? Can you not share your work in progress changes with others? Can you goes collaborate on changes at all?

I can try. I think people are averse to doing this because it can sort of require a deep dive into how Piper and Citc work, and the linked article does a good job of explaining that, and beyond what the article says, its not clear what you can discuss.

[Everything I'm about to explain is for the average user's workflow, like others have mentioned, "real" branches do exist, but most engineers will never use them, and my current workflow works differently than what I'm explaining, but I used to do it this way.]

Piper generally speaking doesn't have the concept of commits or "sets of patches". You have clients. A client trunk@time + some local changes. You could maybe call this a branch, but you can't stack multiple commits[1], so its a branch of length exactly one. It can only be merged back into trunk. Then you delete the client and start a new one. You can patch changes from one client into another, but this isn't generally done or super useful because again, you can't stack changes.

A given client has an owner, and the owner has write access. Everyone else has read access.

So to answer your questions:

>Do you not have local changes or are these changes not shared with the test/build server?

There are local (sort of) changes. And you can test/build them, but they lack many of the concept one would expect of a branch, so I'm not sure that's a good name for them.

>Do you not make changes that breaks other peoples code?

Sure you do. But you're responsible for fixing it (as I said elsewhere).

>Do you not get them to help fix it?

Yeah, but normally this is done by having them review the change, or talking in person. There's nothing like multiple commits by multiple people which are then squashed and merged.

>Can you not share your work in progress changes with others?

Sure, but they can't edit them.

>Can you goes collaborate on changes at all?

Kind of, but not with multiple authors.

[1]: There's a hack that allows chained CLs, but its a hack, a leaky abstraction, and still doesn't provide multiple authors squashing and merging.

Google internally has many concepts of branches. What you describe is not any of them.

Is it just me, or are a lot of people here conflating source control management and dependency management? The two don't have to be combined. For example, if you have Python Project X that depends on Python Project Y, you can either have them A) in different scm repos, with a requirements.txt link to a server that hosts the wheel artifact, B) have them in the same repo and refer to each other from source, or C) have them in the same repository, but still have Project X list its dependency of project Y in a requirements.txt file at a particular version. With the last option, you get the benefit of mono-repo tooling (easier search, versioning, etc) but you can control your own dependencies if you want.

edit: I do have one question though, does googles internal tool handle permissions on a granular basis?

The key here is reverse dependency management. “If I change X, what would influence this change?”.

This can be achieved with single repo better than multi-repo due to the completeness of the (dependency) graph.

Exactly this. Or at least it's a way this can be achieved, assuming solid testing & some tooling in the mix.

For folks unfamiliar with it, the issue is something like:

1. You find a bug in a library A.

2. Libraries B, C and D depend on A.

3. B, C and D in turn are used by various applications.

How do you fix a bug in A? Well, "normal" workflow would be something like: fix the bug in A, submit a PR, wait for a CI build, get the PR signed off, merge, wait for another CI build, cut a release of A. Bump versions in B, C and D, submit PRs, get them signed off, CI builds, cut a release of each. Now find all users of B, C and D, submit PRs, get them signed off, CI builds, cut more releases ...

Now imagine the same problem where dependency chains are a lot more than three levels deep. Then throw in a rat's nest of interdependencies so it's not some nice clean tree but some sprawling graph. Hundreds/thousands of repos owned by dozens/hundreds of teams.

See where this is going? A small change can take hours and hours just to make a fix. Remember this pain applies to every change you might need to make in any shared dependency. Bug fixes become a headache. Large-scale refactors are right out. Every project pays for earlier bad decisions. And all this ignores version incompatibilities because folks don't stay on the latest & greatest versions of things. Productivity grinds to a halt.

It's easy to think "oh, well that's just bad engineering", but there's more to it than that I think. It seems like most companies die young/small/simple & existing dependency management tooling doesn't really lend itself well to fast-paced internal change at scale.

So having run into this problem, folks like Google, Twitter, etc. use monorepos to help address some of this. Folks like Netflix stuck it out with the multi-repo thing, but lean on tooling [0] to automate some of the version bumping silliness. I think most companies that hit this problem just give up on sharing any meaningful amount of code & build silos at the organizational/process level. Each approach has its own pros & cons.

Again, it's easy to underestimate the pain when the company is young & able to move quickly. Once upon a time I was on the other side of this argument, arguing against a monorepo -- but now here I am effectively arguing the opposition's point. :)

[0] https://github.com/nebula-plugins/gradle-dependency-lock-plu...

> So having run into this problem, folks like Google, Twitter, etc. use monorepos to help address some of this.

I think you’re retroactively claiming that Google actively anticipated this in their choice at the beginning of using Perforce as an SCM. They may believe that it’s still the best option for them, but as I understand it, to make it work they bought a license to the Perforce source code forked it and practically rewrote it to work.

Here’s a tech talk Linus gave at Google in 2007: https://youtu.be/4XpnKHJAok8

My theory (I wonder if someone can confirm this), is that Google was under pressure at that point with team size and Perforce’s limitations. Git would have been an entirely different direction had they chosen to ditch p4 and instead use git. What would have happened in the Git space earlier if that had happened? Fun to think about... but maybe Go would have had a package manager earlier ;)

> I think you’re retroactively claiming that Google actively anticipated this in their choice at the beginning of using Perforce as an SCM.

Oh I didn't mean to imply exactly that, but really good point. I just meant that it seems like folks don't typically _anticipate_ these issues so much as they're forced into it by ossifying velocity in the face of sudden wild success. I know at least a few examples of this happening -- but you're right, those folks were using Git.

In Google's case, maybe it's simply that their centralized VCS led them down a certain path, their tooling grew out of that & they came to realize some of the benefits along the way. I'd be interested to know too. :)

Maybe Google’s choice for monorepo was pure chance. However, on many occasions the choice was challenged and these kinds of arguments were (successfully) made in order for it to stay.

There's a subtler, and potentially more important thing that can crop up with your scenario:

Library A realises that its interface could be improved, but it would not be backwards incompatible. In the best case scenario, with semver, there is a cost to this change. Users have to bump versions and rewrite code, maybe the maintainer of Library A has to keep 2 versions of a function to ease the pain for users. It may just be that B, C and D trust A less because the interface keeps changing. All this can mean an unconscious pressure to not change and improve interfaces, and adds pain when they do.

Doing it in a monorepo can mean that the developers of A can just go around and fix all the calls if they want to make the change, allowing for greater freedom to fix issues with interfaces between modules. And that is really important in large complex systems with interdependent pieces.

This is my biggest gripe in discussions like this as well, dependency management and source control are two completely different things. It should be convenient to use one to find the other but they should not necessarily be 1-1 coupled together with each other.

1. A single repo should be able to produce multiple artifacts. 2. It should be possible to use multiple repos to produce one artifact. 3. It should be possible to have revisions in your source control that don't build. 4. It should be possible to produce artifacts that depend on things not even stored in a repo, think build environment or cryptographic keys etc. An increase in version number could simply be an exchange of the keys.

Number three I disagree with. Bisection depends on build (and test) always working on trunk.

Single repo is one design that coherently addresses source control management and dependency management.

The key is to let the repo be a single comprehensive source of data for building arbitrary artifacts.

> The key is to let the repo be a single comprehensive source of data for building arbitrary artifacts.

By that do you mean it's one way of doing it, or that it's the only way?

Seems clear to me that it's not the only way. For instance .Net code tends to be Git for the project source + NuGet for external dependencies. It works pretty well.

It's one way. There isnt any problem that can only be solved in one way.

I don't know what this means.

How is "single repo" a "design" and how does this design dictate dependency management?

Yes, if you have a single repo then that would be a single source of data for building your stuff. That seems redundant.

See Bazel, you have the depes manifested as source controlled the data, then you can build everything as deterministically as possible.

Then you can manage dependency as part of the normal source control process.

A single repo makes it a bit tricky to use some library in version A for project X and version B for project Y.


You can consider that a bad thing or a good thing.

Most language's package composition (C/C++, Java, Python, Ruby) don't permit running multiple versions at runtime. The single-version policy is one way of addressing dependency hell.

I think that's actually a good thing. Allowing different projects to use different versions of a 3rd-party package may be convenient for developers in the short term, but it creates bigger problems in the long term.

It depends on the industry. In some places changing a dependency, no matter how trivial the change, entails a lot of work. Think for example about embedded systems where deploying is a lot harder than pushing a Docker image somewhere. It is often far cheaper to analyze whether the fixed bug can be triggered to avoid upgrading unless necessary.

In those situations, why not go ahead and keep the code up-to-date and consistent, and simply not deploy when you don't need to?

Because that costs money now that could be spent on something that actually produces a profit.

If I recall, in Google's build system, a dependency in the source tree can be referenced at a commit ID, so you can actually have a dependency on an earlier version of artifacts in source control.

No, that's not true since at least 2013 (the year I joined Google).

Yes, Google's internal tool handles permissions based on directory owners.

They use the same OWNERS-file model as in the Chromium project [1], the only difference being the tooling (Chromim is git, google3 is ... its own Perforce-based thing).

[1] https://chromium.googlesource.com/chromium/src/+/lkcr/docs/c...

I can't comment specifically on Google's tool, but I know it's based on perforce. perforce does have granular permissions - https://www.perforce.com/perforce/r15.1/manuals/p4sag/chapte...

> The two don't have to be combined.

They do have to be combined in some way, at least to be reproducible. Your requirements.txt example is one way of combining version control + dependencies: give code an explciit version and depend on it elsewhere by that version.

Google has chosen to do combine them in a different way, where ever commit of a library implicitly produces a new version, and all downstream projects use that.

> googles internal tool handle permissions on a granular basis?

Not sure what you mean...it's build tool handles package visibilty (https://docs.bazel.build/versions/master/be/common-definitio...). It's version control tool handles edit permissions (https://github.com/bkeepers/OWNERS).

It is very tempting to believe that a monorepo will solve all your dependency issues. If you have a project that's say pure python consisting of a client app, a server app, and then a dozen libs, that might actually be true, since you force everyone to always have the latest version of everything, and always be running the latest version. Given a somewhat sane code base and smart IDE, refactoring is really easy and and updates everything atomically.

In reality you often have different components, some written in different languages, at a certain size, not everyone has all the build environment set up and might be working with older binaries, and now it's just as easy to have version mismatches, structural incompatibilities, etc. So you need a strong tooling and integration process to go along with your monorepo. The repo alone doesn't solve all your problems.

Maybe this is a reflection of modern tools using the version control system to store built artifacts, like npm and "Go get" do. Anyway, depending on the programming language, you can have a monorepo and still bind your modules with artifact dependecy, not necessarily depending on the code itself.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact