Hacker News new | comments | ask | show | jobs | submit login
Advantages of monolithic version control (danluu.com)
198 points by Tomte on Feb 12, 2018 | hide | past | web | favorite | 138 comments

> With a monorepo, projects can be organized and grouped together in whatever way you find to be most logically consistent, and not just because your version control system forces you to organize things in a particular way. Using a single repo also reduces overhead from managing dependencies.

This is the major thing I miss about Subversion, and the fact that in Subversion a subdirectory in a repository can be checked out on its own.

At the top level of our Subversion tree, there were 'web', 'it', and 'server' directories, reflecting the departments in the company (at least those departments that dealt with source code).

In the 'server' directory, there were things like 'payments', 'reports', and 'support', for things like payment processing, reporting, and stuff to help the customer support people.

So lets say we had programmer in the server department working on the credit card storage system, and on a script to make a quarterly tax report. That programmer would just have to check out /server/payments/cc_storage and /server/reports/quarterly_tax from the company Subversion repository. When he checks the history in either of those, he only sees commits that affected those directories or their subdirectories. It really is like they are separate repositories.

Suppose another programmer is working on the whole payment system. He can check out /server/payments, and automatically gets cc_storage under that, but also order_processor, paypal_callback, cc_updater, and subscription_biller.

I was in charge of the whole server department. So I could check out /server and have a copy of everything we did. The other programmers would usually save work in progress to a personal development branch, and so every morning I could update from the server, and then see what everyone had done the day before. I could do a quick review to check the less experienced programmer's work, and it also gave me what I needed to write a short note to my manager letting him know what the server department was up to and who we were progressing.

I like this about perforce too. Another thing I like is that I can easily substitute subdirectories with other ones, or mask out directories I don't need. I can swap in the new code for a single module into a legacy version with a one line config change.

I also get the ability to use the depo explorer to poke around at absolutely everything that folks are working on. The search function in gitlab is basically a joke compared to this.

Unfortunately for perforce, git/gitlab seems better in basically every other way. It took me 2 years to have anything nice to say about perforce, my regular development flows we're just that painful.

I have poked around in the git file format a tiny bit and I think the hash tree semantics aren’t incompatible with subversion’s commit tree semantics, which are what let you grab a particular sub tree cleanly.

I kinda think you might be able to convince git to let you check out a subdirectory. But I’m not sure if any of the plumbing exposes that ability or if it would take significant surgery.

Git supports this scenario as "sparse checkouts": https://git-scm.com/docs/git-read-tree - see this QA: https://stackoverflow.com/questions/4114887/is-it-possible-t...

Does this allow you to file PRs against the repository you have checked out? Or does this only work for read-only use? What about CI? How do you convince TC or Jenkins or Bamboo to do the same thing?

If this does do all that, I think this functionality needs some SEO love because this pretty much never comes up when I search for the latest ways to grab part of a repo. All I find are conversations where people are trumpeting the wrong tools for the job.

Edit: Also, this doesn't seem to let me check out a subtree the way people mean "check out a subtree". When I check out 'just foo/bar/baz' I expect to have a directory called baz as my project root. Not a directory named foo with a single grandchild named baz.

perhaps you could rename the repo to something else/move it somewhere else, then ln -s the grandchild, if that’s important?

It breaks down if you need more than a couple of them. But I guess hat could go both ways.

A problem we saw with perforce and the clientspec: when people see a directory structure, they forget they don’t have all the bits. They make errors of judgment based on bad info.

You can get all this functionality with git, gitlab and looking through the consolidated activity page, no?

Perhaps although this sort of reinforces the point that the design of git tends to encourage creating a lot of individual, single-purpose repositories, leaving it up to the user to string them all together with gitlab or github or a custom server, if that's what you want.

I'm not saying I prefer the subversion architecture, but with subversion the pattern described is quite natural and requires no additional technology or design beyond a directory structure.

the design of git tends to encourage creating a lot of individual, single-purpose repositories

Which is weird, considering it was designed to be used as the Linux repository control system, which is a monolithic repo.

I've heard people longing for this, but unless you have huge binary assets, what's the point?

When he checks the history in either of those, he only sees commits that affected those directories or their subdirectories.

And in git, you just enter the directory and do "git log ."

On the other end of the spectrum, a colleague of mine recently told me about when his previous company (big, 100k+ employees) were starting to adopt git, some people seriously considered having a separate repository per file! "Because then you can just set up your project as using version X of file A and version Y of file B"

It's good we now have (at least) Google, Facebook and Microsoft as examples of companies using monorepos. Those are names that carry some weight when thrown into a conversation.

That sweet abuse of version control reminds me of the fabled JDSL from http://thedailywtf.com/articles/the-inner-json-effect

That can't be real... is it? No. No way.

TDWTF started modifying their stories significantly to avoid people getting in trouble due to telling the stories. Sadly this was used as an opportunity to significantly embellish the stories to the extent that often the major weird thing told about is all made up.

It has to be made up. I am with you, that can no be real. There are trolls for everything.

It is so well crafted that it must be rooted in truth... I just... no. No way. This is too much madness...

Seems interesting that he would put in comment examples that would delete the entire database. Perhaps the guy telling the story is what happened when little Bobby Tables grew up.

It strikes me as /r/nosleep but for IT.

Google, Facebook, and Microsoft also make significant investments in tooling to get their monorepos to work. Picking a good package manager (or artifact repository) and making it a part company culture could be less work, depending on the context.

Those people probably were used to a VCS like Clearcase where this is standard practice. If your process grew around something like that and depends on it it might seem reasonable to try and emulate it in the new system, to reduce the amount of stuff you need to change and learn.

For a close enough real life example, and very often just one file (PKGBUILD), ArchLinux uses one (orphaned) branch per package:



Orphaned branches allows you to have multiple independent trees in the same git repo, si it's basically a way to stuff many "repos" (as in history) in a single one (as in object storage).

That seems very weird. Do you happen to know the reasoning behind it?

Arch maintainer here. Arch Linux doesn't use git, it uses svn still. Each arch package-repository (core, extra, community...) is a single SVN repository. Packages are split in subdirectories in that repository and checked out individually. Every maintainer deals with a few dozen packages individually, not the whole. There are seldom any commits that span multiple repositories.

What you were linked is the "svn to git" mirror which has to somehow translate that aspect of the setup to git.

Incidentally, the way Arch svn is currently set up works really well for arch devs and it is very hard to find a good replacement to it using git. Having a monorepo split up in lots of microrepos is straight up not possible in git.

> Having a monorepo split up in lots of microrepos is straight up not possible in git.

Not saying you should, but if your goal is to have a single repo (VCS) per repo (Arch) within which all packages are stored, and replicate the {core/{foo,bar,baz},extra/{qux,tor,meh}} tree, there could be many ways to do it depending on the exact needs, by leveraging GIT_DIR, GIT_WORK_TREE, GIT_OBJECT_DIRECTORY, GIT_ALTERNATE_OBJECT_DIRECTORIES, clone --single-branch and checkout --orphan.

Single monorepo with orphan branches, single clone, multiple work trees:

    # create monorepo
    mkdir core
    cd core
    git init --bare .git
    export GIT_DIR=$(PWD)/.git
    # add package in its own branch isolated
    export GIT_WORK_TREE=$(PWD)/bash
    mkdir bash
    cd bash
    git checkout --orphan bash
    git add PKGBUILD
    git commit
    # switch package
    export GIT_WORK_TREE=$(PWD)/readline
    mkdir readline
    cd readline
    git checkout --orphan readline
    git add PKGBUILD
    git commit
    # back to bash
    export GIT_WORK_TREE=$(PWD)/bash
    cd bash
    git checkout bash 
But each time you switch you have to switch the branch too since HEAD is in core/.git, so to work around that you can share the object dir.

Single monorepo with orphan branches, multiple clones but shared object dir, multiple work trees:

    mkdir core
    cd core
    export GIT_OBJECT_DIRECTORY=$(PWD)/.git_objects
    mkdir bash
    cd bash
    GIT_DIR=$(PWD)/.git git init # otherwise .git will be created next to .git_objects
    git checkout --orphan bash
    touch PKGBUILD
    git add PKGBUILD
    git commit -m "Add package: bash"
    cd ..
    mkdir readline
    cd readline
    GIT_DIR=$(PWD)/.git git init
    git checkout --orphan readline
    touch PKGBUILD
    git add PKGBUILD
    git commit -m "Add package: readline"
    # back to bash
    cd ../bash
    git branch # just to check it's "bash", not "readline"
    # clone existing package
    cd ..
    git clone some.where:core.git --single-branch --branch libarchive
    cd libarchive
That's from the top of my head, and well it very much depends on what you want to achieve as a workflow.

Orphan branches removes the (seldom-used, but still used) ability to commit on multiple microrepos at once.

Haha definitely!

That's a trade-off I'd be willing to make though given other advantages git offers. I wonder how mercurial would fare on that regard (possibly enhanced with a bespoke extension). Not that there's any reason for your tooling to change since it fits the bill so well :)

Anyway for archmac I went the boring way and stuffed everything as subdirs inside a single monorepo as I reckoned it'd just be easier for contributions. Definitely not the same scope as Arch Linux though.

FWIW there is a move towards git in Arch, which is happening slowly and I'm not sure what it'll end up looking like (probably 1 repo per pkgbuild on a custom git server, just like the AUR works right now). I was mostly talking about why svntogit looks the way it does :)

I often think about this issue of monorepos... git is such a wonderful tool, but the problems of subrepos and monorepos comes up so often it's surprising there isn't a definitive answer in git.

Microsoft doesn't really use a monorepo (although they may have done so in the past). Everything is moving to git, and while they've invested significant effort into getting git to handle larger repos, you don't have everything living in one repo quite the same way that Google does it.

My understanding of Google's monorepo is that search, mail, maps, etc. all live inside one repo.

Uh, they moved all of Windows into a single git-repo. It used to sprawl over 40+ Source Depot repos but now it's one gigantic monorepo [1].

[1] https://blogs.msdn.microsoft.com/bharry/2017/02/03/scaling-g...

Right, but Office, SQL Server and Azure don't live beside it in the same repo. Windows is one project. The FreeBSD base system lives in one repository but I wouldn't call it a monorepo.

> On the other end of the spectrum, a colleague of mine recently told me about when his previous company (big, 100k+ employees) were starting to adopt git, some people seriously considered having a separate repository per file! "Because then you can just set up your project as using version X of file A and version Y of file B"

This basically sounds like CVS.

If that is how you were using CVS, its no surprise that it gets so much hate.

OTOH, I think the common CVS workflow actually matches the modern "we don't do stable branches" workflow a lot better than git does. Basically, if you had upstream CVS branches for more than released versions of software in maintenance mode you were doing it wrong.

I also tend to yearn for the days when I didn't spend 20% of my time rebasing and merging patches together, or rewritting dozens of patches worth of git history in order to move a couple minor commit hunks between patches for some reviewer. Or just juggling 20 different -next style remote repos.

Git is one of those tools that let you endlessly play with your tools rather than getting the job done.

> Git is one of those tools that let you endlessly play with your tools rather than getting the job done

I vehemently disagree with this sentiment. In fact, I find the opposite to be true. Git is the first VCS I used that is useful during coding instead of after it, when it's time to publish the final product, ie. a changeset. I can commit, merge, branch, rewrite and share changes freely and effortlessly whenever I need to without committing a bunch of crap to the shared repository that is of no interest to anyone.

After I'm finished getting shit done, I can then spend some time reviewing and thinking about the logical progression of changes so that the commit log will be readable to other people and older me. This is often just as valuable as writing the code itself.

What workflow do you use with git that has you doing so much juggling?

Pick one. I have yet to encounter a workflow that did not introduce tons of extra fiddling and stupid busywork.

http://scottchacon.com/2011/08/31/github-flow.html is the style I favour. It certainly involves steps but every distinction it forces you to draw feels meaningful (at least to me), with the possible exception of the commit/push distinction which is a performance optimisation (I certainly wouldn't want to have to wait for a push on every commit).

Exactly the same workflow you were using with CVS?

The argument against this is that other external-to-Google-but-started-by-Google projects like Golang and Android use the multirepo model, with gerrit.

There are pros and cons to each - do you want to have a hugely churning "I always have to rebase/merge" repo under you, or multiple repos and trouble keeping them in sync?

Having done both, I'm not sure which is better - it's probably very project specific.

Android uses multirepo model not because they want to, it's because they have to. You can't have a single repo Android's size in git and still run it snappy. All the tools, including the old repo tool and the new toolings around gerrit, are to make the actual multirepo model works like a single repo.

Yes, and lack of proper tools is of course the main reason against a monorepo. Google had to develop most of its dev tools internally.

I always find it a bit funny how monorepos are now this big exotic new age thing. "Monorepos are the future!".

That very well may be, but I think we're at a point where the majority of developers started after git came out (since the industry experienced explosive growth just in the last few years). They kind of forget (or didn't know) that it's not that long ago we did monorepos because we HAD to. The tooling to do a system in multi-repos was just not there. It wasn't practical.

Companies like Microsoft, Google and Facebook predates the days where building your company on top of 3000 repos was practical, and they certainly were not going to convert everything if they could help it. Thus, they built an enormous amount of tooling to make it work. It certainly has benefits (and tradeoffs). With similarly advanced tooling to support you, multiple repos also has a lot of very nice properties and scale quite nicely.

To each their own.

> They kind of forget (or didn't know) that it's not that long ago we did monorepos because we HAD to.

That was never ever the case. We did centralized repos because we had to, but every single place I worked at, whether they were using SCCS, RCS, CVS, SourceSafe or SVN, had multiple repositories for different projects. No place had more than 20 repos, but then, the largest of those was about 250 people with VCS access.

Most places I worked at had everything in one repo and folders for projects /shrugs. Though some did have multiple repos, though in my book, that's just "many monorepos". 20 repos for 250 people was quite manageable with old tools. When one system is split in 4000 repos, that's a different story.

I can't wait for someone to implement a C interpreter in Haskell or something and boldly proclaim that imperative programming (called by another name, of course) is the way of the future.

Like Golang?

Using multiple repositories seems like the most natural workflow to have. It's easy to make a new one; it's lightweight to do, and it allows your code base to scale naturally. You can set permissions for each repository, so if you did include some sensitive code within one repository or another, it's easy to narrow the access to them. (Yes, don't include sensitive code in a repository—but in an early-stage company, you or your engineers may not know that.)

Multiple repositories make for a forgiving structure for your code base. You can tailor them however you like.

But once you have a _lot_ of code, they become hard to manage. I see the utility of a monolithic repository there—now you know exactly where all your code is: it's in this one repository!

Package managers mitigate a lot of the trouble with pulling in internal dependencies from other repositories. Nowadays, most languages have a package manager that can work with a private codebase, so monorepos aren't necessary to help with that. But monorepos can help if you have a ton of versions floating around and you don't want to support version 1.5.x of libjohnny when it's now on 4.3.x. Your code either works with libjohnny as it is right now, or it doesn't. (Which in turn makes it very clear to you how important it is to manage API-breaking changes!)

This feels a little bit rambling, but my thought is that there is some analogy between monorepos and microservices; don't use them until you _need_ to use them! You'll know it when you get there.

With multiple repos, how do you solve the issue that the private dependencies installed by package managers are not themselves under source control when you edit them? They are in /vendor or /node_modules or whatever. It's easy to pull them down and of course, for other people's dependencies, this is fine. But say I pull out one component of my app that's used by multiple apps. I set up my private repo or maybe the package manager works straight with git. But when it pulls the dependencies down, it doesn't pull the git subdirectories and so that code is not under source control. I make edits, test, then I have to copy that code back to its original repo, resolve any conflicts, and check it in. If I need to make a change, I have to do that again. If someone worked on that code, I have to pull that and merge it in separately. Even if I manually pull it myself in the right place, I'm still dealing with a git repo inside a git repo but the two are unrelated so many tools won't work with the inner repo. Short of writing a bunch of custom scripts, is there a standard way to handle this situation that I assume anyone with multiple repos that share internal, private dependencies has?

Any changes you need to make to internal libraries you install via package manager would need to be made to those separate repositories that house the libraries, and then those changes would have a release made (generally this is done through a git tag, and using semantic versioning). Once you update the version, your package manager will allow you to install that update to the other places you're using it, so all you need to change is the package file.

It might seem like a pain to do it this way, especially if you're rapidly iterating on a library—it might be that your library is not really mature, or even used by more than one repository, so it may not even make sense to have that library in a separate repository to begin with! But once you do have a mature code base, semantic versioning is a really sane way of managing dependency updates for the N number of other projects which use your library.

And how do you test if a change in one of these repos fixes the problem you’re seeing? This is what we’re failing at with our multirepo.

That and resectioning code to split or combine responsibilities in different ways. Something a monorepo makes trivial.

Ideally, your library code that you're pulling in has some unit testing to demonstrate that things are working as they should be. (If not, consider adding unit testing! It's really useful!)

If that is the case, then you can isolate the likelihood of an issue as either in the library (because unit tests fail there), or in the project consuming the library (because unit tests succeed in the library).

Without testing, it's hard to have a ton of confidence in where the problem lies—which is exactly the problem you cite. And while a monorepo (or just a standalone repository with no separate libraries, which is frankly an easier setup to manage than monorepos or multirepos!) may make debugging a bit easier, it's not going to give you much more confidence in your code.

Once your organization outgrows the paradigm where you just have N standalone projects with N completely separate code bases, and you do need to commit to either a multi-repo or a mono-repo configuration, it's really, really helpful to have unit testing to allow you to isolate where issues are.

The class of problems I’m talking about are integration issues. Unit tests look good but when you put the pieces together...

When the tests that matter cross version control boundaries you pay for it. Whether the costs outweigh the benefits is something you have to think about.

What does your test coverage look like? Perhaps you're missing something there that would have caught that bug?

Testing is, of course, no silver bullet. Tests are written by humans, and humans make mistakes—and it's pretty difficult to achieve 100% test coverage in a production system. The goal of testing is to have confidence in the code you've written.

Tests often don't need to cross version control boundaries. You can use mock data—like would be produced by the library—on the consumer side, because you can delegate responsibility for testing of that library to the library repository itself. If your tests work great with the mock data, but things are still failing, then you can infer that the mock data and the actual data are different, and your bug is in the library.

For instance, I have a piece of code in a particularly gnarly modules that at this moment is disabled and has been for two sprints due to emergent behavior. First sprint it had adequate unit tests but not enough functional tests to exhibit a problem. Second sprint I fixed the testing deficiency and got the code to work end to end.

Or so I thought. The first time I turned it on it preprod I couldn't turn it back off again because some piece of data that came from five function calls away was being shared, and nobody who participated in the PR recalled that fact.

Most of the code I'm dealing with is in a single module. I have been chipping away at fixing the insane ball of mud as I can. My coworkers often aren't that lucky. They come to me for advice on how to deal with this sort of problem but crossing 2 or three modules.

There's no low-friction way for them to fix any of this. They can't just refactor because of the coordination costs, and also the loss of historical information when you move a block of code across module boundaries or try to change module boundaries. This is the prime argument for monorepos in the literature - not making irreversible decisions on Law of Demeter problems. It's not my biggest reason, but it's sufficient for most people.

Yeah, I sympathize—that's a tough situation.

I think there's two ways you can look at your choice of configuration: ease of debugging, and ease of organization. When Google lays out why they use a monorepo, they are doing so because it simplifies their organization—there are no longer so many versions of so many libraries and apps they need to support; there's only one version of anything to support. Either everything works or everything fails.

But in your case, you're looking at it from the debugging point of view. It's easier to play around with the code in a monorepo. And that's totally fair point of view to have, particularly in your predicament.

That choice of a monorepo doesn't necessarily improve the quality of your code organization and interoperability. It's still going to be a bad bug to fix. It's just a little bit easier to debug.

>Tests often don't need to cross version control boundaries. You can use mock data—like would be produced by the library—on the consumer side, because you can delegate responsibility for testing of that library to the library repository itself.

I take it as a rule of thumb that a more realistic test is better than a less realistic test.

I can't think of any reason why you would want to mock anything if using the real thing is cheap and easy, building mocks is expensive, and testing against the real thing will increase the chances of detecting real bugs.

I also prefer it when my tests detect bugs in other libraries which my code depends upon because as far as users are concerned, bugs in libraries my code depends upon are bugs in my code.

I have in the past written a bunch of functional tests which check out / pull and build code from other repos to run with my code.

Indeed, plus using a mock adds a whole new class of bugs, arising from the mock not precisely mimicking the real component or subsystem. As the sorts of bugs you are trying flush out in integration testing generally result from unforeseen interactions between components, it is more likely than chance that the mock will not behave in a way that will reveal the presence of the error.

Locally reasonable decisions can have globally unreasonable effects.

Some (most?) package managers allow you to install to your local cache for local testing. Nuget and Maven allow this at least.

back in the day, eclipse would handle this fine - you normally have the jars, but if you need to edit the underlying project you can tell it that jar is this folder on disk(which is a different repo), and it loads it correctly as another project in the workspace replacing the jar.

Yes, tho I'd go further and say semver is table stakes -- and so is determinism (achieved on the library-consuming side by committing and maintaining the yarn.lock file, in the specific case of node projects).

No, semantic versioning and package management is really not a sane way to manage development. You're saying every time I make a code change, I should release a patch version BEFORE I can even test it, since I cannot test that dependency by itself. I'm sorry, but that's just insane, not to mention time consuming. And yes, my libraries are used by multiple apps. Why else would I need to split them out as dependencies? Actually, after thinking through this, the only solution I can think of is symlinks and those would work with a monorepo or multiple repos just as well.

Sorry—I just saw this! But, as it was, I was not saying that you should release a patch version every time you make a code change. That would indeed be pretty arduous!

What you should do is test your library independently from the consumer that uses it. If that's not possible, ask yourself why that is; maybe this code should not be sequestered into a library after all, or maybe it needs to be designed a bit differently to remove some of the coupling between the library and app that seems to be at issue.

Ideally, your testing in your app should not be testing that the library works; it should be testing that your use of the library works. Too bad the real world is much messier than the ideal world.

Good luck either way!

> Short of writing a bunch of custom scripts, is there a standard way to handle this situation that I assume anyone with multiple repos that share internal, private dependencies has?

You write custom scripts to reïmplement everything you'd get for free with a monorepo! Having worked at organisations with a monorepo and with many repos, I can confidently say that any team which is using multiple repos is very probably wrong — and the more repos, the more likely wrong they are. If you have more repos than team members, you are almost certainly wrong. You end up spending far more time managing cross-repo dependencies and changes than you would merging changes in a monorepo.

Multiple repos: not even once.

We used git submodules. The "app" repo didn't contain any code, just configuration and the submodules, which are essentially links to other repos (the app modules, in this case). Those libraries could be freely linked from multiple apps, of course.

Disadvantage: it's not a package manager; it doesn't read all the dependencies from each library, resolve the duplicates and install them centrally. Instead of that, we only had submodules in the top repo (the app) - a bit like having a single requirements.txt file.

I didn't notice a link to "Software Engineering at Google" which is a great article that goes into the monorepo argument as well as a lot of other cool practices.


The SRE book is also a great resource:


[disclaimer: I work at Google]

Even more relevant IMHO is this: Why Google Stores Billions of Lines of Code in a Single Repository


wait hang on...

> At Google, we have found, with some investment, the monolithic model of source management can scale successfully to a codebase with more than one billion files, 35 million commits, and thousands of users around the globe.

so each commit adds on average over 28 new files?

There are probably commits that add hundreds or thousands of auto-generated files, raising the average.

Also, it depends on how much you squash - my latest 10 commits turned into just one when merging to master.

I initially found the idea of monolithic repositories hard to digest. But now I think it's a good idea for some of the reasons outlined in the article. Namely, it's very easy to depend on other code that the organization has created.

In the open source world, I have found some Unix distros use the same model. I know it's not as extreme, but the principle is quite similar. For example, in Nixpkgs all package definitions (which are actually code in the functional language Nix) are in the same repository and thus they can all depend on each other in a very easy and transparent way.

I initially found the idea of monolithic repositories hard to digest. But now I think it's a good idea for some of the reasons outlined in the article. Namely, it's very easy to depend on other code that the organization has created.

I remember the days when monorepo was the norm, and distributed version control was the weird, kooky idea. Mainstream programmers had knee-jerk notions that all managed environments were too slow.

For game development, monorepo is simpler. If one is using git, one needs to use some other software to turn the part of your repository for media into a monorepo, otherwise the asset files become a burden. (gitannex, for example)

Do you really mean a monorepo, though, or just a project repo whose scope is one entire game? Because a monorepo for multiple games - including released and in progress ones - seems likely to create a lot of pain in the long term. The release cycle for games seems much more suited to a release branch model, which would kind of require per-game repositories, and some sort of package versioning for common dependencies. I guess maybe with things like mobile games where you have a constantly moving target platform even ‘released’ games are live code so maybe I’m just betraying an outdated ‘gold master’ kind of mindset here?

>For game development, monorepo is simpler. If one is using git, one needs to use some other software to turn the part of your repository for media into a monorepo, otherwise the asset files become a burden. (gitannex, for example)

I think this has more to do with how git handles diffs more than monorepo vs distribution. As you said, git lfs solves the issue with centralization but that's not the same as a monorepo. You can still split all your libraries out in such a system without issue.

> Namely, it's very easy to depend on other code that the organization has created.

On the other hand, I've seen the other sides of this in monorepos:

1. It's too easy to depend on code, so there is dependency bloat when something simpler would work just as well.

2. It's relatively hard to depend on things not in the repo, reinforcing not-invented-here culture.

2 is simple. Import everything you need to depend on into the repo. Google has 3rdparty directory in its mono repo to put them.

That's an approach. Don't know if I'd call it simple. Your monorepo would start looking like an artifact repository at some point, with multiple versions of products.

And Google's approach, I'm sure, requires a bit of standardization and tooling investment. It's not clear to me that equivalent conformance and investment in monorepos and a package manager wouldn't work just as well.

>Your monorepo would start looking like an artifact repository at some point, with multiple versions of products.

It shouldn't, that's one of the big gains of a monorepo, is that there's only one version of everything. You don't need to version your dependencies within the repo, which means you only need to maintain one version of any external dependency.

I wouldn't call getting every project, internal or external, on the exact same version of every dependency "simple". That's a lot of hand waving around a really hard problem.

Its hard to do if you don't start out that way early on, yes. I don't really think its hard to maintain that state.

With multi-repo environments, btw you still need to do that kind of dependency version management for certain upgrades. Essentially you can desync but you have to occasionally re-sync everything. I na monorepo environment you can just prevent desynchronization.

It's a hard problem, but it's not necessarily harder than other approaches to dependency management. There's pretty much no approach to dependency management that isn't hard in one way or another.

A large chunk of third party usage is open source Google libraries like Guava. Anecdotally, I don't end up using that many libraries from third party.

As for JavaScript, the hassle of importing the entire transitive closure of an NPM library you want into third party means it's much more attractive to go with NIH syndrome. I looked at importing ESLint, but it has something like 110 dependencies.

The catch is the tooling. If you have the time and resources to make the tooling that is necessary to make it work specifically for your org, than great!

But if you don't, then a monorepo will generally slow you down because it will require coordinating changes across a much bigger group of people.

Monorepos are great for very small companies with a low communication overhead, and very large companies with the resources to build the tooling to make it work.

For everyone in between, I feel that small repos and microservices give the best developer velocity.

But I can say the same thing about multiple repos!

I've seen more than one company now that has had the same problem: how do they patch atomic cross-repo changes onto their multiple git repos? The reasons for this can vary, but the core problem is always that. As far as I see it, there are two solutions:

- Use a monorepo

- Create some external database that ties multiple hashes together for use in your ecosystem. This also requires re-inventing bisect on top of this database. I'm sure the intelligent people of HN can come up with the multitude of other tools you need to modify to make this work, but its not trivial.

If you're willing to manage that overhead somehow, that's fine, but I can't imagine its fun.

> Create some external database that ties multiple hashes together for use in your ecosystem. This also requires re-inventing bisect on top of this database. I'm sure the intelligent people of HN can come up with the multitude of other tools you need to modify to make this work, but its not trivial.

The thing is, there are already many mature well understood tools that you're probably already using in your organization whose goal is to tie multiple hashes together. Back in the day they were called package managers and had names like dpkg and rpm, and we'd use something like "pkg-config" to link to a specific one. These days we have docker and nix and a hundred language-specific dependency managers to solve every little variation on "tie these specific hashes together".

Monorepos still require tooling to manage effectively, but that tooling is at a disadvantage because it's not already being ubiquitously used. And unless you're bringing your entire dependency tree, recursively, all the way down to your OS, into the monorepo then you're still going to have to be dealing with those external dependencies one way or another.

And if you're not versioning the world, then it's really just an argument of the appropriate size of a unit of functionality in a multi-repo, and in that case "small enough to be well supported by existing tooling" isn't a terrible upper-bounds to pick for most people.

Eventual consistency. The great thing about multi repo is the ease of decoupling the pieces so they can evolves separately (you can do that in monorepos too, but it's not quite as natural).

You're free to PR changes gradually, making sure things work a couple of repos at a time, until you eventually get everything. If you can tolerate temporary inconsistencies, it allows you to scale to infinity, essentially for free.

>Eventual consistency. The great thing about multi repo is the ease of decoupling the pieces so they can evolves separately (you can do that in monorepos too, but it's not quite as natural).

I don't see how you can do this any better in a multi-repo than a monorepo though, unless you mean to the extent of simultaneously having multiple versions of the same library in your transitive deps (and thus kind of kludgily sidestepping the diamond dependency issues). Would you mind elaborating?

>You're free to PR changes gradually

This is possible in a monorepo too, by much the same means, I'd expect: you define an adaptor that you slowly migrate everyone onto, deprecate the old thing, and then optionally remove the adaptor and deprecate it too. Am I missing something?

> I don't see how you can do this any better in a multi-repo than a monorepo though, unless you mean to the extent of simultaneously having multiple versions of the same library in your transitive deps (and thus kind of kludgily sidestepping the diamond dependency issues). Would you mind elaborating?

Leaf repos (projects nothing depends on, like apps) can do literally whatever they want at any time without affecting anyone else. Right there is a big win. Dependencies can have multiple versions and the leaf can depend on whichever version they want at any given time. You can do the same thing in a monorepo if everything is in independant folders, but that's just the worse of both worlds.

> you define an adaptor that you slowly migrate everyone onto

No no. The way we do it is: make breaking change, people upgrade to it whenever (with gentle pushes so that we're eventually all on the same version sooner than later). No adapter, no transient state within a service. Just upgrade repos one by one until you got them all, no magic involved. This assumes that your repos represent loosely coupled components (micro services or micro apps).

These two things you state only work if you have no shared deps. If I depend on A and B, and B also depends on A, I can't use whatever version of A I want, it has to be compatible with B. This means that I'm forced to delay my upgrade of A until B has done so.

There longer the chain of deps is the worse this is. Even if everyone takes just couple days to upgrade, which ime is generous, your leaves end up being forced to wait weeks to upgrade in the worst case.

Depends: if B declares that it is compatible with both versions of A (because it uses methods that did not break), then you sure can upgrade whenever you want (and potentially start using new methods that contained breaking change).

If the change is breaking for B, then yeah, you have to wait until B upgrades (or you can upgrade it yourself!). During that time, other projects that don't depend on B can go on using the new A.

The alternative is "stop the press, everyone is upgrading to the latest A NOOOOOOOOOW", which if the change is not automatable, might either be non-realistic, be pretty large in scope, or require you to never make drastic changes in A (which is tricky if A is a 3rd party you don't control). You can also just have these long-running transient state where somehow A is always compatible with everything no matter via compatibility wrappers.

It's tradeoffs. I like our world where we don't have to migrate everything at the same time all the time, even with drastic changes. Works quite well for us, with thousands of repos and 10s of millions of lines of code. We enjoy the flexibility. It makes certain cross-project efforts harder. That's the tradeoff.

The diamond problem is more due to the lack of tooling to alert you when that's going on. (Tooling on the building side and the CI/CD/Jenkins side) [Jenkins should be able to kick off other builds that depend on whats being built.

Developers are way to lazy for this to work in practice. Imagine every second commit you make has to be backwards compatible, that for sure is a great way to effectively stop any refactoring of your code base, unless all your interfaces are already perfect and very fixed and you find this a feature. Or people will not care about partial cross repo dependencies resulting in randomly broken builds for others during a short time. I don't want to pull down a broken build just because some other team at that moment were only half way into pushing a feature. What should I do against that? Retry pull all repos and rebuild again?

I think the problem is that people misuse their version control as a package manager. If you want that type of behaviour, with semver and all to manage compatibility, just use a package manager to manage your dependencies, not a multirepo git contraption. You can still store your packages in one repo each if you like but at least you now get some control over interface compatibility which is a requirenent when you have multiple repos.

> Imagine every second commit you make has to be backwards compatible

No. The whole point is that they don't have to think about it. Aside for micro services used machine to machine with breaking changes that cannot exist in parallel, the point is that they don't have to worry about this. Do whatever you want, other projects do whatever the hell they want, and we just slowly go toward consistency as it becomes convenient until everyone's the same. Repeat. I can upgrade a dependency with a breaking change to my repo. The other team isn't ready yet so they keep using the old version (which works) until they are.

This is great in theory, but it requires _a lot_ of discipline and responsibility around dependency management from the developers.

What ends up happening in practice is that all these seemingly independent components have very strict—and sometimes even unspecified—dependencies between each other.

People create "base" packages that all other packages have a strict or loose dependency on, so when that package changes, it's a guessing game if it introduced breaking changes downstream.

With a monorepo, these dependencies are tracked and always visible, and integration testing between all dependent components becomes much easier.

It's very unlikely that you'll find truly decoupled code bases within an organization. It goes against the point of grouping people to work on a common goal to begin with.

> This is great in theory, but it requires _a lot_ of discipline and responsibility around dependency management from the developers.

Why? Again, the point is eventual consistency. If I make a breaking change, apps/services that are ready to migrate do so. The ones that aren't keep using the old version. Eventually, everyone's on the same page. The whole point is that it requires a lot less discipline.

> how do they patch atomic cross-repo changes onto their multiple git repos

If you need atomic cross repo changes, then you're doing multi-repo wrong. You need to have an upgrade path, so that you support both old and new in parallel, so you can upgrade one repo at a time.

You'd need that anyway during deployment.

I don't particularly see why you need that during deployment. And maintaining an upgrade path can be very dangerous. Its both a bunch of added complexity, and a very easy way to introduce subtle issues where upgrading is fine and dandy, but where its impossible to revert a breaking change, because the "change" happened over the course of 3-4 commits and so a revert puts you in an invalid situation, so you're forced to rollback entirely instead.

Yeah, the catch is always the tooling. It's like GitHub Flow. On the surface, it sounds simple - run everything from master. But in reality (https://github.com/blog/1241-deploying-at-github) it's Hubot, Janky, locking, monitoring, merging API, etc..

The thing people need to remember is the whole labor force of a normal company is a rounding error at Google and Facebook.

It will slow you down because you're forced to confront the consequences of your change in real time, not down the road when someone notices something changed somewhere else that may or may not have broken their invariants in ways they now have to figure out.

Tooling isn't the issue here, you're basically arguing for kicking the can down the road and encouraging a buildup of technical debt.

That's not true at all.

When you write code, do you put everything into a single function, or do you use separate functions for different parts of the code? Multi-repo is just the extension of that.

Software quality is orthogonal to your development methods. You can have good quality multi-repos or bad ones, and you can have good quality mono-repos, or bad ones.

When I write two functions in the same project I get a compile error if I change the signature of one without changing the call site. This is analogous to a monorepo, in which I have projects in separate directories but with a unified build.

If you really want to take this analogy to its logical conclusion for separate repos, you're gonna have to do something more like compile every function as a dynamically loaded library. And then hope you notice a function signature change because the dynamic linker won't tell you.

I keep hearing the tooling argument in discussions about monorepos, and I'm not entirely sure what is missing in this area when compared to projects with multiple repositories.

Tools like Pants[1] or Lerna[2] solve a lot of the issues related to builds and dependency management.

What exactly are you missing _today_ that prevents your organization from adopting a monorepo approach?

[1]: https://www.pantsbuild.org/

[2]: https://lernajs.io/

Isn't the point of the article that you need less tooling with monorepos, not more (examples given: cross-repo changes, code searching, git bisect, dependency management).

Now, you might be referring to the fact that Facebook and Google have built up lots of tooling to help git/hg scale since their repos are too large and operations take too long. But that is not a problem you're going to have until you have at least a thousand engineers. At that size you need tooling for everything anyways.

I think that the question of whether or not to use a monorepo is not a function of "how to organize code for efficient coding". It's more of an actual organizational question; how and how often do you deploy, what are the actors and their needs, and what's the domain? Are you producing statically linked C libs? interrelated NPM packages? SaaS?

If you look at monorepos from a package management standpoint, they are usually just high-level graphs of dependencies. Or, more accurately, sources of dependencies. How these are rolled up, shipped, and ultimately deployed is a function of the operational culture and business needs more than it is source control or even language choices, in my opinion. Business needs impact source control in any sufficiently complex, source-controlling org.

That's not to say that, for example, small companies benefit from monorepos, while large ones benefit from small repos and packages (or the opposite). I think the pros and cons are entirely decoupled from codebase size and complexity. In order to do one or the other well, you need the right business needs, operational parameters, engineering culture, and resources. So, I always find it interesting to read about Google or Microsoft leveraging one, the other, or both approaches with their own codebases.

I liken the monorepo vs small packages approach to be a little bit like rendering a frame on a CPU vs a GPU. Do you build/test/deploy each "frame" (iteration) as a top-down, more-or-less-discrete block of work, or can you parallelize it and "ship" multiple compatible streams at once?

I suggest it depends almost entirely on the problem space and the "hardware" (business needs), far more than it does the actual code or volume of code.

Perhaps Conway's law here applies here in a sense, i.e. any organization that manages source code will produce a source control management scheme that is representative of how they deploy to downstream consumers.

How do people handle the case where in a large repository, with people committing almost constantly, pushing your changes to a git server on the other side of the world becomes quite tricky. By the time my push gets to the server it seems someone has gotten in before me and my repo is out of date. I have to pull the new changes and try again. During busy parts of the day (my afternoon, the U.S morning) you might have to loop this process a few times. If I was to check the code still builds after each of those merges I'd be there all day, so there is some risk there. When we had cvs this wasn't an issue since most changes are to completely different parts of the code base so you don't need their changes.

You don't ever push straight to the repo. You enqueue your changeset to be pushed by a central system. At Facebook we call this "asynchronous landing" and you're right, before this became a thing about five years back, at certain times of day it was very tricky to push your changes out.

Once you switch to this model, you can do convenient things like landing straight from the review system (by a "Ship it" button), landing after some checks were successfully performed, and so on.

Incidentally, this is similar to how merging pull requests by rebasing works on GitHub.

This does sound like the answer. I wonder if our Atlassian solution supports this, perhaps they call it by another name.

One solution is to coordinate who 'owns' master, mutex style, using something like an intranet wiki.

When you 'hold the lock', you get to rebase and push, and 'release the lock', and if you've any sense the system will automatically ensure you haven't broken the build.

Of course, it's important not to waste time, as this serializes the commit process, as it were.

I warmed up to the monorepo awhile ago but tooling and reference material that helps guide an org into adopting a monorepo and using it properly is hard to find.

People always talk about how it takes all this tooling to have an effective monorepo— it definitely does; many moons ago I was a Googler too and experienced all the fun little perforce wrappers that would check out the portions of the tree you needed to build whatever it was you were trying to build.

But having things split across many repos also takes a lot of tooling too. So really, for each approach you're looking at which portions of it are covered by the version control system itself, and where the gaps are that have to be plugged by auxiliary stuff. And of the auxiliary stuff, how much of it is standard enough to be things you can inherit pretty much off the shelf from your distro or other ecosystem vs. something you need to actually build and maintain in-house.

Other organizational priorities come into play too, like the importance of open source in your codebase, and what your relationships are to your upstreams, if you have them. It actually surprises me that there aren't more/better tools out there that help with synchronizing commits (or portions of commits) in and out of external standalone repos. The main patch-management tool I'm aware of is Debian's quilt, which pretty much just boils down to a handful of bash scripts and arcane conventions. Why isn't there more stuff in this space?

The difference is that multirepo large project dev and dep mgmt are problems solved today with existing, widely used tooling. With monorepo, I’m DIYing scalability onto a system not designed for it.

Oh, I totally agree with you. I do robotics, and my company inherited a multi-repo approach kind of by default from our upstream ecosystem (ROS). So we get a lot of the benefit of our upstream having tackled a number of the issues around having a product made up of hundreds of repos.

Now, our upstream doesn't _actually_ ship very much, and what they do ship is relatively slow moving. So we've had to extend the supplied tooling in various ways to truly meet our needs.

Thinking that git is the only option is a harmful and pervasive form of Stockholm syndrome.

Thinking that SVN is an option is even more harmful.

I'm supportive of this approach; it's been an interesting shift from "multirepo". I think an advantage does come when a common language is used for projects, which is what I am familiar with. It can make it easier to enforce design patterns and re-use common code/dependencies (grep, text search, etc all in the same place that you're working in); which I think has a large positive effect on "loading working context" for whatever project you are working on - when there are common phrases and patterns and libraries used, it is easier to discuss with or to get advice from co-workers.

I personally find multi-repo thinking leads to better architecture. Cross project change history seems nice but if the projects are that coupled in the first place, why are they difference solutions to begin with? If you're building decoupled code you shouldn't need cross project changes.

That said, I understand the worth of getting things done at the expense of rigor so I chalk this topic of discussion up to personal taste. It's akin to the dynamic vs static typing debate.

> If you're building decoupled code you shouldn't need cross project changes.

That's the theory, but in practice designing robust, future proof APIs has proven to be really hard in a lot of cases. You're then left with the option of supporting old APIs forever or migrating dependent code to new APIs, both of which are difficult in their own ways.

That has nothing to do with requiring an atomic cross project change. If your depended library needs to be updated, update as needed.

Upgrading other projects to support the new dependency version can come in a different commit. You only run into trouble when you've set up your projects to always use the latest version of their dependencies. That's a recipe for disaster.

I think it boils down to two things that are not mutually exclusive: 1. Are you building a monolith? 2. What does your org chart look like?

I work for a fairly large company where a monorepo doesn’t make a whole lot of sense because each team runs several services that get released or patched independently. If you have a large product composed of many components that need to function together as a cohesive whole, go ahead, use a monorepo.

Here is a naive question:

Let's say I want a small application with flask and angular.

I create a single repo for both flask and angular. I put everything flask in one sub folder called backbend and I put everything angular in another folder called frontend

WIP here: https://github.com/kusl/flaskexperiment or https://git.sr.ht/%7Ekus/flaskexperiment/

Now the problems are just starting: how do I set up ci for all my projects? Travis ci expects a single file at the root of the project and so does gitlab ci (hi sid, big fan)

I am sad I can't talk to the experts at Google about how they navigated these problems. I understand Google has many enemies that are constantly trying to exploit whatever but I still wish we could have a more open conversation here.

> Travis ci expects a single file at the root of the project and so does gitlab ci

GitLab has a related issue to support "Several .gitlab-ci.yml for monorepos" (https://gitlab.com/gitlab-org/gitlab-ce/issues/18157).

We also have an issue for running jobs only when modifications are made to a given file or files in a directory, it's tentatively scheduled for 10.7 (April's release).


Not Sid, but thanks for being a fan ;)

Is there any Free Software VCS that specifically targets monorepos?

If not, which of the "usual" VCS are best suited for monorepos? CVS? Subversion? Darcs? Bazaar? Mercurial? Git?

Should one use Mercurial simply because Facebook uses (and patches) it, or are the better choices for small-to-midsize organizations?

OMG someone is finally going to tell me about how SVN can be better than GIT in particular cases! I have been asking this question for years out of pure curiosity yet everything I've received in response were accusations in trolling.

There is a people aspect to this trend.

There is a lot of developer movement between the companies cited and it's not surprising that people take the practices they're familiar with to their new employer.

Companies incentivize people to deprecate feature X and replace it with feature Y and celebrate the win. Much harder without a monorepo.

The counter argument is open source, where the development follows the distributed model and the difficulty of syncing the monorepo with a custom build system. Figuring out a way to leverage the QA work distro people do in coming up with a consistent cut + patches would benefit everyone regardless of repo structure.

You can successfully use this approach only if you do not have external dependencies. If you use some library you have basically the same problem as multirepos. I worked in companies with monorepo and it work really well. Except all external dependencies were copied into monorepo, outdated and with unpatched security problems because no one ever updated them unless there was some functionality missing. Also there were some internal libraries for logging etc. which were worse than publicly available alternatives but it was easier to maintain them then use something standard.

There should be " (2015)" added to the title.

(BTW, I was slightly confused not to find a single date on this blog post, and the HTTP headers were also useless. But at least there are rough timestamps at the main site.)

As a non-Google/Microsoft/Twitter person, if my org wanted to start using a monorepo, where would I begin? Or is the relevant tooling all closed?

Doesn't history become a mess with Monorepos?

Alternatively, it is a mess to track connected changes through the history of many repos. In this case, I'd think default tooling is actually better to solve the problem that the mono-repo scenario has.

I suppose you can use `git log -- path/to/specific/project` to get a specific history.

Relevant talk by Google at CppCon last year focused on C++, but really talks about getting rid of versioning in general.


re: "I’ve seen APIs with thousands of usages across hundreds of projects get refactored and with a monorepo setup it’s so easy that it’s no one even thinks twice."

This makes it sound much easier than it is. You still need to get approvals from all the teams whose code is touched. In the meantime, code may be changing out from under you. And the more code you touch, the more tests you need to run.

For anything but the most obviously risk-free changes (where a global owner can approve it), splitting up a large change into independent pieces and sending out a bunch of changelists in parallel will make more sense. There are tools to do that too.

Anybody can point me to any free version control software tool be used with MONOREPO.

I mean do we simple use 1 GIT repo with multiple folders for each Project/Library?

We build our 2-4 iOS apps out of 70 separate git repos, each creating a dynamic framework. I told some folks at Apple once and they were ROFL.

Monorepos are great if you are a monoculture. I don't find it surprising that Google, Facebook and Twitter all enjoy and benefit from monorepos. I also don't find life inside the monochromatic empire particularly appealing.

I think the benefits that massive corporations derive from monorepos demonstrates how massive corporations are a net negative to society. Imagine if, instead of an increasingly centralised and closed culture in technology, companies were small enough and interdependent enough that a decentralised model became a net positive.

Monorepo? Why not just keep everything in one directory? Let's go 8.3. Everyone in same functional division? No way. Dependencies in your source are managed at a higher level. This whole thing is nonsense because the orgs can't figure it out. If you cannot figure out git or mercurial integration it's either institutional break down or...

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact