Hacker News new | past | comments | ask | show | jobs | submit login
Monorepos: Please don’t (medium.com)
332 points by Artemis2 3 months ago | hide | past | web | favorite | 391 comments

My advice is that if components need to release together, then they ought to be in the same repo. I'd probably go further and say that if you just think components might need to release together then they should go in the same repo, because you can in fact pretty easily manage projects with different release schedules from the same repo if you really need to.

On the other hand if you've got a whole bunch of components in different repos which need to release together it suddenly becomes a real pain.

If you've got components that will never need to release together, then of course you can stick them in different repositories. But if you do this and you want to share common code between the repositories then you will need to manage that code with some sort of robust versioning system, and robust versioning systems are hard. Only do something like that when the value is high enough to justify the overhead. If you're in a startup, chances are very good that the value is not high enough.

As a final observation, you can split big repositories into smaller ones quite easily (in Git anyway) but sticking small repositories together into a bigger one is a lot harder. So start out with a monorepo and only split smaller repositories out when it's clear that it really makes sense.

Components might need to be released “together”, but if they are worked on by different teams, it means they’ll have a different release process, as in different timeline, different priorities.

First of all this is normal, because otherwise the development doesn’t scale.

In such a case the monorepo starts to suck. And that’s the problem with your philosophy ... it matters less how the components connect, it matters more who is working on it.

Truth of the matter is that the monorepo encourages shortcuts. You’d think that the monorepo saves you from incompatibilities, but it does so at the expense of tight coupling.

In my experience people miss the forest from the trees here. If breaking compatibility between components is the problem, one obvious solution is to no longer break compatibility.

And another issue is one of responsibility. Having different teams working on different components in different repos will lead to an interesting effect ... nobody wants to own more than they have to, so teams will defend their components against unneeded complexity.

And no, you cannot split a monorepo into a polyrepo easily. Been there, done that. The reason is that working in a monorepo versus multiple repos influences the architecture quite a lot and the monorepo leads to very unclear boundaries.

> Components might need to be released “together”, but if they are worked on by different teams, it means they’ll have a different release process, as in different timeline, different priorities.

released "together" == part of the same feature. Timelines, release process and team priorities are all there to help to deliver features. If they stand in the way, they need to be adjusted. Not the other way around.

Multi repos encourage silos. Silos encourage focusing on the goals of the silo and discourage poking around the bigger picture. Couple that with scrum, that conveniently substitute real progress metrics with meaningless points, and soon enough you end up with an IT department, full on with processes but light on delivering value.

And no, you cannot split a monorepo into a polyrepo easily. Been there, done that. _The reason is that working in a monorepo versus multiple repos influences the architecture quite a lot and the monorepo leads to very unclear boundaries.

I think you are conflating a monorepo (where boundaries can still be established, e.g. via a module isolation mechanism specific to the stack used) with a "monoproject"/"monomodule", where is no modularization at all.

Edit: expanded wording

No, there's no such confusion.

> where boundaries can still be established, e.g. via a module isolation mechanism specific to the stack used

Unfortunately this isn't a technical issue and that's the problem.

If the projects within the monorepo are decoupled and have clear boundaries then why not have them in separate repositories?...

In my opinion monorepos make refactoring dependant projects much easier. However it is much harder to establish and enforce clear boundaries...

With monorepos you don't have to manage PRs for 8 different repositories when adding a feature.

In my experience it's hard to establish clear boundaries, regardless of repository kind. It may be more difficult to create features which are tightly coupled across multiple repositories, but people do it regularly. And when they do, you suddenly have to manage and maintain synced features across multiple repositories.

In fact, the repo tool for the android project makes it quite easy to develop features across repositories, thus lowering the boundaries significantly.

i have a monorepo that contains a few different early stage frontend web projects that does not interact with each other at all. They do however uses a shared component library that is also placed inside the monorepo. Tools like yarn workspaces makes sharing the library easy if the projects are located on the same repo.

When I change something on the library, i could easily also run tests across all the projects that depends on it with the latest changes of the library and make sure that my change is not breaking things all over the place, which is also pretty nice.

I am not sure yet if using a monorepo is actually the best way deal with this kind of projects, but for now it feels better than having them on seperate repos and then having to deal with the complexity of sharing the library across repos by publishing it somewhere or using git submodules or something.

I work on a project structured into microservices and use both. There is one global repo with submodules in subrepositories.

So when someone only wants a submodule they can happily only clone that, but when someone wants all stuff (which is the default case), the can clone and install all at once.

Downside is that I have to commit twice

> Having different teams working on different components in different repos will lead to an interesting effect ... nobody wants to own more than they have to

... and so nobody really understands how all of the components tie together and as a result it takes weeks of manual testing to release.

My rule of thumb is: if you need to do PRs in several repositories to do one features, you should probably merge the repositories. At work, we have code spread among a bunch of repositories, and having to link to the 2/3 related PRs in other repos is a major PITA, and even more so for the reviewers.

My rule of thumb is: if you need to do PRs in several repositories to do one feature, your projects are either tightly coupled enough that they should be one monolithic piece of software, or your tight coupling is a problem you should work on resolving.

Requiring multiple PRs to multiple repos to roll out one user-facing feature is fine, as long as your independent modules/projects are not actually interdependent (i.e. one of those PRs will not break another independent repo that lacks a corresponding PR).

Sometimes a feature needs to change a shared dependency library.

But in that case you could consider the change to the dependency a single release. And ingesting it into another app a separate release.

At a past job, I had to edit roughly 5 different repositories in order to do some trivial programming task (send an email or some such). It was quite easily the least productive / most demoralizing workflow I've ever experienced.

Context switching really sucks. You should aim to reasonably avoid it

Sending an email can have a few different responsibilities:

Who is the email being sent to?

What is the content of the email?

What data does the email content and recipient depend on?

What are you tracking on the email?

How is the email visually formatted?

All those things might be in different apps as the logic gets more complicated.

Don't mix up downsides of multirepo and bad composition of your microservices

Just because things change in tandem, that does not mean that they're all the same thing. When I add a new function to my backend service, all frontends that consume its API also need to be adjusted. But that doesn't mean that the backend service, its command-line clients and its web GUI client should live in the same repo.

It's probably a matter of taste - but I think they should be in the same repo. I like tying test failures/regressions to a specific commit for documentation and admin purposes. Having a test fail or regression due to an 'unrelated' commit in another repo sounds like a nightmare waiting to happen when you try investigating.

I the difference of opinion is between developers who work on self-hosted "evergreen" products where the latest version is deployed, and others who work with multiple release branches with fixes/features constantly being cherry-picked.

Why? You are just creating more work for yourself by keeping the components in different repos. Now you need to create N commits when updating something. If your future self wants to investigate how the software has evolved there are N times as many commits to analyze.

I really think it should if possible. Makes life much easier in my experience.

Not always. It makes absolutely sense to have a repository for the gui and one for the server. When writing a new feature you usually write some gui code and some server code and create different pull requests. I think monorepos are seriously wrong and I completely agree with this article.

Well... Why does that make sense? I have a repository containing both the GUI and the server, and sometimes I have to make changes to both. Locating those related changes together in the same commit and/or PR makes a lot of sense to me: the changes depend on each other, and thus should be reviewed together. What's the advantage of splitting them up?

Because obviously the changes that you make in the gui are completely isolated from the changes you make on the server. When you are working on the gui the server code is just noise and vice versa. And it gets even worse when you use two different languages for the gui and the server.

That only really matters if your backend developers are a different team to your frontend developers where they'd want to be working concurrently. And even then, they could work in different branches and both teams merge into a development branch when finished.

The idealistic discussions for or against monorepos often overlook the most important detail: who's working on the code and how would you want them to version control it?

If it's separate projects with their own versioning then it makes sense to have them as separate repositories. If it's a single project but with individual components you'd want to version (eg because it's developed by different teams with different release timelines) then there you also have a situation where you'd want to version the code separately so once again there is a strong argument for separate repositories. However if it's one product with a single release schedule then splitting up the frontend from the backend can often be a completely unnecessary step if you're doing it purely for arbitrary reasons such as the languages being different. (I mean Git certainly doesn't care about that. A project might have Bash scripts, systemd service files, Python bootstrapping, code for an AOT compiled language (eg Rust, Go, C++, etc), YAML for Concourse, etc. They're all just text files required for testing and compiling so you wouldn't split all of those into dozens of separate repos).

> That only really matters if your backend developers are a different team to your frontend developers

What if there is one team, but different developers (one working on the frontend, another on the back)? What if QA can test the API while the frontend development is ongoing?

What if the front and backends have different toolchains, and ultimately separate execution environments (server app backend vs JS running on client machines).

I’m not sure what your point is. There’s obviously going to be thousands of different scenarios that I didn’t cover; it would be impossible to cover every imaginable use case.

> What if there is one team, but different developers (one working on the frontend, another on the back)?

Then presumably everyone in that team are full stack?(Otherwise it would be different teams in the same department) so it still makes sense to have a monorepo because you could have a situation (holiday, sickness) where someone would be working on both the front end and back end. Thankfully got is a distributed version control solution and supports feature branches so you can still have multiple people working on the same repo and then merge back into a developement branch.

> What if QA can test the API while the frontend development is ongoing?

Testing isn’t the same as released versions. You can (and should) test code at all stages of development regardless of team structures, git repo structures nor release cycles.

> What if the front and backends have different toolchains, and ultimately separate execution environments (server app backend vs JS running on client machines).

I’d already covered that point when talking about different languages in the same repo. You’re making a distinction about something that version control doesn’t care in the slightest about.

I think it’s fair to say any significant cross-project tooling should be it’s own repo (you wouldn’t include the web browser or JVM with your frontend and backend repos). But if it’s just bootstrapping code that is used specifically by that project then of course you’d want that included. Eg you wouldn’t have Makefiles separate from C++ code. But you wouldn’t include GCC with it because that’s a separate project in itself.

Ultimately though, there is no right answer. It’s just what works best for the release schedule of a product and teams who have to work upon that project.

> Because obviously the changes that you make in the gui are completely isolated from the changes you make on the server.

In my experience, that is almost never the case. Often, the frontend requires a new endpoint or a modification to an existing endpoint. If you don't coordinate this change, you end up with a non-functional PR that cannot even be tested. Same happens when the backend proposes an endpoint change that affects the frontend.

We have moved the frontend and backend to the same repo to make coordination and testing of such cases simpler.

You make the endpoint first, and test it without the UI. What challenges do you foresee here?

* Changing graphql schemas.

* Any non-backwards compatible change in the interface between the components. Yes this can be solved. But when working in a smaller team on proprietary software why use time solving a problem you don't need to solve?

(This is from experience.)

> why use time solving a problem you don't need to solve?

Unless they're running on the same computer and deploy literally simultaneously, this is already a problem you need to solve.

A surprising number of companies are prepared to accept an hour of downtime for an internal system if it saves them money.. In my experience the best business practice is to offer the product owner/manager the costed options in such a situation and allow them to choose.

This is true for systems where there is a well-defined protocol between GUI and servers and a proper versioning process in place, i.e. most "old-school" client/server systems.

I expect lots of people on HN are working on systems with very tight coupling between client/GUI and server and no proper versioning between them, as is common in web applications. Hence the replies to the contrary: you're probably from quite different worlds :)

(Now, I personally think that maintaining sound versioning practices is a good idea even if you do have tightly coupled control of both the client and the server side. But that may just be me...)

I think what matters in the end is Conway's law. Conway's is frequently misinterpreted as an observation when it's actually advice: Structure your applications/repos like you structure your teams. You're going to end up with that code structure anyway, so might as well save some time.

Hmm, that's not really obvious to me. Sometimes the server has to deliver new data that is to be used in the GUI, so it's nice to be able to present those together in the same PR. If it then happens that the server-side changes do not match what you need in the GUI, it's relatively painless to add those changes in the same branch that hasn't yet been merged. In other words: although you can make changes in one without breaking the other, that doesn't make them completely isolated.

You should always have a communication layer between the gui and the server. For example using protobuf you would update the proto definition (that can be in a shared repo) and when building the gui and the server the protobuf layer is regenerated. So the only place where you make your changes for the new data contract is the shared repo and the gui and server would automatically have the new changes.

So now we're at three repo's, one of which is shared by the other two, and changes will have to be coordinated over them. I fail to see how that is an improvement over having both in the same repo.

In the end, I think the other comments are right that it mostly depends on who's working on something. If it's different teams, then different repo's probably make sense. But if I'm responsible for both the back-end and the front-end, they're usually not isolated at all at least in terms of project requirements, and hence keeping them together makes sense.

(But of course, even then there's nuances. I think the article is mostly arguing against monorepo's as in company-wide monorepo's. I'm willing to believe Googlers that it works well for Google, and I'm not in a position to claim what it'd be like for other companies. Team-wide monorepo's for different parts of the same project, however, make a lot of sense to me.)

> I'm willing to believe Googlers that it works well for Google

It doesn't. In my entire career, that was the only environment in which some random would break us and we couldn't do anything about it other than hope for a rollback and then wait for hours for the retest queue to clear before we could deploy anything at all.

Maybe not all the time, but you need the escape hatch of pinning healthy deps, because HEAD of everything is not guaranteed to work.

Well, I'm willing to believe you that it didn't work well for you as well. My point is that company-wide monorepo's are largely irrelevant to my point, as I'm not arguing in favour or against those (I'm leaving that to people who've worked with it).

It'll be really typical for a gui/server to want to share some is_valid_payload() function. The client to validate it before sending, and for the server to do its own validation.

If it's a monorepo your PR might be a 2 line patch to that function, then adding the GUI and server code.

If you split it you'll first need to have a PR on the "validation-lib" repo, then once that gets in a PR on the "server" repo, bumping the "validation-lib" version dependency, and finally a PR on the "gui" repo bumping the dependency for both "validation-lib" and "server" (for testing etc.). That's before you need do deal with the circular dependency that "server" also wants "gui" for its own "I changed my server code, does the GUI work?" testing.

Better just to have them in a monorepo if they're logically the same code and want to share various components.

> If it's a monorepo your PR might be a 2 line patch to that function, then adding the GUI and server code.

> If you split it you'll first need to have a PR on the "validation-lib" repo, then once that gets in a PR on the "server" repo, bumping the "validation-lib" version dependency, and finally a PR on the "gui" repo bumping the dependency for both "validation-lib" and "server" (for testing etc.). That's before you need do deal with the circular dependency that "server" also wants "gui" for its own "I changed my server code, does the GUI work?" testing.

The above is exactly why I am so firmly opposed to multirepo[0]-first. And it's really just a throwaway example: a real change would involve multiple different library and executable repos, all having separate PRs. And then there's the relatively high risk of getting a circular incompatibility.

This can be worth the cost, for organisational reasons. But until you need it, don't do it. It's very easy to split a git repo into multiple repos, each retaining its history (using git filter-branch). Don't incur the pain until you need to, because honestly, you're not likely to need to. You're probably not going to grow to the size of Google. Heck, most of Google runs in one monorepo, with a few other repos on the side: if they can make it work at their scale, so can you. And if, as the odds are, you never grow to their size, then you'll never have wasted time engineering a successful multirepo system instead of delivering features to your business & customers.

0: 'polyrepo,' really? https://trends.google.com/trends/explore?date=all&q=multirep... clearly shows that 'multirepo' is term.

These are two separate functions why would you ever want a function that checks both gui and server? The gui validation logic belongs to the gui layer, the server validation logic to the server layer. If you have a function that contains logic from both layers there is something seriously wrong with your design.

The classic reason for any validation is that you want the validation to be done in the frontend (to save a network roundtrip and provide better, immediate feedback), on the backend (so that if the frontend is compromised and maliciously circumvents that validation, it still gets validated), and both of the validations to be the same to prevent inconsistencies.

A good way to fulfil those requirements is to have the exact same function available in both places.

If you have the same functionality that can't be re-used (for no reason), then I'd call that a design flaw.

I'll need a few more validation functions for each clients. I don't want to write+maintain multiple functions that do the same thing, even if it's just copy+paste.

It's "data" validation. So let's put that in the "data layer" repo.

We now have, at least:

- Server

- Web (GUI)

- Android

- iOS

- Data

- More clients?

We'll also have branches for each development task. How do we know what branch the other branches should use? One "simple" feature can easily spread over multiple repos. Does each repo refer to the repo+branch it depends upon (don't forget to update the references when we merge!), or we add a "build" repo which acts as the orchestrator?

Most PRs will need to be daisy chained - who reviews each one? Will they get comitted at the same time?

How do we make the builds reproducible? commit hashes? tags? ok, we now need to tag each repo, and update the references to point to that tag/hash... but that changes the build.

Well, I'm glad our code base is split over multiple repos because "scalability".

Imagine something like "curl" where a client needs to validate a manually provided request before making it.

In any case, if you're nitpicking that example you're missing the point. The same would go for any number of other shared code you could imagine between a client/server trying that logically make up one program talking over a network.

I still can’t see how you would have a shared library for a C# gui and a Java server for example. Your communication layer would obviously live in both repositories. Even in case you are using the same language and you do have shared libraries then what is the problem? The shared libraries would surely be shared with other projects so it makes sense to have them in a separate repository.

In cases where there's a high degree of churn (i.e. early-stage startups) in shared libraries, updating those libraries can cause a large amount of busywork and ceremony.

If you had a `foo()` function shared between the GUI and the server (or two services on your backend, or whatever), in a monorepo your workflow is:

   - Update foo()
   - Merge to master
   - Deploy
In a polyrepo where foo() is defined in a versioned, shared library your workflow is now:

   - Update foo()
   - Merge to shared library master
   - Publish shared library
   - Rev the version of shared library on the client
   - Merge to master
   - Deploy client
   - Rev the version of shared library on the server
   - Merge to master
   - Deploy server
This problem gets even more compounded when your dependencies start to get more than one level deep.

I recently dealt with an incredibly minor bug (1 quick code change), that still required 14 separate PRs to get fully out to production in order to cover all of our dependencies. That's a lot of busywork to contend with.

It seems to me that the real problem is your toolchain. In a previous project the workflow was like this:

Update foo() Merge to master Publish shared library Deploy

So as you can see the only step added was to publish the shared library that would automatically update the version in all the projects using it. If you are really doing everything manually I can understand that this is a pain, but this has nothing to do with the monorepo / multiple repo distinction, this is a tooling problem.

But you've just invented a sharded monorepo, and now have all the monorepo problems without the solutions.

What if updating foo() breaks something in one of the clients (say due to reliance on something not specified). Then you didn't catch that issue by running client's tests, now client is broken, and they don't necessarily know why. They know the most recent version of shared broke them, but then they have to say "you broke me" or now one of the teams needs to investigate and possibly needs to bisect across all changes in the version bump under their tests to find the breakage.

How is that handled?

(the broader point here is that monorepo or multirepo is an interface, not an implementation, its all a tooling problem. There are features you want your repo to support. Do you invest that tooling in scaling a single repo or in coordinating multiple ones? Maybe I should write that blog post).

Some package managers that support git repos as dependency versions can offset this in development.

>It makes absolutely sense to have a repository for the gui and one for the server.

Not really. You can have a single repo with top level directories tigershark-gui and tigershark-server.

What is the point instead of having them in two separate repos?

Any full stack change will be represented by one PR that changes from pre change to post change. Two repos would introduce a new possible state where one has the change applied and the other doesn't.

And later you add the iOS and Android clients too. Will those go into the same repo? Better to keep server and clients apart, especially if release schedules are different.

Sure if the release schedules are different then have them in separate repos so things like tagging makes sense. But often people work with a single release schedule. There's just so many variables that go into these decisions that the thread here is bonkers.

Smart people can work through problems to get the job done. Monorepo vs polyrepo won't stop people from moving forward.

> As a final observation, you can split big repositories into smaller ones quite easily (in Git anyway) but sticking small repositories together into a bigger one is a lot harder. So start out with a monorepo and only split smaller repositories out when it's clear that it really makes sense.

If you only need to do this once, subtree will do the job, even retaining all your history if you want.

I'm not sure what the easier way to split big repos is.

To split, you can duplicate the repo and pull trees out of each dupe in normal commits.

In principle: Yes.

In practice, I can tell you from first-hand experience that this isn't all that simple in bigger, organically grown cases (you'll have many other things to consider if you want to keep the history in a useful way). Especially the broken branching model of SVN and co. is a problem here: In the wild, it immediately leads to "copy&paste branching" (usually through multiple commits. Migrating that to Git or Hg and splitting it up can be a challenge.

I haven't tried in Git, but with Mercurial merging repos is as simple as pulling from an unrelated repository and merging, that's it. It's a lot simpler than splitting a repo up unless you accept that all of the old history can remain, then you just make a clone and delete what should no longer be a part of the repository.

But monorepo leads to tight coupling, and that is just as much a pain to work on as versioning, or two teams are simultaneously working on the same shared code, and you have not only merge conflicts, but conflicting functionality.

So why is that? Why do we ned to couple together the software development efforts with release? Based on my experience there is no difference between the monorepo vs multirepo approach from the deployment point of view.

After trying to get the best of both with Subversion Externals and Git Submodules, I'd have to agree. At least until things are so loosely coupled they're begging for a public release.

That said, some packaging solutions can bridge the gap reasonably well. Unless you need instantaneous, atomic releases.

I switched to using submodules about a year ago, and they work very well for a project + a set of 4 dependencies. I handle that zoo from VS Code + Git Lens plugin.

Funnily, I only use Code to handle commits to submodules, because Git Lens is not available for the full VS IDE.

What are you talking about! In my perfect micro services world I just have these enforced bounded contexts that are so perfectly designed they never need to change. Consequently all parts of the system are perfectly independent snowflakes that can be deployed without thinking about any other parts of the system. It’s beautiful really when you think about the mess that things were before we could do this!

I generate Coq proofs of Swagger descriptions that were compiled from a speech to text dump during a 10 person Hangout. Downside is that some of the protobufs aren't laid out as cleanly as one would like.

While I know you are being sarcastic, I really have heard bushy tailed young “architects” say something similar who just read about Domain Driven Design and then decided they were trying to “educate us”.

Oh I worked on a project like this, which still hasn’t launched any software yet 5 months after I left...

I can think of situations where components 'need' to release together because of organizational rules and not any actual binding between the components, in that case of course they do not need to be in the same repository.

I agree that you should always start with one repo and split as needed, it's the MVR way (minimum viable repository)

My problem with polyrepos is that often organizations end up splitting things too finely, and now I'm unable to make a single commit to introduce a feature because my changes have to live across several repositories. Which makes code review more annoying because you have to tab back and forth to see all the context. It's doubly frustrating when I'm (or my team is) the only people working on those repositories, because now it doesn't feel like it gained any advantages. I know the author addresses this, but I can't imagine projects are typically at the scale they're describing. Certainly it's not my experience.

Also I definitely miss the ability to make changes to fundamental (internal) libraries used by every project. It's too much hassle to track down all the uses of a particular function, so I end up putting that change elsewhere, which means someone else will do it a little different in their corner of the world, which utterly confuses the first person who's unlucky enough to work in both code bases (at the same time, or after moving teams).

My current team managed to break a single "component" out into a separate repository. Then that repository broke into two, then those broke into other repositories, until we've eventually have around 10 or so different repositories that we work on every day.

An average change touches 4 of them, and touching one of them triggers on average releases on 2 or 3 of them. Even building these locally is super tedious, because we don't have any automation in place (not formally plan to) for chain building these locally.

This is a nightmare scenario for myself. A simple change can require 4 pull requests and reviews, half a day to test and a couple hours to release.

Yet my team keeps identifying small pieces that can be conceptually separated from the rest of the functionality, even if they are heavily coupled, and makes new repos for these!

I’ve come to the conclusion that an organisation should ideally have no more than one primary repo, with maybe a handful of ancillary repos for stuff that really doesn’t make sense in the primary. What does ‘organisation’ mean there? Well, it could mean a company, or a team, or a division. Just as software conforms to organisational structure (Conway’s Law), so too should repo structure.

Once you start having lots of peer repos being worked on within the same organisation on a daily basis, you know that you’ve partitioned far too far, and you need to roll back.

Otherwise one ends up in exactly the position you’re in. The ultimate slippery-slope end-state would be hilariously bad: a repo for each ASCII character, with repos for each word or symbol constructed out of those characters, with repos for each function constructed out of those words & symbols, with repos for each module constructed out of those functions, with repos for each system constructed out of those modules, with any change requiring a massive, intricate, failure-prone dance in order to update anything, all while patting oneself on the back about how one has avoided complexity.

Noöne sane would argue for that situation, and yet I’ve seen smart people argue that requiring coördinated changes to half a dozen repos is fine & dandy.

even if they are heavily coupled,

So don't use polyrepos for heavily coupled projects, then. Or even better...

... try to avoid heavy coupling in the first place.

Unfortunately, these debates tend to be of the bikeshed variety.

Q: Why are we debating the merits of mono-repos over poly-repos?

A: Because it's managing dependencies is really hard and needs expertise.

It's an interesting social problem in how you manage those project / library / repository boundaries. On the flipside, though, it's been well documented that among many of the major monorepos those boundaries still exist, they just become far more opaque because no one has to track them. You find the weird gatekeepers in the dark that spring out only when you get late in your code review process because you touched "their" file and they got an automated notice from a hidden rules engine in your CI process you didn't even realize existed.

In the polyrepo case those boundaries have to be made explicit (otherwise no one gets anything done) and those owners should be easily visible. You may not like the friction they sometimes bring to the table, but at least it won't be a surprise.

http://wiki.c2.com/?ConwaysLaw "Conway's Law" is something like "organisation of code will match the organisation of people". It's a neat description.

I think it's more common to merge or split modules and classes than repositories. I wonder if there'd be less tension if repos and teams were 1:1 though.

> I wonder if there'd be less tension if repos and teams were 1:1 though

Anecdotally, yes I think it helps a lot. I was once part of an organization for which each "team" having a repo is the only thing that prevented violence :-)

I've also seen arbitrary separations of repos because 2 people didn't get along and couldn't work together.

Can a monorepo support module- or subdirectory-level ownership controls? Or do teams using a monorepo just do without them?

Partially answering my own question: SVN, recommended in a prior comment [0], supports path-based authorization [1]. But what about teams using another version control system?

[0] https://news.ycombinator.com/item?id=18810313

[1] http://svnbook.red-bean.com/en/1.5/svn-book.html#svn.serverc...

In Google, this is pretty explicit with a plaintext OWNERS file in the directory. Internal IDEs not only have an understanding of that, but can automatically suggest a minimal set of reviewers in a possibly close time zone and not out of office.

Piper, Google’s implementation of monorepo, has that, and it is very important and widely used.

With Phabricator, yes, you can setup herald rules that stops a merge from happening if a file has changed in a specific subdir.

We use service owners, so when a change spans multiple services, they are all added automatically as blocking reviewers.

For my clients, I use an open source monorepo submodule inside the client's proprietary monorepo. I can maintain the organizations' software while sharing common code.

So (mono)repos are composable.

If I remember correctly, Gitlab has introduced some sort of ownership control where you can say who owns what directories for things like approving merge requests that affect those directories.

Hey, did you mean of assigning approvers based on code owners [1]? You can find more info about Code Owners and syntax at the documentation [2].

[1] https://gitlab.com/gitlab-org/gitlab-ee/issues/1012 [2] https://docs.gitlab.com/ee/user/project/code_owners.html

Github has this feature[1], which we use extensively in our monorepo.

[1] https://help.github.com/articles/about-codeowners/

google3 uses the same model that Chromium does, see an example here: https://github.com/chromium/chromium/blob/master/chromeos/OW...

SVN allows for you to create mutliple repos within a repo. (That's probably why the path based auth works).

Git has the idea of sub-modules, but they're really just filters. (They're in the same repo). So ultimately, you don't have that kind of control.

Git submodules are not in the same repo, they are a link from one repo to another, and you need to push to both if you make a change to the submodule. Maybe you're thinking of subtrees? I've never used those.

OctoLinker really helps when browsing a polyrepo on Github:


You can just click the import [project] name and it will switch to the repo.

It's very much possible to make changes to internal libraries used all over the place, but it does require versioning to be something that people think about, and a mechanism by which those libraries aren't just pulled from source control to depend on them. Once you've got some sort of dependency management such as an internal gem/npm/whatever source you can treat those internal dependencies the same as you'd treat external ones, instead of having to somehow coordinate a release of absolutely everything in one go.

That's not really that different in a monorepo since you often need reviews from the same number of people anyway.

I once had to wait for 9 months to get a complex change through in a monorepo setting because of all the people involved, the number of stuff it touched and the fact that everything was constantly in flux so I spent half my time tracking changes. I'm not saying it would have been faster in a polyrepo. I'm saying that complex changes are complex regardless of how the source is organized.

I do however think that polyrepos forces you to be more disciplined and that it is easier to slip up in a polyrepo and turn a blind eye to tighter couplings.

The multi-repository code review is an interesting concept. Here at RhodeCode we're actually working on such solution to implement. This is in first to solve our internal problem of release code-review spanning usually two projects at once.

This is a hard and complex problem. Especially how to make code-review not too messy if you target 5-8 repos at once.

I think this article is complete horseshit. A monorepo will serve you 99% of the time until you hit a certain level of scale when you get to worry about whether a monorepo or a polyrepo is actually material. Most cases are never going to get there. Before that point, a polyrepo is purely a distraction and makes synchronous deployment really painful. We had to migrate a polyrepo to a monorepo and it was not fun because it was a migration that should have never had to be done in the first place. Articles like this are fundamentally irresponsible.

I work on CI/CD systems, and that’s one thing that definitely gets harder in a monorepo.

So you made a commit. What artifacts change as a result? What do you need to rebuild, retest, and redeploy? It doesn’t take a large amount of scale to make rebuilding and retesting everything impossible. In a poly repo world, the repository is generally the unit of building and deployment. In monorepo it gets more messy.

For instance, one perceived benefit of a monorepo is it removes the need for explicit versioning between libraries and the code that uses them, since they’re all versioned together.

But now, if someone changes the library, you need to have a way to find all of its usages, and retest those to make sure the changed didn’t break their use. So there’s a dependency tree of components somewhere that needs to be established, but now it’s not explicit, and no one is given the option to pin to a particular version if they can’t/won’t update. This is the world of google & influcenced the (lack of) dependency management in go.

You could very well publish everything independently, using semver, and put build descriptors inside each project subdirectory, but then, congratulations, you just invented the polyrepo, or an approximation thereof.

> So you made a commit. What artifacts change as a result? What do you need to rebuild, retest, and redeploy?

If you're using Git, then typically for each push to the remote repository you get a notification with this data in it:

  BRANCH        # the remote branch getting updated
  OLD_COMMIT    # the commit the branch ref was pointing to before the push
  NEW_COMMIT    # the commit the branch ref was pointing to after the push

  # To get the list of files that changed in the push:
  git diff --name-only "$OLD_COMMIT" "$NEW_COMMIT"
Once you know which files changed in a push you can figure out which artifacts you need to build. Right now you'll have to write that tooling yourself since I don't know of any off-the-shelf tools that do it. In my company's case, we have "project.yml" files scattered through the repo telling us which directories have buildable artifacts and what branches each one needs to be built for. The tooling to support this is a few hundred lines of Bash and Python. In our case we're still small enough that we can brute force some stuff, but we can easily improve the tooling as we go along.

This is something I've been working on a bit myself.

Figuring out which files change is relatively easy (as you've demonstrated). Figuring out what the impact of that is quite hard in non-compiled languages (tools like Maven, Buck, Bazel, etc do this well for compiled languages). I.e. In a repo which is primarily JavaScript, I can get the list of changed files, and hopefully have unit test files which are obviously linear to those. However, knowing if these are depended on by other files/modules (at some depth) is much harder. Same for integration tests -- which of these are related?

I believe the typical approach is to have project.yml list its dependency projects. Build a DAG(error on cycles) and then build all changed and downstream projects.

Rebuild and deploy everything, what's the actual problem? Like the OP said, that's a scale issue and most projects don't have it.

Also building/testing is far more effective at finding dependencies than just going by repo structure. There are numerous package managers available to solve versioning if you need separate components.

100% agree with your entire comment. This is what we do with our monorepo now -- it turns out the rebuilding and deploying everything is actually just fine. If your application services are stateless and decoupled from your state stores, it's completely harmless. If you need to do something fancy, congrats! You're at scale -- enjoy it but remember that it's something rare.

Yes! This brings to mind Donald Knuth: "Premature optimization is the root of all evil."

One thing I heavily enjoy about monorepo's (I'm talking java/c#/c++ projects) is the ability to navigate the entire codebase from within an ide. That alone has caused me to migrate projects (medium projects ~20 developers) from poly to mono repos. Dropping tons of duplication in the build system in the process. I can think of good reasons to split projects along boundaries when it makes sense, but not blindly by default, and not without carefully considering the tradeoffs.

In the java world this gets solved with gradle's incremental build system, which uses a build cache, a user configured dependency tree, and some hashing to determine what needs to build.

bazel/buck/pants all solve this, but independently of that, they're probably the best build systems.

I found it to be neither horseshit nor irresponsible. A bit overdrawn and skewed in some of its arguments, perhaps. But then again... so was your critique. For example:

We had to migrate a polyrepo to a monorepo and it was not fun because it was a migration that should have never had to be done in the first place

s/polyrepo/monorepo/ in the above and you have an assertion of about equal plausibility and weight.

No, it is horseshit. 99% of companies will never hit big company VCS scaling issues, and once they do, they're on their own. To characterize that scale as common is one of the most embarrassing failures of modern software engineering. People are so embarrassed to use well worn tooling and accept that large scale is both uncommon and something that doesn't invalidate tried and true patterns for smaller scales. It's utterly baffling to me.

> It's utterly baffling to me.

It's not hard to explain: Scale has been fetishized by the industry/trade. Everyone wants the cachet of working at scale. 1.5 GB of CSV text? That's Big Data, let's break out map-reduce. 1 load balancer and not enough servers to fill half a rack? That's a scalable architecture, we could scale to multiple datacenters at some point in the future, so let's design it now.

Deploying oversized solutions is partly due to outsiders jonesing the scale of Google, Fb and gang, partly Resume-stuffing ("I have worked with this tech before"), and lastly FANG diasporans who miss the tech they used and rewrite systems/evangelize the effectiveness of those solutions to much smaller organizations.

To be fair, part of the problem is that each of us has been bitten throughout his career by issues which could have been prevented by being able to predict the future. We then move from the truth that if we had known the future, we could have acted better yesterday to the fallacy that today we finally know what we're going to need tomorrow.

This isn't isolated to our industry, of course: a constant refrain is that generals & admirals fight the last war; the financial industry is rife with products which are secure against the last recession, and so forth.

"To characterize that scale as common is one of the most embarrassing failures of modern software engineering."

This point cannot be stressed enough. Almost all the worst software engineering failures I have seen have been caused by premature scaling - which is way worse than premature optimization because the latter's effects are usually local. But premature scaling causes architectural decisions that affects the whole project and simple cannot be undone.

One example among many are some of the influential engineers insisting on that we needed four application servers with fail-over because they had experienced servers crashing under heavy load. This complicated failover setup took huge amount of time and resources to setup, delaying the project by months. In the end it only attracted a few hundred visitors per day and was cancelled in under a year.

This complicated failover setup took huge amount of time and resources to setup, delaying the project by months.

Hmm - failover shouldn't be that hard to set up. If it was then that suggests that other issues (technical debt, inexperienced management) were the more likely culprits.

Not the simple fact that they chose not to ignore the need for failover.

> [it] shouldn't be that hard ...

Now where have I heard those words before... :)

> 99% of companies will never hit big company VCS scaling issues

A much higher percentage of developers will. Number of companies is not a good metric for whether a topic is worthy of discussion.

Number of companies is a good metric, because companies own the repos and if it becomes a pain-point, only the developers working at that point in time will be hit by this. Anyone who leaves before this inflection point or joins after it's been solved will not be hit, so I don't think the percentage of developers in that intersection is large.

> after it's been solved

I think a quick perusal of this page will show that it's not really "solved" after all. A far higher percentage of developers continue to be affected by large-repo issues than a Python-specific issue (currently #1 story on the front page) or anything to do with Ethereum (currently #7). Are those "horseshit" topics too?

I agree, it's not really solved, but solved "enough". You can't have your cake and eat it, there are tradeoffs involved- if you grow large enough to hit monorepo limitations, you are large enough to invest in tooling that manage your workflow (the tradeoff). However, if you're a small organization, you can't afford the tooling and you're wasting time/quality coordinating polyrepo releases, so you are better off with a monorepo.

> A far higher percentage of developers continue to be affected by large-repo issues than...

Are you suggesting that the results of the HN ranking algorithm at this very moment in time is a good metric of measuring what affects developers? I don't agree, and besides @yowlingcat's opinion that the article is "horseshit" is unrelated to how well its ranked on HN.

> opinion that the article is "horseshit" is unrelated to how well its ranked on HN.

When the opinion is not just disagreement but outright dismissal of the topic as worth discussing, I'd say ranking is relevant. So is comment count. Clearly a lot of people do believe it's worth discussion, not irrelevant or a foregone conclusion as yowlingcat tried to imply.

A lot of people can think a lot of things are worth discussion, but it doesn't mean it's prudent to waste time on it.

Incidentally, I also think those are horseshit topics as well (Coconut is someone trying to daydream Python into Haskell with no practical reasons to do so and making Ethereum scale better doesn't make a legitimate use case for it emerge) but that's besides the point.

What you call large-repo issues I call organization issues. From your other comments, it's clear that we draw the lines at different places, but I think I'm right and you're wrong in this case because I've seen engineers try to solve organizational issues with technology enough times that it's a presumable anti-pattern. Why don't we take your own words at face value?

"That hasn't been my experience. Yes, it's a culture thing rather than a technology thing, but with a monorepo the "core" or "foundation" or "developer experience" teams tend to act like they're the owners of all the code and everyone else is just visiting. With multiple repos that's reversed. Each repo has its owner, and the broad-mandate teams are at least aware of their visitor status. That cultural difference has practical consequences, which IMO favor separate repos. The busybodies and style pedants can go jump in a lava lake."

Why are there busybodies and style pedants working in your organization? Because your organization has an issue. Do you think that would be at the root of this pain, or a tool choice? I'll give you a hint, it's not the tool choice.

> Why are there busybodies and style pedants working in your organization?

Because to an extent they serve a useful purpose. In a truly large development organization - thousands of developers working on many millions of lines of code - fragmentation across languages, libraries, tools, and versions of everything does start to become a real problem with real costs. You do need someone to weed the garden, to work toward counteracting that natural proliferation. That improves reuse, economies of scale, smoothness of interactions between teams, ease of people moving between teams, etc. It's a good thing. Unfortunately...

(1) That role tends to attract the very worst kind of "I always know better than you" pedants and scolds. Hi, JM and YF!

(2) Once that team reaches critical mass, they forget that the dog (everyone else) is supposed to wag the tail (them) instead of the other way around.

At this point, Team Busybody starts to take over and treat all code as their own. Their role naturally gives them an outsize say in things like repository structures, and they use that to make decisions that benefit them even if they're at others' and the company's expense. Like monorepos. It's convenient for them, and so it happens, but that doesn't mean it's really a good idea.

Sure, it's a culture issue. So are the factors that lead to the failure of communism. But they're culture issues that are tied to human nature and that inevitably appear at scale. I know it's hard for people who have never worked at that scale to appreciate that inevitability, but that doesn't make it less real or less worth counteracting. One of the ways we do that is by putting structural barriers in the corporate politicians' way, to maintain developers' autonomy against constant encroachment. The only horseshit here is the belief that someone who rode a horse once knows how to command a cavalry regiment.

You realize many, if not most, people reading this work at places already big enough to have "VCS scaling issues". I've seen more than a few monorepos, but I've never seen one used as anything but a collection of small repos.

No, it is horseshit [because scale]

The thing is, scale was only one factor listed among many.

Was it? Once scale problems is gone -- you assume that all code can be checked out on one machine, and you have enough buildfarm to build all the code -- the most of the article's points no longer apply.

The downsides which still apply are Upside 3.3 (you don't deploy everything at once) and Downside 1 (code ownership and open source is harder).

And those are pretty weak arguments -- I would argue that deploying problems exists with polyrepo as well, and there are now various OWNERS mechanisms.

The fact the polyrepos are harder to open source is a good point, but having to maintain multiple separate repos just in case we would want to opensource one day seems like sever premature optimization.

In my experience, monorepos cause outrageous problems that have nothing to do with scale. Small or medium monorepos are equally as terrifying.

It’s much more about coupling and engendering reliance on pre-existing CI constraints, pipeline constraints, etc. If you work in a monorepo set up to assume a certain model of CI and delivery, but you need to innovate a new project that requires a totally different way to approach it, the monorepo kills you.

Another unappreciated problem of monorepos is how they engender monopolicies as well, and humans whose jobs become valuable because of their strict adherence to the single accepted way of doing anything will, naturally, become irrationally resistant to changes that could possibly undermine that.

It’s a snowball effect, and often the veteran engineers who have survived the scars of the monorepo for a while will be the biggest cheerleaders for it, like some type of Stockholm syndrome, continually misleading management by telling them the monorepo can always keep growing by attrition and will be fine and keep solving every problem, unto the point that it starts breaking in collossal failures and people are sitting around confused why some startup is eating their lunch and capable of much faster innovation cycles.

Oddly enough, you could s/mono/multi in your post and that would exactly align with my own experience. I'm not kidding: everything from engendering reliance on weird homegrown tooling, CI & build pipelines to the pain of trying to break out to a different approach, to enforced bad practices, to developers (unknowingly) misleading management, to colossal failures.

I've worked on teams with monorepos and teams with multiple repos, and so far my experience has been that monorepo development has been better — so much so that I feel (but do not believe) that advocating multiple repositories is professional malpractice.

Why don't I believe that? Because I know that the world is a big place, and that I've only worked at a few places out of the many that exist, and my experience only reflects my experience. So I don't really believe that multiple repositories are malpractice: my emotions no doubt mislead me here.

I suspect that what you & I have seen is not actually dependent on number of repositories, but rather due to some other factor, perhaps team leadership.

Everyone always says this type of response about everything though. If you like X, you’ll say, “In my experience you can /s/X/Y and all the criticisms of X are even more damning criticisms of Y!”

All I can say is I’ve had radically the opposite experience across many jobs. All the places that used monorepos had horrible cultures, constant CI / CD fire drills and inability to innovate, to such severe degrees that it caused serious business failures.

Companies with polyrepos did not have magical solutions to every problem, they just did not have to deal with whole classes of problems tied to monorepos, particularly on the side of stalled innovation and central IT dictatorships. Meanwhile, polyrepos did not introduce any serious different classes of problems that a monorepo would have solved more easily.

Absolutely amazing to me how much engineers conflate organizational issues with tooling issues. Let's take a look at one of your comments:

"The last point is not trivial. Lots of people glibly assume you can create monorepo solutions where arbitrary new projects inside the monorepo can be free to use whatever resource provisioning strategy or language or tooling or whatever, but in reality this not true, both because there is implicit bias to rely on the existing tooling (even if it’s not right for the job) and monorepos beget monopolicies where experimentation that violates some monorepo decision can be wholly prevented due to political blockers in the name of the monorepo.

One example that has frustrated me personally is when working on machine learning projects that require complex runtime environments with custom compiled dependencies, GPU settings, etc.

The clear choice for us was to use Docker containers to deliver the built artifacts to the necessary runtime machines, but the whole project was killed when someone from our central IT monorepo tooling team said no. His reasoning was that all the existing model training jobs in our monorepo worked as luigi tasks executed in hadoop.

We tried explaining that our model training was not amenable to a map reduce style calculation, and our plan was for a luigi task to invoke the entrypoint command of the container to initiate a single, non-distributed training process (I have specific expertise in this type of model training, so I know from experience this is an effective solution and that map reduce would not be appropriate).

But it didn’t matter. The monorepo was set up to assume model training compute jobs had to work one way and only one way, and so it set us back months from training a simple model directly relevant to urgent customer product requests."

What do you think is the cause of your woes, the monorepo, or the disagreement between your colleague in central IT tooling who disagreed with you? Where was your manager in this situation? Where was the conversation about whether GPU accelerated ML jobs were worth the additional business value to change the deployment pipeline? Was that a discussion that could not healthily occur? Perhaps because your organization was siloed and so teams compete with each other rather than cooperate? Perhaps because it's undermanaged anarchy masquerading as a meritocracy? Stop me if this sounds too familiar.

I've been there before. I know what it feels like. But, I also know what the root cause is.

Nobody is conflating anything. Culture / sociological issues that happen to frequently co-occur with technology X are valid criticisms of technology X and reasons to avoid it.

To argue otherwise, and draw attention away from the real source of the policy problems (that the monorepo enables the problems) is a bigger problem. It’s definitely some variant of a No True Scotsman fallacy: “no _real_ monorepo implementation would have problems like A, B, C...”.

The practical matter is that where monorepos exist, monopolicies and draconian limitations soon follow. It’s not due to some first principles philosophical property of monorepos vs polyrepos — who cares! — but it’s still just the pragmatic result.

Also you mention,

> “Where was the conversation about whether GPU accelerated ML jobs were worth the additional business value to change the deployment pipeline.”

but this was explicitly part of the product roadmap, where my team submitted budgets for the GPU machines, we used known latency and throughput specs both from internal traffic data and other reference implementations of similar live ML models. Budgeting and planning to know that it was cost effective to run on GPU nodes was done way in advance.

The people responsible for killing the project actually did not raise any concern about the cost at all (and in fact they did not have enough expertise in the area of deploying neural network models to be able to say anything about the relative merit of our design or deployment plan).

Instead the decision was purely a policy decision: the code in the monorepo that was used for serving compute tasks just as a matter of policy was not allowed to change to accommodate new ways of doing things. The manager of that team compared it with having language limitations in a monorepo. In his mind, “wanting to deploy using custom Docker containers” was like saying “I don’t want to use a supported language for my next project.”

This type of innovation-killing monopolicy is very unique to monorepos.

Here here yowlingcat. Article is a way too prescriptive and agreed, borders on irresponsible. Monorepo vs polyrepo argument is way too broad a subject to create generalized stereotypes like this. These opinions sadly are taken as facts by impressionable managers, new developers, etc, and have cascading effects on the rest of us in the industry. Use what makes sense in the project environment and team, don't just throw shade at teams who are successfully and productively using monorepos where they make sense. Sure there is good reason to split things up on boundaries sometimes, (breaking out libraries, rpc modules, splitting along dev team boundaries, etc etc etc), but not blindly by default. Will Torvalds split up the kernel into a polyrepo after reading this article? Something tells me that would be a bit disruptive.

It's interesting that you talk about "team's using monorepos". I think that's different than what the article is arguing against, which is an entire company (100+ devs) using a monorepo.

A team with 5 services and a web front-end in a single repo is doable with regular git. It's a different beast I think.

Thanks softawre, what triggered me is the sensationalist title and general bashing of monorepo's (which a large percentage of impressionable readers will walk away from this article thinking, ie: that monorepo's are only for dummies and you're doing it wrong if you're not using a polyrepo). A less inflammatory title more along the lines of "Having trouble scaling development of a single codebase amongst 100's of developers? Consider a polyrepo". This argument comes up in developer shops almost as much as emacs vs vi, tabs or spaces, etc.

When you have 100+ developers on a project, managing inbound commits/merges/etc will become tedious if they're all committing/merging into one effective codebase.

IMHO, It depends on the project, the team makeup, the codebase's runtime footprint, etc whether or not/or when it makes sense to start breaking it up into smaller fragments, or on the other hand, vacuuming up the fragments into a monorepo.

I did enjoy reading Steve Fink's from Mozilla's comment (it's the top response on the OP's medium article) and counter arguments about monorepos vs polyrepos in that ecosystem (also clearly north of 100 developers). It's easy to miss if you don't expand the medium comment section, but very much worth reading.

> A monorepo will serve you 99% of the time until you hit a certain level of scale when you get to worry about whether a monorepo or a polyrepo is actually material

If you worked in a company that had a core product in a repo, and you wanted to create a slack bot for internal use, where would you put the code? I assume not within your core product's codebase, but within a separate repo, thus creating a polyrepo situation.

So when you say a monorepo will serve you in 99% of cases, are you not counting "side" projects, and simply talking about the core product?

This article is too agressive and have a childish language that is not for my taste.

My last 2 jobs have been working on developer productivity for 100+ developer organizations. One is a monorepo, one is not. Neither really seems to result in less work, or a better experience. But I've found that your choice just dictates what type of problems you have to solve.

Monorepos are going to be mostly challenges around scaling the org in a single repo.

Polyrepos are going to be mostly challenges with coordination.

But the absolute worst thing to do is not commit to a course of action and have to solve both sets of challenges (eg: having one pretty big repo with 80% of your code, and then the other 20% in a series of smaller repos)

Jesus, this. Look, you're going to run into issues either way, because you're trying to solve a difficult problem.

It's like thinking OOP or functional programming is going to solve all your issues... I mean, in some limited cases they could, but realistically you're just smooshing the difficulties around and hopefully moving them to somewhere where you are more able to deal with them.

FWIW, I've worked in a many-repo org and it sucked worse than huge companies with monorepos and good tooling, but I'm not going to make some blanket statement because it depends on the specifics of your code/release process/developer familiarity etc.

This. Every decision is a trade-off. There is no silver bullet. Context matters.

Sounds reasonable. I'll have to add, though, that the underlying technology factors into this as well.

For example: If you're stuck with a TFS monorepo (you poor soul), you actually get to deal with both problems to some extent, since TFS doesn't enforce that you check out the intire repository at once.

This can have very "funny" situations because someone forgot to checkout new changes in some folder. OTOH, at least for releases, you can remedy this by using CI everywhere.

Hilariously misguided.

Pretty funny to read that the things I do every day are impossible.

Monorepo and tight coupling are orthogonal issues. Limits on coupling come from the build system, not from the source repository.

Yes, you should assume there is a sophisticated "VFS". What is this "checkout" you speak of? I have no time for that. I am too busy grepping the entire code base, which is apparently not possible.

If the "the realities of build/deploy management at scale are largely identical whether using a monorepo or polyrepo", then why on earth would google invest enormous effort constructing an entire ecosystem around a monorepo? Choices: 1) Google is dumb. 2) Mono and poly are not identical.

> then why on earth would google invest enormous effort constructing an entire ecosystem around a monorepo? Choices: 1) Google is dumb. 2) Mono and poly are not identical.

I think, once you've chosen a path of mono or poly, you have quite a challenge ahead of you to migrate to the other.

At that point, the tradeoffs arent based purely on the technical benefits - and "invest in monorepo tooling" may become a perfectly valid decision, as it's cheaper than "migrate to a polyrepo setup'.

I'm not arguing either way for or against monorepo, just pointing out that "must be a good idea because Google does it" is invalid - technical merit is just one of the thousands of concerns to be balanced.

This makes sense to me. If you're considering a monorepo with millions of lines of code used concurrently by thousands of developers, it's absolutely possible. You just need a handful of developers working on the infra to make it happen.

I do agree with GP though. I wish the author hadn't decided the things I do everyday are impossible.

> You just need a handful of developers working on the infra to make it happen.

With thousands of developers banging on the code base, it's going to be more than "a handful of developers". It's going to be at least a few "handfuls" of developers full time plus probably many, many other full time equivalents spread out throughout the whole user base (testing, supporting other users, etc.).

In reality, it's several hundred developers working full time on infra, and they are all overworked, and gradually falling behind. Monorepos at the scale of Google/Facebook are hard.

We aren't talking about maintaining Mercurial here - we are talking about developing a brand new distributed VCS that happens to be 'Mercurial-compatible', and deploying/maintaining it for tens of thousands of developers working simultaneously.

Development with thousands of developers is hard. The problems with monorepo and polyrepo are only subtly different. Either way you need a fairly large team just to handle the tools you need to solve your problems. Some of your problems will be because of repo organization (again, both choices have downsides that you need custom tooling to solve).

Note that most of your problems will be related to having a thousands of developers and repo organization is irrelevant.

I'd refine this to say "it's a good idea the way Google does it".

3) Google is committed to a monorepo to the point migrating away from it would be unpractical.

Truth is, ending up with a monorepo is _really easy_. It usually starts with something that doesn't even _feel_ like more than one project: backend code, frontend templates and some celery/whatever tasks, maybe some minor utility CLI tools. And this happens at the stage nobody wants to even _think_ about more than one git repository.

Once those are big enough, it's likely too late.

But hey, you can always claim _you wanted it that way_. My cats always look good while pulling that one.

I've worked with both monorepo orgs, and polyrepo orgs. I think if you have only used git as your VCS and not something like perforce, you're likely to get the wrong idea of how it works.

Both CAN work, but for internal organizations with a reasonably sized team, I've come to realize that a mono repo is better. You attain "separation" by establishing different views of the code/data and at scale, the mental model of what's happening is much simpler.

> I think if you have only used git as your VCS and not something like perforce

I think if you worked with Perforce, you're likely to get the wrong idea that people who dislike monorepos didn't work with Perforce. But the reality is that anyone who worked in this industry long enough did at some point end up traumatised by it, thanks.

> the mental model of what's happening is much simpler.

How does introducing the concept of "views" to the VCS model make anything simpler?

It can work. That doesn't mean it is a universal solution. And it doesn't even mean it is a solution that is guaranteed to cover most projects. Whether or not a monorepo works depends on a lot of factors. In my experience the number of cases where it doesn't work appears to outnumber the cases where it works.

It can work nicely when you have disciplined and demonstrably above average programmers that are good at structuring the internal architecture of systems and will know how to design for plasticity. It is also an advantage if all your code is written in the same style and doesn't come from a bunch of older codebases. But even then you can end up with messes that you will be likely to conveniently forget about.

For instance while clear decoupling was a goal when I worked at Google, it wasn't always a reality. There were still lots of very deep and direct dependencies that should never have been there.

It does not work well if you have "average" developers or if you have undisciplined developers or excessive bikeshedders (which kill productivity).

Then there is the tooling. Most people do not work for Google and do not have the ability to spend as much money and time on tooling as Google does. What Google does largely works because of the tooling. It would suck balls without it. To be honest: some things sucked balls even with the tooling. Especially when working with people in different time zones.

Google isn't really a valid example of why monorepo is a good idea because your average company isn't going to have a support structure even remotely as huge as Google. (If you disagre: hey, it's easy, go work for Google for a while and then tell me I'm wrong)

At some point blaze added the concept of visibility to control dependencies, so teams can whitelist users of their code. Though you can always comment it out while developing locally.

“why on earth would google invest enormous effort constructing an entire ecosystem around a monorepo?”

Didn’t google have a monorepo before git was created? And was created by academics? Legacy and momentum have a strong influence on the future. Hasn’t google also built a lot of tools for the monorepo and dedicates employees to it? That’s exactly the issue this article is about.

From an external perspective, the speed and scale of product rollouts from the bigger tech companies is very slow. I don’t know if the tooling has much to do with it, but I suspect it might. I’ve heard some horror stories (some from here) about how it takes months to get small changes into production.

If you think companies like Google are slow, you should look at the enterprise world. I do some technical due diligence work from time to time, and although things have gotten better of the years it wasn't that long ago that there was a more than 50% chance of seeing companies that only have source code repositories at a team level and 20% chance of them not even using a VCS to manage code.

I work for a company now where top management doesn't even understand what a repository is and what role it plays in software development.

Yes it is that bad in much of the "entreprise" world.

Yes, Google had a monorepo before git was in widespread use. They used Perforce while I was there, which was a miserable, miserable experience. It only worked because they poured engineering effort into making it somewhat tolerable.

I think it would be wrong to say Google chose a monorepo because it was the best choice. To be honest, I don't think they really planned how to deal with many thousands of developers when they made the choice. They just did what seemed to make sense at the time and then had to make it work as the challenges started to mount.

Does Google require more engineers to support their build system than they would with a polyrepo? That question is not trivial to answer, IMO.

Any is more than 0 though. In my experience (probably shared by many devs), polyrepos don’t require a team, or even a single person, dedicated to version control. It’s a minor part of the software management (usually: “mind if I create a new repo for this?” “Yes/no”).

It does affect dependency management but no more than any external dependency.

A polyrepo setup at Google's scale would pretty obviously require some dev work. For example, their CI/build story would be way more complex.

While that may be true, I'm not convinced it is a given. Any complicated enough monorepo requires complex CI/build tools, and Bazel/Blaze exist for a reason ...

> Any complicated enough monorepo requires complex CI/build tools

Any complicated multi-repo setup requires tooling, processes, procedures, cross-repo PRs & issue tracking, &c. &c. &c.

The question is: which requires less cost in order to deliver business value? In my experience, on the teams I've been so far, the answer has been monorepos — but I don't know everything.

At Google scale, you'd either need tooling for automated version bumps, or some other infra to manage versioning.

You'd need cross repo bisection.

You'd need a way to run all tests in all repos reflecting a new change.

There's 10s or hundreds more frs I could list.

“You'd need a way to run all tests in all repos reflecting a new change.”

You really shouldn’t have to run every test on every product. Or really any other repos. Use semantic versioning, pin your dependencies, don’t make breaking changes on patch or minor versions.

Pinning your dependencies is an antipattern (or at least in the eyes of many people who support monorepos it is).

It results in one of three things:

1. People never update their dependencies. This is bad (consider a security issue in a dependency)

2. Client teams are forced to take on the work of updating due to breaking changes in their dependencies. If they don't, we're back at 1.

3. Library teams are forced to backport security updates to N versions that are used across the company.

But really, the question to ask is

>don’t make breaking changes on patch or minor versions

How can you be sure you aren't breaking anyone without running their code? You can be sure you aren't violating your defined APIs, but unless you're perfect, your API isn't, and there are undocumented invariants that you may change. Those break your users. Monorepo says that that was your responsibility, and therefore its your job to help them fix it. Polyrepo says that you don't need to care about that, you can just semver major version bump and be done with it, upgrading be damned.

No semver means that you, not your users, feel the pain of making breaking changes. That's invariably a good thing.

At AMZN, which has 1000s of separate repos, 1) was the general case, with 2) occurring whenever there was a critical security issue in some library that no one had updated for years. The resulting fire drill of beating transitive dependencies into submission could occupy days or weeks of dev time.

When you change a project, you have to test the effect of your changes on downstream dependencies. Semver is wishful tthinking. Even a change that _you_ think is non-breaking could break something. Saying "well the downstream team was using undocumented behaviour so it's their fault" doesn't really hold much water when your team is Driving Directions and the downstream team is Google Maps Frontend and your change caused a production outage.

The nice thing about polyrepo is that each repo doesn't care what the others are doing. They can use whatever tools they prefer (whatever is least work and most familiar for the team, perhaps). It might also encourage better documentation and adherence to good practices with regard to deprecation and maintenance.

> I don’t know if the tooling has much to do with it, but I suspect it might.

As a development team grows, time to market also grows, in a superlinear fashion. This is known since people shared code on dead-tree pages, so the odds of tooling being the cause are low.

3) A monorepo with significant investment in ecosystem and tooling is a better choise than a polyrepo

For other (smaller) companies, polyrepo might be the better choice because [significant investment in ecosystem and tooling] is not appealing, and the investments of Google et al. have not leaked through sufficiently into general available tools. Some headway is being made in the latter [1], so monorepo might be the "obvious" best choice in 10 years or so.

[1] For example, Git large file support is mostly from corporate contributors https://git-lfs.github.com/ https://github.com/Microsoft/VFSForGit

> For other (smaller) companies, polyrepo might be the better choice because [significant investment in ecosystem and tooling] is not appealing

That's not the choice, though: significant investment in tooling is a function of codebase size. In my own experience, polyrepos require more tooling, because you're not just dealing with files & directories, you're also dealing with repos (& probably PRs & issues & other stuff in a forge).

> In my own experience, polyrepos require more tooling

That's not my experience. In my experience, polyrepo's significantly reduce complexity for a medium (30 developers) project.

An example: the following things are good software development practices if you work with a master-PR branch model:

  1. Tests must pass on CI before merging a branch to master
  2. Before merging a branch into master, the latest master must be merged into the branch so that tests are still reliable
This quickly becomes untenable if 30 people all commit to the same repo. By the time your PR is reviewed, it's outdated. So you merge master into your branch. By the time you come back to check your test results and merge, its outdated again. Repeat until 6 PM.

So you need partial builds to keep build time low, and would probably like to amend 1 & 2 with "unless your code has zero overlap with the changes in master". These are not standard features of any CI system I know of, hence the need for tooling.

Instead of tooling, polyrepo's provide the above benefits out-of-the-box. Just set your CI to build the repo, and it will do partial builds and PR-merging is uninfluenced by other repos. This is a huge advantage over monorepos.

The downside is that if your repos have tight coupling, you'll need simultaneous PRs in more than 1 place or need to look up history/files in more than 1 repo. If this is more than a rare occurence, this downside is so large that polyrepo is not a suitable solution for your project.

The projects of this size I've worked with did not have this problem, or the problem was solvable without much difficulty.

Another thing is that the OP categorized "medium sized menorepos" as being too large to fit on a laptop. Maybe the companies I've worked for is limited (mainly startups & consultancies), but the entire codebase, after several years of operation, still easily fits on a single laptop.

Of course there is a threshold, however this is typically a concern of a large organization or an organization that has been producing software for a decade or more.

> As described above, at scale, a developer will not be able to easily edit or search the entirety of the codebase on their local machine. Thus, the idea that one can clone all of the code and simply do a grep/replace is not trivial in practice.

Yeah this is a pretty widespread and fundamental misunderstanding that leads to a lot of bad policy decisions.

If 'grepping code' is your first resort then you're hitting things with a hammer. I'm writing code that a machine is supposed to understand. If the machine can't understand how the bits interact then I have much bigger problems than where my code is stored. Probably we're dealing with a lot of toxic machismo bullshit that is hurting our ability to deliver.

If you want discipline, if you want cooperation, hell if you just want to be able to hire a bunch of new people when you land a big customer, you need some form of support for static analysis and the code navigation that it enables. Stop the propeller heads from using magic and runtime inference to wire up the parts of the system, or find a new gig. Even languages where static typing isn't a thing have documentation frameworks where you can provide hints that your IDE can understand (ex: jsdoc for Javascript).

For a large team, working without any kind of static analysis is a recipe for a rigid oligarchy. Only people who have memorized the system can reason about it. Everybody else who tries to make ambitious changes ends up breaking something. See what happens when you trust new people with new ideas? New is bad. Be safe with us.

And even if by some miracle you do make the change without blowing stuff up, you're still in the doghouse, because we have memorized the old way and you are disrupting things!

Some crazy ideas work well. Some reasonable ideas fail horribly. To grow, people need the space to tinker and an opaque codebase ruins those opportunities. Transparency is also helpful when debugging a production issue, because people can work in parallel to the people most likely to solve the problem (even the person who is usually right is way off base occasionally). I should be able to learn and possibly contribute without jamming up the rest of the team by asking inane questions.

You need pretty good but entirely achievable tooling and architecture to get that, but man when you do it's like getting over a cold and remembering what breathing feels like.

At least the author gave us the courtesy of italicizing his broken assumption from the outset of the post.

> Because, at scale, a monorepo must solve every problem that a polyrepo must solve, with the downside of encouraging tight coupling, and the additional herculean effort of tackling VCS scalability.


But you have to get to "scale" first (as it relates to VCSs). Most companies don't. Even if they're successful. Introducing polyrepos front loads the scaling problems for no reason whatsoever. A giant waste of time.

Checkmate! I didn't even need a snarky poll. The irony of that poll is that it clearly demonstrates his zealotry, not other people's.

The author talks about proponents of monorepos, but I thought when I read it: actually they are victims of monorepos trying to explain to themselves as much as anyone why they choose to suffer with them. (Actual reason: for $$$).

Nobody would choose to drag around every historical afterthought in the development sequence of long forgotten software going back three decades that no longer builds with current tools, just so they can work on a small library off in a corner. Software is getting written and added to these monorepos at a much faster rate than hardware and networks are able to hide the bloat-upon-bloat growth of them.

>Nobody would choose to drag around every historical afterthought in the development sequence of long forgotten software going back three decades that no longer builds with current tools, just so they can work on a small library off in a corner.

If it doesn't work then it should be deleted. If it's still running somewhere then it should be maintained. Presumably you have a CI system so the monorepo actually requires everything in it to build.

In my experience, it's polyrepos that allow for dead and un-maintained code to just sit there for eternity. You forget about that unused repo right until the moment the service it deploys to (if you can track down that dependency) needs an update or goes down. Monorepos can more easily force system wide CI that checks for broken dependencies or other issues.

That old code worked 30 years ago... We have a closet in our office with a computer with Windows XP no service packs, and whatever compiler was used to build at the time. If we every have to release an update for whatever we were shipping we can do so. (assuming that computer still boots after all these years...)

In the embedded world supporting software for 30 years is not unheard of. We avoid it, but it is in the back of our mind that someday we might have to release an update. Fortunately 30 years ago nothing was internet connected, we are worried that we might be releasing security updates for our current products 50 years from now...

Bloat and the failure to remove broken/bad legacy code have nothing to do with monorepos.

"Hey, what does this server do?"

"No idea; it hasn't been touched in years. What's deployed to it?"

"Some 'foobar-ng' thing, never heard of that. Says it was last updated 5 years ago. Pull up our source repo for that package, will you?"

"Hang on, we've got like 30 services with names containing 'foobar', let me find that one . . . oh god. You don't wanna know."

"Fuck it, I'm just gonna shut this server down and remove that ancient, dead, busted package."

"The main billing system just broke! What did you do?!"

You can't split monorepos after the fact, at least not without immense costs. You can always just put all your small repos into a big one.

> You can't split monorepos after the fact, at least not without immense costs.

That has not been my experience at all. At a previous employer, we did exactly that with a multi-language library. In fairness, having multiple languages enforced fairly good directory structure in our single repo. But isn’t that the real point: good structure makes life easier, period. The thing is, going into a project you often don’t know what the right structures are yet. Creating a new repo for each component you think you need ossifies those choices, making it far more difficult to walk back on them later on (first because you may not even see the architectural mistake, second because the maintainers of that component will have an investment in its existence).

> You can always just put all your small repos into a big one.

In my experience, that’s harder, precisely because over time so much tooling has been built into each repo to manage builds, images, deployment &c.

I’ve worked in monorepos & I’ve worked in multrepos, and so far my experience has been that monorepos enable faster velocity and more-maintainable software. I’ve not (yet) worked at Google- or Facebook-scale, though, and I’m completely open to the idea that at that scale a team really does need lots and lots of repos, and tooling to stitch them all together.

I think there's a nuance to this that should be pointed out: Monorepos allow you to do very bad hacks (I need this other component over there; let me just put in a Symlink. Done.). And if people can, they will use those hacks.

If you split your repo up from the get go, the worst thing you can get that you'll have to assemble multiple distinct, well-encapsulated (in terms of project structure) things into one. In Git, that could lead to multiple root commits, but that's about it.

No. The worst case is that the engineering team spent more time working on “well encapsulated projects” than on the most important project for their business and are all now out of jobs. Most companies don’t fail because of tech debt. And certainly not because of version control tech debt.

Exactly. Whenever I see an engineer take a hardline position (eg: "no monorepos you zealots!") I always ask myself: is this person just annoyed?

Most of the time they're just annoyed.

One side effect of every successful business are annoyed worker ants that are sick of dealing with growth problems. I've been there. I know how annoying it can be.

Personally I've found comfort in embracing the chaos and learning to manage it responsibly. No dogma. No absolutes. Know how to do monorepos well. Know how to do polyrepos well. Learn the pitfalls of both. Don't assume other people are stupid zealots.

I agree. Any article (like the linked one) that states one side of a case as an absolute without giving any exceptions or caveats is going to be greeted by me with scepticism. Particularly as he keeps mentioning 3 large engineering organisations that disagree with him.

Not exactly. (At least small) companies can go out of business because of bugs. And one great way to "achieve" said bugs are implicit dependencies hidden from developers that didn't introduce them.

> The worst case is that the engineering team spent more time working on “well encapsulated projects” than on the most important project for their business

I'm not really sure how I should read this. Don't you use your repos to solve business problems? Why should that change because of the repo layout?

If you do a poly-repo approach from the start, and have dependencies between repos, you need to introduce component versioning from the start. Component versioning doesn't solve any business problems, but requires engineering effort.

Small businesses are not going to go out of business because of bugs unless those bugs aren't addressed. They'll go out of business because of poor sales and product management. Different things.

> Monorepos allow you to do very bad hacks (I need this other component over there; let me just put in a Symlink. Done.).

I've seen that with polyrepos as well: The entire project would require you to clone the individual repos into a specific directory structure so that things would work (no, not even submodules).

> Monorepos allow you to do very bad hacks (I need this other component over there; let me just put in a Symlink. Done.).

Why would you put in a symlink? You could just provide a path to the actual component and import it into your project.

> the worst thing you can get that you'll have to assemble multiple distinct, well-encapsulated (in terms of project structure) things into one

When you have multiple repos, you also have multiple versions and releases of things. Now every team has the following options:

1. Backport critical fixes to every version still in use (hard to scale)

2. Publish a deprecation policy, aggressively deprecate older releases, and ignore pleading and begging from any team that fails to upgrade (infeasible - there'll always be a team that can't upgrade at that moment for some reason)

You also have to solve the conflicting diamond dependency problem. This is when libfoo and libbar both depend on different versions of libbarfoodep. It's even more fun in Java because everything compiles and builds, but fails at runtime. So now you have to add rules and checks to your entire dependency tree - some package managers have this (Maven), others don't (npm IIRC).

> Why would you put in a symlink? You could just provide a path to the actual component and import it into your project.

Where do I need to put the path again? Ah what the heck, I'll just add a symlink inside a folder that's already somewhere in the build definitions.

What language and build tool is this that you're using?

I don't know anyone who has abused Maven or Cargo or Go like this. And I don't imagine Visual Studio Solutions for C# are used like this.

Is there an underlying disagreement based on JS/Ruby/Python scriptish coding (which creaks when a lot of developers work on it) vs C and C++ (which have astonishingly bad build system stories) vs big-iron languages that don't sweat when in a monorepo.

> And I don't imagine Visual Studio Solutions for C# are used like this.

At my workplace, we've just been cleaning up a whole bunch of instances of exactly that anti-pattern. Except that it's obviously not symlinks (which require specific user rights on Windows), but links to external files in VS.

Same problem, though: They're easy to introduce and a pain to deal with later on.

I have never done such a hack myself, but I've seen it. Mostly in C++ projects ;)

Could a VCS simply blacklist symlink files? Plus if you have developers doing crap like this (and their colleagues letting it slide in code reviews) you have problems that can't be solved by monorepo vs polyrepo. You have an engineering culture problem.

No VCS will ever protect you from crappy code though.

That's true, but we should at least make it as hard as possible to write crappy code ;)

> You can always just put all your small repos into a big one.

It's not quite as simple as that. You'll need to avoid rebuilding the entire repo for every change - using something like Bazel. This means your build tooling has to be replaced entirely, which is a non-trivial task and not something your devops/release engineering team will thank you for.

For any 3rd party libraries used by your projects you need to either ingest those projects into your monorepo and update forever. Or keep npm/maven/pip/gem/whatever around just for managing 3rd party dependencies (+ whatever system you use to front the main language package registries, because of course you're not talking directly to NPM/Maven central are you? What if they go down or do a leftpad?).

I think either system - monorepos or polyrepos - works fine; just pick one and stick with it. Monorepos will probably give you better velocity starting out. Past a certain size, which most software shops will never hit, the already-available tools lend themselves better to polyrepo. And more devs know polyrepo tools (eg. Jenkins) than the corresponding monorepo solution (eg. Bazel). Things might swing in favor of monorepo on the VCS side if Twitter/Google/FB ever open-source their stuff.

Eh, if you small repos build separately before you put them into one big repository, they'll build separately after, even if you just have a Makefile in each. If building your software depended on building its parts first, you already have the tooling to do so.

This really does not work for all languages. Speaking from personal experience, trying to monorepo multiple JS projects managed by NPM, it's more work than that.

I know neiher JS nor NPM. Could you elaborate a bit? What's the difference between having your code in folders A B C, each their own repo, and having your code in folders A B C, subfolders of some monorepo? Does NPM try to do something clever with your VCS?

>not something your devops/release engineering team will thank you for.

It's their job. If they actively don't want to do work then you probably made a hiring mistake somewhere. By that logic what DevOps really wants is to the company to shut down since then they'd have none of that tedious work to do.

Release engineering is a conservative profession. They won't thank you for upsetting all their established processes, introducing new software and systems that have to be maintained, and kept running just because you don't like how your code is laid out.

It's harder to make a business value case for this type of change - there are only vaguely worded promises of "improved developer velocity". Contrast that with a change that automates or makes faster some aspect of building and releasing - a professional release engineering team would be all over it because they can demonstrate value in that work.

In any company beyond 50 people, there are multiple engineering managers, directors or the VP of engineering that will need to back this initiative to make the release engineering team do it. It's really not as simple as "dump all the code in one repo". I'm speaking from experience.

They will have to work the entire Christmas break to pull off the switch. That will not make them happy. Rolling out massive changes like this is not easy, it needs to be planed in advance, tools built, dry-runs run, and then the final move. It also requirement management to not schedule any releases for several weeks before or after the change. As a member of a tool team I wouldn't think of doing this level of change when other people are in the office which means I miss my Christmas holiday.

Sure it is their job, but is isn't an easy job and there are many opportunities for things to go wrong. It might or might not be the right choice for you, but don't overlook how hard it is.

Note that the above applies for going in either direction.

You're making a lot of assumptions about how such a move would be done which I don't feel are warranted. You're picking the hardest most painful option and then using it to claim the process is painful rather than that the option you chose is painful.

If I was moving many small repos into a single mono repo then I'd do it one repo at a time. Presumably your small repos are independent entities so there's no reason to do a single massive switch. Transition each repo to the new build system inside the existing repo. Once that works then you can transition that repo into the mono-repo and tie together the build systems. No need to stop releases, no massive chance of everything failing, no weeks of debugging while the world is stopped, etc. Rinse and repeat until everything is moved over. Process becomes more optimized and less error prone with each repo that is moved over.

Now I lose lots of weekends making each separate move.

Why weekends? Small moves means you do it during the week and during regular working hours. Done in branches with CI support so that it's pretty unlikely to break anything. If doing regular releases require you to be spending your weekends then you have bigger issues to fix first. Spread it over however long it takes. Why are you trying to make your life more difficult?

I have done this. Regardless of when you do the move everybody who works in the repo being moved is down for several hours at best if you do it during working hours.

It's funny you say that, because the currently top-voted comment says exactly the opposite: https://news.ycombinator.com/item?id=18811368

> You can't split monorepos after the fact, at least not without immense costs.

Sure you can. The difficulty of doing so depends on many (many) factors. If your team does their job well then the costs won't be immense. It might be annoying, but not that hard.

Speaking in absolutes or platitudes solves nothing. Sometimes monorepos make sense. Sometimes polyrepos make sense. It's entirely dependent on what your company does.

Of course, if everybody is very diligent in keeping things in the monorepo distinct and independent, then it's easy to split it later on. But relying on constant diligence doesn't work out in the long run in my experience.

It's not like polyrepos solve the long term diligence problem though.

E.g. if relying solely on a package manager to to keep coupled things in lock step, you need make sure that version numbers are kept up to date for every little change made to every library.

You can easily end up with a situation where someone in another team makes a small change but doesn't change the lib version number. That's a people issue but it does happen.

You can get round that by using a repo SHA but now you two things to keep up to date for every library.

Like wise you'll have to be diligent in versioning APIs. Anecdotally I've found it easier to keep things in lock step when in a single repo and using a single pull request for each story than I have where separate teams have to keep separate repos in sync.

Both work but the monorepo approach worked better for the projects I've worked on. It just lead to less moving parts and more repeatability when there's a single SHA to watch.

I also have been luck enough that I haven't worked on a project so large that we couldn't build a monorepo on a single machine with "normal" build tooling.

There's a lot wrong with this article. Most of the arguments are either not backed up or are misleading. I haven't heard anyone argue they can drop dependency management because of a monorepo.

The author lists downsides of monorepos without listing the upsides and downsides of polyrepos so its really half complete.

I don't think anyone who likes a monorepo is suggesting you just commit breaking changes to master and ignore downstream teams. What it does do is give the ability to see who those downstream teams (if any) might be.

The crux of the author's argument is that added information is harmful because you might use it wrong. Its just as easy (far easier in fact) to ignore your partners without the information a monorepo gives. Its not really an argument at all. There's really nothing here but "there be dragons".

Monorepo's provide some cross functional information for a maintenance price. Its up to you whether the benefit is worth the overhead.

"... Please don't" titles also give off a condescending vibe, which usually means the author has erected strawmen, is appealing to emotion, & has not thought things through.

Seems like the main point is that you'll still need to add additional tooling (search, local cloning, build, etc) to handle scaling, something you can do just as well with polyrepos. Conversely, for polyrepos, you can add tooling to fix issues with dependency management and multi-project changes/reviews. However, the author figures that monorepos engourage bad code culture and points out that Git is hard to build a monorepo on.

To me this message seems a bit shallow, of course we can build tooling to hide the fact that we have a polyrepo. Given well enough built tooling and consistent enough polyrepo structure (all using same VCS, all being linked from common tooling, following common coding standards and using the same build tooling, etc.) the distinction from having a monorepo is more of an implementation detail.

Given the choice between a consistent monorepo where everyone is running everything at HEAD and a polyrepo where each project have their own rules and there's no tooling to make a multi-project atomic change, I'd go for the former.

Given the choice between identical working environments but different underlying implementations I would go for whatever the tools team think is easier to maintain.

What is the tooling for multi-repo atomic synchronized commits? Monorepo's give you that for free, which is the reason why I think monorepo projects exist. SVN kind of gave you partial checkouts, which was helpful.

Polyrepo argues that is a non-feature and don't give it to you. You can figure out where things are, but you never get synchronization.

This is a good thing because when you have to make the multirepo commit you make the change and then update each downside one at a time. Each change is much smaller and so easier to review (and also easier to find the right reviewer).

Of course the downside is you either have to maintain both ABIs (not just API), have a rollout scheme with two version of the upstream library exist side by side, or don't release.

Nothing is perfect.

Yes I think so too. But of course, as the article points out, nothing is entirely free. At some point we will have to build tools to handle scaling, and then the trade offs between a mono and polyrepo becomes less obvious. I'd lean towards monorepos as a base either way, but given sufficiently well working tooling it might not matter much.

> Given well enough built tooling and consistent enough polyrepo structure (all using same VCS, all being linked from common tooling, following common coding standards and using the same build tooling, etc.) the distinction from having a monorepo is more of an implementation detail.

Exactly. Sure, you can manually recreate a monorepo from a multirepo system, but … why do that? That takes software engineering effort that you could spend on your product instead.

I’ve found monorepos to be extremely valuable in an immature, high-churn codebase.

Need to change a function signature or interface? Cool, global find & replace.

At some point monorepos outgrow their usefulness. The sheer amount of files in something that’s 10K+ LOC ( not that large, I know ) warrants breaking apart the codebase into packages.

Still, I almost err on the side of monorepos because of the convenience that editors like vscode offer: autocomplete, auto-updating imports, etc.

Monorepos and packages are not mutually exclusive. You can and should have many different projects in subfolders I'm your monorepo, each with their own builds and tests and artifacts (though hopefully somewhat standardized). The point is that now it's easy to release changes across multiple projects, integration test between them on a specific global patch, etc, without a whole pile of complex tooling.

Agreed. When I wrote the parent comment, I was thinking of a time I prematurely abstracted an API wrapper to a private git repo and how painful simple, frequent changes were.

Though, as you say and I commented below they’re not mutually exclusive. A wrapper or even an entirely separate service can exist alongside others.

One dark side of this is being able to “reach inside” other parts of the monorepo and blur application boundaries.

This guy gets it. Software Engineering is about using the appropriate tools and techniques for the task at hand. If your repo gets so large it can't be comfortably checked out, something needs to get split apart.

Monorepos are also a great technique for tackling large legacy codebases. When the rot is all in one designated place, it becomes easier to encourage good developer habits on new code created in new, separated repo(s).

Speaking from experience I've worked on a team operating through a monorepo project that came out real well. The codebase was mostly golang, so everything lived in the GO_PATH, but for the most part the typescript in the UI side of the repo didn't complain. Testing and code quality was a higher priority, as well, which may have contributed to its success.

I have also worked on a monorepo project that had minimal tests and automation, that soon grew monstrous and ultimately needed refactoring. That was a big pile of coffeescript, es6 and java that ultimately refactored into three different node modules and two microservices.

Javascript and its module packaging tends to conform better to polyrepo patterns. golang code all wants to be in the same place, and java repos have their own desired nested directory structures. These two languages tend to encourage monorepo design patterns.

Monorepo or Polyrepo, the correct answer is whatever works for your team and task at hand.

Hold on, are we talking about monorepos, ie a set of projects with shared change history (and possibly 'build it all' type tooling) or single monolithic apps?

I'm seeing these two things conflated in this thread.

To me, a monorepo exists of a set of related or semi related services or runtimes that can operate autonomously, but have a dependency on their siblings to operate correctly.

In some cases, this could be two separate backend projects where you want to re-use the same deployment pipeline.

Often, I find that API wrappers are something that I share across frontends and backends in the JS world, so it often makes sense to separate my projects into:

- backend

- frontend

- common

In Typescript I really like this pattern and can namespace shared types so that it’s very clear to the future reader that this type is probably used outside of the current context.

So, to reply to your comment — I think the term “monorepo” can encompass a lot of different project types.

I think Dan Luu covers the bases quite well here:


In fairness, a single repo does encourage a monolithic architecture (even though one can have multiple binaries inside a single repo), just as a monolithic app does encourage a lack of modularity (even though one can write a single app composed of well-chosen modules).

The biggest gripe I have with modern day monorepos is that people are trying to use Git to interact with them, which doesn't make a tremendous amount of sense, and results in either an immense amount of pain and/or the creation of a bunch of tools to try to coerce Git into behaving basically like SVN.

Which of course begs the question, rather than trying to perform a bunch of unnatural acts, why not just use SVN to start with? It works extremely well with monorepo & subtree workflows.

Sure it has some warts in a few dimensions around branching, versioning, etc. compared to Git when using Git in ways aligned with how Git wants to work, but those warts are minimal in comparison to what's required to pretzel Git monorepos into scaling effectively.

Maybe its just that the author's cutoff is at the wrong team size, but the monorepo I work on (with ~150 devs) has almost none of the problems presented.

Unreasonable for a single dev to have the entire repo? I'm looking at a repo with ~10 million LoC and ~1.4 million commits. I have 74 different branches checked out right now. Hard drives are cheap.

Code refactors are impossible? I reviewed two of those this morning. They're essentially a non-event. I'm not sure what to make of the merge issue - does code review have to start over after a merge? That seems like a deep issue in your code review process. The service-oriented point seems like a non-sequitur, unless you're telling me I'm supposed to have a service for, say, my queue implementation or time library.

The VCS scalability issue is the only real downside I see here. And it is real, but it also seems worth it. It helps that the big players are paving the way here - Facebook's contributions to the scalability of mercurial has definitely made a difference for us.

In theory, yes - if the underlying repo changes, code review should start over. In practice though, it's a terrible idea ;)

Part of code review is to ensure the code "fits" with all other merged code - so a re-review is "needed" when other changes merge. E.g. if I merge a refactor that changes everything from Pascal case to 100% SHOUTING, reviews now need to take this into account.

In practice, this doesn't happen - it's way too much effort for far too little value.

I think the trick is to only re-review the areas that had merge conflicts, and to do the re-review aware of both the changes you already reviewed and the changes that caused the conflict. Merge conflicts, even in big code refactors, are fairly rare, so this ends up not being much additional work in practice.

> if I merge a refactor that changes everything from Pascal case to 100% SHOUTING, reviews now need to take this into account.

To be fair, if you get away with merging that refactor, the review that needed more attention was of that refactor ;-)

I do really like mono-repos, but google's other significant new project: fuchsia - is set-up as multi-git repo (and I believe chromium too, maybe android (haven't checked)). For fuchsia, they use a tool called "jiri"[1] to update the repos, previously (and maybe still in use) is the "gclient" sync tool [2] way from depot_tools[3]

[1] - https://fuchsia.googlesource.com/jiri/ [2] - https://chromium.googlesource.com/chromium/tools/depot_tools... [3] - https://chromium.googlesource.com/chromium/tools/depot_tools...

It even reflects a bit to the build system of choice, GN (used in the above), previously gyp, feels similar on the surface (script) to Bazel, but has some significant differences (gn has some more imperative parts, and it's a ninja-build generator, while bazel, like pants/bucks/please.build is a build system on it's own).

Simply fascinated :), and can't wait to see what the resolution of all this would be... Bazel is getting there to support monorepos (through WORKSPACEs), but there are some hard problems there...

Having worked with some organisations building on Android (>1,000 repos), life is not easy when you are trying to build on top of it and regularly take updates etc.

I asked one company how many changes required changes to more than one repo and was told "a small percentage". We then did some basic analysis of issue IDs across commits and discovered that it was in reality nearer 30% of changes. Keeping those together was just plain very hard.

Start to scale this by teams of hundreds or thousands of devs and you get a lot of pain.

Managing branches is also hard - easy to create (with repo tool) - but hard to track changes.

Funny that there are so many reimplementations of git submodules but with support for "just give me HEAD" - Google has two (jiri and repo), my company has a home-grown one too.

Android uses a top level repo that behaves as a monorepo with thousands of submodules inside. It's also designed for multiple companies sharing code and working with non shared code at the same time which introduces some constraints and challenges.

"Bazel is getting there to support monorepos" -> support multiple repos (was tired when I wrote this I guess)...

My polyrepo cautionary tale: Two repos, one for fooclient, one for fooserver, talking to each other over protocol. Fooserver can do scary dangerous permanent things to company server instances, of which there are thousands.

Fooserver sprouts a query syntax ("just do this for test servers A and B"), pushed to production. Fooclient sprouts code that relies on this, pushed to production. A bit later, Fooserver is rolled back, blowing away query syntax, pushed to production. "Just do this for test servers A and B" now becomes "Do this for every server in the company". Hilarity ensues.

Ouch. I suppose the lesson is that a monorepo with both client and server being developed and tested together would have reduced such risk.

Versioning the client/server interface would've also reduced such risk.

I call bullshit on "our repository is too big for one machine".

Seriously, you have over 1 TB of code and 100 people wrote it?

adding raw versions of binary assets (designs, video, ...) can quickly lift a repo beyond a TB. Now, you could say "don't do that", but there's valid use cases where you'd want to track all binary assets as part of the development cycle.

Ouch, well, yes that is a very good situation in which not to take the "mono" part of "monorepo" too seriously.

or use a VCS that allows partial checkouts of repositories. There's no DVCS that I know of that can do that, but for example SVN can. Git LFS might be an option, too. There are also commercial products that target that market.

I just wanted to point out that reaching a measly TB of data doesn't require much effort. (worked on a product that would version rendered clips for special effect production).

You can do that with content, you just partition the workspace/view of the monorepo to what each person needs rather than checking the entire thing out git style.

You can have large repos and it not only be code. I remember seeing repos many tens of gigs because all of VS was versioned as well for "reproducibility".

In 2016, Google's monorepo was 86TB

The keyword there is "Google". Everything is different at the extremes.

Better title would be "Monorepos don't fit with my particular use case."

I strongly agree. I hate this style of blog post.

Telling people what they should or should not do is generally absurd. Every situation is unique and you can't possibly know another project's requirements or acceptable trade-offs.

A better approach, in my opinion, is "Here's what we did and why". The author clearly has experience in the area. Great! Tell me about your problems. Tell me about your attempted solutions and what did or did not work. Tell me what you wish you had done! I'd love to use knowledge of your situation to inform my own decision making.

But don't be surprised if my circumstances are different and lead me to prefer different trade-offs and choose a different solution. That doesn't make me a zealot or an idiot.

Isn't your own post telling people what they should and should not do (specifically on how to give advice)?

The irony wasn't lost on me. It's a fine line. Let me try a slightly different approach.

When I blog I've had much better luck telling people "here's what I did and why". I don't know your circumstances and can't tell you how to solve your problems. You may need to choose different trade-offs than I did. With that said, here is my problem, how I solved it, and what I learned along the way. Hopefully you can learn from my experiences and make a more informed decision for how to handle problems you may encounter.

I disagree. The thesis statement repeated several times throughout is "Monorepos don't scale in exactly the same ways that polyrepos don't scale, the tools to solve the scaling problems are the exact same except monorepos need more of them and encourage bad habits along the way."

You may disagree with that thesis, but it definitely seems to cover more than one use case.

...the author's title is literally "monorepos: please don't!"

To me, the key point is this: Splitting your code into multiple repos draws a permanent architectural boundary, and it's done at the start of a project (when you know the least about the right solution).

The upsides and downsides of this are an interesting debate, but there is a cost to polyrepos if you want to change the system architecture. There is a cost to monorepos too as argued by this post, and its up to the tech leads as to which cost is greater.

"The frank reality is that, at scale, how well an organization does with code sharing, collaboration, tight coupling, etc. is a direct result of engineering culture and leadership, and has nothing to do with whether a monorepo or a polyrepo is used. The two solutions end up looking identical to the developer. In the face of this, why use a monorepo in the first place?"

.....because, as the author directly stated, the type of repo has nothing to do with the product being successful. So stop bikeshedding, pick a model, and get on with the real business of delivering a successful product.

Could you get the best of both worlds by having a monorepo of submodules? Code would live in separate repos, but references would be declared in the monorepo. Checkins and rollbacks to the monorepo would trigger CI.

There's not much good to either world.

You need fairly extensive tooling to make working with a repo of submodules comfortable at any scale. At large scale, that tooling can be simpler than the equivalent monorepo tooling, assuming that your individual repos remain "small" but also appropriately granular (not a given--organizing is hard, especially if you leave it to individual project teams). However, in the process of getting there, a monorepo requires no particular bespoke tooling at small or even medium scale (it's just "a repo"), and the performance intolerability pretty much scales smoothly from there. And those can be treated as technical problems if you don't want to approach social problems.

To put it another way, we're comparing asymptotic O(n) with something bigger, neglecting huge constant factors on the former. There's a lot of path-dependence, since restructuring all your repos with new tooling is hard to appreciate.

This is the approach taken by a number of projects. The ones I am most familiar with are the OpenEmbedded/Yocto/Angstrom family that build Linux for embedded devices. They have a 'root' repo that references the layer repos (using metadata files rather than submodules), and there is a tool that does the pulling. It's optimised for pulling not committing though, I don't think the tooling helps much with bumping versions.

It can be misused though - the releases of the root repository reference the children by tags usually. Someone retagged a child repo and we suddenly had build failures.

We actually did this. When I started at Uber ATG one of our devs made a submodule called `uber_monorepo` that was linked from the root of our git repo. In our repo's `.buckconfig` file we had access to everything that the mobile developers at Uber had access to by prefixing our targets with `//uber_monorepo/`

We did however run into the standard dependency resolution issue when you have any loosely coupled dependency. Updating our submodule usually required a 1-2 day effort because we were out of sync for a month or two.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact