Hacker News new | past | comments | ask | show | jobs | submit login
Monorepos and the Fallacy of Scale (presumably.de)
104 points by loevborg on Jan 8, 2019 | hide | past | favorite | 118 comments



Whenever I hear smart and reasonable people argue well for both sides of an engineering issue, my experience is that it will turn out that we're arguing the wrong question. The perspective is wrong. We can't get past thinking in terms of our old terminology.

What we all really want is a VCS where repos can be combined and separated easily, or where one repo can gain the benefits of a monorepo without the drawbacks of one.

Another crazy tech prediction from me: just as DVCS killed off pre-DVCS practically overnight, the thing that will quickly kill off DVCS is a new type of VCS where you can trivially combine/separate repos and sections of repos as needed. You can assign, at the repo level, sub-repos to include in this one, get an atomic commit hash for the state of the whole thing, and where my VCS client doesn't need to actually download every linked repo, but where tools are available to act like I have.

(This will also enable it to replace the 10 different syntaxes I've had to learn for one project to reference another, via some Dependencies list and its generated Dependencies.lock list.)

In a sense, we already have all of these features, in folders. You can combine and separate them, you can make a local folder mimic a folder on a remote system, and access its content without needing to download it all ahead of time. They just don't have any VCS features baked in. We've got {filesystems, network filesystems, and VCS}, and each of the three has some features the others would like!

I don't have much money right now but I'd pay $1000 for a good solution to this. I'd use it for my home directory, my backups, my media server, etc.


I don't think it is quite a revolution from/killer for the DVCS as you think. A lot of what you ask for can be done with DVCS tools if you go low level enough. The git graph supports more complex shapes in the raw than the git UI tools ('porcelains') tend to make it easy to work with. Especially when you throw in new tools like GitVFS for virtualizing some/all of the git object database. To some extent what you ask for is simply a UX problem in the DVCS world of how to make such power available in an easy-enough to use way, without providing too many additional new footguns.

> Whenever I hear smart and reasonable people argue well for both sides of an engineering issue, my experience is that it will turn out that we're arguing the wrong question.

Right, though my feeling is the root underlying discussion is actually sociopolitical rather than technical. The preference between monorepos and polyrepos seems more to do with the organizational structures of organizations and people than about the actual technical merits of either approach. I think the issue is that developers just feel safer trying to tie things to technical merits than engage with the problem at a sociopolitical level.


> tie things to technical merits than engage with the problem at a sociopolitical level.

This hits the nail on the head. The real advantage of monorepos is that they are reproducible. Many-small-repos aren't. A monorepo, ideally one like bazel[1] pushes will produce the same bitstream every time you put "make".

Projects terminated, going in a different direction, servers moved, down, forked, ... it all just doesn't matter. Your stuff works. Network or no network.

The problem is that these repos allow programmers to be very undisciplined in a way that sabotages projects. In a monorepo you are updating libraries. In every individual project. That's a big plus, when it comes to "my software works", but of course it is work. Your software doesn't suddently stop working, but that's because it doesn't change, which means it won't change unless you make it change. Which means, new database driver ? You're upgrading it. No apt-get, no sysadmin deploying a shared library or dll install doing it for you.

There is always a strong drive by higher up programmers and managers to fix "undisciplined" programming, and ... I've never seen this work. Or at least, I've never seen this do more than move the problem.

The thing I don't understand about some languages, like Golang, is the (ridiculous) insistence on hermetic binaries (not using anything on the system, not even libc), and yet their very strong insistence on non-hermetic source (even to the point of sabotaging the tools for people who wish to do it, and sabotaging to some extent the community to achieve this). Either do one, or the other. I feel like C/C++'s support of both approaches is a superior option. You want shared (nobody does, but hey that's my opinion) ? You can do that. You want everything hermetic ? Not a problem.

In my opinion sharing things between projects is a mistake.

[1] https://bazel.build/


There are just as many tools for reproducible builds of polyrepos, between various package management strategies and CI tools.

> Projects terminated, going in a different direction, servers moved, down, forked, ... it all just doesn't matter. Your stuff works. Network or no network.

I think it becomes a forest/trees thing. People that prefer a monorepo are most focused on the forest, and people that prefer polyrepos are most focused on the individual trees. The end goal is usually the same (a healthy forest), and done right both approaches are generally isomorphic (the "canopy" of the forest looks generally the same to the users either way).

To push the metaphor perhaps to the breaking point, a lot of the arguments between monorepos and polyrepos break down into management of the "root system" in the forest. Monorepos often don't care as much how deeply the roots are entangled between trees so long as the forest is healthy, and polyrepos tend to encourage a more "bonsai" approach of tending to each tree on its own. Either way, it rarely seems to affect the "canopy" (end user experience) of the forest, but can matter a great deal to things like technical debt and project management structure and project deadlines.


> We've got {filesystems, network filesystems, and VCS}, and each of the three has some features the others would like!

We actually have a fourth one that has nice features but missing what you exactly need: packages (and associated package managers). A package has (on top of build instructions) a list of files, and can depend on another package for working properly. At the top, you could have your own "machine.pkg" that defines all that is needed and is updated as some files are updated. A package can go from one machine to another and only the missing dependencies will be fetched.

A package manager has the pro that any part can be defined as a package and abstracted away, and the con that packages aren't expected to change much so it'd become heavy to recursively update all packages, but that could be a way to hack something.


> You can assign, at the repo level, sub-repos to include in this one, get an atomic commit hash for the state of the whole thing, and where my VCS client doesn't need to actually download every linked repo, but where tools are available to act like I have.

Mercurial subrepos work a lot like you describe.


With remote depots, perforce has the functionality you want, or nearly so. There used to be a free tier, which AFAICT is gone now.

The google implementation internally supports virtual checkouts like you suggest.


I mean free and open source, of course. "Only exists as an internal Google project" is isomorphic, from my perspective, to "Does not exist".


> "Only exists as an internal Google project" is isomorphic, from my perspective, to "Does not exist".

I mean, it shows that it's possible...


While certainly interesting, the article leaves out the juicy bits monorepos and their workflows offer:

- library developers can easily see the impact their changes will have on their consumers

- library changes that introduce regressions in their consumers can be caught pre-merge given good test coverage

- dependency version updates between packages cause less mayhem because they are performed atomically and only merged when green

At the same time, many drawbacks are also left out:

- the incentive to have long living branches for stability reasons can negate most benefits mentioned above

- build times for compiled languages can become problematic even for moderately sized organizations (I’m looking at you, C++)

- in my experience, you pretty much need a dedicated dev team working on workflow tooling because ready solutions are fragmented and hard to integrate (code review, merge bots, CI/CD, ...)


"- library developers can easily see the impact their changes will have on their consumers"

So, to commit breaking changes you need to file 30 Jira tickets first and wait for a synced commit with 30 branches? Or how do you do it? Do monorepo shops work without releases and stable versions?

If dependent projects (in use) pull your head and there is no synced development I assume you will end up with a complete legacy mess to work around other projects unit tests to get anything done.


If I update a library function to require a new parameter, then in the same commit, I will also change every application which uses that library to supply the new parameter. The promise of the monorepo is "you can pick any random commit from any point in history, and all of those applications and libraries will be compatible with each other".

For major changes that are too big to migrate every user in a single commit - create a new API, do development there, migrate users from the old API to the new API, delete the old API.

(I wonder if you're coming from a world where library developers and application developers don't have permission to touch each other's code? At FB at least, the library developer would be allowed to create a pull request modifying the application code - though they may need a stamp of approval from the application owner before that pull request is merged into master)


How does that work in code bases where you're not authorized to change everything?


Painfully. File a ticket with another team and pray they don't put it off until next quarter because your teams priorities don't match up with theirs.


The impact of changes across an ecosystem can also be checked in other ways. See for example how Rust checks the whole cargo ecosystem for problems caused by compiler changes.


Sure… monorepos provide solutions to certain problems, but aren’t the only way to solve those problems.


You list three items here:

“- library developers can easily see the impact their changes will have on their consumers

- library changes that introduce regressions in their consumers can be caught pre-merge given good test coverage

- dependency version updates between packages cause less mayhem because they are performed atomically and only merged when green”

But I think these are actually signs of big failure modes of monorepos, and each one has analog for solving it with polyrepos and/or versioned artifacts that’s actually much safer and more practical.

Firstly, wishing to see how library changes will cause issues in downstream consumers is often a sign of very deep problems, because the library designer should be free to make changes or enhancements as they deem necessary to solve the problems they need to solve, which may require validating new changes don’t introduce unexpected problems for consumers, but, critically, also may not involve that, and in fact the usual case should be that it doesn’t involve that.

Instead, it’s up to consumers to consume versioned artifacts of the library (more on this in a moment), so that consumers are totally in control of “opting in” to new changes. If the consumer doesn’t want to opt in, they should not be forced to (e.g. in monorepos where a successful merge of library X is a silent, de facto upgrade for all consumers required to consume library X from the same commit, etc.)

Instead, the consumer should be the one creating an experimental branch with the intended version upgrade, and running it through the consumer’s validation tests to see if opting to upgrade the version will work.

If it fails, they can open a bug report, and the library maintainer will decide if that regression is now necessary because of a constraint in the new version’s design, or if it can or should be patched.

To be clear, this all should be happening inside a monorepo or outside of one, truly doesn’t matter. The point is to get rid of the horrible practice of forcing de facto software upgrades on one section of the code merely via a successful merge in a another section. That idea is terrible in concept, a very misguided thing that shouldn’t be desired.

Instead, when you merge library changes in one part of the monorepo, CI or other tooling should produce a fixed artifact from it, like a Python wheel, jar file, Docker container, shared object binaries, whatever, and automatically upload to an artifact store like artifactory, with versioned identifiers.

The code in some other section of the monorepo then can just happily keep plugging along, not blindsided by new changes, hit by errors uncaught in testing, etc., and the maintainer of that code can plan for their version upgrade according to their own timeline and testing.

In particular, this is what actually reduces “mayhem” as you put it in dependency upgrades, because the standard of doing it implicitly when there is a green merge is just a form of putting your head in the sand and acting like the library author can trust green merge status as a reliable indicator that downstream consumers can auto-accept new changes, when really only those downstream consumer team devs actually know if that’s true or desirable.


>Firstly, wishing to see how library changes will cause issues in downstream consumers is often a sign of very deep problems, because the library designer should be free to make changes or enhancements as they deem necessary to solve the problems they need to solve, which may require validating new changes don’t introduce unexpected problems for consumers, but, critically, also may not involve that, and in fact the usual case should be that it doesn’t involve that.

One of my pet peeves is when another developer tells me something "should" happen. I know library changes should not cause unexpected problems. The question is does that library change cause unexpected problems. Which is one of the benefits of a monorepo. You can make a library change and with at least some level of backing say that it doesn't break anything.

>Instead, it’s up to consumers to consume versioned artifacts of the library (more on this in a moment), so that consumers are totally in control of “opting in” to new changes. If the consumer doesn’t want to opt in, they should not be forced to (e.g. in monorepos where a successful merge of library X is a silent, de facto upgrade for all consumers required to consume library X from the same commit, etc.)

>Instead, the consumer should be the one creating an experimental branch with the intended version upgrade, and running it through the consumer’s validation tests to see if opting to upgrade the version will work.

The problem is that in an enterprise you will have many things that aren't under active development. So if you rely on consumers to pin a version and then upgrade you might end up with dozens of different versions of a library out there. And then what happens if you find a security issue or something else that forces everyone to upgrade to the latest version? Suddenly you have a bunch of applications that no one has touched for a while that all need to be upgraded.


> “One of my pet peeves is when another developer tells me something "should" happen.”

Exactly, like when someone tells you that you should upgrade your dependency simply because some new code was merged.

> “So if you rely on consumers to pin a version and then upgrade you might end up with dozens of different versions of a library out there.”

The same thing happens in a monorepo, usually with dozens of incompatible legacy bundles of feature flags / toggles.


In theory, sure. In practice, consumers end up pinned on old/outdated versions of the libraries they depend on, because upgrading is hard, and they often lack a specific incentive to stay on top of the most recent version of each library.

The "consumers are responsible for library upgrades" approach, also requires every single consumer team to become knowledgeable about changes between library versions, rather than having one specialized team or individual that can stay on top of it and do the work for the whole organization.


I agree with both of your paragraphs, except that I see them as being applicable in practice not in theory, and I think both things you mention are positive things.

Using pinned / outdated versions to preserve legacy behavior is better than the alternatives for preserving old functionality in a monorepo, where it leads to uncontrollable feature flag growth and an inability to delete code.

Likewise, compelling consumer teams to program defensively around all dependencies is a huge positive, not negative.


> Using pinned / outdated versions to preserve legacy behavior is better than the alternatives for preserving old functionality in a monorepo, where it leads to uncontrollable feature flag growth and an inability to delete code.

I wonder if it becomes an incentive to delete code. Keeping old projects up to date with every library change is a continuous maintenance burden, incentivizing their removal. The cost of updating exists either way, of course, but the matter of who pays that cost and when it must be paid would probably affect decision making.


I agree, but a further point is that not all consumers need to or care about upgrading. I worked on a project before that used an 8 year old version of an in-house dependency. It worked perfectly for us, and out use case did not jam up or limit the team that owned that library. It morphed into something completely different after years of development, and the library team would have been hamstrung if they could not have innovated the ways they did on account of needing every merged commit of the code to continue to support our old legacy use case.

You could think of it in terms of Python 3 and Python 2 for example. Imagine if Python was an in-house library at Bank of America. Tons of teams there still happily use Python 2, and plan to keep on using it for a while even after official support is deprecated from libraries, etc.

If it lived in a monorepo with this deployment policy (that everything had to be in sync on every merged commit), then you could never attain implementation of Python 3.

The same idea plays out at all sorts of scales. It’s not even necessarily part of monorepo tools, it just happens to be a misguided way that a lot of organizations set up their monorepos.


Very tue. Coming back to OP, long before this point would have been reached, you might have switched to a different process (probably including VCS setup). But as OP states: many of us are not Bank of America


> Firstly, wishing to see how library changes will cause issues in downstream consumers is often a sign of very deep problems, because the library designer should be free to make changes or enhancements as they deem necessary to solve the problems they need to solve, which may require validating new changes don’t introduce unexpected problems for consumers, but, critically, also may not involve that, and in fact the usual case should be that it doesn’t involve that.

...

> In particular, this is what actually reduces “mayhem” as you put it in dependency upgrades, because the standard of doing it implicitly when there is a green merge is just a form of putting your head in the sand and acting like the library author can trust green merge status as a reliable indicator that downstream consumers can auto-accept new changes, when really only those downstream consumer team devs actually know if that’s true or desirable.

The only reason for a library's existence is to help solve problems for consumers. I would say that if the usual case does not involve worrying about issues for consumers then whether or not you choose polyrepo or monorepo is the least of your issues.


> “The only reason for a library's existence is to help solve problems for consumers.“

I agree and this is the whole line of thinking behind my comment. In order to help consumers with what they need to solve, I might be required to break backward compatibility, delete old code to reduce complexity, refactor/redesign in ways that require consumers to change usage patterns, etc. If I cannot do these necessary things because every change I make is beholden to every regression test for every consumer for every legacy behavior, then it means I am failing to help my consumers, and the library is failing to give them what’s needed.

Instead, if they can isolate their dependence on the library away from possibly breaking changes needed to reach new solutions in the newest version, then the library is still healthy and capable of solving consumer requirements.


> I might be required to break backward compatibility > ... > I cannot do these necessary things because every change I make is beholden to every regression test for every consumer

Given that backward-incompatible changes are going to be a fact of life, then at some point, somebody will have to deal with them. We can either write one commit which deals with the library and the consumers at the same time; or we can modify the library in isolation and 3 months later the consumer tries to upgrade and is surprised to find out that the API has changed and the author of that change got hit by a bus. It's the same amount of code either way, but I find small incremental atomic changes are easier to manage than large batches of non-atomic changes.


You’re missing the point, which is that not all consumers care about the upgrade, and some may be fine living with the existing version for months, years or even forever. Holding up library development to force resolving breaking changes for that consumer is a waste of time that just prevents the library from moving on and providing changes that other consumers really do want or need. Meanwhile, forcing every consumer to also continuously pay the cost of upgrading is also a waste of time, because that consumer team (and not the library team) knows what their timetable and prioritization should be for when and how to spend effort resolving breakages when choosing to upgrade at whatever point they want.

I agree that resolving breakages soon is usually smoother and allows adoption of the new version most quickly. It just turns out this isn’t important, and very often consumers are happy to pay a slightly higher friction cost to deal with breakages in an upgrade but to do it on their decided and deliberate timetable, than to be enforced to have changes thrust on them with monorepo-wide implicit upgrades at every point of a successful merge of code from the library.

This is why introducing the extra degree of separation to make dependencies into versioned artifacts stored in an artifact repository is so critical. It allows for setting up whatever automated regression tests you want bases on automatically pulling the latest artifacts. You can get all the same information about what consumer breakages there might be, but instead of that acting as a barrier to merging the code (which is a decision that only the library team should be making), it allows consumer teams to file bug reports or backward compatibility requests that can actually be individually prioritized and planned, instead of enforced always all at one single commit / revision despite the effort to do that not really being a priority for the vast majority of consumer teams.

In this sense, it is more atomic. Each set of changes addresses a smaller scope of overall state change of the library. Multiple tiny commits, each addressing different specific bugs or backward compatibility requests in isolation, even if involving coordinated commits in different repos, are more atomic because the scope of the state change is smaller and isolated in each step.


Two points:

1. Consumers that "didn't care" about new versions will end up caring very quickly in the case of bugs in the version that they are using, and now they will have to make fixes to adjust for a lot of breaking changes at once just to get a bugfix (or more likely they will fork the library to get just the fix they want and eventually you end up with 1 consumer because all the other consumers are using forks of old versions).

2. The number of consumers matters. If you have 2 or 3 consumers of the library versus 10s of consumers it makes a big difference for whether or not the extra degree of separation you are talking about make sense. 3 consumers of an internal library might be "a lot" at a small to medium sized company, while 10s of consumers wouldn't be unheard of at a large company.


The benefits of monorepos (in my experience) are all people/organisation based e.g. it can be easier to enforce standards/processes among 100s-1000s of engineers with a monorepo, or it can be easier to manage/release a very large interdependent codebase/eco-system being worked on/coordinated between dozens of teams.

However, this linked post makes a great point, those benefits are all 'scale' problems which 99.99% of orgs don't have. The corollary is I've seen how hard it is to go from from multi-repo -> monorepo when you reach the scale where you would see some benefit.

I also think that the tooling/UX doesn't publicly exist to solve the multi-repo problem with 100s-1000s of engineers working on 100s of repos. It becomes so hard to navigate, understand and grok and so much is buried in dark corners. My experience is that that tooling is less hard to build around monorepos (Google for example).


Interestingly, the article linked here argues that monorepos are good at small scale (e.g., startups) and multi-repos are good at large scale (e.g., Twitter).

The author does not discuss the fact that Google has used a monorepo. (Not sure if it still does.)


It still does, while someone might point out that a few percent of Google's code is not in the monorepo, most of it is in the same repo.


We are a small 20ish dev company using a monorepo with mostly python. Our tooling is Bazel and Drone.

- Ease of onboarding. Being able to quickly build or test any target is awesome for the new employee.

- Ease of collaboration. I can see all of the code easily and can learn from these patterns. I can also quickly contribute or extend apis and fix all usages without concern for breaking changes.

Our use of Bazel quickly gets us around git scale issues by enabling external dependencies that can be loaded into the workspace without fully vendoring everything.


I would also say that while monorepos are not as necessary for the small, non-unicorn team, they improve the ability to get there by increasing transparency and preparing for a growth rate where the number of engineers doubles every year.


I'm interested in using Bazel at work and would appreciate any input:

1) Was it hard to change projects and workflow to use Bazel (assuming the company didn't use Bazel at first)?

2) Is the dev workflow too different than typical git branch/commit/PR?


1) This is mostly dependent on language and the rules available. In our case python is a mess regarding 2/3 compatibility and external dependencies. There is a roadmap being executed on to fix these issues. Regarding our transition, it was slow at first as we pulled in services and tackled some tech debt. We wrote a handful of custom rules that helped. Bazel does have the ability to import repositories and that can be used as a little bit of a crutch in the process.

2. Exactly the same process except there is a single pr instead of multiple across multiple repositories.

One of our challenges was integrating one of our public repos into our repo and that is more of a difference in release patterns(semantic versioning vs nightlies). We use copybara to help with this problem.

- https://github.com/google/copybara


That sounds cool. Could you elaborate on this "Our use of Bazel quickly gets us around git scale issues by enabling external dependencies that can be loaded into the workspace without fully vendoring everything."?


Not GP, but Bazel requires you to explicitly declare your dependencies for each 'target' you want to build. These dependencies can be within the same Bazel workspace, or imported at build time via say git - there's no need to have all the files directly in your repo. The nice thing is you can declare the commit id or file hash for the dependency you're importing to make sure you're getting what you expect, and keep Bazel's reproducibility properties.

Source: one happy Bazel user :)


The other question is: how stable is Bazel and how easy is it to extend and/or find open source extensions for it?


We haven't had any issues regarding stability. For the most part we extend by creating additional rules. Many of these are supported by the community(search for bazel and rules_* and you should get a bunch of results on github).


I’ve never understood the debate between mono and multi repos. With the right tooling, the line seems to vanish and you just have folders anyway.

Each repo may have their own policies and permissions, which is the biggest reason I see to keep them separate, but again the distinction still seems little more than a folder.

Am I missing something?


In my case, I'm struggling to find good tooling to support monorepos. We're running a microservice architecture, but our CI is triggered by GitHub web hooks. Currently we're either doing a full build of everything on each webhook event or we're doing some error-prone git-diffing to try to make sure we're only rebuilding when necessary. I've looked at tools like Buck and Bazel, but they seem really heavy for our ~30 person engineering team and they also seem to have odd ways of doing things (no support for pulling from a package repository, vendoring is assumed to be vendored--which incidentally the author of this blog post characterizes as a Bad Thing). Folks who are using monorepos successfully--what tooling do you use to solve these problems?


I come back to this tooling issue again and again. It really, really depends on the platform, too.

Some platforms, like C++, require a lot of tooling work. So it's kind of a natural thing to just add on to it. And in fact can speed up the build a bunch when you start tracking change of intermediate objects.

For almost every web service kind of app I've been involved with, build tooling is typically very standard, and supports semantic version locking out of the box (e.g., bundler, Maven/Gradle, etc). Dealing with multiple repos is just easy to manage in a very controlled way.

The places I've been at that use monorepos were C++ shops from long ago. Their tooling grew out of that approach.


make solved dependency resolution for partial rebuilds over 30 years ago. There has been lots of improvement since. I don't doubt you have a real challenge, but the problem comes from some immature tool in your toolchain (that I'd be willing to speculate is because someone decided only Javascript tools are interesting after 2010, because reimplementing features classic tools is more fun than learning old reliable tools), not an aspect of the monorepo.


How does make solve the problem of only building what has changed in the current PR? It seems like this would only work if you have a single CI server and all of your artifacts live in the filesystem on that server, no? Solving that seems sufficiently difficult that saying "make solves the problem" seems disingenuous, but maybe it's easier than it appears to me?


We have been using Bazel in 20 dev python org. I think most of us have been happy with the transition while collaboration has improved and onboarding has become much easier. The biggest pain point is the python part and not having the resources to vendor everything but the Bazel team is working to fix this albeit very slowly.

We get around some of these nonhermetic issues with python and dependencies by using a consistent docker based dev environment for everyone. Only takes a few minutes bazel build/test/run anything in our repo.

EDIT: We run Bazel in drone as part of our CI/CD system and this is probably the most unoptimized part given how they do not work that well together. Bazel query allows us to due targeted builds/tests based upon the git changes and dependents of those changes.


Any chance you would be willing to describe your setup in greater detail? I'd really like to know more.


Especially about how you're building Python. Seems like Bazel still doesn't support Python 3 at all.


I'm working to solve the same issue with our CI/CD pipeline.

Have you tried checking for version differences between your existing images and the services in your monorepo as an indicator of which services to rebuild(instead of using a git-difference)?


I'm not sure what you're proposing? Hashing the inputs to the build function and associating that with an image and using that checksum to determine if there's a change? If so, that doesn't seem to solve the problem of reliably identifying inputs to include in the checksum in the first place, which is a big part of the problem. I imagine we could do the build in a container and use the Docker build cache to identify what things need to be rebuilt, but that doesn't seem like it would scale well since afaik there's no good way to share the cache across multiple nodes.


Agreed. Would like to see an article on tooling, to address common pain points. This article simply says "try a monorepo", and when you hit problems, figure them out. It mentions that Twitter had issues, but no specifics.


>I’ve never understood the debate between mono and multi repos. With the right tooling, the line seems to vanish and you just have folders anyway.

First, that's assuming the "right tooling" exists. Microsoft, Google, and FB have all wrote custom tooling or custom extensions to tooling (e.g. to version control tools) to fit their monorepos and get acceptable performance. And all three have more resources to throw at their problems compared to a 10, 100 or 400 person company.

Second, this is a little like "C and Haskell are both turing complete, so what's the big deal with adopting either?".


The first point is valid. No one seems interested in releasing such tech publicly.

The second point misunderstands the point being made. It is more like saying you don’t care if your program is running on X86 or ARM because the compiler/interpreter abstracts that detail away from you anyways.


> because the compiler/interpreter abstracts that detail away

This assumes the first point though.


The people who are using the tech aren’t bothered by the first point.


MS Google and FB also have much larger monorepos than a 10, 100 or 400 person company (which is a point TFA makes).

Also, if you want to make an atomic commit between two git repositories, it's possible to do it, but you are at some point no longer working with git, you are working with some VCS that is building on top of git.


I agree tooling can resolve much of the difference. I think the tooling for monorepos as currently more advanced although it tends to be company specific. At GitLab we're working on a feature to group merge requests so people can get closer to an atomic commit across repos. This will help the multi repo users.


Good point! I haven't looked into the tools available to small companies and just assumed open source solutions existed.

As an intern at Microsoft a few years ago I extended an internal tool that provides abstractions such that it can work seamlessly across repos using different version control systems.


The benefits and drawbacks of both approaches is mostly:

-what do existing tools do for you by default

-what does creating artificial boundaries around projects does to your org (this is purely social/psychological).

In both cases there are good and bad things, but given enough tooling and enough process, they do end up exactly the same.

In practice you don't have enough time to write an infinite amount of tooling to achieve this, and the psychological aspect if the repository as a boundary is extremely strong.

(disclosure, Im personally for having many repos in a medium to large organization, because I think those boundaries are good).


> I’ve never understood the debate between mono and multi repos. With the right tooling, the line seems to vanish [...] Am I missing something?

The space of development styles has high diversity. Tools are opinionated and constraining, with limited coverage of the space. Tools depending on tools exacerbates this. People underestimate the diversity. People have limited awareness of the characteristics of their own style that permit a particular tooling to work. People have even less awareness of how their tooling would degrade with an altered style. People similarly underestimate the cost and limitations of using additional tooling and workarounds to adapt tools to a different style. And so, the "debate".

So git is perfect, doing everything the way you want. And git is a joke toy, badly solving a tiny fraction of what you need from a VCS. And who could think otherwise? The pair of OP blog posts repeatedly illustrate this dynamic.

(In javascript/node land...) For solo personal projects, where extensive custom tooling can easily be a rathole, I generally use a mix of mono- and multirepos. And shift code between them based on tool ecosystem friction.


The arguments against circle around unnecessary coupling.

The arguments for revolve around necessary coupling. If a change triggers a change in multiple repos, it is easier to review those changes for completeness and administer them (commits/rollbacks to the code) in sync if the are in the same repo.

My own jury is still out as to which argument is more valid, I don't have enough experience with mono repos yet.


> If a change triggers a change in multiple repos, it is easier to review those changes for completeness and administer them (commits/rollbacks to the code) in sync if the are in the same repo.

This sounds like what submodules are for. A and B depend on C and C is kept in its own repo, referenced as a submodule by A and B.


When you think of monorepo, think of 1000 daily engineers and upward of 10,000 commits per day in the repo.

Think of running a company the size of google or Facebook off of a single repository.

Of course you can have monorepo at small scales. However, it is at the larger scales where differences become apparent and trade-offs become non-trivial.


We made the switch at less than 10 engineers and would never go back.


That perspective does change things a bit :) All their eggs are literally in one basket!


Even at a 1/5th of that scale the problems become obvious.


With a monorepo you can do an atomic commit over everything. Without you can not.


This is a negative quality of monorepos.


We're pretty small scale (< 20 services, < 10 devs), and happily use a monorepo (recently moved from multiple repos when that became unwieldy as services grew). If you have a lot of services/projects with some shared dependencies they can make tracking that easier. I agree with the article that in general they make life easier.

It depends on what tooling you're using, and whether it is tied to the version control system. Clearly if the tooling makes assumptions about one deployable per repo and works on git hooks that's going to cause pain, but the answer is don't use monorepos if your tooling doesn't support it, or change the tooling so it does.

Most companies won't scale past a few hundred employees, so they're never going to hit any sort of scale issues with monorepos, and if they do, they'll have the resources to deal with it.

Does this have to be a religious war? Does one size fits all really apply here?


> We're pretty small scale (< 20 services, < 10 devs)

This is really important and the whole point, indeed.

Eg: We consider our org to be "many repos" (we have several thousands). However, hundreds of them contain 5, 10, or 20+ packages/projects/services. It's funny because we'll talk about creating "monorepos" (plural) for certain part of our product, and it confuses the hell out of people.


Several thousand repos - ye gads! How do you document them? How do they fit together? Do they reference each other?


The answers to all those questions are "it depends". There's a few thousand libraries, those obviously refer to each other.. Some have readme files and that's enough, some have full documentation "books", some have comments in the code and that's enough.

We don't mandate a company wide development process, so each team and groups can choose their own process and how they track their stuff.

We do have automation and tooling to keep track of things though.


You are talking about creating "oligorepos".


Maybe an "omnirepo"?! (Though risks connotations with "omnishambles" - https://en.wikipedia.org/wiki/Omnishambles


How does CI work here? Do you kick off tests and deploys for 20 services when any one changes?


Most CI (and a decent amount of CD) tools I know allow you to specify what to run based on subfolder in a repo.


No, only those that are impacted. Most services are pretty standalone, those that share code we try to put tests for that shared code in with the shared code.

I don't think CI tooling needs to be tied to a given repo, that's just one way to do it.


The right way to write this kind of content is like Digital Ocean did: https://blog.digitalocean.com/cthulhu-organizing-go-code-in-...

Rather than these back and forth about the theoretical implications of a monorepo, actual stories of implementing one are 10x more useful to me.


I think there is a balance. It's always possible to argue that DO would have done better or worse with a polyrepo. However if you ignore evidence, then the theoretical arguments can get silly very quickly.


Another problem is confusing causal effects from confounders. Did a monorepo cause success, or did they succeed in spite of a monorepo. Studying individual cases for which there could never have been a counterfactual outcome (e.g. FB or Google) literally cannot provide evidence of a causal effect.


That's essentially what I was saying. We have evidence that it is possible to be successful with both a monorepo and a polyrepo setup, which is perhaps uninteresting other than to contradict people saying that either setup is guaranteed to be an unmitigated disaster.

We also, however, have subjective feedback from people working on those teams as to how they think it would have been different. It's not rigorous data, but it shouldn't be altogether ignored either, particularly since rigorous data is so hard to come by.


> Developers are not arguing children that need to be confined to separate rooms to prevent fights

Has the author seen the fights that go on? We're extremely opinionated.


The original "Monorepos please don't" article really just convinced me how great monorepos are when you aren't at scale. So you know, put your shit in a monorepo, and then when it gets painful, break it out.


That process will take you 3-7 years, depending on how many resources you throw at it. Can your business survive 3-7 years of #seriouspain?

This article hand-waves over many of the criticisms while ignoring a few cold realities. If you're following an infrastructure as code pattern and/or if you run bare metal, at some point you _WILL_ determine that some things are too sensitive to keep in the monorepo. Here is one of the places where coupling will screw you the hardest.

Your lightweight production deployment repo will have a hard dependency on some nightmarish 35-40GB monorepo. Your collaboration tools like rietveld/gerrit will choke under the load and you will struggle to get big enough servers to maintain it. You'll do things like push to one target and pull from another. You'll deal with all sorts of transient failures trying to push or pull. Your CI/CD platform will start taking an eternity to do anything.

Monorepos absolutely result in coupling and coupling is one of those nasty things that you don't realize how much of a problem it is until you're drowning.

None of the above-mentioned complaints are theoretical. I've lived through them all.


I imported a 20 year old SVN monorepo to git with 100s of thousands of commits and tens of thousands of branches/tags and it was under 10GB. Removing a few large .tgz files that were inadvertently committed brought it down to 5 GB.

Linux has 25Mloc and ~800k commits; I think the pack is on the order of 2GB?

I don't doubt that 40GB nightmarish monorepos exist, I'm just wondering how and why.


Linux is a highly focussed project trying to accomplish a single thing well and with rigorous standards.

If you have 100 developers working average US work schedules and making 5-10 commits per workday (debatable number, depends on culture, but i'm averaging between the "big" commit and lots of small commits), you're going to end up with 100k commits _per year_. And many large startups have a multiplier of that number of developers and they're much, much messier than kernel devs.

Referencing the ideal case as counter-example is a bit silly.


So the nightmare monorepos are caused by unfocused teams trying to accomplish many different things poorly?


Monorepo vs polyreop summary notes from previous HN discussions: https://github.com/joelparkerhenderson/monorepo_vs_polyrepo

I'm adding notes from this HN discussion today. Feedback welcome.


Merging/integrating code & styles is difficult and error prone. At the end of the day, if two systems interact they will need to be "merged" at some point. It seems to make more sense to handle this in tests/at a source code level then risk doing it in the runtime environment alone.

I think tooling and granular permissions (still part of tooling) can be blockers, though. It makes less sense outside an enterprise/company perspective such as developing a discrete component that gets pushed to a public repo (Maven, pypi, npm, etc)


What are the actually serious downsides of having a repo for each project again? Serious question. Mercurial supports Subrepositories for example. Just define your rules for pulling stuff.

From my own experience, if you are arguing about whether use convention A or convention B, the answer should be to have C that allows both, and then configurations on top of C for A and B.

This applies for example to lookups in the dabatase by an index.


> Does the practice of keeping all code together in one place lead to better code sharing? In my experience that's clearly the case.

This is where abstraction comes in. When done correctly abstractions are necessary so that you can separate your work from things you don't want to work on. In my application I want to be able to access and modify files on the local filesystem. I don't care about the differences between opening files in Windows versus Linux or the intricacies of how filesystems work at the bit level. My application evaluates some code and writes some output to a file. I use Node.js to solve for a universal file management API. This is an example of a good abstraction because the separation is clear and explicit.

The simple rule for abstractions is if you can do the very same job in a lower level you don't need the higher level code. In the Node.js example you cannot access the filesystem in a lower level, because no such standard library exists to JavaScript.

Bad abstractions don't provide separation. Many times developers want to use an abstraction to solve for complexity, but inadvertently do the very same things the abstraction is supposedly solving for just in a different style or syntax. Many JavaScript developers use abstractions to access the DOM or XHR. XHR is simple: assign a handler to the onreadystatechange property, open the connection, and then send the request. You lose huge amounts of performance by abstracting these and dramatically increase your code base and the separation between the API, the framework performing the abstraction, and the code you are writing are all superficial and self-imposed.

By using and enforcing good abstractions while avoiding bad abstractions you keep your application far more lean and restrict the focus of your development team to the goals of the project. Without that your code isn't a monorepo, its a dependent library of another repo.


I just have this to say: the discussions here are painfully oriented around SaaS. Once you're doing stuff on-premise or making desktop applications (things requiring long lived release branches), the discussion is totally different.


I don’t see why shipped software is mutually exclusive from monorepos. You can always check out new subdirectories of repos and treat them as a form of branching where eventually directories are merged together in a separate commit.


I find that shipped software actually operates better in a normal, branched monorepo. You just branch the whole thing. The alternative is several repo and using a package manager. That minimizes merging in many branches, as you can just point the package manager at the updated module version, but brings its own hassles.

Either way, as I said, different world from the discussion here, which I'd summarize as "SaaS monorepo vs SaaS polyrepo".


Can you elaborate a bit?


Once youre shipping software off prem you need to patch it between major and minor releases.

Typically one way to do that is to branch when you do a release to a branch namded for the release. Say 1.2. Then when issues pop up you fix it in the branch then see if it applies to the trunk or other branches after that.


Yup. I've worked on projects where there were 6-7 major branches active at the same time and several smaller ones, besides the master branch. Then you'd have to merge everywhere applicable, etc.

Totally different from the Google monorepo approach of "master only", basically. And probably one of the main reasons why Golang is having a ton of difficulties in the outside world by not having a proper versioning story.


I think I can summarize my thinking on this pretty succinctly: I want to build a product, not tooling for software development, and I certainly don't want to spend any time trying to keep different repos synchronized, etc.


It turns out that effectively separating dependencies is a huge part of building a product.

This is like saying you really want to swim but you don’t want to get wet.


I'm part of a company that went from a boring VCS strategy to jumping on the monorepo bandwagon against my advice to keep our git usage simple. It's been fairly terrible - merge conflicts, code going to the wrong environment, nobody can actually do a hot patch, and even long running feature branches which should be stupidly simple run into immense problem.

It also caused issues with our npm repo solution, and has created the worst case of dependency lock we've ever had.

Do yourself a favor and say no to monorepos. It is massive complexity for no benefit.


1. It's really hard to tell if any of the people writing blog posts about these things have ever experienced the larger scale monorepos or not for any length of time.

As best I can tell, the answer is "no", and they are mostly writing based on perception. They don't appear to even do things like "try to talk to people who have experienced the good and bad of it".

While the writing is fun, it makes it a lot less useful in both directions, IMHO.

2. The author is right that planning more than 6 months for smaller scale companies makes no sense. However, both of these authors seem to fundamentally miss the actual problem in large companies, which they assume is around engineering and scaling large systems. In fact, it is not. The underlying issue is that engineering a thing is no longer your main cost. This is one of many reasons larger teams/companies are fundamentally different (as this author does correctly point out).

There are 2080 work hours in a year.

If i have 8000 developers, and I have to spend an hour teaching them a new thing, i just spent ~4 people for a year.

If you spend a day teaching them something new, I just spent ~31 people for a year.

If you spend a work week teaching them something new, I just spent ~154 people for a year.

That's just the basic learning costs, it doesn't include migration costs for code base or anything else[1].

But these costs certainly dominate the cost to engineer a solution as you get larger - the systems being talked about here (which have scaled engineering wise) are not 50 people a year (i work next to them :P). Not even close.

In some sense, talking about the engineering challenges makes no sense - they basically don't matter to the overall cost at large scale.

These same things apply to most of the broader (in the sense of who it touches) pieces of developer infrastructure like programming languages, etc.

As you can also imagine, you can't stand still, and so you will pay costs here, and need to be able to amortize these costs over the longer term. In turn, this means you have to plan much longer term, because you want to pay these costs over a 5-10 year scale, not a 1 year scale.

[1] It also excludes the net benefits, but you still pay the costs in actual time even if you get the benefits in actual time as well :)

Also, productivity benefits from new developer infrastructure are wildly overestimated in practice. Studies I've seen basically show people perceive a lot of benefit that either doesn't pan out or doesn't translate into real time saved. So at best you may get happiness, which while great, doesn't pay down your cost here ;)


> 1. It's really hard to tell if any of the people writing blog posts about these things have ever experienced the larger scale monorepos or not for any length of time.

Isn't one of the points of this article that it's not necessarily important for your decision about repo structure how it feels like at "larger scale" (for sufficiently large larger scale) if that's not where you are, and that "Google does X" or "Twitter had problem Y" isn't necessarily relevant for others not hitting their scale? Is it still important to have experience with a scale you're never gonna hit to make a decision about it? Or should the label "monorepo" not be applied to smaller entities, since the term (as far as I know) originates with the large ones?


What is the difference in cost between an 8000-person org all spending a week learning something new, and 800 10-person orgs all spending a week learning something new? it's the same fraction of availalbe time

Large orgs have more resources (revenue) and more costs. What matters is whether revenue:cost ratio is superlinear or sublinear in org size.


I'm not sure of your first point. The organizations i'm talking about are clearly agglomerations of teams anyway. Nobody has a flat 8000 person team structure i'm aware of :)

Also, the fraction is not the problem, it's certainly the same. But the absolute scale of the number matters at some point.

0.001 people a year of lost dev time is not likely to change what your company could accomplish.

200 is.

(I think you don't disagree, but i can't really tell, so if you do, let me know and happy to argue about it further :P)


If you don't do things in the way that is reasonably popular outside of your organization, you have to train every new hire (should you find people willing to work using unpopular technology). You also risk losing talent because they don't want to get locked into your way of doing things with nothing useful on their resume should you fire them.


This is also true, but appears to mostly be made up for with money or other factors. I say this because pretty much all large enterprisey companies (in tech or not) use unpopular or ridiculous technology stacks in various places (I expect you can collect thousands of stories from HN easily), but their attrition rates are not particularly high.

I could believe it affects the bay area and other areas of "high amounts of choice" more, but ....

So while I believe what you say, and I even believe it impacts recruiting/attrition more than it used to, it still doesn't seem like a dominant factor yet.


This can't really be discussed without also clarifying how your target language and tools ecosystem does package management. i.e. are you expecting to generate and version internal packages and then consume them from a feed?

It seems like, if you don't have this facility then the monorepo becomes more compelling.

Not to mention, are you building 1 or 2 apps, or a whole host of microservices.

Without knowing that, your experience of mono vs. multi-repos won't be much use.


I have a simple question: are monorepos possible in git? What is the upper limit of contributions per day for git to be effective in a monorepo?


Absolutely. Most monorepos are git.

The entire commit history of Linux kernel development exists as a single git repository which you can check out here: https://github.com/torvalds/linux

(Indeed, git was developed for this very purpose.)

And it's very doubtful you're going to become larger than Linux. If your company becomes so large that git can't handle all of your code then quite frankly you've exceeded beyond your wildest dreams. Worrying about Google scale for your code repo when you're a startup is the ultimate counting chickens before they hatch.


Am I reading correctly that Linux only has <300 commits per day? (<100K per year, 800K in histoy)

https://github.com/torvalds/linux/commits/master?after=3bd6e...

That's a lot for a code base, but not a lot for a configuration base that has commits automatically generated by various tools.

Anyway, Linus advertised that git could apply 3 commits/sec in 2005: https://git.wiki.kernel.org/index.php/GitBenchmarks


I really wouldn't call the Linux kernel a monorepo. Not compared to Microsoft keeping all of Windows in a single repo.

The problems start when you can't clone the repo within an hour or so, because git doesn't allow partial clones like SVN did.


Which is why Microsoft built GitVFS and has been contributing all sorts of interesting performance work to upstream git. Microsoft's devops work with Windows and the git transition is full of interesting blog posts and stories and metrics.


All of the Linux source code is in a single repo. How is that not a monorepo? How are you defining monorepo if not this?

The entire Windows codebase is also a monorepo. It just happens to be a bigger one.


I'd define a monorepo as multiple loosely (or un-) coupled projects in the same repository. The Linux kernel is a single strongly coupled project.

Nevertheless, whether or not you should keep your code in a single repository is more a question of the sheer size of the repository than of whether the code logically belongs in the same place.


Makes sense.

But yeah, even the total code of all projects combined at a startup isn't likely to get anywhere close to the scale of the Linux repo, so it's at least a good example of how far you can get with a single large git repo (regardless of the relatedness of the contents therein).


> Looking at it from the other side, could introducing strict borders somehow make it easier to reuse logic? I think it's clear that borders can only take away from your ability to perceive opportunities to use abstractions or to unify code.

This is not at all clear. I'd argue that a visible organization of your codebase into repositories makes it easier to reuse code in the same way that interface/implementation splits do: it makes it clearer which parts felt domain-specific and which felt like reusable libraries.

> The bottom line is that you should pick the right abstraction and the right place for a function or class based on the individual merits of the case - and not driven by facts about repos created a long time ago.

This seems to be assuming that repository boundaries are defined in the beginning and fixed for all time - the same mistake I see opponents of static typing making. Your repository structure reflects your logic and business structure; as those change you change your code structure to match.

> True, touching multiple subprojects in a single commit is not always desirable. For example, updating backend and frontend components incrementally in backward-compatible ways can be the better approach. But even so, it's useful to retain the option of cross-boundary commits for many reasons including simplicity and enforced coordination.

This needs to be justified. In much of programming we consider the benefits of strict isolation to outweigh the costs - e.g. private fields in OO languages, true parametric polymorphism, microservices. You can't just assert that having the option of bypassing the good practice is worthwhile.

> If you think about it, splitting a codebase into sub-repos is a ham-fisted way to enforce ownership boundaries. Developers are not arguing children that need to be confined to separate rooms to prevent fights. With sufficient communication and good practices, a monorepo will allow you to avoid the question “which repo does this piece of code belong to?” Instead of thinking about repo boundaries - effectively a distraction - a monorepo allows you to focus on the important question: where should we draw the boundaries between modules to keep the code maintainable, understandable and malleable in the light of changing requirements?

Communication and good practice are the most costly way to enforce important things; you could equally well argue that e.g. unit tests are a ham-fisted way to enforce non-breaking of code and developers are not arguing children that need to be reminded not to break each other's functionality.

Repo boundaries are higher-level than directory boundaries. No-one is arguing for having each directory in its own repo, but being able to represent "not directly involved, but versioned together" and "separate enough to be versioned separately" is a very valuable distinction to have in your toolbox.

> Many of us, especially in the world of startups, work in smaller teams - let's say less than 100 developers.

Do you find it practical to communicate and co-ordinate with 100 other developers before making any changes? Because that's the only case where a single repo makes sense - when you are working closely enough with every other developer sharing the repository that you don't need to go to any extra effort to organize who is changing what.

Once you're not attending the same standup, you shouldn't be working on the same repository. You need to have a release cycle with semVer etc. so that people who aren't in close communication with you can understand the impact of changes to your code area. Since tags are repository-global, the repository should be the unit of versioning/releasing.


OP here - interesting perspective, especially the parallel drawn to dynamic/static types. Thanks for the thoughtful post


Glad you wrote this and saved me the effort, pretty much what I was thinking of jotting down




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: