On the other hand if you've got a whole bunch of components in different repos which need to release together it suddenly becomes a real pain.
If you've got components that will never need to release together, then of course you can stick them in different repositories. But if you do this and you want to share common code between the repositories then you will need to manage that code with some sort of robust versioning system, and robust versioning systems are hard. Only do something like that when the value is high enough to justify the overhead. If you're in a startup, chances are very good that the value is not high enough.
As a final observation, you can split big repositories into smaller ones quite easily (in Git anyway) but sticking small repositories together into a bigger one is a lot harder. So start out with a monorepo and only split smaller repositories out when it's clear that it really makes sense.
First of all this is normal, because otherwise the development doesn’t scale.
In such a case the monorepo starts to suck. And that’s the problem with your philosophy ... it matters less how the components connect, it matters more who is working on it.
Truth of the matter is that the monorepo encourages shortcuts. You’d think that the monorepo saves you from incompatibilities, but it does so at the expense of tight coupling.
In my experience people miss the forest from the trees here. If breaking compatibility between components is the problem, one obvious solution is to no longer break compatibility.
And another issue is one of responsibility. Having different teams working on different components in different repos will lead to an interesting effect ... nobody wants to own more than they have to, so teams will defend their components against unneeded complexity.
And no, you cannot split a monorepo into a polyrepo easily. Been there, done that. The reason is that working in a monorepo versus multiple repos influences the architecture quite a lot and the monorepo leads to very unclear boundaries.
released "together" == part of the same feature. Timelines, release process and team priorities are all there to help to deliver features. If they stand in the way, they need to be adjusted. Not the other way around.
Multi repos encourage silos. Silos encourage focusing on the goals of the silo and discourage poking around the bigger picture. Couple that with scrum, that conveniently substitute real progress metrics with meaningless points, and soon enough you end up with an IT department, full on with processes but light on delivering value.
I think you are conflating a monorepo (where boundaries can still be established, e.g. via a module isolation mechanism specific to the stack used) with a "monoproject"/"monomodule", where is no modularization at all.
Edit: expanded wording
> where boundaries can still be established, e.g. via a module isolation mechanism specific to the stack used
Unfortunately this isn't a technical issue and that's the problem.
In my opinion monorepos make refactoring dependant projects much easier. However it is much harder to establish and enforce clear boundaries...
In my experience it's hard to establish clear boundaries, regardless of repository kind. It may be more difficult to create features which are tightly coupled across multiple repositories, but people do it regularly. And when they do, you suddenly have to manage and maintain synced features across multiple repositories.
In fact, the repo tool for the android project makes it quite easy to develop features across repositories, thus lowering the boundaries significantly.
When I change something on the library, i could easily also run tests across all the projects that depends on it with the latest changes of the library and make sure that my change is not breaking things all over the place, which is also pretty nice.
I am not sure yet if using a monorepo is actually the best way deal with this kind of projects, but for now it feels better than having them on seperate repos and then having to deal with the complexity of sharing the library across repos by publishing it somewhere or using git submodules or something.
So when someone only wants a submodule they can happily only clone that, but when someone wants all stuff (which is the default case), the can clone and install all at once.
Downside is that I have to commit twice
... and so nobody really understands how all of the components tie together and as a result it takes weeks of manual testing to release.
Requiring multiple PRs to multiple repos to roll out one user-facing feature is fine, as long as your independent modules/projects are not actually interdependent (i.e. one of those PRs will not break another independent repo that lacks a corresponding PR).
But in that case you could consider the change to the dependency a single release. And ingesting it into another app a separate release.
Context switching really sucks. You should aim to reasonably avoid it
Who is the email being sent to?
What is the content of the email?
What data does the email content and recipient depend on?
What are you tracking on the email?
How is the email visually formatted?
All those things might be in different apps as the logic gets more complicated.
I the difference of opinion is between developers who work on self-hosted "evergreen" products where the latest version is deployed, and others who work with multiple release branches with fixes/features constantly being cherry-picked.
The idealistic discussions for or against monorepos often overlook the most important detail: who's working on the code and how would you want them to version control it?
If it's separate projects with their own versioning then it makes sense to have them as separate repositories. If it's a single project but with individual components you'd want to version (eg because it's developed by different teams with different release timelines) then there you also have a situation where you'd want to version the code separately so once again there is a strong argument for separate repositories. However if it's one product with a single release schedule then splitting up the frontend from the backend can often be a completely unnecessary step if you're doing it purely for arbitrary reasons such as the languages being different. (I mean Git certainly doesn't care about that. A project might have Bash scripts, systemd service files, Python bootstrapping, code for an AOT compiled language (eg Rust, Go, C++, etc), YAML for Concourse, etc. They're all just text files required for testing and compiling so you wouldn't split all of those into dozens of separate repos).
What if there is one team, but different developers (one working on the frontend, another on the back)? What if QA can test the API while the frontend development is ongoing?
What if the front and backends have different toolchains, and ultimately separate execution environments (server app backend vs JS running on client machines).
> What if there is one team, but different developers (one working on the frontend, another on the back)?
Then presumably everyone in that team are full stack?(Otherwise it would be different teams in the same department) so it still makes sense to have a monorepo because you could have a situation (holiday, sickness) where someone would be working on both the front end and back end. Thankfully got is a distributed version control solution and supports feature branches so you can still have multiple people working on the same repo and then merge back into a developement branch.
> What if QA can test the API while the frontend development is ongoing?
Testing isn’t the same as released versions. You can (and should) test code at all stages of development regardless of team structures, git repo structures nor release cycles.
> What if the front and backends have different toolchains, and ultimately separate execution environments (server app backend vs JS running on client machines).
I’d already covered that point when talking about different languages in the same repo. You’re making a distinction about something that version control doesn’t care in the slightest about.
I think it’s fair to say any significant cross-project tooling should be it’s own repo (you wouldn’t include the web browser or JVM with your frontend and backend repos). But if it’s just bootstrapping code that is used specifically by that project then of course you’d want that included. Eg you wouldn’t have Makefiles separate from C++ code. But you wouldn’t include GCC with it because that’s a separate project in itself.
Ultimately though, there is no right answer. It’s just what works best for the release schedule of a product and teams who have to work upon that project.
In my experience, that is almost never the case. Often, the frontend requires a new endpoint or a modification to an existing endpoint. If you don't coordinate this change, you end up with a non-functional PR that cannot even be tested. Same happens when the backend proposes an endpoint change that affects the frontend.
We have moved the frontend and backend to the same repo to make coordination and testing of such cases simpler.
* Any non-backwards compatible change in the interface between the components. Yes this can be solved. But when working in a smaller team on proprietary software why use time solving a problem you don't need to solve?
(This is from experience.)
Unless they're running on the same computer and deploy literally simultaneously, this is already a problem you need to solve.
I expect lots of people on HN are working on systems with very tight coupling between client/GUI and server and no proper versioning between them, as is common in web applications. Hence the replies to the contrary: you're probably from quite different worlds :)
(Now, I personally think that maintaining sound versioning practices is a good idea even if you do have tightly coupled control of both the client and the server side. But that may just be me...)
In the end, I think the other comments are right that it mostly depends on who's working on something. If it's different teams, then different repo's probably make sense. But if I'm responsible for both the back-end and the front-end, they're usually not isolated at all at least in terms of project requirements, and hence keeping them together makes sense.
(But of course, even then there's nuances. I think the article is mostly arguing against monorepo's as in company-wide monorepo's. I'm willing to believe Googlers that it works well for Google, and I'm not in a position to claim what it'd be like for other companies. Team-wide monorepo's for different parts of the same project, however, make a lot of sense to me.)
It doesn't. In my entire career, that was the only environment in which some random would break us and we couldn't do anything about it other than hope for a rollback and then wait for hours for the retest queue to clear before we could deploy anything at all.
Maybe not all the time, but you need the escape hatch of pinning healthy deps, because HEAD of everything is not guaranteed to work.
If it's a monorepo your PR might be a 2 line patch to that function, then adding the GUI and server code.
If you split it you'll first need to have a PR on the "validation-lib" repo, then once that gets in a PR on the "server" repo, bumping the "validation-lib" version dependency, and finally a PR on the "gui" repo bumping the dependency for both "validation-lib" and "server" (for testing etc.). That's before you need do deal with the circular dependency that "server" also wants "gui" for its own "I changed my server code, does the GUI work?" testing.
Better just to have them in a monorepo if they're logically the same code and want to share various components.
> If you split it you'll first need to have a PR on the "validation-lib" repo, then once that gets in a PR on the "server" repo, bumping the "validation-lib" version dependency, and finally a PR on the "gui" repo bumping the dependency for both "validation-lib" and "server" (for testing etc.). That's before you need do deal with the circular dependency that "server" also wants "gui" for its own "I changed my server code, does the GUI work?" testing.
The above is exactly why I am so firmly opposed to multirepo-first. And it's really just a throwaway example: a real change would involve multiple different library and executable repos, all having separate PRs. And then there's the relatively high risk of getting a circular incompatibility.
This can be worth the cost, for organisational reasons. But until you need it, don't do it. It's very easy to split a git repo into multiple repos, each retaining its history (using git filter-branch). Don't incur the pain until you need to, because honestly, you're not likely to need to. You're probably not going to grow to the size of Google. Heck, most of Google runs in one monorepo, with a few other repos on the side: if they can make it work at their scale, so can you. And if, as the odds are, you never grow to their size, then you'll never have wasted time engineering a successful multirepo system instead of delivering features to your business & customers.
0: 'polyrepo,' really? https://trends.google.com/trends/explore?date=all&q=multirep... clearly shows that 'multirepo' is term.
A good way to fulfil those requirements is to have the exact same function available in both places.
I'll need a few more validation functions for each clients. I don't want to write+maintain multiple functions that do the same thing, even if it's just copy+paste.
It's "data" validation. So let's put that in the "data layer" repo.
We now have, at least:
- Web (GUI)
- More clients?
We'll also have branches for each development task. How do we know what branch the other branches should use? One "simple" feature can easily spread over multiple repos. Does each repo refer to the repo+branch it depends upon (don't forget to update the references when we merge!), or we add a "build" repo which acts as the orchestrator?
Most PRs will need to be daisy chained - who reviews each one? Will they get comitted at the same time?
How do we make the builds reproducible? commit hashes? tags? ok, we now need to tag each repo, and update the references to point to that tag/hash... but that changes the build.
Well, I'm glad our code base is split over multiple repos because "scalability".
In any case, if you're nitpicking that example you're missing the point. The same would go for any number of other shared code you could imagine between a client/server trying that logically make up one program talking over a network.
If you had a `foo()` function shared between the GUI and the server (or two services on your backend, or whatever), in a monorepo your workflow is:
- Update foo()
- Merge to master
- Update foo()
- Merge to shared library master
- Publish shared library
- Rev the version of shared library on the client
- Merge to master
- Deploy client
- Rev the version of shared library on the server
- Merge to master
- Deploy server
I recently dealt with an incredibly minor bug (1 quick code change), that still required 14 separate PRs to get fully out to production in order to cover all of our dependencies. That's a lot of busywork to contend with.
Merge to master
Publish shared library
So as you can see the only step added was to publish the shared library that would automatically update the version in all the projects using it.
If you are really doing everything manually I can understand that this is a pain, but this has nothing to do with the monorepo / multiple repo distinction, this is a tooling problem.
What if updating foo() breaks something in one of the clients (say due to reliance on something not specified). Then you didn't catch that issue by running client's tests, now client is broken, and they don't necessarily know why. They know the most recent version of shared broke them, but then they have to say "you broke me" or now one of the teams needs to investigate and possibly needs to bisect across all changes in the version bump under their tests to find the breakage.
How is that handled?
(the broader point here is that monorepo or multirepo is an interface, not an implementation, its all a tooling problem. There are features you want your repo to support. Do you invest that tooling in scaling a single repo or in coordinating multiple ones? Maybe I should write that blog post).
Not really. You can have a single repo with top level directories tigershark-gui and tigershark-server.
Smart people can work through problems to get the job done. Monorepo vs polyrepo won't stop people from moving forward.
If you only need to do this once, subtree will do the job, even retaining all your history if you want.
I'm not sure what the easier way to split big repos is.
In practice, I can tell you from first-hand experience that this isn't all that simple in bigger, organically grown cases (you'll have many other things to consider if you want to keep the history in a useful way). Especially the broken branching model of SVN and co. is a problem here: In the wild, it immediately leads to "copy&paste branching" (usually through multiple commits. Migrating that to Git or Hg and splitting it up can be a challenge.
But monorepo leads to tight coupling, and that is just as much a pain to work on as versioning, or two teams are simultaneously working on the same shared code, and you have not only merge conflicts, but conflicting functionality.
That said, some packaging solutions can bridge the gap reasonably well. Unless you need instantaneous, atomic releases.
Funnily, I only use Code to handle commits to submodules, because Git Lens is not available for the full VS IDE.
I agree that you should always start with one repo and split as needed, it's the MVR way (minimum viable repository)
Also I definitely miss the ability to make changes to fundamental (internal) libraries used by every project. It's too much hassle to track down all the uses of a particular function, so I end up putting that change elsewhere, which means someone else will do it a little different in their corner of the world, which utterly confuses the first person who's unlucky enough to work in both code bases (at the same time, or after moving teams).
An average change touches 4 of them, and touching one of them triggers on average releases on 2 or 3 of them. Even building these locally is super tedious, because we don't have any automation in place (not formally plan to) for chain building these locally.
This is a nightmare scenario for myself. A simple change can require 4 pull requests and reviews, half a day to test and a couple hours to release.
Yet my team keeps identifying small pieces that can be conceptually separated from the rest of the functionality, even if they are heavily coupled, and makes new repos for these!
Once you start having lots of peer repos being worked on within the same organisation on a daily basis, you know that you’ve partitioned far too far, and you need to roll back.
Otherwise one ends up in exactly the position you’re in. The ultimate slippery-slope end-state would be hilariously bad: a repo for each ASCII character, with repos for each word or symbol constructed out of those characters, with repos for each function constructed out of those words & symbols, with repos for each module constructed out of those functions, with repos for each system constructed out of those modules, with any change requiring a massive, intricate, failure-prone dance in order to update anything, all while patting oneself on the back about how one has avoided complexity.
Noöne sane would argue for that situation, and yet I’ve seen smart people argue that requiring coördinated changes to half a dozen repos is fine & dandy.
So don't use polyrepos for heavily coupled projects, then. Or even better...
... try to avoid heavy coupling in the first place.
Q: Why are we debating the merits of mono-repos over poly-repos?
A: Because it's managing dependencies is really hard and needs expertise.
In the polyrepo case those boundaries have to be made explicit (otherwise no one gets anything done) and those owners should be easily visible. You may not like the friction they sometimes bring to the table, but at least it won't be a surprise.
I think it's more common to merge or split modules and classes than repositories.
I wonder if there'd be less tension if repos and teams were 1:1 though.
Anecdotally, yes I think it helps a lot. I was once part of an organization for which each "team" having a repo is the only thing that prevented violence :-)
Partially answering my own question: SVN, recommended in a prior comment , supports path-based authorization . But what about teams using another version control system?
We use service owners, so when a change spans multiple services, they are all added automatically as blocking reviewers.
So (mono)repos are composable.
Git has the idea of sub-modules, but they're really just filters. (They're in the same repo). So ultimately, you don't have that kind of control.
You can just click the import [project] name and it will switch to the repo.
I once had to wait for 9 months to get a complex change through in a monorepo setting because of all the people involved, the number of stuff it touched and the fact that everything was constantly in flux so I spent half my time tracking changes. I'm not saying it would have been faster in a polyrepo. I'm saying that complex changes are complex regardless of how the source is organized.
I do however think that polyrepos forces you to be more disciplined and that it is easier to slip up in a polyrepo and turn a blind eye to tighter couplings.
This is a hard and complex problem. Especially how to make code-review not too messy if you target 5-8 repos at once.
So you made a commit. What artifacts change as a result? What do you need to rebuild, retest, and redeploy? It doesn’t take a large amount of scale to make rebuilding and retesting everything impossible. In a poly repo world, the repository is generally the unit of building and deployment. In monorepo it gets more messy.
For instance, one perceived benefit of a monorepo is it removes the need for explicit versioning between libraries and the code that uses them, since they’re all versioned together.
But now, if someone changes the library, you need to have a way to find all of its usages, and retest those to make sure the changed didn’t break their use. So there’s a dependency tree of components somewhere that needs to be established, but now it’s not explicit, and no one is given the option to pin to a particular version if they can’t/won’t update. This is the world of google & influcenced the (lack of) dependency management in go.
You could very well publish everything independently, using semver, and put build descriptors inside each project subdirectory, but then, congratulations, you just invented the polyrepo, or an approximation thereof.
If you're using Git, then typically for each push to the remote repository you get a notification with this data in it:
BRANCH # the remote branch getting updated
OLD_COMMIT # the commit the branch ref was pointing to before the push
NEW_COMMIT # the commit the branch ref was pointing to after the push
# To get the list of files that changed in the push:
git diff --name-only "$OLD_COMMIT" "$NEW_COMMIT"
Also building/testing is far more effective at finding dependencies than just going by repo structure. There are numerous package managers available to solve versioning if you need separate components.
We had to migrate a polyrepo to a monorepo and it was not fun because it was a migration that should have never had to be done in the first place
s/polyrepo/monorepo/ in the above and you have an assertion of about equal plausibility and weight.
It's not hard to explain: Scale has been fetishized by the industry/trade. Everyone wants the cachet of working at scale. 1.5 GB of CSV text? That's Big Data, let's break out map-reduce. 1 load balancer and not enough servers to fill half a rack? That's a scalable architecture, we could scale to multiple datacenters at some point in the future, so let's design it now.
Deploying oversized solutions is partly due to outsiders jonesing the scale of Google, Fb and gang, partly Resume-stuffing ("I have worked with this tech before"), and lastly FANG diasporans who miss the tech they used and rewrite systems/evangelize the effectiveness of those solutions to much smaller organizations.
This isn't isolated to our industry, of course: a constant refrain is that generals & admirals fight the last war; the financial industry is rife with products which are secure against the last recession, and so forth.
This point cannot be stressed enough. Almost all the worst software engineering failures I have seen have been caused by premature scaling - which is way worse than premature optimization because the latter's effects are usually local. But premature scaling causes architectural decisions that affects the whole project and simple cannot be undone.
One example among many are some of the influential engineers insisting on that we needed four application servers with fail-over because they had experienced servers crashing under heavy load. This complicated failover setup took huge amount of time and resources to setup, delaying the project by months. In the end it only attracted a few hundred visitors per day and was cancelled in under a year.
Hmm - failover shouldn't be that hard to set up. If it was then that suggests that other issues (technical debt, inexperienced management) were the more likely culprits.
Not the simple fact that they chose not to ignore the need for failover.
Now where have I heard those words before... :)
A much higher percentage of developers will. Number of companies is not a good metric for whether a topic is worthy of discussion.
I think a quick perusal of this page will show that it's not really "solved" after all. A far higher percentage of developers continue to be affected by large-repo issues than a Python-specific issue (currently #1 story on the front page) or anything to do with Ethereum (currently #7). Are those "horseshit" topics too?
> A far higher percentage of developers continue to be affected by large-repo issues than...
Are you suggesting that the results of the HN ranking algorithm at this very moment in time is a good metric of measuring what affects developers? I don't agree, and besides @yowlingcat's opinion that the article is "horseshit" is unrelated to how well its ranked on HN.
When the opinion is not just disagreement but outright dismissal of the topic as worth discussing, I'd say ranking is relevant. So is comment count. Clearly a lot of people do believe it's worth discussion, not irrelevant or a foregone conclusion as yowlingcat tried to imply.
What you call large-repo issues I call organization issues. From your other comments, it's clear that we draw the lines at different places, but I think I'm right and you're wrong in this case because I've seen engineers try to solve organizational issues with technology enough times that it's a presumable anti-pattern. Why don't we take your own words at face value?
"That hasn't been my experience. Yes, it's a culture thing rather than a technology thing, but with a monorepo the "core" or "foundation" or "developer experience" teams tend to act like they're the owners of all the code and everyone else is just visiting. With multiple repos that's reversed. Each repo has its owner, and the broad-mandate teams are at least aware of their visitor status. That cultural difference has practical consequences, which IMO favor separate repos. The busybodies and style pedants can go jump in a lava lake."
Why are there busybodies and style pedants working in your organization? Because your organization has an issue. Do you think that would be at the root of this pain, or a tool choice? I'll give you a hint, it's not the tool choice.
Because to an extent they serve a useful purpose. In a truly large development organization - thousands of developers working on many millions of lines of code - fragmentation across languages, libraries, tools, and versions of everything does start to become a real problem with real costs. You do need someone to weed the garden, to work toward counteracting that natural proliferation. That improves reuse, economies of scale, smoothness of interactions between teams, ease of people moving between teams, etc. It's a good thing. Unfortunately...
(1) That role tends to attract the very worst kind of "I always know better than you" pedants and scolds. Hi, JM and YF!
(2) Once that team reaches critical mass, they forget that the dog (everyone else) is supposed to wag the tail (them) instead of the other way around.
At this point, Team Busybody starts to take over and treat all code as their own. Their role naturally gives them an outsize say in things like repository structures, and they use that to make decisions that benefit them even if they're at others' and the company's expense. Like monorepos. It's convenient for them, and so it happens, but that doesn't mean it's really a good idea.
Sure, it's a culture issue. So are the factors that lead to the failure of communism. But they're culture issues that are tied to human nature and that inevitably appear at scale. I know it's hard for people who have never worked at that scale to appreciate that inevitability, but that doesn't make it less real or less worth counteracting. One of the ways we do that is by putting structural barriers in the corporate politicians' way, to maintain developers' autonomy against constant encroachment. The only horseshit here is the belief that someone who rode a horse once knows how to command a cavalry regiment.
The thing is, scale was only one factor listed among many.
The downsides which still apply are Upside 3.3 (you don't deploy everything at once) and Downside 1 (code ownership and open source is harder).
And those are pretty weak arguments -- I would argue that deploying problems exists with polyrepo as well, and there are now various OWNERS mechanisms.
The fact the polyrepos are harder to open source is a good point, but having to maintain multiple separate repos just in case we would want to opensource one day seems like sever premature optimization.
It’s much more about coupling and engendering reliance on pre-existing CI constraints, pipeline constraints, etc. If you work in a monorepo set up to assume a certain model of CI and delivery, but you need to innovate a new project that requires a totally different way to approach it, the monorepo kills you.
Another unappreciated problem of monorepos is how they engender monopolicies as well, and humans whose jobs become valuable because of their strict adherence to the single accepted way of doing anything will, naturally, become irrationally resistant to changes that could possibly undermine that.
It’s a snowball effect, and often the veteran engineers who have survived the scars of the monorepo for a while will be the biggest cheerleaders for it, like some type of Stockholm syndrome, continually misleading management by telling them the monorepo can always keep growing by attrition and will be fine and keep solving every problem, unto the point that it starts breaking in collossal failures and people are sitting around confused why some startup is eating their lunch and capable of much faster innovation cycles.
I've worked on teams with monorepos and teams with multiple repos, and so far my experience has been that monorepo development has been better — so much so that I feel (but do not believe) that advocating multiple repositories is professional malpractice.
Why don't I believe that? Because I know that the world is a big place, and that I've only worked at a few places out of the many that exist, and my experience only reflects my experience. So I don't really believe that multiple repositories are malpractice: my emotions no doubt mislead me here.
I suspect that what you & I have seen is not actually dependent on number of repositories, but rather due to some other factor, perhaps team leadership.
All I can say is I’ve had radically the opposite experience across many jobs. All the places that used monorepos had horrible cultures, constant CI / CD fire drills and inability to innovate, to such severe degrees that it caused serious business failures.
Companies with polyrepos did not have magical solutions to every problem, they just did not have to deal with whole classes of problems tied to monorepos, particularly on the side of stalled innovation and central IT dictatorships. Meanwhile, polyrepos did not introduce any serious different classes of problems that a monorepo would have solved more easily.
"The last point is not trivial. Lots of people glibly assume you can create monorepo solutions where arbitrary new projects inside the monorepo can be free to use whatever resource provisioning strategy or language or tooling or whatever, but in reality this not true, both because there is implicit bias to rely on the existing tooling (even if it’s not right for the job) and monorepos beget monopolicies where experimentation that violates some monorepo decision can be wholly prevented due to political blockers in the name of the monorepo.
One example that has frustrated me personally is when working on machine learning projects that require complex runtime environments with custom compiled dependencies, GPU settings, etc.
The clear choice for us was to use Docker containers to deliver the built artifacts to the necessary runtime machines, but the whole project was killed when someone from our central IT monorepo tooling team said no. His reasoning was that all the existing model training jobs in our monorepo worked as luigi tasks executed in hadoop.
We tried explaining that our model training was not amenable to a map reduce style calculation, and our plan was for a luigi task to invoke the entrypoint command of the container to initiate a single, non-distributed training process (I have specific expertise in this type of model training, so I know from experience this is an effective solution and that map reduce would not be appropriate).
But it didn’t matter. The monorepo was set up to assume model training compute jobs had to work one way and only one way, and so it set us back months from training a simple model directly relevant to urgent customer product requests."
What do you think is the cause of your woes, the monorepo, or the disagreement between your colleague in central IT tooling who disagreed with you? Where was your manager in this situation? Where was the conversation about whether GPU accelerated ML jobs were worth the additional business value to change the deployment pipeline? Was that a discussion that could not healthily occur? Perhaps because your organization was siloed and so teams compete with each other rather than cooperate? Perhaps because it's undermanaged anarchy masquerading as a meritocracy? Stop me if this sounds too familiar.
I've been there before. I know what it feels like. But, I also know what the root cause is.
To argue otherwise, and draw attention away from the real source of the policy problems (that the monorepo enables the problems) is a bigger problem. It’s definitely some variant of a No True Scotsman fallacy: “no _real_ monorepo implementation would have problems like A, B, C...”.
The practical matter is that where monorepos exist, monopolicies and draconian limitations soon follow. It’s not due to some first principles philosophical property of monorepos vs polyrepos — who cares! — but it’s still just the pragmatic result.
Also you mention,
> “Where was the conversation about whether GPU accelerated ML jobs were worth the additional business value to change the deployment pipeline.”
but this was explicitly part of the product roadmap, where my team submitted budgets for the GPU machines, we used known latency and throughput specs both from internal traffic data and other reference implementations of similar live ML models. Budgeting and planning to know that it was cost effective to run on GPU nodes was done way in advance.
The people responsible for killing the project actually did not raise any concern about the cost at all (and in fact they did not have enough expertise in the area of deploying neural network models to be able to say anything about the relative merit of our design or deployment plan).
Instead the decision was purely a policy decision: the code in the monorepo that was used for serving compute tasks just as a matter of policy was not allowed to change to accommodate new ways of doing things. The manager of that team compared it with having language limitations in a monorepo. In his mind, “wanting to deploy using custom Docker containers” was like saying “I don’t want to use a supported language for my next project.”
This type of innovation-killing monopolicy is very unique to monorepos.
A team with 5 services and a web front-end in a single repo is doable with regular git. It's a different beast I think.
When you have 100+ developers on a project, managing inbound commits/merges/etc will become tedious if they're all committing/merging into one effective codebase.
IMHO, It depends on the project, the team makeup, the codebase's runtime footprint, etc whether or not/or when it makes sense to start breaking it up into smaller fragments, or on the other hand, vacuuming up the fragments into a monorepo.
I did enjoy reading Steve Fink's from Mozilla's comment (it's the top response on the OP's medium article) and counter arguments about monorepos vs polyrepos in that ecosystem (also clearly north of 100 developers). It's easy to miss if you don't expand the medium comment section, but very much worth reading.
If you worked in a company that had a core product in a repo, and you wanted to create a slack bot for internal use, where would you put the code? I assume not within your core product's codebase, but within a separate repo, thus creating a polyrepo situation.
So when you say a monorepo will serve you in 99% of cases, are you not counting "side" projects, and simply talking about the core product?
Monorepos are going to be mostly challenges around scaling the org in a single repo.
Polyrepos are going to be mostly challenges with coordination.
But the absolute worst thing to do is not commit to a course of action and have to solve both sets of challenges (eg: having one pretty big repo with 80% of your code, and then the other 20% in a series of smaller repos)
It's like thinking OOP or functional programming is going to solve all your issues... I mean, in some limited cases they could, but realistically you're just smooshing the difficulties around and hopefully moving them to somewhere where you are more able to deal with them.
FWIW, I've worked in a many-repo org and it sucked worse than huge companies with monorepos and good tooling, but I'm not going to make some blanket statement because it depends on the specifics of your code/release process/developer familiarity etc.
For example: If you're stuck with a TFS monorepo (you poor soul), you actually get to deal with both problems to some extent, since TFS doesn't enforce that you check out the intire repository at once.
This can have very "funny" situations because someone forgot to checkout new changes in some folder. OTOH, at least for releases, you can remedy this by using CI everywhere.
Pretty funny to read that the things I do every day are impossible.
Monorepo and tight coupling are orthogonal issues. Limits on coupling come from the build system, not from the source repository.
Yes, you should assume there is a sophisticated "VFS". What is this "checkout" you speak of? I have no time for that. I am too busy grepping the entire code base, which is apparently not possible.
If the "the realities of build/deploy management at scale are largely identical whether using a monorepo or polyrepo", then why on earth would google invest enormous effort constructing an entire ecosystem around a monorepo? Choices: 1) Google is dumb. 2) Mono and poly are not identical.
I think, once you've chosen a path of mono or poly, you have quite a challenge ahead of you to migrate to the other.
At that point, the tradeoffs arent based purely on the technical benefits - and "invest in monorepo tooling" may become a perfectly valid decision, as it's cheaper than "migrate to a polyrepo setup'.
I'm not arguing either way for or against monorepo, just pointing out that "must be a good idea because Google does it" is invalid - technical merit is just one of the thousands of concerns to be balanced.
I do agree with GP though. I wish the author hadn't decided the things I do everyday are impossible.
With thousands of developers banging on the code base, it's going to be more than "a handful of developers". It's going to be at least a few "handfuls" of developers full time plus probably many, many other full time equivalents spread out throughout the whole user base (testing, supporting other users, etc.).
We aren't talking about maintaining Mercurial here - we are talking about developing a brand new distributed VCS that happens to be 'Mercurial-compatible', and deploying/maintaining it for tens of thousands of developers working simultaneously.
Note that most of your problems will be related to having a thousands of developers and repo organization is irrelevant.
Truth is, ending up with a monorepo is _really easy_. It usually starts with something that doesn't even _feel_ like more than one project: backend code, frontend templates and some celery/whatever tasks, maybe some minor utility CLI tools. And this happens at the stage nobody wants to even _think_ about more than one git repository.
Once those are big enough, it's likely too late.
But hey, you can always claim _you wanted it that way_. My cats always look good while pulling that one.
Both CAN work, but for internal organizations with a reasonably sized team, I've come to realize that a mono repo is better. You attain "separation" by establishing different views of the code/data and at scale, the mental model of what's happening is much simpler.
I think if you worked with Perforce, you're likely to get the wrong idea that people who dislike monorepos didn't work with Perforce. But the reality is that anyone who worked in this industry long enough did at some point end up traumatised by it, thanks.
> the mental model of what's happening is much simpler.
How does introducing the concept of "views" to the VCS model make anything simpler?
It can work nicely when you have disciplined and demonstrably above average programmers that are good at structuring the internal architecture of systems and will know how to design for plasticity. It is also an advantage if all your code is written in the same style and doesn't come from a bunch of older codebases. But even then you can end up with messes that you will be likely to conveniently forget about.
For instance while clear decoupling was a goal when I worked at Google, it wasn't always a reality. There were still lots of very deep and direct dependencies that should never have been there.
It does not work well if you have "average" developers or if you have undisciplined developers or excessive bikeshedders (which kill productivity).
Then there is the tooling. Most people do not work for Google and do not have the ability to spend as much money and time on tooling as Google does. What Google does largely works because of the tooling. It would suck balls without it. To be honest: some things sucked balls even with the tooling. Especially when working with people in different time zones.
Google isn't really a valid example of why monorepo is a good idea because your average company isn't going to have a support structure even remotely as huge as Google. (If you disagre: hey, it's easy, go work for Google for a while and then tell me I'm wrong)
Didn’t google have a monorepo before git was created? And was created by academics? Legacy and momentum have a strong influence on the future. Hasn’t google also built a lot of tools for the monorepo and dedicates employees to it? That’s exactly the issue this article is about.
From an external perspective, the speed and scale of product rollouts from the bigger tech companies is very slow. I don’t know if the tooling has much to do with it, but I suspect it might. I’ve heard some horror stories (some from here) about how it takes months to get small changes into production.
I work for a company now where top management doesn't even understand what a repository is and what role it plays in software development.
Yes it is that bad in much of the "entreprise" world.
Yes, Google had a monorepo before git was in widespread use. They used Perforce while I was there, which was a miserable, miserable experience. It only worked because they poured engineering effort into making it somewhat tolerable.
I think it would be wrong to say Google chose a monorepo because it was the best choice. To be honest, I don't think they really planned how to deal with many thousands of developers when they made the choice. They just did what seemed to make sense at the time and then had to make it work as the challenges started to mount.
It does affect dependency management but no more than any external dependency.
Any complicated multi-repo setup requires tooling, processes, procedures, cross-repo PRs & issue tracking, &c. &c. &c.
The question is: which requires less cost in order to deliver business value? In my experience, on the teams I've been so far, the answer has been monorepos — but I don't know everything.
You'd need cross repo bisection.
You'd need a way to run all tests in all repos reflecting a new change.
There's 10s or hundreds more frs I could list.
You really shouldn’t have to run every test on every product. Or really any other repos. Use semantic versioning, pin your dependencies, don’t make breaking changes on patch or minor versions.
It results in one of three things:
1. People never update their dependencies. This is bad (consider a security issue in a dependency)
2. Client teams are forced to take on the work of updating due to breaking changes in their dependencies. If they don't, we're back at 1.
3. Library teams are forced to backport security updates to N versions that are used across the company.
But really, the question to ask is
>don’t make breaking changes on patch or minor versions
How can you be sure you aren't breaking anyone without running their code? You can be sure you aren't violating your defined APIs, but unless you're perfect, your API isn't, and there are undocumented invariants that you may change. Those break your users. Monorepo says that that was your responsibility, and therefore its your job to help them fix it. Polyrepo says that you don't need to care about that, you can just semver major version bump and be done with it, upgrading be damned.
No semver means that you, not your users, feel the pain of making breaking changes. That's invariably a good thing.
As a development team grows, time to market also grows, in a superlinear fashion. This is known since people shared code on dead-tree pages, so the odds of tooling being the cause are low.
For other (smaller) companies, polyrepo might be the better choice because [significant investment in ecosystem and tooling] is not appealing, and the investments of Google et al. have not leaked through sufficiently into general available tools. Some headway is being made in the latter , so monorepo might be the "obvious" best choice in 10 years or so.
 For example, Git large file support is mostly from corporate contributors https://git-lfs.github.com/ https://github.com/Microsoft/VFSForGit
That's not the choice, though: significant investment in tooling is a function of codebase size. In my own experience, polyrepos require more tooling, because you're not just dealing with files & directories, you're also dealing with repos (& probably PRs & issues & other stuff in a forge).
That's not my experience. In my experience, polyrepo's significantly reduce complexity for a medium (30 developers) project.
An example: the following things are good software development practices if you work with a master-PR branch model:
1. Tests must pass on CI before merging a branch to master
2. Before merging a branch into master, the latest master must be merged into the branch so that tests are still reliable
So you need partial builds to keep build time low, and would probably like to amend 1 & 2 with "unless your code has zero overlap with the changes in master". These are not standard features of any CI system I know of, hence the need for tooling.
Instead of tooling, polyrepo's provide the above benefits out-of-the-box. Just set your CI to build the repo, and it will do partial builds and PR-merging is uninfluenced by other repos. This is a huge advantage over monorepos.
The downside is that if your repos have tight coupling, you'll need simultaneous PRs in more than 1 place or need to look up history/files in more than 1 repo. If this is more than a rare occurence, this downside is so large that polyrepo is not a suitable solution for your project.
The projects of this size I've worked with did not have this problem, or the problem was solvable without much difficulty.
Of course there is a threshold, however this is typically a concern of a large organization or an organization that has been producing software for a decade or more.
Yeah this is a pretty widespread and fundamental misunderstanding that leads to a lot of bad policy decisions.
If 'grepping code' is your first resort then you're hitting things with a hammer. I'm writing code that a machine is supposed to understand. If the machine can't understand how the bits interact then I have much bigger problems than where my code is stored. Probably we're dealing with a lot of toxic machismo bullshit that is hurting our ability to deliver.
For a large team, working without any kind of static analysis is a recipe for a rigid oligarchy. Only people who have memorized the system can reason about it. Everybody else who tries to make ambitious changes ends up breaking something. See what happens when you trust new people with new ideas? New is bad. Be safe with us.
And even if by some miracle you do make the change without blowing stuff up, you're still in the doghouse, because we have memorized the old way and you are disrupting things!
Some crazy ideas work well. Some reasonable ideas fail horribly. To grow, people need the space to tinker and an opaque codebase ruins those opportunities. Transparency is also helpful when debugging a production issue, because people can work in parallel to the people most likely to solve the problem (even the person who is usually right is way off base occasionally). I should be able to learn and possibly contribute without jamming up the rest of the team by asking inane questions.
You need pretty good but entirely achievable tooling and architecture to get that, but man when you do it's like getting over a cold and remembering what breathing feels like.
> Because, at scale, a monorepo must solve every problem that a polyrepo must solve, with the downside of encouraging tight coupling, and the additional herculean effort of tackling VCS scalability.
But you have to get to "scale" first (as it relates to VCSs). Most companies don't. Even if they're successful. Introducing polyrepos front loads the scaling problems for no reason whatsoever. A giant waste of time.
Checkmate! I didn't even need a snarky poll. The irony of that poll is that it clearly demonstrates his zealotry, not other people's.
Nobody would choose to drag around every historical afterthought in the development sequence of long forgotten software going back three decades that no longer builds with current tools, just so they can work on a small library off in a corner. Software is getting written and added to these monorepos at a much faster rate than hardware and networks are able to hide the bloat-upon-bloat growth of them.
If it doesn't work then it should be deleted. If it's still running somewhere then it should be maintained. Presumably you have a CI system so the monorepo actually requires everything in it to build.
In my experience, it's polyrepos that allow for dead and un-maintained code to just sit there for eternity. You forget about that unused repo right until the moment the service it deploys to (if you can track down that dependency) needs an update or goes down. Monorepos can more easily force system wide CI that checks for broken dependencies or other issues.
In the embedded world supporting software for 30 years is not unheard of. We avoid it, but it is in the back of our mind that someday we might have to release an update. Fortunately 30 years ago nothing was internet connected, we are worried that we might be releasing security updates for our current products 50 years from now...
"Hey, what does this server do?"
"No idea; it hasn't been touched in years. What's deployed to it?"
"Some 'foobar-ng' thing, never heard of that. Says it was last updated 5 years ago. Pull up our source repo for that package, will you?"
"Hang on, we've got like 30 services with names containing 'foobar', let me find that one . . . oh god. You don't wanna know."
"Fuck it, I'm just gonna shut this server down and remove that ancient, dead, busted package."
"The main billing system just broke! What did you do?!"
That has not been my experience at all. At a previous employer, we did exactly that with a multi-language library. In fairness, having multiple languages enforced fairly good directory structure in our single repo. But isn’t that the real point: good structure makes life easier, period. The thing is, going into a project you often don’t know what the right structures are yet. Creating a new repo for each component you think you need ossifies those choices, making it far more difficult to walk back on them later on (first because you may not even see the architectural mistake, second because the maintainers of that component will have an investment in its existence).
> You can always just put all your small repos into a big one.
In my experience, that’s harder, precisely because over time so much tooling has been built into each repo to manage builds, images, deployment &c.
I’ve worked in monorepos & I’ve worked in multrepos, and so far my experience has been that monorepos enable faster velocity and more-maintainable software. I’ve not (yet) worked at Google- or Facebook-scale, though, and I’m completely open to the idea that at that scale a team really does need lots and lots of repos, and tooling to stitch them all together.
If you split your repo up from the get go, the worst thing you can get that you'll have to assemble multiple distinct, well-encapsulated (in terms of project structure) things into one. In Git, that could lead to multiple root commits, but that's about it.
Most of the time they're just annoyed.
One side effect of every successful business are annoyed worker ants that are sick of dealing with growth problems. I've been there. I know how annoying it can be.
Personally I've found comfort in embracing the chaos and learning to manage it responsibly. No dogma. No absolutes. Know how to do monorepos well. Know how to do polyrepos well. Learn the pitfalls of both. Don't assume other people are stupid zealots.
> The worst case is that the engineering team spent more time working on “well encapsulated projects” than on the most important project for their business
I'm not really sure how I should read this. Don't you use your repos to solve business problems? Why should that change because of the repo layout?
I've seen that with polyrepos as well: The entire project would require you to clone the individual repos into a specific directory structure so that things would work (no, not even submodules).
Why would you put in a symlink? You could just provide a path to the actual component and import it into your project.
> the worst thing you can get that you'll have to assemble multiple distinct, well-encapsulated (in terms of project structure) things into one
When you have multiple repos, you also have multiple versions and releases of things. Now every team has the following options:
1. Backport critical fixes to every version still in use (hard to scale)
2. Publish a deprecation policy, aggressively deprecate older releases, and ignore pleading and begging from any team that fails to upgrade (infeasible - there'll always be a team that can't upgrade at that moment for some reason)
You also have to solve the conflicting diamond dependency problem. This is when libfoo and libbar both depend on different versions of libbarfoodep. It's even more fun in Java because everything compiles and builds, but fails at runtime. So now you have to add rules and checks to your entire dependency tree - some package managers have this (Maven), others don't (npm IIRC).
Where do I need to put the path again? Ah what the heck, I'll just add a symlink inside a folder that's already somewhere in the build definitions.
I don't know anyone who has abused Maven or Cargo or Go like this. And I don't imagine Visual Studio Solutions for C# are used like this.
Is there an underlying disagreement based on JS/Ruby/Python scriptish coding (which creaks when a lot of developers work on it) vs C and C++ (which have astonishingly bad build system stories) vs big-iron languages that don't sweat when in a monorepo.
At my workplace, we've just been cleaning up a whole bunch of instances of exactly that anti-pattern. Except that it's obviously not symlinks (which require specific user rights on Windows), but links to external files in VS.
Same problem, though: They're easy to introduce and a pain to deal with later on.
It's not quite as simple as that. You'll need to avoid rebuilding the entire repo for every change - using something like Bazel. This means your build tooling has to be replaced entirely, which is a non-trivial task and not something your devops/release engineering team will thank you for.
For any 3rd party libraries used by your projects you need to either ingest those projects into your monorepo and update forever. Or keep npm/maven/pip/gem/whatever around just for managing 3rd party dependencies (+ whatever system you use to front the main language package registries, because of course you're not talking directly to NPM/Maven central are you? What if they go down or do a leftpad?).
I think either system - monorepos or polyrepos - works fine; just pick one and stick with it. Monorepos will probably give you better velocity starting out. Past a certain size, which most software shops will never hit, the already-available tools lend themselves better to polyrepo. And more devs know polyrepo tools (eg. Jenkins) than the corresponding monorepo solution (eg. Bazel). Things might swing in favor of monorepo on the VCS side if Twitter/Google/FB ever open-source their stuff.
It's their job. If they actively don't want to do work then you probably made a hiring mistake somewhere. By that logic what DevOps really wants is to the company to shut down since then they'd have none of that tedious work to do.
It's harder to make a business value case for this type of change - there are only vaguely worded promises of "improved developer velocity". Contrast that with a change that automates or makes faster some aspect of building and releasing - a professional release engineering team would be all over it because they can demonstrate value in that work.
In any company beyond 50 people, there are multiple engineering managers, directors or the VP of engineering that will need to back this initiative to make the release engineering team do it. It's really not as simple as "dump all the code in one repo". I'm speaking from experience.
Sure it is their job, but is isn't an easy job and there are many opportunities for things to go wrong. It might or might not be the right choice for you, but don't overlook how hard it is.
Note that the above applies for going in either direction.
If I was moving many small repos into a single mono repo then I'd do it one repo at a time. Presumably your small repos are independent entities so there's no reason to do a single massive switch. Transition each repo to the new build system inside the existing repo. Once that works then you can transition that repo into the mono-repo and tie together the build systems. No need to stop releases, no massive chance of everything failing, no weeks of debugging while the world is stopped, etc. Rinse and repeat until everything is moved over. Process becomes more optimized and less error prone with each repo that is moved over.
Sure you can. The difficulty of doing so depends on many (many) factors. If your team does their job well then the costs won't be immense. It might be annoying, but not that hard.
Speaking in absolutes or platitudes solves nothing. Sometimes monorepos make sense. Sometimes polyrepos make sense. It's entirely dependent on what your company does.
E.g. if relying solely on a package manager to to keep coupled things in lock step, you need make sure that version numbers are kept up to date for every little change made to every library.
You can easily end up with a situation where someone in another team makes a small change but doesn't change the lib version number. That's a people issue but it does happen.
You can get round that by using a repo SHA but now you two things to keep up to date for every library.
Like wise you'll have to be diligent in versioning APIs. Anecdotally I've found it easier to keep things in lock step when in a single repo and using a single pull request for each story than I have where separate teams have to keep separate repos in sync.
Both work but the monorepo approach worked better for the projects I've worked on. It just lead to less moving parts and more repeatability when there's a single SHA to watch.
I also have been luck enough that I haven't worked on a project so large that we couldn't build a monorepo on a single machine with "normal" build tooling.
The author lists downsides of monorepos without listing the upsides and downsides of polyrepos so its really half complete.
I don't think anyone who likes a monorepo is suggesting you just commit breaking changes to master and ignore downstream teams. What it does do is give the ability to see who those downstream teams (if any) might be.
The crux of the author's argument is that added information is harmful because you might use it wrong. Its just as easy (far easier in fact) to ignore your partners without the information a monorepo gives. Its not really an argument at all. There's really nothing here but "there be dragons".
Monorepo's provide some cross functional information for a maintenance price. Its up to you whether the benefit is worth the overhead.
To me this message seems a bit shallow, of course we can build tooling to hide the fact that we have a polyrepo. Given well enough built tooling and consistent enough polyrepo structure (all using same VCS, all being linked from common tooling, following common coding standards and using the same build tooling, etc.) the distinction from having a monorepo is more of an implementation detail.
Given the choice between a consistent monorepo where everyone is running everything at HEAD and a polyrepo where each project have their own rules and there's no tooling to make a multi-project atomic change, I'd go for the former.
Given the choice between identical working environments but different underlying implementations I would go for whatever the tools team think is easier to maintain.
This is a good thing because when you have to make the multirepo commit you make the change and then update each downside one at a time. Each change is much smaller and so easier to review (and also easier to find the right reviewer).
Of course the downside is you either have to maintain both ABIs (not just API), have a rollout scheme with two version of the upstream library exist side by side, or don't release.
Nothing is perfect.
Exactly. Sure, you can manually recreate a monorepo from a multirepo system, but … why do that? That takes software engineering effort that you could spend on your product instead.
Need to change a function signature or interface? Cool, global find & replace.
At some point monorepos outgrow their usefulness. The sheer amount of files in something that’s 10K+ LOC ( not that large, I know ) warrants breaking apart the codebase into packages.
Still, I almost err on the side of monorepos because of the convenience that editors like vscode offer: autocomplete, auto-updating imports, etc.
Though, as you say and I commented below they’re not mutually exclusive. A wrapper or even an entirely separate service can exist alongside others.
One dark side of this is being able to “reach inside” other parts of the monorepo and blur application boundaries.
Monorepos are also a great technique for tackling large legacy codebases. When the rot is all in one designated place, it becomes easier to encourage good developer habits on new code created in new, separated repo(s).
Speaking from experience I've worked on a team operating through a monorepo project that came out real well. The codebase was mostly golang, so everything lived in the GO_PATH, but for the most part the typescript in the UI side of the repo didn't complain. Testing and code quality was a higher priority, as well, which may have contributed to its success.
I have also worked on a monorepo project that had minimal tests and automation, that soon grew monstrous and ultimately needed refactoring. That was a big pile of coffeescript, es6 and java that ultimately refactored into three different node modules and two microservices.
Monorepo or Polyrepo, the correct answer is whatever works for your team and task at hand.
I'm seeing these two things conflated in this thread.
In some cases, this could be two separate backend projects where you want to re-use the same deployment pipeline.
Often, I find that API wrappers are something that I share across frontends and backends in the JS world, so it often makes sense to separate my projects into:
In Typescript I really like this pattern and can namespace shared types so that it’s very clear to the future reader that this type is probably used outside of the current context.
So, to reply to your comment — I think the term “monorepo” can encompass a lot of different project types.
I think Dan Luu covers the bases quite well here:
Which of course begs the question, rather than trying to perform a bunch of unnatural acts, why not just use SVN to start with? It works extremely well with monorepo & subtree workflows.
Sure it has some warts in a few dimensions around branching, versioning, etc. compared to Git when using Git in ways aligned with how Git wants to work, but those warts are minimal in comparison to what's required to pretzel Git monorepos into scaling effectively.
Unreasonable for a single dev to have the entire repo? I'm looking at a repo with ~10 million LoC and ~1.4 million commits. I have 74 different branches checked out right now. Hard drives are cheap.
Code refactors are impossible? I reviewed two of those this morning. They're essentially a non-event. I'm not sure what to make of the merge issue - does code review have to start over after a merge? That seems like a deep issue in your code review process. The service-oriented point seems like a non-sequitur, unless you're telling me I'm supposed to have a service for, say, my queue implementation or time library.
The VCS scalability issue is the only real downside I see here. And it is real, but it also seems worth it. It helps that the big players are paving the way here - Facebook's contributions to the scalability of mercurial has definitely made a difference for us.
Part of code review is to ensure the code "fits" with all other merged code - so a re-review is "needed" when other changes merge. E.g. if I merge a refactor that changes everything from Pascal case to 100% SHOUTING, reviews now need to take this into account.
In practice, this doesn't happen - it's way too much effort for far too little value.
To be fair, if you get away with merging that refactor, the review that needed more attention was of that refactor ;-)
 - https://fuchsia.googlesource.com/jiri/
 - https://chromium.googlesource.com/chromium/tools/depot_tools...
 - https://chromium.googlesource.com/chromium/tools/depot_tools...
It even reflects a bit to the build system of choice, GN (used in the above), previously gyp, feels similar on the surface (script) to Bazel, but has some significant differences (gn has some more imperative parts, and it's a ninja-build generator, while bazel, like pants/bucks/please.build is a build system on it's own).
Simply fascinated :), and can't wait to see what the resolution of all this would be... Bazel is getting there to support monorepos (through WORKSPACEs), but there are some hard problems there...
I asked one company how many changes required changes to more than one repo and was told "a small percentage". We then did some basic analysis of issue IDs across commits and discovered that it was in reality nearer 30% of changes. Keeping those together was just plain very hard.
Start to scale this by teams of hundreds or thousands of devs and you get a lot of pain.
Managing branches is also hard - easy to create (with repo tool) - but hard to track changes.
Fooserver sprouts a query syntax ("just do this for test servers A and B"), pushed to production. Fooclient sprouts code that relies on this, pushed to production. A bit later, Fooserver is rolled back, blowing away query syntax, pushed to production. "Just do this for test servers A and B" now becomes "Do this for every server in the company". Hilarity ensues.
The upsides and downsides of this are an interesting debate, but there is a cost to polyrepos if you want to change the system architecture. There is a cost to monorepos too as argued by this post, and its up to the tech leads as to which cost is greater.
Seriously, you have over 1 TB of code and 100 people wrote it?
I just wanted to point out that reaching a measly TB of data doesn't require much effort. (worked on a product that would version rendered clips for special effect production).
Telling people what they should or should not do is generally absurd. Every situation is unique and you can't possibly know another project's requirements or acceptable trade-offs.
A better approach, in my opinion, is "Here's what we did and why". The author clearly has experience in the area. Great! Tell me about your problems. Tell me about your attempted solutions and what did or did not work. Tell me what you wish you had done! I'd love to use knowledge of your situation to inform my own decision making.
But don't be surprised if my circumstances are different and lead me to prefer different trade-offs and choose a different solution. That doesn't make me a zealot or an idiot.
When I blog I've had much better luck telling people "here's what I did and why". I don't know your circumstances and can't tell you how to solve your problems. You may need to choose different trade-offs than I did. With that said, here is my problem, how I solved it, and what I learned along the way. Hopefully you can learn from my experiences and make a more informed decision for how to handle problems you may encounter.
You may disagree with that thesis, but it definitely seems to cover more than one use case.
.....because, as the author directly stated, the type of repo has nothing to do with the product being successful. So stop bikeshedding, pick a model, and get on with the real business of delivering a successful product.
You need fairly extensive tooling to make working with a repo of submodules comfortable at any scale. At large scale, that tooling can be simpler than the equivalent monorepo tooling, assuming that your individual repos remain "small" but also appropriately granular (not a given--organizing is hard, especially if you leave it to individual project teams). However, in the process of getting there, a monorepo requires no particular bespoke tooling at small or even medium scale (it's just "a repo"), and the performance intolerability pretty much scales smoothly from there. And those can be treated as technical problems if you don't want to approach social problems.
To put it another way, we're comparing asymptotic O(n) with something bigger, neglecting huge constant factors on the former. There's a lot of path-dependence, since restructuring all your repos with new tooling is hard to appreciate.
It can be misused though - the releases of the root repository reference the children by tags usually. Someone retagged a child repo and we suddenly had build failures.
We did however run into the standard dependency resolution issue when you have any loosely coupled dependency. Updating our submodule usually required a 1-2 day effort because we were out of sync for a month or two.