I've worked in environments across the version-control gamut. The best-run places have had monorepos. But, for the life of me, I would not trust the companies that didn't have monorepos, to operate a monorepo.
To go mono is to make an org-level engineering/cultural commitment, that you're going to invest in build tools, dependency graph management, third-party vendoring, trunk-driven development, and ci/cd infrastructure.
Can you make a mono repo work without all of those things? Yes, but you are sacrificing most of its benefits.
If your eng org cannot afford to make those investments (i.e. headcount is middling but there is zero business tolerance for investing in developer experience, or the company is old and the eng org is best described as a disconnected graph), forcing a monorepo is probably not the right idea for you.
Monorepo vs microrepos is analogous in some ways to static vs dynamic typing debate. A well-managed monorepo prevents entire classes of problems, as does static typing. Dynamic typing has much lower table stakes for a "running program", as do microrepos.
edit:
It's worth noting that open source solutions to build tooling, dependency graph management, etc. have gotten extremely good in the last 10 years. At my present company (eng headcount ~200), we spend about 2 engineers-per-year on upgrades and maintenance of this infrastructure. These tools are still quite complex, but the table stakes for a monorepo are lower today than they were 10 years ago.
It's strange that you make this incorrect comparison to static and dynamic typing. As with static typing, an automated tool prevents you from doing something. It comes for free at the cost of you needing to satisfy the tool. Where is this in a monorepo? You make it sound like the company itself must invest in assuring that the rules are not broken. That sounds like _dynamic_ typing to me.
> Where is this in a monorepo? You make it sound like the company itself must invest in assuring that the rules are not broken.
My argument, as made in the original comment, is that a monorepo is not a git repository. It is a git repository, a build tool / dependency graph manager, and a continuous integration system.
In the same way, a statically-typed project is not a collection of source files. It is a collection of source files, a linker, and a compiler.
The linker and compiler do not come "for free" any more than a build tool / ci system do, although they have been around for a very long time so you might consider them part of the background. In a well-structured monorepo, the build tool and ci system are as much in the background as `gcc` or `go build` might be.
> As with static typing, an automated tool prevents you from doing something. It comes for free at the cost of you needing to satisfy the tool. Where is this in a monorepo?
In a monorepo if your change breaks someone else's code, you'll know immediately because you won't be able to check in the code with a breaking build. If you have the correct monorepo tooling set up it will help you figure out what you're breaking.
With multi-repos others will have to upgrade to your new version and then see that their code is breaking.
GP's analogy was imperfect because as you pointed out the company has to invest in the monorepo tooling. But it makes sense to me.
Git + Bazel will be fine for scaling up to 100s of engineers from my experience (at Lyft's L5 autonomous division). My other data point is Google, with 10,000s of engineers and a bespoke VCS. I'm not sure how things work in the middle (1000s of engineers), but I think you can solve that problem when you get to it, and Bazel has some features (look at the git_repository rule) to help you split a big repo if you need to.
You may also want a service like Artifactory for hosting binary blobs that feed into the build process.
I work in a project with 50 engineers. Our git repo is 2GB plus 5GB of submodules. Git is painfully slow and I would imagine with more engineers it would become unusable. For example, git-fetch takes at least a minute.
To some degree we are doing it wrong like using Windows, creating too many tags, and committing binary blobs (despite LFS). Still, scaling it up by factor of two or three would not change it significantly.
I would avoid Git submodules; it seems like a fairly poorly designed feature and I'm constantly getting my submodules into a bad state (e.g. when switching branches). The alternative that has worked a lot better for me, in repos that use Bazel, is Bazel external repositories (using git_repository or http_archive). A more direct replacement that I hear is better is git subtree although I haven't used it.
Also, you're probably already doing this, but make sure you're running Git on an SSD! Spinning rust will slow down VCS workloads, which are disk intensive, considerably.
Our next try is probably to rely more on Conan and Artifactory to download binary artifacts. In theory, that could result in similar behavior to Bazel cloud builds though not as integrated and granular.
Interesting... when you say fetch, are you talking about a cold-fetch?
`du -sh repo` for me is 15GB, and we don't experience the slowness you describe. If I'm checking out a branch that diverged from master 1 year ago, it takes maybe... 30 seconds?
But I don't know, maybe we've cast different git-spells from you all? We don't use sub-modules, for example.
Have you considered making shallow clones?
Also indeed you are doing many things that harm usability. I am surprised though you are saying you have issues with blobs even in LFS.
Shallow clones and submodules do not go well together though: For shallow clones, git is setup to not track all remote branches. This can result in an error message when the submodule has not fetched the commit you want to switch to.
This is more or less what we use at my current place (100s of engineers). We haven't run into any dramatic issues at this scale.
Our architecture is arcanist/phab over git --> stash (git) --> custom build server that's basically identical to gitlab --> artifactory.
we use pants, but we made that choice before bazel was a thing.
Bazel is probably the most "scalable" solution, but it isn't always the most intuitive/developer-friendly. It's worth exploring the alternatives (biggest ones imo are pants, buck) to see what you like the most. They're all pretty similar, but Bazel has some hard-to-replicate bells and whistles that mean it'll probably be the eventual winner.
Things that change together, go together.
Monorepos prevent people from having to dig in all imaginable places of your version system to find all pieces of your application.
At the same time, if you have 10 micro services which are accessed by 1 frontend, it may be a little bit messy to keep all that code in the same place.
Common sense (which is not that common) is what should be used to determine. Ask yourself some questions:
- Does these repositories change together ?
- If the code is not together, one repo should be linked to the over via what (git tag ? version number ?)
- All pieces of the application are contained within a context ? Or they can be split (example, 10 micro services and 1 frontend, but 5 of those micro services are also used by another frontend in another project).
The main goal is to make things easier, some PoC and experiments may make it more clear because it really depends on the situation.
> if you have 10 micro services which are accessed by 1 frontend, it may be a little bit messy
I see another differentiator here. If I have a typed API interface (both endpoints and payloads), and I publish a (generated) client library for the frontend, then that frontend will only use the new API version if it is also upgraded to use the new client lib. Here I think multiple repos are more suitable because the pieces of code can improve independently.
In case the API is not typed/versioned, the case for a monorepo is much more important: the change to the API on the BE should come along with the change in the FE.
Thus typing/versioning (on specific barriers) helps with breaking up a code base (along those barriers).
Even with untyped, I'd argue that you want it to be in your face that you cannot change the front and the back ends together. Getting them in one commit will not get them in one deploy.
In mono/mulirepo debates, I frame this as multirepos make hard, scary things hard. Monorepos make it easy to introduce changes that break APIs at deploy time, but are hidden by the atomic commit actually being OK.
I think API versioning is orthogonal to how you organize your code. If you have multiple repos, then you still have that "atomic commit" where you update the version of the client library and adjust the code to handle the new semantics. It then breaks when you deploy it to production because the server code isn't deployed yet (or vice versa).
Basically, if you make RPCs, breaking up your repositories or combining your repositories doesn't eliminate this problem. It's a separate problem that you have to tackle. The reason it doesn't come up as often as it should is because a lot of code is liberal in what it accepts, and most changes are strictly additive (i.e. it's "add more information to the response", not "remove information from the response", because we're generally pushed in the direction of making our applications do more). But it is something you have to attack head-on. No tool will auto-fix the problems for you.
> API versioning is orthogonal to how you organize your code
They are, but they aren't, because we have to basically increment two counters when things change. If you want to be statically correct, you now have a distributed transaction to update both systems.
There is some path through a call graph (in-proc and rpc) that is statically valid. Then there are duck-typed results that are only additive, which you are describing. Provided the system can handle this (dictionaries over structs). You see this in protobuf and other static centric serialization systems. They start to encode dynamic information using the static building blocks. GCCs internal IR is like this. I think what you are advocating for only applies to the domain of microservices, loosely connected using languages and formats that are open under union. Other systems without those properties will not fare so well.
The folks using the monorepos probably don't disagree, but having atomic commits with hashes makes the identity function a useful crutch. I believe there are other solutions where the VCS could actually make this problem a non-issue.
I think they were referring to a false atomic commit to the front and back end. In a mono repo, that can be one commit. However, it has to be two deploys, so it should have been two commits.
At least, that is my argument. So, I could be reading my view into it. :)
> Basically, if you make RPCs, breaking up your repositories or combining your repositories doesn't eliminate this problem.
I kinda agree with this.
But as someone who has always worked in a microservice environment kept by different git repos I don't understand the benefits of a mono repo. I can see the benefits of a mono repo in a massive code base that produces a binary at the end (i.e. windows). But could you elaborate more on the benefits of a mono repo in the case of many small microservices?
>At the same time, if you have 10 micro services which are accessed by 1 frontend, it may be a little bit messy to keep all that code in the same place.
Why is this messy? Using folders to separate your code has no intrinsic organizational difference than using an entire repo. There's no issue in throwing every app in your company under one repo and just using a folder to organize your stuff.
Separating code into multiple repos makes one repo less aware of changes in another repo. It actually makes things harder and worse.
This is the same issue with monoliths and microserves. You don't need to separate your code into several computers just for organizational issues. You can use folders to organize things.
* Commits that go across components/apps are atomic.
Cons:
* When it gets big, those features matter less
* Churn from other dev's stuff gets in your merge/rebase work.
* 'git log' and other commands can be painfully slow
* Mistakes in the repo (e.g., committing a password) now affect many more people.
Use for highly-coupled source bases. Where releases together and atomic commits are very useful. Every other time, you're probably better off splitting early and taking a little time to build the coordination facilities needed for handling dependency versioning across your different repos.
Note: I used to really prefer monorepos, but I've had some time away now to take a better look at it. Now they feel like megaclasses where small things are easier (because everything is in one place) but large things are way harder (because there's no good split point).
> Churn from other dev's stuff gets in your merge/rebase work
Wouldn't other people' work only cause issues if they are changing the same files, in which case conflicts would happen even if the work is spread in multiple repos?
1- You finished your work, so you pull from the central repository, and merge your your branch to the master
2- You do your merge work, test it a bit and commit
3- Now, you push and oops, another team did the same, you are now left with to heads
4- You merge or rebase the two heads. Probably a simple task, but you may still need to run some tests again. And if you are lucky, you are done, otherwise, back to step 3
Also, if you are rebasing before push, you should be able to keep a clean history. However, if you are merging, and there are good reasons for a "merge only" policy, you are going to have a mess of merge commits every time the previous situation happens.
That's something you can work around with good management. But the more freedom you give individual teams to push to the common branches, the more that situation will arise, the more you try to control access, the more chance you will have for teams to go in different directions, making the merges infrequent but tricky.
When you say "But the more freedom you give individual teams to push to the common branches", what do you mean? I've typically only seen CI merge to master, and disable pushing.
Imagine you're in a branch for large project P. You want to merge to trunk but there are conflicts. The upstream library project L was changed in your branch but not in merged down by the submitter.
Even if P simply relies on a binary of L that is already in the artifact store, you still have to deal with merging this code down when normally you wouldn't.
In fact, the much bigger issue is that you now must deal with the double edged sword of updating the entire org when a shared dependency must be updated.
Perhaps you even completed that task in your project branch. Now you get to deal with merging into every project in the org that was touched.
Good point. I've gotten hit by this in monorepos but I probably just tried to merge instead of rebase. Both are crazy painful on large source bases.
It still takes forever!
I should replace that with 'most commits in history are irrelevant to you, so you have to dig smarter to understand what's been going on with your components'
Invest in using a gui merge tool. It makes this process 10x easier. Also your complaint is unrelated to monorepos vs. multiple repos.
Additionally rebase or merge won't make the conflict issue go away. I recommend companies stop using rebase as it produces an in accurate history of what's going on in your git history.
I'm a magit-er now. In multiple repos, the history for one repo is limited to that repo, which is usually more topical for people working on it.
If I've got changes on an old version of the repo, and try to merge against a new one, I find conflicts in the changes others have made to the same repo -- I have the 'before' version, they have the 'after'. I frankly don't investigate too much when it hits.
As for history, I love rebase for two reasons: (1) I can squash mistakes in my history so they're gone forever. (2) In reconciliation I'm watching my changes filter through the history of the repo. When I understand what's up, this is great. When I don't, it's a catastrophe of me re-applying changes that haven't actually been applied yet, then conflicting with their application later. Ugh.
Github (might depend on settings) refuse to merge if your branch is out of date. Rebasing on master re-triggers CI/CD. Might cause lots of waiting for this case...
Good list. I’ll add: If your company does feature freezes or general code freezes around holidays or corporate earnings, it can be harder to get “run the business” changes approved during these periods in a monorepo. So even in monorepo companies it’s common to have a separate “infra” repo for network and cloud configuration, sometimes for auth too. Makes the politics easier.
For me the biggest pro and con are the same thing. With a monorepo your respective projects' codebases become tightly coupled.
Why that can be a bad thing? Good software engineering practices tend to be associated with loose coupling, modularity and small, well-defined interface boundaries. Putting each project into its own repo, encourages developers to think of it as a standalone product that will be consumed by third-party users at arms length. That's going to engender better design, more careful documentation, and less exposed surface area than the informal cross-chatter that happens when separate projects live in the same codebase.
Why that can be a good thing? The cost of that is that each project has to be treated as a standalone project with its own release schedule, lifecycle support, and backwards compatibility considerations. Let's say you want to deprecate an internal API in a monorepo? Just find all the instances where it's called and replace accordingly. With a multi-repo it's nowhere near as easy. You'll find yourself having to support old-style calling conventions well past the point you'd prefer to avoid breaking the build for downstream consumers.
If a monorepo is generating multiple binaries, there’s no reason it can’t use separate compilation units. In many languages, separate compilation units give you that arm’s length separation you need, without introducing the problem of making breaking changes that you can’t reasonably detect prior to commit, because the code is used in some obscure module you don’t even have checked out.
For that reason, monorepos favor the new programmer, which makes it easier to ramp your team size.
let’s split our project into multiple libraries so it can be reused within the company or even better if we put it on GitHub everyone can use it.
After a few months you notice nobody within the company cares about your nicely versioned libraries and on GitHub you get more and more complaints that the libraries are very limited and need more features.
After that you merge everything back into your monorepo and try to forget all the time wasted on git bureaucracy, versioning, dependency handling and syncing changes between repos.
I have to laugh at that. The two biggest banks Goldman Sachs and JP Morgan are heavily mono repo.
It actually makes all certifications and auditing easier. The shared tooling/platform can be checked, everything else can ride on it. Half the questions of certifications are about tracking changes... easy when it's all tracked by the repo.
We've had to go through similar-to-PCI compliance hoops for our monorepo, and settled on a solution that didn't degrade the median developer's velocity too much.
I'm curious to know what other monorepo companies had to go through to satisfy the compliance people.
Thanks for sharing that. It's mind-blowing that Google runs everything from a single repo. Most places that I work with create several repos for just one project.
Yeah, it was a significant infrastructure cost. I heard at one time that the single largest computer at Google was the Perforce server. They ended up completely re-writing it (called "Piper") for scaling.
This is sort of what I was alluding to with my comment about Google having different concerns than many other companies. They can afford to dedicate a number of engineers to maintaining a monorepo system and then re-writing when it doesn't scale. That said, I personally believe that there are a lot of benefits to monorepos, and I think those tradeoffs are worth it for other companies too.
It depends. At a high level, "at scale" you'll have to solve all the same problems for both, to the point where you have a dedicated team or teams solving those problems. Monorepos don't automatically solve issues of version skew or code search or code consistency, and multirepos don't automatically solve problems of access control or performance or partial checkouts. All a monorepo strategy does is say that all your source files will share the same global namespace, and all a multirepo strategy does is say that they can have different namespaces (often corresponding to a binary or grouping of closely coupled binaries). Everything after that is an orthogonal concern. As far as it goes, conceptually monorepos appeal to me, and they offer more discoverability and a simpler, more consistent interface than multirepos. It's also worth considering that there must be some kind of trade-off if you need to pull in the abstraction of "separate repos" to handle code: typically you have fewer guarantees about the way source files will interact when they're in separate namespaces, which makes some things harder.
But if you're just starting out, you're going to be going with off-the-shelf components. Usually this is git hosted on GitHub, GitLab, or something similar; there's a good chance you're going to be using git. Vanilla git works sub-optimally once you reach a certain number of contributors, a certain number of different binaries being generated, and a certain number of lines of code, as a lot of its assumptions (and the assumptions of folks who host git) focus on "small" organizations or single developers. You aren't going to have a good time using a vanilla git monorepo with tens of millions of lines of code, and hundreds of developers, and dozens of different projects, even though in principle you could have a different source control system that would function perfectly well as a monorepo at that scale.
My general approach would be to start with a git monorepo, do all development within it, and once that becomes a pain point migrate to multirepo.
You always have a monorepo whether you realize it or not. The only difference is how your monorepo is organized.
If you split your codebase into lots of little SCM repositories and manage dependencies by pushing prebuilt artifacts here and there, all you've done is create a new meta-repo system on top of your SCM. And usually, this dynamic meta-repo coordination system is not only implicit, not only undocumented, but also not even understood by any single human being --- yet your project's success depends on the operation of this meta-repository built accidentally and unconsciously from coordinated development in tons of little repositories.
I'm strongly in team formal-monorepo. Just make a single big repository that holds everything you depend on. Be honest and direct about the interactions of your components --- explicit, not accidental, as in a microrepo ecosystem.
The performance-based arguments against monorepos don't apply as strongly as they once did: Mercurial and Git (moreso the former) have seen a lot of work for optimizing big repositories, and both support a sparse checkout model that allows people focused on specific tasks to focus on a small part of the larger repository without splitting the repository apart.
The only legitimate reason I have encountered so far for not using a monorepo is open-source dependencies that the company maintains patches to/contributes changes to. In this sort of situation, you'll want to have them in their own repository so that you can go from "foolib v5 plus our changes" to "foolib v6 plus our changes" without too much fuss.
Use them whenever things are one project, only split them up when it's legitimately separate projects.
At work I set up a repo with our back-end code in one directory, the front-end in a directory, and the ansible playbooks/other deploy stuff in another..
This allows us to automatically create servers for every branch because we know that everything goes together. Working on other projects where they have all of these separate is a nightmare. Especially when they use microservices and each service has its own repo and each library has a repo... I want to smack anyone who ever thinks that is a good idea
The big advantage of a monorepo is that dependency on changes to a "library" or component that gets reused in multiple places can be made explicit. A user can commit a change that safely modifies a library or component in a non-backward-compatible fashion AND changes all the places that use it. This is far simpler than the alternative of releasing a non-compatible version of the library/component and maintaining both in parallel for a period while the users transition.
The big advantage of separate repositories is that the work done on them is very clearly marked as independent. It becomes easier to take one component out and utilize it elsewhere... but at the cost of having to deal with dependency management. Dependency management is simple and automatic in easy cases and really terrible in hard cases (like when two components you use mandate mutually exclusive versions of a third component).
So I think the driving factor in the decision is the tradeoff between the degree to which you are willing to invest in careful packaging of each component (and the corresponding dependency management), versus your willingness to force everyone and everything to utilize the monorepo.
(Notice I didn't mention performance issues. I really don't think they are important enough the control the decision.)
A monorepo is a transfer of convenience from people who make services to people who make libraries. A library author's change automatically runs the tests (pre land) and CI/CD pipelines (post land) of every dependent service, where previously it would have taken months of wrangling to distribute a library release this widely. But service author laptops are now struggling under the weight of every line of code at the company. Git operations take long enough for you to lose interest and tab over to HN. After a git pull, many of your dependencies have changed which can break things (at worst) or just invalidate a lot of cached compilation and make the next build slow (always). You need the build system to generate a "mask" for your IDE so that code intelligence doesn't try to index the whole thing in memory, etc. Buck and Bazel are less familiar and less ergonomic than languages' native or customary toolchains. You need further tools (or a lot of manual effort) to keep the monorepo build system's own configuration files in sync.
Mono-repos are either a shortcut to avoid release management and dependency management, or are a way to manage development at scales approximating Google's.
First, for almost all companies copying non-selectively what Google does is harmful - you are not Google.
Second, if you are a small team, your code only produces one binary that is shared across team boundaries, then you might be able to do with a monorepo. But if you are refactoring pieces of your code into libraries (you probably should be) and sharing those libraries for use by other teams in your company (you maybe should be), then you probably shouldn't use a monorepo.
However, converting may be a painful process. Using multiple repos, one per shared product (library, SDK, web site, documentation package) is harder to manage because you need to manage your dependencies much better and pull in the correct versions for the current product.
You also will need to communicate changes on products better across your teams, ideally with automatic notification of new releases that include a changelog and examples of new/changed features.
As part and parcel of the above you'll need to have better release management, including a release repository. What you use will depend on your environment, the language(s) your team uses, and your budget. There are some good Open Source dependency repositories out there that can be used to accomodate NPM-style dependencies, JAVA dependencies, and others (e.g., Artifactory).
In summary migrating away from monorepos is going to mean: investing in good DevOps people and giving them what they need to create/install/manage good processes; learning these new processes and dealing with the added work they impose on developers. But, it also will likely give you better products and over time both speed up development time and reduce defects (in part through more well-defined API touchpoints).
> Mono-repos are either a shortcut to avoid release management and dependency management
In a small company, release and dependency management are pointless busy work. And in larger companies, you have enough labor to manage monorepo tooling.
> Using multiple repos, one per shared product (library, SDK, web site, documentation package) is harder to manage because you need to manage your dependencies much better and pull in the correct versions for the current product.
This is the pointless busy work I was alluding to. See also: the diamond dependency problem. Meanwhile the benefits are unclear.
Multi-repos makes sense for entirely disconnected projects or products, or for collaboration in the open-source world. Most companies aren't going to run into significant issues with scaling a monorepo and they'll save a ton of work.
Avoiding release and dependency management is not a shortcut - it's avoiding a needless detour. Management is overhead, overhead requires resources like the 'good devops' people you mention. Many companies and teams are lean and must find ways to work more efficiently - mono repos are simpler so they are the default option. Only if you have a really good reason to support multiple versions of software simultaneously or have real independent teams with boundaries should you consider making your process more complicated by breaking this up.
"The first rule of distributed systems is don’t distribute your system until you have an observable reason to."
Mono repos can be useful if you have different services that are highly dependent on each other. This lets you package releases in a simple way by just tagging the repo. That tag/release gives you the exact version of all the components that you need.
But CI becomes substantially more complicated. You have to figure out with every commit what actually changed and based on that change what needs to be tested and built. You don’t want to run a build of all components if you only changed a small piece of one component.
We just build and test everything every time. Maybe it costs us more $$$ on CI, but it costs far less in developer time. That being said, I mostly work at startups and larger companies will certainly cross a threshold where this isn't feasible.
>>> larger companies will certainly cross a threshold where this isn't feasible.
Nope. Worked at the largest US bank that happens to be mono repo. They had compute grids with thousands of cores running tests 24/7. It's not sizable compared to the costs of the developers writing and maintaining software.
Google, Twitter, and Facebook have open sourced their solutions to this. Bazel, Pants, and Buck. Each have some rough edges but I'm currently migrating a bunch of stuff to Bazel and the workflow and tooling for it is pretty amazing.
I don't have a lot of experience with monorepos but I've found them very useful for:
- Simple projects with a server and an SPA component - frontend and backend code for the same feature is on the same feature branch, can be tested and reviewed together.
- Projects with a couple of microservices that share some common libraries - these libs can be directly referenced by the microservices instead of being published on an internal package server. This of course has its drawbacks too but overall, the benefits have outweighed them.
It's a tradeoff. Yes, you can only remove something if all callsites are gone, need to take care of breaking changes etc, but on the other hand you have clear insight into that and individual components can't as easily drag behind and require you to keep old versions around/maybe even backport fixes/...
And ideally a monorepo workflow gives you the tools to a) find all callsites, b) run tests to ensure your changes won't break anything, c) apply changes in lockstep. Those tools are an engineering effort, but the equivalents for a large multi-repo setup are too.
I find internal libraries are incredibly frustrating to manage in microservices. I spend more time changing the library, updating each service to the new version, getting PR approvals, and deploying then with a monorepo. Additionally, identifying that changes in my library dont break any of my clients are more difficult in distributed repos.
This is all with a fairly robust CI/CD pipeline so its not like i'm marred by inefficiencies elsewhere -its just an annoying process that makes me (and others) not want to write libraries.
edit: I should qualify that this is most annoying when you own the library and the services that consume the library. If the burden of the upgrade was left to my customers then the tradeoff might be different for me :)
I’ve pondered the same thing in my time working for web startups.
I think a good goal to instead shoot for is being able to extremely trivially build and ship a “latest” of your entire product.
You know your product doesn’t grow in sane predicable ways, and you know you won’t have resources dedicated to maintaining internal dependencies, so why try to half-ass a complex internal dependency tree? What ALWAYS happens is unexpected feature requests break clean APIs setup by engineers, which creates all this fake work of “making the breaking change” across the dozens of dependents.
Something like a monorepo could help prevent issues like this, but it’s not a guarantee you’re making it easier to ship one latest version of your software.
What makes Amazon's multirepo setup "tick" isn't the multirepos themselves but the coordination mechanisms around it (version sets, brazil, and pipelines). whereas bazel/buck are open source and work ok enough to scale monorepos, i dont think there are the equivalent tools out there for multirepo and it shows in the comments for people with multirepo pain.
EDIT: that being said, I do think multirepo is better, but only when you have the above tools to manage it.
Agreed that Amazon build and deploy is vastly ahead of bazel/buck. That said, git just works so much faster on a smaller repo. There's significant overhead trying to use git or hg on a huge monorepo that really takes a toll on development.
This isn't answerable in short form comment thread like you would find here.
Besides the structure of the repo itself, you have to consider the "base" SCM software, as well as the interfaces that are and can be built around it.
You are probably limiting yourself to a specific implementation of git, and for that, a specific and relatively concise answer can probably be given. But the issue in general is much, much more expansive than that. The question as posed doesn't give us enough information to give a proper answer. As such, you can see how the answers so far are all over the board.
- You can more easily switch to code dependency instead of binary dependency. With things like gradle becoming able to pull and build git commits this is less of a win.
- Its easier to kick off downstream builds because everything is in the monorepo and you can more easily track dependencies.
Cons:
- Its a technical challenge. Git isn't great at it although its getting better.
- If you do want to lean into linking source instead of binaries you now need to have a more consistent build system across your codebases.
I prefer working in them and in my opinions its approaching a personal preference choice.
Having experienced both the monorepo approach (at Google and Lyft's L5 autonomous division) the and manyrepo approach (at Lyft's main rideshare division), my conclusion is that keeping as much code in a single repository as possible (i.e. a monorepo) is generally the best approach.
The downside of manyrepos is that you often have to merge multiple changes into different repos in order to achieve a single logical change, and each of these changes requires a code review, waiting for CI, etc. For example, you may have a library that is shared by several services that your team owns. If you want to change some logic that lives in the library and propagate that change to a service, you have to make the change in the library and then bump the library version in the service. The latter change, while simple, is pure busywork, and at Lyft this second step easily adds 1-2 hours of overhead between waiting for code reviews and waiting for CI. Monorepos therefore make it much easier to share code and meaningfully reduce overhead for your team.
One of the cited downsides of monorepos, VCS scalability, only really kicks in for very large teams. At Lyft L5, we have a single shared Git monorepo hosted on GitHub that hundreds of engineers contribute to daily, and to my knowledge we haven't hit serious problems with Git itself (although I last worked in that org about a year ago). We did run into a few peripheral issues though:
- People kept inadvertently merging things that broke master. This would happen when two incompatible changes were merged at nearly the same time, or two incompatible changes were merged a few days apart but the second change was based off a stale master from before the first change, so tests pass on the branch but not after merge. We ended up solving this by having "deliver" branches that are basically master branches for a single team; if you break the deliver branch the only people you have to answer to are your immediate teammates, and it doesn't stop all other merges across the org. Deliver branches are periodically merged into master by a release manager who handles merge conflicts.
- CI got progressively slower. We addressed this by making improvements to the build system, optimizing tests, granularizing dependency graphs, using C++ idioms like forward declarations and PImpl to reduce dependency chains, and so on.
If you have a monorepo, you probably want to use a tool like Bazel. And since Bazel is, to my knowledge, the best tool in the world for doing what Bazel does, that means you probably want to use Bazel. Bazel has you specify your dependency structure as a DAG, and then allows you to quickly re-run tests on PRs based only on the code that changed. It also makes builds for C++ and other compiled languages blazingly fast, and if you're building C++ I don't think there's a better build system out there, monorepo or no. Bazel is a complex tool though so I'd encourage you to read through its highly detailed user manual, and if you have any questions ask on Stack Overflow where you'll often get a response directly from one of the core maintainers.
We are breaking our monorepo into 3 kind-of mono repos based on release cadence. A repo for each of two large applications plus a third repo for shared library code. We import the share library code onto the app repos using a git sub module. Has anyone else used git sub-repos to import common code?
We've used submodules pretty extensively in the past but we're working to get rid of as many of them as possible. A friend and former colleague liked to say they were the worst solution... with the exception of all the other solutions. Practically speaking they are probably more confusing to new developers who haven't had a lot of experience with them. We would often run into problems that stemmed from someone making a change in the submoduled repo and then forgetting to update the ref in the parent. Or people would navigate to the submodule to do the work there rather than checking it out separately, and then be confused over the detached HEAD status and check out a branch (not a problem, necessarily, but confusing to some extent). For some of our dependencies we've gotten rid of submodules in favor of hosting internal pypi packages.
Git submodules have a comically bad user interface imho but they work really well so I use them despite this. As with git in general the core functionality is solid. There is also something to be said for making as much of core functionality without resorting to another piece of infrastructure. People can use whatever front end they prefer on their machine but in a pinch the core git package is all you need.
I've heard many teams are breaking repos up by team, and I think that makes sense for many non-Google companies, but what I wonder about is whether or not clients and backends should ever be bundled in one repo.
My gut tells me they should be separate, but I'm curious what other people think.
Is there any webpage or book that explain how to set up a monorepo and what not to do? I haven't really though about it until now, but certain things sound kind of complicated, though I could see it simplifying other things.
Simplicity and being able to checkout and work on single folders, this is not directly related to the repo being mono but usually you wouldn't use git as your vcs when working with mono repos.
code reviews and ownership boundaries are less clear on a large code base in a mono-repo. Depending on your team culture, this will either be a pro or a con. While careful design of the build system can mitigate build times, there is likely to be some disagreement on basic practices such as "how to launch a service" or when to use spring vs. guice vs. rails that will be tougher than simply letting everyone go their own way.
Refactors and cross-service deployments can become much simpler in a MonoRepo.
Do people big monorepo's always clone the whole thing? Our CI/CD (TeamCity) always seems to check out the whole repo even if you want just one sub directory.
It would also be interesting to hear about the tooling available/recommended for maintaining monorepos.(ideally FOSS tooling, but good proprietary ones as well)
I won't try to be exhaustive here, but I think it's worth mentioning a few things:
One con is that most open source (as well as publicly available but proprietary) tooling is geared towards the non-monorepo approach. So if you want to use a monorepo, you're going to have to fight a bit of an uphill battle because most tools and processes assume you have lots of little repos. For example, build triggers in a lot of CI/CD tools operate on a per-repo basis.
There are some random benefits - it's easier to make everything public-by-default, which encourages people to look at source code written by other teams, and creates a culture of transparency and internal openness.
But the big thing in my opinion isn't directly the monorepo itself, but what's known inside Google as the "One Version Rule"[1]. Basically this means that only a single version of any package should be in the repo at a time. There are exceptions, but that requires going to extra effort to exempt your package from this rule.
I guess Chrome and Android are examples of this - they are made up of lots of little Git repos that are stitched together, but they generally follow the One Version Rule. On the other hand, if you just stick a lot of npm modules in there and every single one has a separate package.json file, then it's technically a "monorepo" but it's not following the One Version Rule.
You also need good test coverage. Not just in terms of line coverage or some other artificial metric, but to the point where you could say "I feel reasonably comfortable that a random change will be caught by my tests". This lets people in other teams catch regressions without having to have a detailed understanding of your team's codebase.
So once you've got these three things - 1) a monorepo 2) with everyone following the One Version Rule and 3) lots of tests - it means that dependency owners can update all of their consumers at once without much effort. They just make a change to their base library, and the build system will walk the dependency tree and figure out all the consumers that could possibly break, and runs all the appropriate regression tests.
This is the inverse of how it normally works at large companies, where each team pulls in their dependencies and pins them to a specific version. At most companies, updates require extra effort, so the default is to let everything go stale. This is especially problematic when security vulnerabilities are released (e.g. to an ancient version of jQuery) but teams can't update until they migrate off of an old API. It also means that library owners regularly have to maintain multiple old branches for months or even years after the initial release, because everyone's too afraid to update.
I personally think it's a myth that you need to be "Google scale" to benefit from a monorepo. In my opinion, you only need a few tens of repos before all the different combinations of semvers get unweildy. For me, going from Google's monorepo to a company that is built around lots of little repos in GitHub Enterprise felt like going back to the CVS/RCS days, where every single file had a separate revision number and changes weren't made atomically.
Agreed - that's why I said that the big thing is the One Version Rule, not the monorepo itself. If you're not following the One Version Rule, then end-to-end tests have to deal with an explosion in the different combinations of versions of everything. It quickly becomes unmanageable.
1. I've only used monorepo professionally once in a big org and it was SVN based
2. I'm going to look at monorepos from the perspective of git and github
The main argument I can think of in favor of monorepos, is to maintain the cohesion as high as possible between different parts of your system. For example, if you want to make a change to the load balancer regarding TTL, you can also go and make the same change in your API and your mobile clients and in the end you create one single PR and you have your tests run against that single revision.
Compare and contrast the same scenario in the traditional multi-repo approach: You make your changes to your `infra` repo, then you make the same change in your `api`, `android-client`, `ios-client` repos and in you create 4 PRs, that need to be reviewed at the same time. Which one would you prefer to do?
Potential arguments against are:
1. Too much noise - If multiple people are working on the same repo, you'd be getting emails for PRs for parts of the code that you may not care about.
2. Longer clone/pull times - In the same vain as before, the initial `git clone` and maybe every `git pull` after that will be bringing in a lot of code that you may not care about, increasing the time of each operation and increasing your frustration.
3. Access control - How do you limit who has write access to which part of your repo? AFAIK this isn't possible in git but it may be possible with codeowners in github, I don't know.
4. Organization - How do you structure your monorepo? Do you know in advance how many components it's going to have? How is a restructuring going to affect your history and/or dependencies between different components of your code?
5. History and code sharing - With a monorepo it's more difficult to share/open source just part of your code and keep the history intact.
Having said all that, I think the monorepo is a good match at a service level, not at an organization level. I know Google and Facebook have gone all in at an org level, but first of all they don't use git and second, they have the luxury and the resources to make it work for their use case. At a service level you should have a good idea of your boundaries, your applications and your infrastructure and assuming a relatively small engineering team (~10-12 people) eventually everyone would be confident enough with all parts of your system.
Personally I really like the idea of having one hash define the entire state of the world. I have a few common utilities that I use across my projects and I find it too annoying to keep them all up-to-date between different repos. I hope that eventually I will find some time (or be annoyed enough) that I will converge everything in one repo.
There is zero difference between using repos to organize things or using folders to organize things, except for the fact that updated dependencies across repos require extra steps to update if they change. More repos = more annoyance.
Use folders to organize things because common sense.
Additionally, don't use classes to organize your code, use combinators.
Folders and combinators will solve basically imo 95 percent of all organizational problems related to design.
Things like micoservices, multiple repos, and Gof design patterns only serve to make the organizational problem worse.
To go mono is to make an org-level engineering/cultural commitment, that you're going to invest in build tools, dependency graph management, third-party vendoring, trunk-driven development, and ci/cd infrastructure.
Can you make a mono repo work without all of those things? Yes, but you are sacrificing most of its benefits.
If your eng org cannot afford to make those investments (i.e. headcount is middling but there is zero business tolerance for investing in developer experience, or the company is old and the eng org is best described as a disconnected graph), forcing a monorepo is probably not the right idea for you.
Monorepo vs microrepos is analogous in some ways to static vs dynamic typing debate. A well-managed monorepo prevents entire classes of problems, as does static typing. Dynamic typing has much lower table stakes for a "running program", as do microrepos.
edit:
It's worth noting that open source solutions to build tooling, dependency graph management, etc. have gotten extremely good in the last 10 years. At my present company (eng headcount ~200), we spend about 2 engineers-per-year on upgrades and maintenance of this infrastructure. These tools are still quite complex, but the table stakes for a monorepo are lower today than they were 10 years ago.