I think that having one large repo helps identify cross-project dependency breakage faster, e.g. on a small team without fully-automated integration tests by increasing the likelihood that the person or team who broke the integration notices rather than the person or team who maintains the affected components.
There's also the issue, as jacques_chester notes, of shared components, some of which are far too small too small the be their own projects and which don't necessarily make sense thrown into a pile with other projects.
Project-specific repos make a lot of sense from an organised, a-place-for-everything-and-everything-in-its-place perspective, but real life is often quite messy and mutable, and the proper organisation for a project can change frequently (as the article notes); there's no sense chiselling it into stone.
This may be better for them, but it wouldn't work for example for OpenStack, which has many projects available and released separately. Putting nova, stevedore, anchor, bandit, etc. in one repository just wouldn't make sense - they have their own versioning and live their own lives, even if they will be frozen/released in one go as a single working deployment in October.
So when the author writes "When I am interacting with version control, I just want to get stuff done. I don't want to waste time dealing with multiple commands to manage multiple repositories." they just don't have that use case. They don't have to care about releasing different bits separately.
The question really is whether you can scale your version control practices, build tools, and source organization habits across many diverse projects.
As far as I can tell, at least android bits live in many repositories: https://android.googlesource.com/ - I'd say over 300, with one for each utility.
In fact, with hundreds of developers and everything in one repo, I don't see how you'd ever be able get a commit through : you'd be merging commits that others just did all the time and would have to get lucky?
The problems arise when you have to combine code from different repositories into a single deployable product. Most of us don't take Amazon's hard-line stance of making absolutely everything a microservice, so we end up with libraries of reusable code that are referenced by multiple projects. But when you store those libraries in separate repositories, it becomes impossible to describe the state of your deployed code without listing the version of every single dependency. That makes it easy for subtle inconsistencies and bugs to creep in, especially when the dependencies are multiple levels deep and are owned by different teams. If everything lives in the same tree, then a single commit ID reproducibly describes a complete system from top to bottom. And you can atomically make changes that cross module boundaries, which is difficult to do safely with separate repositories.
I don't really follow your comment about merging. Pretty much every version control system since forever has been smart enough to realize that, if I make changes only to foo/src/ and you make changes to bar/src/, our changes don't conflict and can be merged automatically without user intervention. (There might be technical difficulties; for example, if you're using Git, I would imagine that trying to view the list of commits of a small subtree of a gigantic repo might not be terribly efficient. But just like the issue of managing multiple repos, that's something that you can solve with better tool support, if you really need to and are motivated enough.)
This is a tooling problem, that is somewhat solved with a gigantic mono repo. All of our shared components are published with NuGet, and referenced as such. There is no interrepo dependence, and there shouldn't be.
Of course you effectively organise your repos in a 'tree' like e.g. happens with the Linux kernel, but, well, then you again have many repos. One per component that has someone responsible for signing off on them. It reduces the amount of people committing to each specific repo, but in the end, it's still many repos. So that's not 'one monolithic repo' in my book.
You're running into problems because you have multiple people all trying to push to the same branch, and Git errs on the side of extreme caution rather than creating a merge that you didn't specifically request. If you use merge requests instead, as supported by Github/Gitlab, you don't get blocked by other people's commits.
(Actual merge conflicts are an orthogonal problem. They'll happen whenever you have multiple people editing the same code, regardless of what VCS you use or how it's organized.)
The Linux kernel example is a bit of a tricky one. Yes, kernel.org hosts a lot of repos, but they're all different versions of the same codebase. They have a common commit ancestry, and commits get merged from one to another. So they're not really separate modules in the sense that we're talking about; they're more like namespaces that each contain a set of branches.
In the good case (almost every time), there were no conflicts, and the merge went fine (we had unittests, builds and regression tests as extra checks in our CI system).
In the bad case, the developer request was rejected and the developer was told to rebase or merge his code on his own, so the merge issues would be handled.
The problems arise when you have to combine code from
different repositories into a single deployable product.
But when you store those libraries in separate repositories, it
becomes impossible to describe the state of your deployed code
without listing the version of every single dependency.
If everything lives in the same tree, then a single commit ID
reproducibly describes a complete system from top to bottom.
And you can atomically make changes that cross module
boundaries, which is difficult to do safely with separate
So I just don't recognize any of:
I don't want to waste time dealing with multiple commands
to manage multiple repositories. I don't want to waste
time or expend cognitive load dealing with submodule,
subrepository, or big files management. I don't want to
waste time trying to find and reuse code, data, or
documentation. I want everything at my fingertips, where
it can be easily discovered, inspected, and used.
I have a directory with a tree of repos. They could be in one big repo, but aren't. Otherwise, on disk, everything is the same. If I change things in two related libs, I have to commit twice, but I'm also changing two independent libs that I have to release separately anyway, just like in a monolithic repo. Finding and reusing code is independent of how many repos.
I just don't see any concrete problem in the article that I can relate to that explains how what we do is more work, causes problems, can be optmimized.
There is still a lot one the monolithic repo. I heared "1TB" when I asked a Googler. Also, they don't use Perforce anymore, at least for the central repo, because Perforce requires a single host. They built a distributed monolithic repo server themselves.
(disclaimer: I have no proof for any of this and might misremember)
When I've brought this up before, people occasionally mention submodule or subrepositories, and those are equally broken. It makes big assumptions about how you're going to be organizing repositories (i.e., a strict tree), and if your design is not in that organization scheme, you're up a creek. For practical development, the subrepo tree effectively becomes one monorepo anyways: touch the innermost subrepo, and now you need to add a commit to all the outer repos to update the newer version of the innermost subrepo.
A saner way to handle repos is to recognize that it's not necessary for people to have the full history of everything in the repository stored on their local machine most of the time. This is something that SVN does better--you can checkout subdirectories of an SVN repo, but the commits are still atomic across the entire repository.
Granted, this is based on experience at several startups and a single huge company with a famous monorepo. Maybe there is a size in the middle where a single dev can still understand all the code well enough to make a sweeping commit without breaking lots of things.
Local history enables rebasing. Of course you can do it without local history already present, say in SVN you'd use separate branches for changes and 'rebased' changes where you merge your work on top of some new state of trunk. But this involvs creating branches(and at cleast one checkout for separate branch folder) and communicating with the server(takes time). This all means that people almost never do this.
With git rebase is a snap.
But of course one can live without rebase, it's not oxygen or something.
I spend a lot more time on the build and administration side of things than the code side, and I personally prefer more smaller repos. Builds are faster and less error prone, less disk space is used overall (regardless of cloning scheme - I have used them all), and I do believe the separation and inherent difficulty aids quality at the expense of productivity. I'm about the only person in the company who does, though, and that tells me this discussion depends more on how you personally interact with source control than any abstract 'monolithic vs. not' ideal. Or that I'm crazy, but I refuse to accept that.
For example, Servo, the project I work on, consists of around 150 small repositories (in contrast to every other browser engine). Lots of them are maintained by the Servo project, but lots of them aren't. They're separate projects of which Servo is only one user of many. The fact that we can take advantage of the fantastic work of the Rust community by simply adding a couple of lines to our Cargo files has been invaluable in helping our small team get the browser off the ground. A culture of monolithic repositories would discourage code sharing, leading to more wheel reinvention and less code collaboration overall.
I think the model works well for Google and Facebook because they're big centralized companies with thousands of engineers under one management and reporting structure. But that's a far cry from the aggressive, fine-grained code reuse culture that the Ruby, JS, and Rust (for example) have fostered.
- is there one build system for everything, or one per project?
- are there dependencies between projects?
- are they symlink-vendored? (which means that potentially I need more than one toolchain if projects A and B are in different languages)
- are they completely separate? (do they always assume latest version of your projects, or do they have reasonable version qualifiers)
- are any elements included cross-project, or can I just copy one directory and package it separately?
Those and other similar questions just don't exist (in well maintained projects) that have separate repos. I know that I can clone one project and build it.
- How does the build system integrate multiple, discrete repositories into a unified system?
- What are the dependencies between the repositories?
- How are the sub-repositories laid out on disk? Do the separate repositories use separate toolchains?
- How do I decide when to update the reference to a sub-repository? Are they completely separate? Versioned as one logical entity?
- Do separate repositories reference elements in each other? Can I copy files between repositories or should certain files live in certain repositories?
These and other similar questions exist when you use multiple repositories.
(I hope you see that multiple, discrete repositories aren't a panacea and there is a counterpoint to each of your points.)
1. (build integration) All popular build systems have some dependency management answers. Even down to C's `autotools` and `pkg-config` which will at least tell you what you're missing. But more likely something like pip / gem / cargo which can just get it for you. Whether that's a released version, or another repo - none of my business.
2. (dependencies between repos) Same as 1. They're separate projects.
I don't think these apply at all:
3. (sub-repositories) I don't see a difference between sub-repos and symlinking to a repo outside. This is a problem of single repo.
4. (sub-repositories update) Same as 3 - it's the same as one repository - avoid sub-repos unless you want to pretend you have one big repo with everything.
5. (moving elements) I think that's a straw man. Does anyone have a reasonable expectation that a file containing code can be moved between repositories without issues?
While no repository layout is perfect and there are always pros and cons, I think those examples are really bad as counterpoints.
Disclaimer, these opinions do not necessarily represent the opinions of my employer.
I posit that if Git or Mercurial allowed you to clone a subset of directories, the differences between a monolithic repository and a set of smaller repositories becomes indistinguishable, as a clone of a sub-directory is functionally equivalent to a standalone repository! The problem is that narrow clone is not implemented in any popular DVCS tool today (but Mercurial is working on it).
Well, let's wait and see how partial clones actually handle this situation. I'm not convinced a partially cloned monolithic repository will be better than what submodules currently do.
Where possible, the goal is to decouple software components by design, not by backpressure from the toolchain.
I've seen the many-repo approach. It's particularly frustrating on distributed systems when a shared component migrates from repo to repo like a sad ronin, sometimes alighting in some of them more than once.
Taken to the extreme, this produces a single main.c file for an entire organisation.
Good software design should dictate the types toolchain backpressure tradeoffs that need to be managed.
The monolithic approach makes more sense for big companies that have a large portfolio of projects, an army of software engineers and the resources to develop their own tools around a SCM. This solves several problems like multiple commits per minute, resolving ever occurring merge conflicts, code search queries and code sharing. Someone might argue that it can slowdown the development process by having to load tons of data on a local machine but this issue can be solved by creating the correct custom tooling to download a sub node of the overall repo. At this scale level this is the only feasible solution to empower developer productiveness and to avoid headaches in keeping track of a long array of projects.
The project-based repositories solution has nice qualities as well: Service oriented projects, clearcut responsibilities and clear dependencies. This seems to be a good solution for a small to middle scale organization. Can design a well documented interface for every service and even expose some of those services to external clients when needed https://www.nginx.com/blog/microservices-at-netflix-architec... . Not that this service design solution is incompatible with a monolithic approach, is just that is easier to come up with a simpler answer when having silos of knowledge.
At the lower scale, where most startups reside. Either one seems to be a reasonable approach. Albeit a startup doesn't have the resources nor the time to spend on developing custom tools to manage a monolithic repo, it turns out they don't need to, because they have a small amount of developers and ongoing commit traffic. A plain simple git repo works as well as a multi repo. This is a matter of taste and the organization the founders/early employees wish to create for their project.
It appears the main argument for monolithic repositories is that it improves developer productivity by giving the developer access to the entire organisation's codebase.
What a terrible hack to provide something that could be better managed with other tools. No DCVS is an island.
I've been using it for years mainly because I have a few repos shared between projects and jobs, and using one monolithic repo would mean having to copy those repos around. (At least AFAIK - or how else do people using one big repository use external projects?)
It uses the check in/check out model, so there's no problem with unmergeable binary files. There are per-user access permissions. Branches are folder copies. There's a GUI tool, but you can do everything from the command line as well (I believe that is exactly how the GUI tool does everything).
(UX is a bit hit or miss though. There's no git-style index, and the command line tool's output isn't as convenient to parse as you might like. On the other hand the diff/merge tool is alright and the UI for keeping your branches in sync is fine.)
I've never minded using it.
There are a lot of good points made about the benefits of monorepos, and at Airbnb we enjoy several of them. What hasn't been mentioned is the effort required to do them well: you need specialized build and dependency tools to ensure that you only run builds+tests for the single project that's being changed; engineers have to check out extremely large amounts of data to work on a single subdirectory, or else you need custom tooling to allow them to only check out portions while still contributing to the larger whole; if someone mistakenly breaks a piece of shared code and merges it to master, every project is now broken and engineering work may be stalled unless you have very good debugging tools and testing frameworks to quickly recover from and prevent these kinds of issues.
The upfront costs of doing monorepos well are high, and doing them poorly is in my experience a net productivity loss. For large companies with established business models (and I'd consider Facebook, Google, and Airbnb to be some degree of "large" and with "established business models," although Airbnb is clearly still much smaller and pre-IPO), the tradeoff of allocating some number of engineers to work on in-house tools for a much larger engineering team is usually worthwhile, and monolithic repos start to become an attractive option. I'd caution small companies or early-stage startups against monorepos, though: when you're a twenty-person team, that amount of tools work just isn't worth it. Use open-source tools, and spend the rest of your time shipping product.
TL;DR: Facebook and Google have optimized their workflows for their size; if you're not a Facebook or a Google, your mileage will probably vary.
Mercurial has no clear technical advantage over git though. That might change if Mercurial gets narrow clones (only check out subdirectories) working and git fails to clone (haha) that.
Ironically, to make narrow clones happen, Mercurial is going to change its internal structure representing the files in the repository (the manifest) to use a separate manifest for each sub-directory... like git has been doing for trees from day one.
Both of these features are already part of git.
A current experiment is an untracked files cache, which speeds up stuff on large repos considerably. This stuff is actively being worked on -- the git project has always valued performance.
A shallow clone misses history after a certain point in time (well, commit order). A narrow clone misses history except in a subtree. A partial clone looks like a narrow clone in the workspace, but a narrow clone should have substantially less objects to download and store.
Large repositories are painful, as each developer bears the pain of integration every time they have to make a change to the repository integrating with all of the other code.
Small, distributed repos are painful when change accrues in larger increments and must be resolved at release/integration time.
As a developer in one of the largest monolithic repos in the world, I feel the pain every day of that pattern.
Distributed repos make writing software easier.
Monolithic repos make testing and running software easier.
Large monolithic repos are an artifact of scale where the individual cost of lost productivity in development is worth the increased performance at deployment and testing.
With a monolithic repo, especially without development branching - everyone is running against HEAD all the time. So while you are developing your component, everything else is changing around you.
We've tried both, had one large repo (2G in size, mostly due to historical mistakes of checking in large binaries). Now we have many little repos but we've spend I don't know how many man months writing custom scripts (based on gyp) to manage all these collection of repos.
So far I would say it is toss-up. I would say for completely separate products that do not share almost any code it might makes sense to have separate repos. But if you find yourself building a custom version of git porcelains on top of your multiple repos -- you have probably gone too far.
"(I work on source control at Facebook.)
Every problem you mentioned with monolithic repositories is a well-known problem with Git (though some of them do have workarounds, such as clone with --depth). None of them are issues in principle.
With Mercurial we're aiming to provide tooling that scales well while still maintaining DVCS workflows like local commits."
Whether or not one should go monolithic or project-based depends on how tightly integrated those projects are. Folks like Google and Facebook and OpenBSD's dev team develop all their things under a single source tree because those things are pretty tightly integrated and designed to interoperate. Other folks develop each of their things in a separate repo because they expect those things to be independently useful (OpenBSD's subprojects do happen to be independently useful, but AFAIK the priority is generally "OpenBSD first, everyone else second").
Multi-repository environments can also be very manageable depending on the language, runtime, etc. Dividing things into do-one-thing-and-do-it-well chunks (perhaps gems in Ruby land, or crates in Rust land, or packages in Perl land, or whatever) goes a long way toward alleviating the typical non-atomicity of multi-repo setups. Of course, this tends to be easier for open-source projects than closed-source projects (though the closed-source camp can have some of this fun, too; using Ruby as an example, one can run a private gem server just by running the "gem server" command, or - if using Bundler - can even install dependencies directly from Git repos), but it's still a possibility.
Plus, the whole "well Google says it's more productive for them, so we should take it as a general rule" vibe doesn't really sit well with me.
With the monolithic approach everybody is forced to be on the latest version (or to update to the latest version). There can be a lot of contention when trying to get things out - especially when things are moving fast. You need a lot of discipline to make it work and you also are forced to constantly update parts of the code you own. Also, we tend to forget about deployment and management of the build artifacts. I am willing to bet that bigCo's that do this have a separate system for managing the artifacts and performing the deployments.
In the presence of a decent build and deployment system the only way to go is multiple repos - a logical separation per component / service. The build system can easily track dependencies via components via metadata associated with each of them. You can easily orchestrate builds when something happens, it's easier to rebuild only what changes and you get granular control over build artifacts.
We have 6 projects on bitbucket - and we would love to move to a monolithic model (we have already seen integration issues because of developer carelessness with deployments, that would go away)... but I'm just not sure on how we will do things like Slack/Hipchat integration, emails on commits, etc.
- DBs ddl
- Selenium automatic tests
- A hierarchy with our Java code were we have our libraries, modules, final products and client personalization of our products.
Our build solution relay on Maven with a nexus server and Jenkins to autodeploy maven artifacts.
- We can do a single commit across all to update many modules/libs/final product to fix a issue or implement something new. So tracking a issue fix or a new feature across all is more easy.
- We can't work like in git, where we can create a branch to develop a new feature or fix a issue isolated from other changes.
- Enforces big commits.
Actually I managed to use git svn to map our subversion repository, but I need to create many small git local repositories to track every module/lib/product/personalization that work over. Also, I have a few scripts to update my git local repos (master branch) against the subversion repo. This allowed me to be more flexible when I work on a new feature or a potential dangerous change, as I can create local branches, do local commits (and do a commit squash of they before send to subversion), stash , etc.
I think that with this I grab the best of both ways of working. I can be more flexible and more productive, as I can change of branch of a project/module/lib with a few keystrokes, and do local code versioning. At same time. I only miss doing a commit across modules/libs/products when to resolve a issue I need to do change across many modules/libs. On this case, I try to use the same commit message or at least, put the issue number on the commit message, so make easy track it on fisheye.
The monolithic approach works so smoothly because Perforce makes it trivial to only check out the directories we're working on. It has huge advantages when we're working on interdependent projects.
Perforce does have some historical oddities (top level directories are called "depots" and are slightly different than normal directories), but the ability to branch, merge, and check out using normal filesystem concepts is a huge usability boon.
That alone justifies a single repository.
TortoiseSVN has a feature of pegging externals at a specific release when tagging the main project, but this only works at one level.
Also, they have a ridiculous number of machines dedicated to running automated tests. My tech lead told me "feel free to add as many tests as long as they're under 30 seconds each".
I was super skeptical of perforce before starting at Google, but it works really well, especially in conjunction with the code review tools and processes
How are the dependencies specified? Is this a language-specific tool?
However, that's not enough when you touch a fundamental library and nearly everything is a downstream dependency. Then you really do need to run all the tests, so the build system should be able to run them in parallel on a build farm. Also, test results need to be cached by the build farm so every user doesn't need to run the same tests again for the same change.
See Bazel  for how Google does it. (Well, it doesn't provide the build farm, but you can see what the build files look like.)
The notion of zero-config CI is a recent invention, thanks to conventions of small repos with well-known project file names per language/platform ecosystem.