Hacker News new | past | comments | ask | show | jobs | submit login
On Monolithic Repositories (gregoryszorc.com)
84 points by tristanz on Aug 5, 2015 | hide | past | favorite | 72 comments

I've worked with monolithic and project-based repositories, and in practice I've found the problems with monolithic repos are less than the problems with project-based repos, and the benefits of monolithic repos are more than the benefits of project-based repos. Certainly there can be issues at a very large organisation with very many extremely large projects—but most of us don't work at those organisations with that many projects that large.

I think that having one large repo helps identify cross-project dependency breakage faster, e.g. on a small team without fully-automated integration tests by increasing the likelihood that the person or team who broke the integration notices rather than the person or team who maintains the affected components.

There's also the issue, as jacques_chester notes, of shared components, some of which are far too small too small the be their own projects and which don't necessarily make sense thrown into a pile with other projects.

Project-specific repos make a lot of sense from an organised, a-place-for-everything-and-everything-in-its-place perspective, but real life is often quite messy and mutable, and the proper organisation for a project can change frequently (as the article notes); there's no sense chiselling it into stone.

I think there's a missing distinction here: is it a monolithic repo for products available separately, or monolithic final deployable thing(s) (google has multiple products, but it's still "google services").

This may be better for them, but it wouldn't work for example for OpenStack, which has many projects available and released separately. Putting nova, stevedore, anchor, bandit, etc. in one repository just wouldn't make sense - they have their own versioning and live their own lives, even if they will be frozen/released in one go as a single working deployment in October.

So when the author writes "When I am interacting with version control, I just want to get stuff done. I don't want to waste time dealing with multiple commands to manage multiple repositories." they just don't have that use case. They don't have to care about releasing different bits separately.

Google has self driving cars, YouTube, various Android apps, and web search in the same repository. That does seem a lot more varied at the first look than e.g. OpenStack, and different products certainly have extremely different release cycles and even deployment platforms (hardware in cars, Android phones, web servers).

The question really is whether you can scale your version control practices, build tools, and source organization habits across many diverse projects.

Do they really store youtube, self driving car and Android in the same repo?

As far as I can tell, at least android bits live in many repositories: https://android.googlesource.com/ - I'd say over 300, with one for each utility.

Exactly. I frankly have no clue what this article is about. I work at a company with < 10 developers. We started with one repo and now have over 20 repositories for various bits and pieces of our code. Each one maps to its own releasible component. I don't recognize the 'I don't want to waste time dealing with multiple commands to manage multiple repositories' at all. The only time there is a difference on the cli is when you clone a repo. If anything about this would hurt, we'd change it: optimizing our workflow is something we pay a lot of attention to.

In fact, with hundreds of developers and everything in one repo, I don't see how you'd ever be able get a commit through : you'd be merging commits that others just did all the time and would have to get lucky?

The issue with multiple repositories has nothing do to with the number of commands you have to run. As you say, that's the kind of thing that can easily be automated.

The problems arise when you have to combine code from different repositories into a single deployable product. Most of us don't take Amazon's hard-line stance of making absolutely everything a microservice, so we end up with libraries of reusable code that are referenced by multiple projects. But when you store those libraries in separate repositories, it becomes impossible to describe the state of your deployed code without listing the version of every single dependency. That makes it easy for subtle inconsistencies and bugs to creep in, especially when the dependencies are multiple levels deep and are owned by different teams. If everything lives in the same tree, then a single commit ID reproducibly describes a complete system from top to bottom. And you can atomically make changes that cross module boundaries, which is difficult to do safely with separate repositories.

I don't really follow your comment about merging. Pretty much every version control system since forever has been smart enough to realize that, if I make changes only to foo/src/ and you make changes to bar/src/, our changes don't conflict and can be merged automatically without user intervention. (There might be technical difficulties; for example, if you're using Git, I would imagine that trying to view the list of commits of a small subtree of a gigantic repo might not be terribly efficient. But just like the issue of managing multiple repos, that's something that you can solve with better tool support, if you really need to and are motivated enough.)

> it becomes impossible to describe the state of your deployed code without listing the version of every single dependency...If everything lives in the same tree, then a single commit ID reproducibly describes a complete system from top to bottom.

This is a tooling problem, that is somewhat solved with a gigantic mono repo. All of our shared components are published with NuGet, and referenced as such. There is no interrepo dependence, and there shouldn't be.

About merging (I'll get back to the other bit later): if I want to push my change, I already regularly encounter the situation where git tells me: you are not up to date, even if I pulled say 15 minutes ago. So I have to fetch and rebase before I can push. Even if that doesn't require manually solving any merge conflicts (and I think I have to in about 1/4th of the cases), it takes a bit of time. In that bit of time another developer can push a commit. If that already happens on occasion with 10 developers, I imagine it becomes very problematic with 1000 developers on the same repo.

Of course you effectively organise your repos in a 'tree' like e.g. happens with the Linux kernel, but, well, then you again have many repos. One per component that has someone responsible for signing off on them. It reduces the amount of people committing to each specific repo, but in the end, it's still many repos. So that's not 'one monolithic repo' in my book.

Still, that's a limitation of the specific tools and the way you're using them, not a fundamental problem with monolithic repositories.

You're running into problems because you have multiple people all trying to push to the same branch, and Git errs on the side of extreme caution rather than creating a merge that you didn't specifically request. If you use merge requests instead, as supported by Github/Gitlab, you don't get blocked by other people's commits.

(Actual merge conflicts are an orthogonal problem. They'll happen whenever you have multiple people editing the same code, regardless of what VCS you use or how it's organized.)

The Linux kernel example is a bit of a tricky one. Yes, kernel.org hosts a lot of repos, but they're all different versions of the same codebase. They have a common commit ancestry, and commits get merged from one to another. So they're not really separate modules in the sense that we're talking about; they're more like namespaces that each contain a set of branches.

I guess it comes down to organisational style. Using merge requests requires some benevolent dictator (poor sod?) to perform the merges. For us, if your commit is approved in Gerrit, it's your task to merge it. If some other commit was merged in between, you have to rebase, possibly solving merge conflicts. So guess that with a monolithic repo, that wouldn't scale, but it could be made to scale by appointing someone to perform the merge requests. I'm not sure I would like to be that person...

I helped set up such a system for a few hundred developers. We had an 'automated Linus Torvalds', which did the merge, and aborted whenever a file was changed on both sides of the merge.

In the good case (almost every time), there were no conflicts, and the merge went fine (we had unittests, builds and regression tests as extra checks in our CI system).

In the bad case, the developer request was rejected and the developer was told to rebase or merge his code on his own, so the merge issues would be handled.

  The problems arise when you have to combine code from 
  different repositories into a single deployable product. 
OK, we only have a single deployable product, which combines code from all those repositories, so it seems we match this criterion.

  But when you store those libraries in separate repositories, it 
  becomes impossible to describe the state of your deployed code 
  without listing the version of every single dependency. 
Yes, and that is what we do: each released lib has a version. The deployed version only uses released libs and lists exactly what it uses. The libs cannot be overwritten or changed after release: once released, that's what that version of the lib is.

  If everything lives in the same tree, then a single commit ID 
  reproducibly describes a complete system from top to bottom. 
In our case a single commit ID also reproducibly describes what has been deployed. Each library version was built from a specific commit ID in their respective repos.

  And you can atomically make changes that cross module 
  boundaries, which is difficult to do safely with separate 
We don't have to atomically change anything, because a new backwards incompatible release of a certain lib doesn't affect any other lib, until we explicitly opt to upgrade it to the new version and release a new version. And if it's backwards compatible, it gets automatically picked up, because version management allows it.

So I just don't recognize any of: I don't want to waste time dealing with multiple commands to manage multiple repositories. I don't want to waste time or expend cognitive load dealing with submodule, subrepository, or big files management. I don't want to waste time trying to find and reuse code, data, or documentation. I want everything at my fingertips, where it can be easily discovered, inspected, and used.

I have a directory with a tree of repos. They could be in one big repo, but aren't. Otherwise, on disk, everything is the same. If I change things in two related libs, I have to commit twice, but I'm also changing two independent libs that I have to release separately anyway, just like in a monolithic repo. Finding and reusing code is independent of how many repos.

I just don't see any concrete problem in the article that I can relate to that explains how what we do is more work, causes problems, can be optmimized.

Google has a few departements that do their own thing. Android and Chrome for example.

There is still a lot one the monolithic repo. I heared "1TB" when I asked a Googler. Also, they don't use Perforce anymore, at least for the central repo, because Perforce requires a single host. They built a distributed monolithic repo server themselves.

(disclaimer: I have no proof for any of this and might misremember)

I'm very much a fan of monolithic repositories, because I have had horrible experiences working with project-based repos. I've been around long enough to recall discussions of why people should stop using CVS, and invariably one of the bullet points on that list is "CVS lacks atomic commits." Projects that use multiple repositories invariably fall into the trap of having non-atomic commits, and people who advocate multiple repositories have likely never had the fun task of trying to do archaeology on those repositories where the sudden non-atomicity is suddenly painful.

When I've brought this up before, people occasionally mention submodule or subrepositories, and those are equally broken. It makes big assumptions about how you're going to be organizing repositories (i.e., a strict tree), and if your design is not in that organization scheme, you're up a creek. For practical development, the subrepo tree effectively becomes one monorepo anyways: touch the innermost subrepo, and now you need to add a commit to all the outer repos to update the newer version of the innermost subrepo.

A saner way to handle repos is to recognize that it's not necessary for people to have the full history of everything in the repository stored on their local machine most of the time. This is something that SVN does better--you can checkout subdirectories of an SVN repo, but the commits are still atomic across the entire repository.

Beyond a certain scale, atomic commits don't work in monolithic repos either. If a big commit touches code in many areas, it needs to be broken up into many small commits, so that experts in each of those areas can review, rework, and approve those changes without tripping over each other. Those small commits won't all happen at the same time, so they need to be backward compatible. Now depending on the tooling, you're back to essentially the same workflow you'd have with many repos.

Granted, this is based on experience at several startups and a single huge company with a famous monorepo. Maybe there is a size in the middle where a single dev can still understand all the code well enough to make a sweeping commit without breaking lots of things.

> A saner way to handle repos is to recognize that it's not necessary for people to have the full history of everything in the repository stored on their local machine most of the time.

Local history enables rebasing. Of course you can do it without local history already present, say in SVN you'd use separate branches for changes and 'rebased' changes where you merge your work on top of some new state of trunk. But this involvs creating branches(and at cleast one checkout for separate branch folder) and communicating with the server(takes time). This all means that people almost never do this. With git rebase is a snap. But of course one can live without rebase, it's not oxygen or something.

You can download the history you need for rebasing when you do the rebase command, in effect turning the local .git (or .hg) store into basically a local cache of a remote repository.

I'm sitting here running git gc and repack on about 50 repos right now, of varying sizes. We just actually combined two of our larger repos into one for productivity reasons, so this article resonates with me a little on that front.

I spend a lot more time on the build and administration side of things than the code side, and I personally prefer more smaller repos. Builds are faster and less error prone, less disk space is used overall (regardless of cloning scheme - I have used them all), and I do believe the separation and inherent difficulty aids quality at the expense of productivity. I'm about the only person in the company who does, though, and that tells me this discussion depends more on how you personally interact with source control than any abstract 'monolithic vs. not' ideal. Or that I'm crazy, but I refuse to accept that.

Small repositories are great for open source because they encourage code sharing. When your project consists of many pieces all owned by different, often volunteer, teams, then having one big repository is a barrier to that.

For example, Servo, the project I work on, consists of around 150 small repositories (in contrast to every other browser engine). Lots of them are maintained by the Servo project, but lots of them aren't. They're separate projects of which Servo is only one user of many. The fact that we can take advantage of the fantastic work of the Rust community by simply adding a couple of lines to our Cargo files has been invaluable in helping our small team get the browser off the ground. A culture of monolithic repositories would discourage code sharing, leading to more wheel reinvention and less code collaboration overall.

I think the model works well for Google and Facebook because they're big centralized companies with thousands of engineers under one management and reporting structure. But that's a far cry from the aggressive, fine-grained code reuse culture that the Ruby, JS, and Rust (for example) have fostered.

Thought experiment: the entire github.com URL space is a single repository. Each organization/user has a top-level directory and projects/forks exist under them. Does that prevent/encourage code sharing? Why or why not?

Prevents. Instead of cloning just one of your repos and expecting it to work, I clone `indygreg2` and need to answer:

- is there one build system for everything, or one per project?

- are there dependencies between projects?

- are they symlink-vendored? (which means that potentially I need more than one toolchain if projects A and B are in different languages)

- are they completely separate? (do they always assume latest version of your projects, or do they have reasonable version qualifiers)

- are any elements included cross-project, or can I just copy one directory and package it separately?

Those and other similar questions just don't exist (in well maintained projects) that have separate repos. I know that I can clone one project and build it.

Instead of cloning just one of your non-monolithic repositories and expecting it to "just work" I need to answer:

- How does the build system integrate multiple, discrete repositories into a unified system?

- What are the dependencies between the repositories?

- How are the sub-repositories laid out on disk? Do the separate repositories use separate toolchains?

- How do I decide when to update the reference to a sub-repository? Are they completely separate? Versioned as one logical entity?

- Do separate repositories reference elements in each other? Can I copy files between repositories or should certain files live in certain repositories?

These and other similar questions exist when you use multiple repositories.

(I hope you see that multiple, discrete repositories aren't a panacea and there is a counterpoint to each of your points.)

I don't agree with some of the counterpoints. Not because they couldn't exist in theory, but because we already worked out some common solutions for them and by cloning a separate repository, I expect that kind of problem to be one of: solved, documented in red big letters, or worthy of a bug report.

1. (build integration) All popular build systems have some dependency management answers. Even down to C's `autotools` and `pkg-config` which will at least tell you what you're missing. But more likely something like pip / gem / cargo which can just get it for you. Whether that's a released version, or another repo - none of my business.

2. (dependencies between repos) Same as 1. They're separate projects.

I don't think these apply at all:

3. (sub-repositories) I don't see a difference between sub-repos and symlinking to a repo outside. This is a problem of single repo.

4. (sub-repositories update) Same as 3 - it's the same as one repository - avoid sub-repos unless you want to pretend you have one big repo with everything.

5. (moving elements) I think that's a straw man. Does anyone have a reasonable expectation that a file containing code can be moved between repositories without issues?

While no repository layout is perfect and there are always pros and cons, I think those examples are really bad as counterpoints.

I've worked both at a large company (Facebook) that used the mono repo approach as well as a large company that uses the per project repo approach (Uber) and I have to say I'm personally a VERY big fan of the project based repo approach but every company is different. So is every team. If you're a small company with primarily one service and primarily in one programming language, the mono-repo way seems to be the best approach. On the other hand, if you're a company that has embraced a service oriented architecture, the per project repo approach is likely the way to go. Especially if your company is OK with services being written in a variety of different languages and so long as it is as easy to use open source code as it is to use third party code written within your org. It also goes a long way in supporting local (ie., laptop) development. Otherwise, the entire codebase would be too big to fit in RAM.

Disclaimer, these opinions do not necessarily represent the opinions of my employer.

If your concerns are driven by resource requirements, then I posit your concerns are driven by limitations of fully distributed version control tools of today. Shallow and/or narrow clone (like the Subversion model) limit the amount of data required on clients and thus facilitate monolithic repositories without the extreme resource requirements on clients.

I posit that if Git or Mercurial allowed you to clone a subset of directories, the differences between a monolithic repository and a set of smaller repositories becomes indistinguishable, as a clone of a sub-directory is functionally equivalent to a standalone repository! The problem is that narrow clone is not implemented in any popular DVCS tool today (but Mercurial is working on it).

I posit that if Git or Mercurial had better workflows for submodules, monolithic repositories would seem less attractive.

I don't disagree. Although, for the case where you want to copy/move things across repositories, monolithic repositories still have the advantage that history is more easily preserved. Although you can argue that proper submodule support would handle this and preserve history.

Although, for the case where you want to copy/move things across repositories, monolithic repositories still have the advantage that history is more easily preserved.

Well, let's wait and see how partial clones actually handle this situation. I'm not convinced a partially cloned monolithic repository will be better than what submodules currently do.

atlassian recommends you consider sub tree as an option if you are looking at Git submodule

Subtrees are nice for some use cases. But that's the very problem, there's no general (and supported out-of-the-box) solution to the problem. DVCSes have solved the distributed development part for one (sub)project, but we currently don't have a compelling solution for distributed development across several (sub)projects.

Another way to look at this is that repos quickly ossify into unplanned Conway boundaries.

Where possible, the goal is to decouple software components by design, not by backpressure from the toolchain.

I've seen the many-repo approach. It's particularly frustrating on distributed systems when a shared component migrates from repo to repo like a sad ronin, sometimes alighting in some of them more than once.

> Where possible, the goal is to decouple software components by design, not by backpressure from the toolchain.

Taken to the extreme, this produces a single main.c file for an entire organisation.

Good software design should dictate the types toolchain backpressure tradeoffs that need to be managed.

It all comes down to the scale factor.

The monolithic approach makes more sense for big companies that have a large portfolio of projects, an army of software engineers and the resources to develop their own tools around a SCM. This solves several problems like multiple commits per minute, resolving ever occurring merge conflicts, code search queries and code sharing. Someone might argue that it can slowdown the development process by having to load tons of data on a local machine but this issue can be solved by creating the correct custom tooling to download a sub node of the overall repo. At this scale level this is the only feasible solution to empower developer productiveness and to avoid headaches in keeping track of a long array of projects.

The project-based repositories solution has nice qualities as well: Service oriented projects, clearcut responsibilities and clear dependencies. This seems to be a good solution for a small to middle scale organization. Can design a well documented interface for every service and even expose some of those services to external clients when needed https://www.nginx.com/blog/microservices-at-netflix-architec... . Not that this service design solution is incompatible with a monolithic approach, is just that is easier to come up with a simpler answer when having silos of knowledge.

At the lower scale, where most startups reside. Either one seems to be a reasonable approach. Albeit a startup doesn't have the resources nor the time to spend on developing custom tools to manage a monolithic repo, it turns out they don't need to, because they have a small amount of developers and ongoing commit traffic. A plain simple git repo works as well as a multi repo. This is a matter of taste and the organization the founders/early employees wish to create for their project.

Option C: Custom tooling built for the purpose of managing collections of repositories.

It appears the main argument for monolithic repositories is that it improves developer productivity by giving the developer access to the entire organisation's codebase.

What a terrible hack to provide something that could be better managed with other tools. No DCVS is an island.

There's also the multiple repository tool 'mr' http://linux.die.net/man/1/mr which adresses exactly the I don't want to waste time dealing with multiple commands to manage multiple repositories complaint of the author and moreover can do so for repositories using different version control systems.

I've been using it for years mainly because I have a few repos shared between projects and jobs, and using one monolithic repo would mean having to copy those repos around. (At least AFAIK - or how else do people using one big repository use external projects?)

The Android team wrote the "repo" to manage multiple git repositories:


Someone needs to do for subrepos what git did for branches. Working with subrepos is a nightmare.

Subversion ;P (the reason their branches were so awkward was because they were essentially sub repositories...)

So we combine the best of git, subversion, and perforce. perverted-git, anyone?

What's the advantage of Perforce (I don't know as I've never used it before)?

It's quite fast and it seems to scale well to large (though not necessarily to ridiculously large) repositories. The largest one I've used it with had a 1TB head revision - I don't know how large the history was - and a zillion files. Performance was still fine. (This is probably its main advantage, really: you can just put all your files in it. Then it's easy for everybody to get them, and you didn't have to think too hard about it.)

It uses the check in/check out model, so there's no problem with unmergeable binary files. There are per-user access permissions. Branches are folder copies. There's a GUI tool, but you can do everything from the command line as well (I believe that is exactly how the GUI tool does everything).

(UX is a bit hit or miss though. There's no git-style index, and the command line tool's output isn't as convenient to parse as you might like. On the other hand the diff/merge tool is alright and the UI for keeping your branches in sync is fine.)

I've never minded using it.

I work on the Developer Infrastructure team at Airbnb — essentially our tools team — and have some experience with both sides of the coin. Airbnb's monolithic vs project-based repository organization is currently split along language lines on the backend: the Java folks prefer a single monorepo (and have one), and the Ruby engineers use project-based repos (and have many).

There are a lot of good points made about the benefits of monorepos, and at Airbnb we enjoy several of them. What hasn't been mentioned is the effort required to do them well: you need specialized build and dependency tools to ensure that you only run builds+tests for the single project that's being changed; engineers have to check out extremely large amounts of data to work on a single subdirectory, or else you need custom tooling to allow them to only check out portions while still contributing to the larger whole; if someone mistakenly breaks a piece of shared code and merges it to master, every project is now broken and engineering work may be stalled unless you have very good debugging tools and testing frameworks to quickly recover from and prevent these kinds of issues.

The upfront costs of doing monorepos well are high, and doing them poorly is in my experience a net productivity loss. For large companies with established business models (and I'd consider Facebook, Google, and Airbnb to be some degree of "large" and with "established business models," although Airbnb is clearly still much smaller and pre-IPO), the tradeoff of allocating some number of engineers to work on in-house tools for a much larger engineering team is usually worthwhile, and monolithic repos start to become an attractive option. I'd caution small companies or early-stage startups against monorepos, though: when you're a twenty-person team, that amount of tools work just isn't worth it. Use open-source tools, and spend the rest of your time shipping product.

TL;DR: Facebook and Google have optimized their workflows for their size; if you're not a Facebook or a Google, your mileage will probably vary.

Facebook uses Mercurial. Mercurial works well for small projects as well.

Mercurial has no clear technical advantage over git though. That might change if Mercurial gets narrow clones (only check out subdirectories) working and git fails to clone (haha) that.

IIRC there were experimental patches implementing it for git a few months ago.

Ironically, to make narrow clones happen, Mercurial is going to change its internal structure representing the files in the repository (the manifest) to use a separate manifest for each sub-directory... like git has been doing for trees from day one.

I found patches from 2008, but nothing recent.


Your link is about narrow checkout, not narrow clone (so the .git data is still transfered completely, as opposed to transfered partially). Maybe what I'm thinking about was just a discussion with no formal patch, I don't remember.

The two features of interest are "shallow clone" and "partial checkout".

Both of these features are already part of git.

A current experiment is an untracked files cache, which speeds up stuff on large repos considerably. This stuff is actively being worked on -- the git project has always valued performance.

No, "shallow clone" and "partial checkout" are both different to "narrow clone". Maybe "partial clone" would fit the git terminology better.

A shallow clone misses history after a certain point in time (well, commit order). A narrow clone misses history except in a subtree. A partial clone looks like a narrow clone in the workspace, but a narrow clone should have substantially less objects to download and store.

Architecture is just dividing pain into different buckets.

Large repositories are painful, as each developer bears the pain of integration every time they have to make a change to the repository integrating with all of the other code.

Small, distributed repos are painful when change accrues in larger increments and must be resolved at release/integration time.

As a developer in one of the largest monolithic repos in the world, I feel the pain every day of that pattern.

Perhaps a better way of framing this is:

Distributed repos make writing software easier.

Monolithic repos make testing and running software easier.

Large monolithic repos are an artifact of scale where the individual cost of lost productivity in development is worth the increased performance at deployment and testing.

Why do you think integration is more painful in larger repositories? Presumably the architecture of the code is unchanged.

When you have separate repos, the external code and integration points remain relatively static. IE: You may consume an API, but it's versioned/snapshotted at a given version.

With a monolithic repo, especially without development branching - everyone is running against HEAD all the time. So while you are developing your component, everything else is changing around you.

As someone already said, it is not "many repositores vs one" but often "many repositories + custom tooling to merge, agregate, diff, checkout, etc. vs one".

We've tried both, had one large repo (2G in size, mostly due to historical mistakes of checking in large binaries). Now we have many little repos but we've spend I don't know how many man months writing custom scripts (based on gyp) to manage all these collection of repos.

So far I would say it is toss-up. I would say for completely separate products that do not share almost any code it might makes sense to have separate repos. But if you find yourself building a custom version of git porcelains on top of your multiple repos -- you have probably gone too far.

Interesting comment on that article by Siddharth Agarwal:

"(I work on source control at Facebook.)

Every problem you mentioned with monolithic repositories is a well-known problem with Git (though some of them do have workarounds, such as clone with --depth). None of them are issues in principle.

With Mercurial we're aiming to provide tooling that scales well while still maintaining DVCS workflows like local commits."

The most interesting takeaway from me is that Mercurial is planning to support "narrow clones" soon, i.e. sparse fetches (i.e. you only have to download some subtree of the whole repository). It would be great if Git followed suit at some point - the whole monolithic versus project-based repository debate would be a lot more interesting if it were just a matter of convention, as opposed to the former being associated with "ew, SVN/Perforce/[insert old fashioned feeling tool]".

I'd agree with this article's conclusions if the current-gen VCSes (like Git and Mercurial) actually provided the necessary features and tooling to effectively manage a monolithic repo. Comparing to SVN's ability to checkout portions of a repo instead of the whole repo doesn't do much good in systems that aren't designed to handle that sort of workflow.

Whether or not one should go monolithic or project-based depends on how tightly integrated those projects are. Folks like Google and Facebook and OpenBSD's dev team develop all their things under a single source tree because those things are pretty tightly integrated and designed to interoperate. Other folks develop each of their things in a separate repo because they expect those things to be independently useful (OpenBSD's subprojects do happen to be independently useful, but AFAIK the priority is generally "OpenBSD first, everyone else second").

Multi-repository environments can also be very manageable depending on the language, runtime, etc. Dividing things into do-one-thing-and-do-it-well chunks (perhaps gems in Ruby land, or crates in Rust land, or packages in Perl land, or whatever) goes a long way toward alleviating the typical non-atomicity of multi-repo setups. Of course, this tends to be easier for open-source projects than closed-source projects (though the closed-source camp can have some of this fun, too; using Ruby as an example, one can run a private gem server just by running the "gem server" command, or - if using Bundler - can even install dependencies directly from Git repos), but it's still a possibility.

Plus, the whole "well Google says it's more productive for them, so we should take it as a general rule" vibe doesn't really sit well with me.

I am definitely in the multiple repository camp, but with one caveat (below).

With the monolithic approach everybody is forced to be on the latest version (or to update to the latest version). There can be a lot of contention when trying to get things out - especially when things are moving fast. You need a lot of discipline to make it work and you also are forced to constantly update parts of the code you own. Also, we tend to forget about deployment and management of the build artifacts. I am willing to bet that bigCo's that do this have a separate system for managing the artifacts and performing the deployments.

In the presence of a decent build and deployment system the only way to go is multiple repos - a logical separation per component / service. The build system can easily track dependencies via components via metadata associated with each of them. You can easily orchestrate builds when something happens, it's easier to rebuild only what changes and you get granular control over build artifacts.

can someone talk about how you keep track of commit messages when committing to different subfolder/subprojects of a monolithic repository ?

We have 6 projects on bitbucket - and we would love to move to a monolithic model (we have already seen integration issues because of developer carelessness with deployments, that would go away)... but I'm just not sure on how we will do things like Slack/Hipchat integration, emails on commits, etc.

Were I work we have a monolithic subversion repo that contains :

- DBs ddl

- Selenium automatic tests

- A hierarchy with our Java code were we have our libraries, modules, final products and client personalization of our products.

Our build solution relay on Maven with a nexus server and Jenkins to autodeploy maven artifacts.

Pros :

- We can do a single commit across all to update many modules/libs/final product to fix a issue or implement something new. So tracking a issue fix or a new feature across all is more easy.

Contras :

- We can't work like in git, where we can create a branch to develop a new feature or fix a issue isolated from other changes.

- Enforces big commits.

Actually I managed to use git svn to map our subversion repository, but I need to create many small git local repositories to track every module/lib/product/personalization that work over. Also, I have a few scripts to update my git local repos (master branch) against the subversion repo. This allowed me to be more flexible when I work on a new feature or a potential dangerous change, as I can create local branches, do local commits (and do a commit squash of they before send to subversion), stash , etc.

I think that with this I grab the best of both ways of working. I can be more flexible and more productive, as I can change of branch of a project/module/lib with a few keystrokes, and do local code versioning. At same time. I only miss doing a commit across modules/libs/products when to resolve a issue I need to do change across many modules/libs. On this case, I try to use the same commit message or at least, put the issue number on the commit message, so make easy track it on fisheye.

We use Perforce (since before DVCSs came along) and follow the monolithic approach with 100s of projects. We use git and mercurial when interfacing to clients' repositories and for small, local temporary repos to structure work that shouldn't clutter our main repository.

The monolithic approach works so smoothly because Perforce makes it trivial to only check out the directories we're working on. It has huge advantages when we're working on interdependent projects.

Perforce does have some historical oddities (top level directories are called "depots" and are slightly different than normal directories), but the ability to branch, merge, and check out using normal filesystem concepts is a huge usability boon.

I was a proponent of project specific repositories before I actually had to work with large scale system that consistently handled development in this way. This has me taught the hard way that managing transitive SVN externals (project A refers project B as external which refers to project C as external; I've had a hierarchy of 5 levels) is an incredibly tedious affair (tagging to get a consistent, pegged state of the main project is now a multi hour process instead of a single mouse click).

That alone justifies a single repository.

TortoiseSVN has a feature of pegging externals at a specific release when tagging the main project, but this only works at one level.

I agree with the author... In the context of startups that are very much in "discovery" mode (still doing big changes to their product) breaking the repositories in subparts is just a source of confusion for the tech lead. I would also argue it is bad for devs because they then need to manage version-binding between different repos. Many repos approach is only justifiable if you have many teams with many tech leads each independently doing releases.

How do people do CI with monolithic repositories? Do they run ALL the tests or is there an easy way around that?

At Google, the dependencies for everything are very well specified, so in most cases the code affected by a single change is quite small. Dependencies are often specified on a per file basis. It's kind of a pain in the ass, but being able to know for sure you won't break stuff is very powerful.

Also, they have a ridiculous number of machines dedicated to running automated tests. My tech lead told me "feel free to add as many tests as long as they're under 30 seconds each".

I was super skeptical of perforce before starting at Google, but it works really well, especially in conjunction with the code review tools and processes

> Dependencies are often specified on a per file basis. It's kind of a pain in the ass, but being able to know for sure you won't break stuff is very powerful.

How are the dependencies specified? Is this a language-specific tool?

Bazel is the open source version of the internal build tool, it is multi-language. BUILD files in each directory specify dependencies and more.

That's the job of the build system. After syncing to a new version, it should only rebuild and re-run the tests where a dependency changed. (It helps to have fine-grained dependencies to avoid running compiles or tests unnecessarily.)

However, that's not enough when you touch a fundamental library and nearly everything is a downstream dependency. Then you really do need to run all the tests, so the build system should be able to run them in parallel on a build farm. Also, test results need to be cached by the build farm so every user doesn't need to run the same tests again for the same change.

See Bazel [1] for how Google does it. (Well, it doesn't provide the build farm, but you can see what the build files look like.)

[1] http://bazel.io/

With SVN (or similary) you'd simply configure your CI to 1) checkout 1 or more files and subdirectories and 2) config your CI to run a particular build script.

The notion of zero-config CI is a recent invention, thanks to conventions of small repos with well-known project file names per language/platform ecosystem.

Your build agents should use a cached .git, and only pull updates. Different test groups can be a different build target in your build/make file, or you can have nested build files.

I feel like I'm in the minority here; even my dotfiles live in separate repositories, managed by vcsh (https://www.github.com/richih/vcsh).

minority as in multiple repositories? eh... I don't think so. I would say a lot of people, given the choice, would go for this approach.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact