Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Debian Git Monorepo (liw.fi)
222 points by pabs3 on April 2, 2024 | hide | past | favorite | 153 comments



Putting aside that this is an April Fool's joke, I like the last part:

    > This time, I’m cruel to Git: can it handle a repository of this size? In 2009 it could not. In 2024 it can.
That really hits home for me. Really: Think about this repo for a moment. 500 GB. 15 miiiiiiiiiiiiiiilion files. It is crazy to think that a "vanilla" git repo can handle it. Applause for the Git team!


i used to work in git analytics and saw the most unimaginable repos. 500gb repos with one commit. repos where the source was in the commit message. repos with millions of commits consisting of bogus utf8, and so on. repos where they dont use branches, only remotes, etc etc. every usecase annihilated some old startup mvp code somewhere it felt like

we wrote our own in memory client but even before that, stock git client would always finish, it was just a matter of how long you were willing to wait. maybe we were lucky overall


Do you still have a record of any of those weird case repos? Perhaps reported as issues somewhere? I’m working on a tool like git filter-repo and Reposurgeon combined and would love stress tests and edge cases like these.


I find the Gentoo ebuild repository, mirrored at http://github.com/gentoo/gentoo often is a stress test for git clients due to the sheer number of commits.

It's interesting to see the definition of scale stretched in all kinds of different directions.


I didn't lift an eyebrow at the headline which is apparently an april fools joke.

But what is it that makes it so absurd that it works as a joke, which I'm missing? The size seems absurd (500GB) but why would the Debian source be 500GB?

To me it's in no way obvious that "Debian" would include any third party packages repackaged by the distro maintainers. Are Debian people maintaining N source repositories for third party code today, which would be the code merged into this hypothetical monorepo? Why?


This includes the sources of all the Debian packages.

In general this is a good idea, and it is close to how nixpkgs works (which is, in some sense, a "distributed monorepo"), but git, the Debian tooling and so on wouldn't actually be able to support a good workflow here.


> it is close to how nixpkgs works

Not really...? Nix distributes package definitions that describe how to build and install the package. This is close to just about every other package manager, besides the fact that most don't have / use the description on how to build the package, and instead they just download a binary based on the package definition.

The Nix equivalent to this joke would be storing all of the package code directly in the nixpkgs repo, instead of storing links to the source code. Debian would ba able to support the same flow as Nix just fine. You can just use source packages[0]

[0] https://wiki.debian.org/Packaging/SourcePackage


I’d say it is close. A purer example would be gittup[1], which has its (very few) upstreams as submodules. But ultimately, Git submodules are just URLs and commit hashes stored in .gitmodules in the repository root, which doesn’t seem all that different from the URLs and hashes of contents passed to fetchgit et al. in Nixpkgs. The capability of being able to patch an arbitrary package locally and have every transitive dependency rebuilt automatically is also the same.

On the other hand, Debian and Fedora maintain full collections of sources used in their releases, while, AFAIU, if you need an old version of something from Nixpkgs, it has fallen out of the official binary cache, and the upstream is gone, you’re on your own. (There have been some efforts to fix that[2], but the first attempt was very limited in scope, and there doesn’t seem to have been a second one.)

[1] https://gittup.org/gittup/

[2] https://www.tweag.io/blog/2020-06-18-software-heritage/


Things don't "fall out of" the binary cache. Garbage collecting it is an active research area.


I think its less maintenance and more for pinning versions of software to ensure stability. I guess that comes with maintaining a package manager like apt.

Definitely with you on not realizing it was a joke at first. Don't google have a company wide mono repo, that I'd guess includes all search and chrome? Or is that just an urban legend?


Doesn't google have a lot of internal tooling to make that monorepo actually work for them?

The issue isn't so much that large monorepos don't work, it's that you are trading one set of issues for another set of issues. And most popular open source tooling (and I assume all debian-specific tooling) is built around solving the issues of many repositories rather than solving the issues of monorepos.


yes, and a lot of it's open source.

The main issue IMHO is that git doesn't scale for a monorepo. A monorepo wants to spread a single virtual repo over a large number of devices, with no device having the entire repo. In contrast, git wants to do the opposite: it's built for a large number of devices to have the same copy of the entire repo.

But aside from tooling and needing a horizontally scalable VCS, there's also the cost and engineering resources required to keep the cluster running, available, and high performance.


> Doesn't google have a lot of internal tooling to make that monorepo actually work for them?

probably tests coverage and workflow matters: you can work with monorepo if you can check how your change will impact rest of the ecosystem.


At its core, with separate repositories it's generally fine to pretend that any commit effects all code within that repo. So for example you might run all unit tests for every new commit in a PR, or run all deploy scripts when something is committed to master. Dependencies are usually set up one-way, so code pulls it its dependencies (by referencing specific git commits or released packages). So a change in one repository might mean that it pulls in new or updated dependencies, but software depending on you should only see changes once a change is made to its repository.

For a monorepo those assumption don't hold, any given commit likely only affects a tiny fraction of the content of the monorepo. But a change in a library might also immediately radiate outward to anything in that repo that uses the library, so you can't do a naive check based on the directory tree. So you end up with a build system like bazel that can evaluate the entire dependency graph to know what to run. And you have to do that for pretty much all tooling you want to be triggered by code changes.

Add on top of that the scaling issues you have with git as your repository grows


The stuff in Google's monorepo is mostly their own code, not snapshots from upstream.


Google's monorepo includes snapshots of everything upstream that Google's proprietary code depends on - starting from the kernel and glibc and core language toolchains, and all the way up to various open-source libraries and modules in c/python/java/whatever.


But 3p sources likely are for specific stable version? Linus can't break Google by submitting bug into linux kernel?


Yes.


Chrome has brought in third party code. Which makes Debian mad and they do a lot of work to strip that out to use the upstream libraries. (Or at least that is how I understand it - I don't follow Debian closely)


Yeah, that's why I wrote "mostly" :)


I think chrome is got its own thing if I remember from my internship. But almost everything in one repo sounds right.


Isn‘t git submodules a much more lean way at accomplishing this than a clunky 500gb repo?


Debian does include the third party packages. This is either a mirror of the git repository, or a tarball of the upstream code.


it's simpler than that: it's just running `dpkg-source -x` on every `*.dsc` file it can find. So that's "all of the source used to build all of the debs" (using dists/stable/main/source/Sources.xz to get the list.) (I think then just a single commit of all of it; the fun bit would be doing this for each release and looking at the size of the diffs...)


I meant what upstream Debian includes.


April fool aside, what I really want is a Debian git monorepo of submodules, where each submodule points to the upstream git repository.


Why use submodules when you can properly vendor the upstream git, and export/import commits without breaking hashes on either side?

https://github.com/josh-project/josh

We've been using josh at TVL for years and it's just amazing.


How does that work with upstreams that often needs patches?


It depends on exactly what workflow you're trying to accomplish!)

The parent comment seemed to be about a workflow where the repos are in sync. If you're carrying patches around and you don't want to export them, you are probably going to merging them. In either scenario if you have some patches you want to export and some you don't want to export, you'll either have to carry two trees, or do cherry-picking or something similar.


You should check out dgit[1] which pretty much does exactly what you're looking for. It's an excellent way (with a little bit of pbuilder scripting) to patch Debian packages in place with sensible version numbering.

[1] https://wiki.debian.org/DgitFAQ


What do you think of the openbsd model? It has a mono repo for the os and a ports repo for everything else

(I think)


Yes and no. OpenBSD uses four primary repositories: ports, src, www, and xenocara. CVS change announcements are merged for all but ports and go out on two mailing lists: source-changes@ and ports-changes@. Patches are discussed on tech@ and ports@ (sometimes also on bugs@). Also, some key software such as OpenSSH, OpenSMTPD, etc. is developed out-of-tree and lifted into the main train when appropriate. I am not a developer, but I do follow development somewhat closely.


What do you do about upstream projects that don't use Git ?


Most non-git projects have git mirrors these days, even if they're not official.

If upstream only releases tarballs, you can still untar them and create a commit for each version so you can easily compare them and merge any additional patches you need.


subtree might be better for exactly that? I have a project with about 100 repos, developing with them individually is so much saner, but every so often you want the one actual benefit of a monorepo: fast really-global search. So keeping a subtree-of-everything (using the bitbucket api to get the list of repos) and then just doing git-grep there when I want to see any use of something.

(Somewhere there's an apt hook for "grab sources for everything that apt installs"; having that extract the source and git commit it, or just get the upstream repo from the metadata and `git subtree add` (or pull) wouldn't take much...)


Please never use submodules for anything, they cause a litany of problems now and down the road (have you ever tried deleting or renaming one?) The best thing to do is to pretend they don't exist.


This needs a "April Fools" in the Headline so people don't waste time or energy reading it.


Definitely unfair to post this on the second of April without a disclaimer.


I think ideally we'd have month/year timestamps for every link submitted, but that's probably not going to happen, so you end up with instances like this, or months-old news being taken as new, etc.

I think the better solution, which I'm trying to do myself, is noting the date on the article before reading


> 2024-04-01 07:41

Seems to have been posted the 1st of April.


Post this to HN…


Ah :)

Seems it was also submitted 1st of April: https://news.ycombinator.com/from?site=liw.fi

I guess someone else posted it again, and the spam-filter ignored the previous entry as it didn't get any traction then.


I was always surprised that Perforce didn't have anything for open source, Git is basically eating their lunch because monorepo's suck so much with git; so people do things in a more fragmented and git-friendly way.

Not that I really want to see a proprietary product succeed, but it's somewhat surprising that:

A) We don't have anything better

B) Perforce isn't trying to gain adoption by giving its software for free to distro maintainers

C) People keep coming up with different paradigms that attempt to solve problems monorepos solve inherently.


Perforce doesn't need to because nothing really threatens it. They're not interested in becoming more known, because they don't need to be.

I can't imagine Debian of all projects using a proprietary VCS though.


> Perforce isn't trying to gain adoption by giving its software for free to distro maintainers

Perforce has been giving its software for free to OSS projects since forever. FreeBSD used to use Perforce, for example.

OSS people and projects don't want Perforce, not the other way around.


Freebsd developers used perforce but the official repo was cvs and somehow the two synced.


For better or for worse, Perforce seems to see itself mainly in the niche of having binary or media files in source control. They are busy promoting themselves in game dev, automotive, etc, where this is a common pain point.

Maybe they should do more to highlight their strengths with regards to monorepos, but apparently their marketing hasn't identified that as a strength, or thinks it isn't a worthwhile market.


Games I understand, but what makes automotive a place that uses a lot of binary files?


Probably tons of 3D models


from what I've heard, git become much friendlier with monorepos in recent years because of microsoft involvement. Windows repo, pretty sizeable, is now using git, and it supposedly works even without strange things like vfs for git.


It's not only Windows that uses Git at Microsoft, but Sharepoint and Office (which includes the on-prem version of SharePoint). In terms of repo size Windows and Office are similar. I was part of the team that migrated Sharepoint from a Perforce clone to Git and helped build the tooling to allow Office to move as well. VFS for Git [1] and Scalar [2] are really good pieces of software.

[1] - https://github.com/microsoft/VFSForGit

[2] - https://github.com/microsoft/scalar


I hope one of these "large file/many files" extensions becomes part of git core eventually. They are too much of a hassle to deal with and many servers don't even support them.


I would actually think that monorepos suck with Github and others like that. Looking at the Linux kernel for example, that seems to me like a well executed monorepo outside of Github. From what I’ve heard and read, they specifically use Git and mailing lists instead of Github or others because Github wouldn’t work for what Linux is (aside from as a mirror).

There might very well be some context I’m missing, but that’s what I understand from that side at least.


they (well, Linus) invented git for their purposes, and not the other way around. They use git and mailing lists because that's how they work, and that's how they worked before git existed. git was built for them.


Github / Gitlab would work great and reduce friction for kernel contribution. The refusal to switch or provide the alternate path is mostly inertia.


I think you have a story eyed view of github... You might want to go read Linus own thoughts on the matter in the linux GitHub backup repo

Git is a tool built for a project of the kernels scope, scale and organization.

Github is a thin web interface over top of that, it cuts some corners here and there and gets opinionated about how you should manage code (pull requests).

Think of it this way: most git hub projects end up with a monotonic output... the kernel isnt that. Between the current version someone is using, the next version that is being developed and the older versions getting back ports there's a lot going on there. Much more than GitHub and a pull request would cover.


Likewise, I think you have a starry-eyed view on how much friction it creates which equals less contribution. People see a bit of code they want to improve, they improve it, and then instead of just opening a PR (and perhaps iterating on it), now they have to learn a weird ancient e-mail workflow and will probably get chewed out for messing it up.

No one is saying to take away mail-in patches but it is positively archaic.


>> Likewise, I think you have a starry-eyed view on how much friction it creates which equals less contribution.

Is it friction? Or is it a filter?

You might remember being a kid and there was the sign in front of the ride that said "you must be at least this high to ride".... The kernel dev process isnt for casuals. It's designed that way.

There's a lot of folks out there who have popular projects on GitHub who are over the endless stream of BS from AI generated pull requests.

You should really dig in deep to what goes on with the kernel, the work flow, why it is that way and why GitHub is outright incapable of supporting kernel dev (there are reasons).... Your going to look at git in a very different way and many of githubs features are gonna feel on par with linkedin adding twitch style videos and zoom adding mail features...


> Is it friction? Or is it a filter?

Friction.

> There's a lot of folks out there who have popular projects on GitHub who are over the endless stream of BS from AI generated pull requests.

So be stringent. First below-par PR get some guidance, pointers and perhaps a reprimand. Second time, a warning, third time a ban.

> You should really dig in deep to what goes on with the kernel, the work flow, why it is that way and why GitHub is outright incapable of supporting kernel dev (there are reasons)....

Code is code. If someone has an improvement, they can offer their new code.

I’m sure the kernel has a unique workflow, but it ultimately boils down to that, no?


Sometimes friction is necessary for quality. More often than not, I’ve seen github repos where issue is treated as a project manager and there are lots of drive-by PRs.


Github and gitlab are both deeply, fundamentally, flawed because the unit of contribution is the branch rather than the commit.

Other than the UI not being very good, the code review experience is fundamentally hampered unless you enable squashing but that's a bit shit for different reasons.

On a purely UX level too the velocity of getting patches in is terrible. They're designed for ad-hoc open source contribution, not tight-loops of consistent work. People put up with the slowness because they know no difference but I promise its slow. You shouldn't need to go and get a coffee to wait for something to get merged and start coding again.


A unit of contribution is always a branch, implicitly or explicitly. What you have in your local repository is a branch of what is in the central repository, which is a branch of what you have in your filesystem, which is a branch of what you have open in your editor. In the same way a contribution is always a merge, saving a file is merging the content of your editor in your filesystem, committing is merging your file in your local repository, and pushing is merging your local repository with the central repository.

That's git main insight, branching is everywhere, so it is designed with branching and merging as fundamental, explicit, and regular operations. Seeing how successful git is, it looks like it was a good choice.

GitHub and GitLab are built on top of git, and follow its principles, so that making the branch the unit of contribution is simply natural.

Of course, you can make single commit branches, in fact, that's what squashing is for. There is, of course, no obligation to wait before you have your change merged before you start working again, you can start from an earlier version and rebase later, merge back some changes, or do whatever you want really. You can tight-loop as much as you want, especially on your local machine.


The way (say) the kernel uses branches is not very similar to the way GitHub does it where you actually push a branch and then let the machine do the merge

It's the same ingredients but it's very hamfisted.

Squashing is like training wheels.

What I was hinting at with the loop concept is that it should be closer to phabricator that gitlab, let me stack.


I know it‘s an April‘s fool. But I don’t get this recent trend of people arguing in favor of a monorepo (at my work place, too). It’s a nightmare to handle and use. What gives?


There are real, tangible benefits to monorepos. And real, tangible downsides, too. Like everything else, it's a tradeoff.

I'm sure people advocating for them have brought up some benefits. But in case they haven't, here's an example I recently encountered at work: for an application we are developing, we needed functionality X. We knew that we'd need X in other projects, too, so we made it a library. I published it to our internal package registry and referenced it in the project. So far, so good. Turns out, feature X wasn't as well understood as we thought it was and we are still changing things around all the time. That meant: * changing the library * cutting a new release and publishing it * updating dependencies in the project and using the new feature

Since this was happening much too often for what was often a one or two line change, I moved the library into the project repo and build it as part of the project pipeline. This removes so much friction around these changes, I find it makes a significant difference. Eventually, I'll move the library back into its own repository, but for now the monorepo is the right choice in my eyes.


I don't think this is really advocacy for monorepo but rather advocacy against premature solidification of abstractions. Whenever I encounter a situation like this, the first step I take is to write the code without any API or interface boundaries to worry about. Then if I still feel it would be a good idea to separate it, I will separate it while keeping it in the source tree (this isn't really vendoring or monorepo, it's just abstraction into a module). Once this thing has survived a few rounds of revisions and you're ready to use it in another project, copy paste it first. Then once two projects have both had a chance to exercise the abstraction, then you factor it out into something totally separate

This process requires that you are able to work on rewrites and refactorings without having to beg. But it's truly the best way to get from an idea of a library to an actually well designed library.

In some situations it might be you already know what you want and can separate it out from the beginning, but if you are coming up with anything remotely novel, then you want to follow this process.

A monorepo will still contain libraries like these. It will also contain vendored code (although monorepos help reduce the amount of duplicate vendored bits of code). Making such changes will be easier in a monorepo once your abstraction starts to solidify. But the process you should go through should still be similar. The reason for not having two things depend on your wobbly abstraction too early is simply that it will inevitability lead to you prematurely solidifying uncertain abstractions.


Most people arguing one way or the other are in such a small code base that they don't face the problems the argument is really about in the first place. If you only have a a million lines of code then by all means use a monorepo as you probably aren't facing the downsides that make people reach for a multi-repo solution anyway.


The real problems with monorepos are that most of the benefits vanish as the scale increases unless you invest into building more monorepo tooling.

In your particular case, if your library becomes too popular, your one or two line implementation detail changes ripple out and trigger rebuilds of too many downstreams, many of which will have flakey tests and fail your MR. If most users are not actually depending on that functionality, or if you simply are doing semver properly, then you could avoid rebuilding those dependencies. Eventually the builds take too much time and CI rejects the pipelines or you continuously bump the build timeout but wait longer and longer for your changes to go live. You can solve these problems, if you invest in more monorepo tooling.

Similarly, once you are too popular a library in a monorepo, you will never do any atomic breaking API changes since it would require updating too many downstreams. Instead you will fake version it: add a new API, migrate users to it in multiple commits, delete the old version. Some of these migrations run out of steam midway in the biggest phase: phase 2. This approach does have the benefit of forcing the upstream author to make the two versions of the API co-exist.

Of course I am talking about scales where real limits start to break down. When your codebase is larger than your ram, an index won't fit into RAM anymore and every code search requires disk or network I/O. Eventually your repository doesn't fit on disk anymore and you interact with it with a special IDE and only download things you start editing or perform sparse checkouts in the first place so discoverability is again a problem.

Edit: of course some problems crop up sooner than hard limits are reached, like the flakey test issue I mentioned as well as visibility and control of changes to actual maintainers.


> Similarly, once you are too popular a library in a monorepo, you will never do any atomic breaking API changes since it would require updating too many downstreams.

This happens no matter which repo type. Even worse if a project chooses to update after a while, its far more painful having to do the changes after losing the context you had when you did the original changes.

If you want a monorepo, libraries being on the same version is a feature, and it keeps you from diverging.


This doesn't happen in a poly repo because you can just do it. You release version 2.0.0 of something and downstreams update at their own pace. Diverging as you call it.

But this isn't a problem. If 1.0.0 is a finished product then why do you ever need to move to 2.0.0 if you don't need the new features?

The issue in the monorepo is that if you are too popular the change must happen all at once or with copying (fake versioning, like people who version excel files by suffixing with dates), which places pressure on maintainers to not fix design mistakes.

It isn't a feature of the monorepo because you can still diverge by copying, forking or merely stopping support for the old library and this becomes more and more necessary at scale: you lose the feature you thought you wanted the monorepo for.


> But this isn't a problem. If 1.0.0 is a finished product then why do you ever need to move to 2.0.0 if you don't need the new features?

Fair point, I assume (from personal experience at the places I've worked at) that updating the library is inevitable and doing so at a later date tends to be more painful than doing these migrations all at one.

> It isn't a feature of the monorepo because you can still diverge by copying, forking or merely stopping support for the old library and this becomes more and more necessary at scale

This is a problem if all the projects in the monorepo are not actually related. But imagine if all these subprojects are bundled as one OS image, in that case it is very rare that you want multiple library version.

At very large scale I can see your point, I don't have experience there so I can't really argue.


I think in a centralized environment (workplace), it could be argued that immediately triggering all the build failures and having good hygiene in cleaning them up is actually not a bad thing. It really depends on how that's set up.

And how is sparse checkout worse for discoverability? With multiple repos it's even harder to find what you want sometimes if you are talking about 100's of random repos that aren't organized well.


> I think in a centralized environment (workplace), it could be argued that immediately triggering all the build failures and having good hygiene in cleaning them up is actually not a bad thing.

In abstract I agree. However when I'm trying to get my code working having test failures in code that isn't even related to the problem I'm working on is annoying and I can't switch tasks to work on this new failure when the current code isn't working either.


How could broken code (or broken tests) be merged in master ? That is a rhetorical question, of course it happends, and of course this is the root issue you would be facing


There are multiple ways that code gets merged into master and ends up broken.

First, the one where everyone does everything correctly: CI executions do not run serially because when too many people are producing a lot of code, you need them to run at the same time. So you have two merge requests which are done around the same time A and B, they each see a commit C before each other. Say merge request A deletes a function or class or whatever that merge request B uses. Of course merge request A deleted all uses of that function but could not delete the use by B since it was not seen. A + C passes all CI checks and merges. B + C passes all CI checks and merges. A + B + C won't compile since B is using a function deleted by A. If you are lucky, they touch the same files and B doesn't merge due to a merge conflict and the rebase picks it up, otherwise broken master.

Then you will typically have emergency commits to hotfix issues which might break other things.

Then you will have hidden runtime dependencies that won't trigger retests before merge due to being hidden, but every subsequent change to that repo will fail.

Then you will have certificates, dependencies on external systems that go away.


As you may be aware, 100% broken code cannot be merged. However code that works 99.99% of the time can be merged and then weeks later it fails once but you rebuild and it passes. There are a lot of different ways this can happen.


Things become difficult at scale regardless of mono- or multirepo. You also have to build dedicated tooling if you heavily lean into splitting things into a lot of repositories, in order to align and propagate changes throughout them.


Sure, but polyrepos don't break with scale in the same way as monorepos. You only need additional tooling when you are trying to coordinate homogeneity a scale larger than your manual capability. Autonomous services don't typically do that kind of coupling without cohesion that people naturally find necessary in a monorepo and you can build cooperative and coexisting products without that kind of coupling.

When I read the white papers by google or uber on their monorepos, when I see what my company is building, it is just a custom VCS. Everything that was thrown away initially gets rebuilt over time. A way to identify subprojects/subrepositories. A way to check out or index a limited number of subprojects/subrepositories. A way to define ownership over that subproject/subrepository. A way for that subproject/subrepository to define its own CI. A way to only build and deploy a subproject/subrepository. Custom build systems. Custom IDEs.

The entirety of code on the planet is a polyrepo and we don't have problems dealing with that scale like we would have if we stuffed it all in one repo like this debian monorepo shows. Independence of lifecycle is important, and as a monorepo scales up people rediscover that importance bit by bit.


> for an application we are developing, we needed functionality X. We knew that we'd need X in other projects, too, so we made it a library

This is the mistake: you should make it a library when you do need the functionality, not when you "know" you "would need" it.


I disagree. If you will really need it, then just make it a library. This is not 1960 when we were still figuring out software. You are not writing something new anymore, the problem has been solved enough time that you can look around and figure out what will really be needed in other projects and what will not. This is something your architects should be doing as getting it right is important and you will regret not doing so (when you have 10 different implementations of the same thing all slightly different)

Now I will agree that developers are often wrong. We often think something we write is important than it really is and so make it reusable when it never will be. We often think something won't be reused when it really would be. This is a hard problem, but that doesn't mean you should not think about it and work hard to get things right.


> Turns out, feature X wasn't as well understood as we thought it was and we are still changing things around all the time.

If you publish something as a library for someone else, you should generally commit to supporting the public interface of that library as if it were a long-term support release. Exactly how long that should be depends on many factors, of course. But as a rule of thumb, I would recommend supporting this library version for at least two years. If that doesn't make sense, don't release it or look for a subset, where it makes sense.


It depends.

If you drop a library for internal usage only and you want to change the contract, and the tests, library, and api are all in the same repo, you just change them in a single PR and that's done. This works because the single commit hash contains all information. It requires people put their code in the monorepo and hookup their tests, but assuming they do, you can build a reasonable degree of confidence on a green build.

Once you separate the library different repo, and let people consume that as something in their own repos, you need to do the versioning dance as you have no idea if their still is still working.

A lot of people go for the latter because it contextually allows them to ignore the rest of the stack, and there are some pros to doing this, but testing, deploying, versioning, etc, all become more difficult, and that's something struggle with.

Thus, unless you have a crap load of code / commits, monorepos arguably have more advantages than disadvantages.


If the same problem doesn't exist in a monorepo in a different form then your project is not large enough to be in this discussion in the first place.


It's actually just trade offs that we should be able to have professional discussions about, otherwise your advocating for "always use a monorepo" unless your library is for public consumption.

The reality is, at some point, having repos that are hundreds of GB with hundreds of active PRs also has its own downsides that requires tooling and workflows to combat. Meanwhile, splitting it all up introduces integration problems.

It's definitely pick your poison, though specific requirements and circumstances make specific paths more or less potent.


The problem is many people arguing for a mono repo feel like the size of their project means breaking into multiple repos mean put each function in a separate repository - which is obviously absurd. It is good if your project is that small, but you also are not facing the same pains as large projects are.


This is exactly the friction/overhead the other person means and what monorepo works around.

(but also keep in mind this is an internal library for likely a couple of internal apps, supporting BC for 2 years is in most cases unwarranted)


Monorepo doesn't "work around" it, it means you have to support everything (including migrating/BC) on the current version, instead of having to support old stuff on the old version.


You can see where the “library” is used. Often the bit you want to change is only used by your project so you can change it freely. If someone else is using it you know who to talk to about changing it.


You can “fork” inside the monorepo too, just take a copy.


This guidance is far too specific to make sense in the general case. Not every scenario has the same constraints. If I had to pay a 2 year maintenance cost for every library I write and wanted to reuse between projects I'd never end up writing it. If you can migrate all library clients over to the new version, there is no reason to maintain old versions.


I would not call this a "library", but a set of interrelated project with "shared moduls" instead. With a "library" I mean a piece of code where the publisher does know who is using it for what exact purpose. Of couse we could now fight over a proper general definition for "library", but this is how I would use the term in this context, because thee long term aspect is what really makes outsourcing such code meaningful -- in contrast to the parents observation that it does not make sense to incorporate foreign code that constantly changes.


The argument then is also that monorepo allows you to use this as "shared modules" instead of published "libraries".

The point is you have some code that you want to reuse, and either within a monorepo or between multiple repositories.


The question is, when it is wise to do so. Changing the public interface of a "shared module" or published "library" comes with a liability. Code reuse is almost a no brainer in the case of a standard library that rarely or never changes. Using a permanently changing modul whose modifications provide no real benefits for your project makes no economic sense. The sweetspot is somewhere in between.

What the parent actually did when copying the library/module code was creating a sort of a a long-term support version for himself maintained by himself. His cost-benefit analysis told him that this is better than always trying to keep up with the changes of the library/modul.

My comment was aimed at setting the bar rather high for when to share reusable libraries. And even for my own modules I rather prefer copying code instead of reusing it in different contexts. There is a reason why the contexts differ and it is typically easier and more economic to adress this differences in its particular place instead of preparing for them in shared code.

The more up-stream a library/module is, the more general it must be, but the less liable to change it should be.


With monorepo he wouldn't have had to worry about this decision, since the decision to go back and forth would have been much lower cost. It's low cost decision either way and low cost impact if you make a wrong decision.


This is exactly why monorepo is advised : you do not need to keep and maintain useless code for years. Instead, you can "git grep", then write and submit MR to quickly change all users.

I have never encounter this use case, tho


This isn't wrong, but somewhat orthogonal. Yes, I "published" it on our package registry, but it wasn't public. Access was still scoped to the team developing the application I was speaking about.


This is orthogonal to mono-repo vs not mono-repo and indeed distribution mechanism.


I find this too at work all the time. People are very quick to branch off some “independent” feature into its own service or lib and repo. Great for them when building, tiny repos do have that greenfield clean slate feeling, but awful when you realize the integration points aren’t as simple and well understood as they first thought.


Obviously that depends on your language's tooling and your company's processes, but directly depending on (a specific branch or commit of) a git repository rather than a release is a great in-between solution that allows you to have separate repositories while still allowing fast iteration.

Obviously you want to switch back to a release model once the feature stabilizes, to have all the advantages your internal package registry brings you.


> Since this was happening much too often for what was often a one or two line change

The famous 5 minute project.

Happens in coding, car repair, and interstellar travel.


That's not a monorepo.


NixOS uses a monorepo and I think everyone loves it.

I love being able to easily grep through all the packages source code, and there's regularly PRs that harmonizes conventions across many packages.

Nixpkgs doesn't include the packaged software source code, so it's a lot more practical than what Debian is doing.

Creating a whole distribution often requires changes synchronized across many packages, so it really makes things simpler.

https://github.com/NixOS/nixpkgs

I think it's important to add that they both have the biggest number of packaged software and the most up-to-date software of all distributions even though they have far fewer maintainers than Debian.

https://repology.org/repositories/graphs


Is nixpkgs really a monorepo as usually discussed? It's got packages, os modules, and some supporting scripts. Nix itself lives in a different place, so does the RFC planning, system artwork and a few other things. I would expect it all together to become a monorepo.

> even though they have far fewer maintainers than Debian.

Debian has ~1.5k contributors, nixpkgs lists >3.5k maintainers. (Although that list is not pruned over time)


> Is nixpkgs really a monorepo as usually discussed? It's got packages, os modules, and some supporting scripts. Nix itself lives in a different place, so does the RFC planning, system artwork and a few other things. I would expect it all together to become a monorepo.

I think we can say it's a monorepo of packages in this context. Not everything from the Nix ecosystem is there. It could also bundle the website, wiki, doc etc but I don't think it matter too much.

> Debian has ~1.5k contributors, nixpkgs lists >3.5k maintainers. (Although that list is not pruned over time)

Thanks for the info, I heard that a long time ago and never checked myself! It's probably less as you say but still probably bigger than Debian.

I guess it makes sense because it's so much easier to contribute there than to Debian.


> Nixpkgs doesn't include the packaged software source code, so it's a lot more practical than what Debian is doing.

This is the key difference. OpenWRT is a “monorepo” too, but it downloads tar balls etc of the upstream software. Makes sense.

Putting the source code of the upstream software in a monorepo sounds like a nightmare…


> I think it's important to add that they both have the biggest number of packaged software and the most up-to-date software of all distributions even though they have far fewer maintainers than Debian.

Only because of https://www.reddit.com/r/NixOS/comments/zp95a2/comment/j5ko9...

I also had plenty of issues with either packages not being available or not building, last I tried. At least with the AUR, it's generally only the latter you have to worry about ;)


Yeah, automation in Nixpkgs is crazy and goes far beyond any other distros (that I know of).

Why spend manual time when it can be automated?


There have already been a lot of points by other commenters, one point that I'd like to add:

It allows to solve "problems of scale" with technical solutions rather than process solutions.

For example: Let's say there's application app-A and app-B, and they decide they want a library lib-1 that they can share.

If they are in separate repositories, this means multiple pull-requests, it means separate pipelines where the pipeline of lib-1 likely won't include the tests of the applications, it means there will be pull-requests to the library which won't immediately be integrated into the applications so that some poor sod has to take care of breaking changes down the road, etc.

If they are in a monorepo, each application can set up tests that need to be fine with the library-code "as is", so any change to the library needs to work against existing application-tests. The price one pays however are pipelines that perform well - nobody wants to wait hours to get pipeline-feedback and such. A monorepo also allows ambitious folks to shine - it's easy to touch many things at once, or to touch things used by everyone, on a pull-request with high visibility, which is an easy platform to get "street creds" as an ambitious engineer.


Having to switch branches in different repos to feature-xyz, remember to build it, and then finally work/test is more of a nightmare to me. In a monorepo, you just switch to the branch once, the build system builds everything that needs building, and you are ready to go.


The opensource tooling to handle large monorepos is just not there.

The monorepo at Google with the massively distributed build system and all other tooling make it very effective to work with.

You cannot just take one part of that system and expect to magically reap the benefits.

The main benefit I see in practice is that you can immediately see which parts of the system are broken by your change immediately as you test your change (as opposed to waiting when integration happen later when dependencies between modules are bumped in a multi-repo)


>You cannot just take one part of that system and expect to magically reap the benefits.

Yeah, specifically people only focus on the "repo" bit. Build system, PRs, history browsing, etc all get handwaved away after you stick all the code in a single repo. These are the extra hard parts though!

I think if you were to properly implement "monorepos" in git world it would actually look a lot like a Github _organisation_, rather than a Github repository. Each git "repo" in the "org" would be a workspace with it's own segregated UI so that you can for example check out just that workspace, or see only that workspaces commits or issues -- but some features such as github actions workflow triggers and PRs would be able to span multiple workspaces.

Github doesn't really seem to have put much thought into really supporting monorepos though. They have added some support (e.g. codeowners) to support "repo"-as-monorepo, but have also added some features that could go towards supporting "org"-as-monorepo (e.g. dependabot dependency updates and the dependency graph). But it all stops well short of a complete monorepo toolset.


I agree it's definitely an underserved tooling area. But so is a maze of submodules which is the realistic alternative. Open source just doesn't care too much about whole-company level projects because those don't really exist in the open source world.

That said, you can definitely make it work with Git & Bazel. Beats submodules by a mile.


Oh yes.

Also another thing: often people end up with the worst of both worlds: one (or two!) big monorepos plus a swarm of many little repos. Now you only pay the price of the monorepo but rarely ever notice the benefits


> It’s a nightmare to handle and use.

Why? If all Your work and knowledge is confined in one repo, there still is a way to work only on that part. Or at least there should be.

But if You want/need to have access to other parts, it's much easier to use monorepo then to search trough 100 of github repos that are somehow dependent but it's not obvious in what way.


it requires a lot of discipline for sure.

"It depends" is thrown around a lot; what I've seen working for large companies and on large projects (mostly AAA games, which do use a monorepo typically) is that having a single repo means your dependencies are more likely to be vendored, and updating the vendoring re-runs all tests across all services because dependencies are mapped across the whole project.

Incremental build is also something that becomes more possible, rather than constantly building all source from scratch as we tend to do for smaller projects.

There's a lot to be said for atomic commits, no "merge trains"[0] and things like a "common library" (that everyone ends up building in a large enough team) are just includes and not entire dependencies with all that entails (versioning, updating dependencies etc;).

Having everything in one place also tends to force people to put documentation near their code instead of something else like confluence.

It's pros and cons, and the pro's are kind of meek-sounding until you've experienced it.

[0]: https://docs.gitlab.com/ee/ci/pipelines/merge_trains.html


> It’s a nightmare to handle and use

Quite the opposite, IMHO. Juggling dozens of PRs across many repos for a simple change is the definition of a nightmare.


that tells me you prematurely split something. I worked at a high scale, high available system with multiple dozens of teams, each with several repos. If we had to sync changes across repos, we joined the repos because they were, well, joined. In the early days, we had shared libraries and keeping those in sync was a pita. Then we realized we should not be doing what we were doing.

The flow should be: (1) update a library and release it, preferring backwards compatible changes or a new major version; (2) code that wants/needs the new functionality updates the usage at its leisure. If you are having to keep multiple repos "in sync" with each other, yes, you picked the wrong abstraction. If the downstream repo(s) need to always be on the latest and greatest version of the library, the library code should be vendored with the calling code.


> (2) code that wants/needs the new functionality updates the usage at its leisure.

That's how you end up with infinite support costs in a large org. I think the point is that these changes are done atomically.


that is the theory. The reality is that your team can work on fixing an issue now, but the consuming team might be able to get to it next quarter due to competing priorities. Getting multiple teams to agree to work together at the same time is much, much harder than allowing teams to update on their own schedule. You do eventually have to press the organization and deprecate the older thing.

The alternative is you do the work for them - which you may or may not have the expertise for, and the other team may not even have time for the needed code review.


You can't deploy a distributed system atomically, and the closer you get the more risk you're taking with prod. The old and new versions need to coexist for at least a couple of weeks in case of rollbacks.


Agreed. For the last three years I’ve been working at a company with literal thousands of repos, and it’s an actual nightmare compared to a well maintained monorepo (eg Facebook/Meta). Sometimes updating some library code means I do 50+ PRs across random repos.


You’re comparing two extremes, and your last sentence just indicates that the modules don’t seem to be properly divided along stable abstraction boundaries.


Having lots of small repos also generates overhead they need to be checked out separately, tagged individually, built individually, etc.

The reason that this is a bad April first joke is that the idea actually has merit. Google runs their company like this. The unthinkable bit is not doing this technically but Debian rearchitecting the way it works. Doing this with Git is a separate topic. It's probably not great for this. But mono repos at this scale are not impossible.


There was an article by some well known CTO that hits the nail on the head for me. Essentially they said that monorepo makes sense at small scale and enormous scale (eg Google) but anywhere in between that microservices make more sense. Knowing when you are going to be at those scales is key.

In any event, as has been alluded to in this thread already, any complaint about microservices vs monorepo and switching between them is almost always a tooling complaint. Companies will invest in one of these “philosophies” around code organization then won’t take the time to build the appropriate tooling to support them and then three years down the road some enterprising architect will switch from one to the other and the cycle begins anew


It’s a nightmare only if you think of the monorepo as just throwing everything in one place (aka just a technical choice).

A monorepo requires org changes. It requires staffing teams to handle processes and tooling for everyone else. (And it may require something else than git… but git will work OK for reasonably large ones anyway.)

I wrote about some of this here: https://jmmv.dev/2023/08/costs-exposed-monorepo-multirepo.ht...


FWIW "Monorepo requires people to work on it" is organisationally a bit of a cop out because everything requires this kind of work, its just a difference between each team dedicating some time to it in there little bubble (in which they do things slightly differently) versus doing it once at a larger scale centrally - this requires some investment in the form of salary but you would still spend this money in the first place, you just don't notice.

Boiling frogs and all that.

You'd also be surprised just how little work this actually is if you get it right, versus having to fix 50 different hacked up solutions across your company. It does put strain onto some tooling, but these days git seems to be fine well beyond kernel-scale - when you are this size you have "made it" anyway.


Linux kernel is a monorepo and somehow works. On the other side of the spectrum we have leftpad nonsense and latest xz thing.


We have 4 interal-only projects that all kind-of depend on each other and are all rolled out at the same time:

backend, frontend and two shell wrappers for the frontend (electron for windows, cordova for ipad)

The total code base is still small enough for network/file system to not care (~10 devs working 10 years.)

Since we're on Azure Devops, we currently have to create a pull requests for every project. If you change the naming of an API variable, you'll have to create two pull requests. You'll have to review those individually. The automatically triggered test runs will fail against the respective "old" version of each other and create noise.

If, instead, we'd have a structure like this, the changes could be together: /src - backend (C#) - frontend (TypeScript) - ...

Note that we're not planning to have a "shared" library or "common" editorconfig files, or changing _anything_ about code internals. Just tracking the folders in git together.


I struggle to understand the upside of avoiding monorepos. A lot of people think it means you can't decouple releases, modularize builds and similar with a monorepo, but that's simply not true. The efficiency at which you can refactor and upgrade your codebase cannot be understated.


The main upsides I see are:

* you can pretend you don't need a proper sandboxing build system like Bazel, Buck etc. and stick to Make/CMake etc.

* you don't need to learn how to use sparse checkouts

It basically lets you avoid learning how to solve the problems you'll inevitably run into for a little while. Not a great reason, but it is a reason.


every monorepo I've seen suffers from a shit CI solution. Other teams break your builds. Builds get longer and more complex. Feedback loops slow down. Engineering velocity slows down. It works at Google because they spent, literally, hundreds of millions of dollars getting it working.


It's not even clear whether a monorepo stretches across teams or not. I think many people are argueing about different things.


It probably depends on your project size, but I am one of those that find them easier to handle.

What do you find nightmarish?


All of the fastest moving, easiest to contribute to, projects I've seen use monorepos or at least understand the basic principle that things that are released together and ship together (and probably depend on eachother) should be developed together)

So many of the supposed problems with monorepos (other than those of pure scaling) seem to basically boil down to "Dr Dr it hurts when I do this"

For example:

"What happens if two teams depend on different versions of the same internal library" - sounds like a reasonable concern no? No! This situation is fundamentally a disaster. On many levels.


It's much nicer to have everything you could ever need in one place, instead of having to know what is in what repo. Especially when you have versions that need to correspond in the different repos...


a hundred small repos are a different kind of nightmare.

the truth is, having a large code base is just hard no matter which way you handle it. you'll end up with custom repo tooling for the monorepo or blown up CI/CD infrastructure for many small repos either way. complexity will be conserved; it can be transferred, but can't be removed.


baq's law of code thermo dynamics: complexity cannot be removed. it can be transferred but can not be removed. your CI/CD system will be as complicated as your repo structure isn't.


Saving


Most developers jump on the hype train and don't stop to think about ramifications or requirements or purpose. Having said that, monorepos have their use cases, and they help eliminate the need to have convoluted tool chains whose entire purpose is to ensure version compatibility between various components of the system... before you can even start making a line of change on the dev machine.. or some asinine docker abstraction that you must spin up... essentially, if everything is in one place (and not conceived by antisocial pranksters), you bet it's meant to be together. This makes it easier for any number of different life cycle teams to pick up a project and not waste time chasing down silly time wasting issues.


Glad to see I’m not the only one thinking that.


Annoyed reading this one the 2nd but also a tad fooled because I thought of this in the context of a nuclear-option remedy to the XZ fiasco that's still lingering in all our heads.


April fools or not, Debian actually does need this. NixOS's influence looms large.


I wish that half the executive initiatives I've seen ended the same way as this post.


It's worth noting that Void Linux (and maybe others) does indeed keep all its (largely declarative) package recipes and build tooling in a single git monorepo, although it does not keep copies of the upstream source because (package, version, url, checksum) is assumed to be idempotent, and some of the stated advantages are real:

>Simpler collaboration: every package uses the same process, and the same tools, and it’ll be easier than ever to help with other people’s packages.

>Enables distribution-wide changes in general: With all the source code for everything in one tree, in on repository, it’s feasible to make changes to Debian that affect many packages. For example, back in the day Debian took seven years to migrate /usr/doc to /usr/share/doc, and that can now be done in one commit.


Curious, what challenges would git still have to deal with to work with big repos?

500 GB isn't all that much data in 2024, as long as you can resume a broken network connection on fetch (like using HTTP range requests if using http, and when using git LFS), and while 15 million files is a pain in most file systems because of all the overhead, a relatively modern NVMe SSD should be able to cope with it just fine, but I don't know enough about the innermost internals of git on how it manages files. Is there something that's inherently slow/expensive that doesn't scale well with the number of files?

(Basically, I'm trying to find a non-rhetorical answer to the question of "Why not?")


Git does pretty well on large repos, when you use sparse checkout and Git LFS. It makes Git behave like a centralized VC system. https://www.anchorpoint.app/blog/scaling-git-to-1tb-of-files...


That would be ok if git gave the option to track branches in subfolders and leave other branches alone. I had a look at submodules but I'm not clever enough to understand how it could help me.



Thanks, this is the pointer I was waiting for!


You could easily do it with Subversion (SVN), and only clone the directory you need.


git clone --sparse --filter=blob:none --depth=1


I wonder how many xz-like backdoors could be in there.


Disappointed to read that this is an April fools. I honestly believe that practical application of software freedom is held back by tooling, and that through the use of their monorepo and build system Google can actually benefit more than normal users from software freedom.

https://blog.williammanley.net/2020/05/25/unlock-software-fr...




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: