The benefits of having one repository do not seem to be worth the serious performance issues as well as potential coupling that can happen with a gigantic code base as well also making much more difficult to OSS certain parts.
e.g. why doesn't FB use dependency management (or binary package management aka maven, npm, etc)? That is have multiple repositories and cut releases. Build tools to help cut and manage the releases and dependency graph of the releases.
There are even plugins and shell scripts that will make Mercurial act like a mono repository for many small repositories (I use it for our own Java Maven code base).
I must be missing some killer features and would love to see it in action (FB major repository).
An example of utility of monorepos is things like automation--if you change, for example, how you publish packages, and you need to maintain multiple stable branches, it is immensely useful to keep the automation steps in the same repository as code. If you don't, your automation repository then looks like
if version < 31:
elif version < 35:
That said I have no idea how Netflix does source control, only that they use BitBucket/Stash on-prem and that supports both mercurial and git.
The only immensely dangerous thing that can happen is if you drop your package repository or change formats of the repository which rarely happens.
And every update to the code requires other teams to then update their library/app to use the new version, and apps depending on that dependency... you get my drift.
You end up with a complicated dependency headache which hurts productivity.
With a monorepo, you can statically identify all places where your library is used and update those automatically with tooling, or manually. You can also monitor where and how a piece of code is used across the company, etc.
If you then ensure that only commits get accepted that pass the relevant tests, you end up with a sane and working HEAD that's always up to date.
Of course there are lot's of drawbacks as well.
But Google and FB seem to have concluded that this is the better approach for them.
>With a monorepo, you can statically identify all places where your library is used and update those automatically with tooling, or manually. You can also monitor where and how a piece of code is used across the company, etc.
Aka you have to figure out dependencies regardless of monorepo or not. Aka you need dependency management. Of course you could make the argument everyone has to use the latest greatest but you now have the possibility of changing one dependency and requiring an entire redeploy of the whole company.... I have seen what people do in these cases during desperate times... they copy the code and put it in their own project which sort of defeats the purpose and worse it is now not tracked (well I suppose if they have a good enough SCM comment it sort of is).
> If you then ensure that only commits get accepted that pass the relevant tests, you end up with a sane and working HEAD that's always up to date.
It depends on workflow. For some HEAD is what is actually deployed. If that is the case it is fairly difficult to achieve that goal with a giant monorepository if you have tons of teams doing microservices and deploying often.
Or you can have your tooling use the convenient built-in versioning provided by the VCS. It's not called a "version control system" for nothing.
> And every update to the code requires other teams to then update their library/app to use the new version, and apps depending on that dependency... you get my drift.
This happens regardless of whether the codebase is organized as a monorepo or not. I've been subjected to monorepos at my previous and current employers, and we run into difficulties with this all the time. At least once a quarter, I get an e-mail from someone telling me that they updated such-and-such, and now the build fails in one of my projects. So I have to sit there and figure out wth they changed, why it's causing my code to fail the build, and how to fix it. Only once was the problem actually in my code (my Makefile, actually, which failed to bind some variable that was expected by the build system but wasn't documented anywhere, and of course the build system spits out a worthless error message, but I digress).
> You end up with a complicated dependency headache which hurts productivity.
Again, monorepos aren't immune from this, nor do polyrepos inherently suffer from it.
> With a monorepo, you can statically identify all places where your library is used and update those automatically with tooling, or manually. You can also monitor where and how a piece of code is used across the company, etc.
There is literally nothing preventing this from being doable with a polyrepo. In the case of updates, you may need to have the updated repos checked out, but it's not like you can only ever have one repo checked out at a time.
I've never worked anywhere on any project that disallowed commits that didn't pass the tests. It's typically been up to the committer to ensure that what they commit is acceptable. Furthermore, while my employers thus far have all been customers of AccuRev or Perforce (whose only real feature beyond what Subversion offers is that they cost lots of money), and have not personally had the pleasure of working with a DVCS at my day job, this workflow of "allow only test-passing commits" is wholly contrary to the "many small commits in quick succession" workflow afforded by DVCSs, which I contend is one of the biggest productivity boosts afforded by DVCSs (along with cheap branches and network unnecessity) and one of their main draws. And even so, it's still rather easy to maintain a sane and working HEAD with a DVCS (thanks to the cheap branches if nothing else).
If you find that you can't do anything you've listed here without a monorepo, that's a tooling issue rather than an organizational one. Frankly, I've never seen a large monorepo I felt was justified. Every single one was just baggage held over from a time long past when a single repo still made sense for the codebase.
* Ability to change an API and all its users at the same time.
* Circular dependencies become a non-issue in a lot of cases where the would be if you vendor your dependencies.
* Even if you vendor your dependencies hunting for bugs is a lot easier, your bisect of a bug in a library will just come down the commit that upgraded it from 1.0 to 2.0 without a monorepo, with a monorepo you can bisect anything down to a specific commit.
* Code discoverability / mass edits are a lot easier, and integrate seamlessly into a lot more tools than if you need to manage multiple checkouts etc.
The second biggest, we used just enough of a module system to enforce modularity by having a single code repo but separate compilation units. Your circular dependencies between subprojects would show up at compile time (or at least, on a clean build, as our CI machine did). Stopped a lot of obscure runtime issues and made people think about what they were trying to do.
Because it's a pain in the ass and a monorepo means you don't have to deal with that crap for internal code. It's especially important for refactorings or API changes where you can atomically perform a change at once rather than have to wait for the changes to trickle down across hundreds of repositories over days or weeks.
> potential coupling that can happen with a gigantic code base
The coupling is a feature not a bug. Decoupling is a mean not an end, and it has costs. If you don't need it you'd rather not pay it. Same as generic collections, if you don't need them there's nothing wrong with specific ones, quite the opposite.
The fact that instead of "hg ci" I can type "thg ci" and get a commit window with cherry-picking and meld integration right off the bat is really powerful. I've not seen any good git guis that come close to it's featureset.
I even use it for Git with hggit sometimes.
And coupling is what is explicitly being sought, not rejected. The whole point is to build everything off head, keep head sane at all times, and avoid version dependency hell.
- Do you have zero external dependencies on 3rd party library not own by Google? These must still be managed anyway? How do you deal with project X is not ready to move to external library version N but project Y needs version N?
- What about release branches? If you need to intergrate a bug-fix in a sub-system, the magic mono-repo now means merging a fix is harder as it may depend on other unrelated changes all over teh repo? The laissez-faire attitude of not having to care about details in HEAD would seem to bite back in release branches.
I own third party policy, so i can answer this. I'll stick to public info.
There are thousands of 3rd party libraries. The rules of the shared codebase is "third party libraries are not free headcount". You choose whether you use them or not (implicitly or explicitly). If you add a third party library, you get to maintain it, and stay within the support horizon of upstream. If you choose to use one, you get to stay up to date with the upgrades others make. If you need features not in the current version, you get to upgrade the library (and work with teams, who are not allowed to block you).
This is pretty much the only way to make it all work in practice (i'm aware of how it sounds. In practice, even upgrading stuff literally every google target depensd on takes a week. It's only stuff where folks let it go for 6 years [we now have better detection of out-of-date code] that become a problem to upgrade).
If you don't like it, don't use third party code.
Note that these rules make it just like any other code, because your problem is not specific to third party code.
Note that binary versioning, etc, is pretty much always a complete disaster in practice on a large scale
"the magic mono-repo now means merging a fix is harder as it may depend on other unrelated changes all over teh repo?"
This is rarely true in practice, because usually fixes are targeted.
> An area of the repository is reserved for storing open source code (developed at Google or externally). To prevent dependency conflicts, as outlined earlier, it is important that only one version of an open source project be available at any given time. Teams that use open source software are expected to occasionally spend time upgrading their codebase to work with newer versions of open source libraries when library upgrades are performed.
So if an external library has to be updated, the entire codebase must be migrated all at once. Then Rosie is used to split the change into a lot of smaller changes (to be reviewed by all affected teams). Once all the smaller changes are LGTM'ed, it's submitted all at once.
I also would say "only check in what you change" is a simpler rule than "check in all dependencies needed".
I understand company source code needs to be different because of security reasons and convenience but should it really be that much different than any OSS project (which I guarantee would be pissed if you checked in all your dependencies).
> I think its insane not to have dependencies and third party code in your scm
Consequently there are some that might think the complete opposite (I wouldn't assign insanity but I would say it probably in most cases is not the right thing to do).
So every commit in every repo has a map that maps the repo path to a commit hash.
You can use this info to sync the versions of other repos when you update/checkout a version in any of the repos. And all of these can be scripted.
Because then you can keep the repos separated. Not saying that you should. But if
this is the only problem that forces you to use a mono repo, then I am just asking if that problem can be solved in this manner?
>hacky scripts that everyone who wants to use the repository must follow when you can just let the VCS itself manage everything for you?..
What hacky scripts? Do you think
they (google, facebook etc) don't have enough resources to build a 'non hacky' script for what ever they needs done?
From talking to some companies that use monorepos that want to use Rust, they _do_ use these kinds of tools, but want support for it in the tool. Firefox, for example, is pretty much a monorepo, and we built tooling in Cargo to help.
I think that part of it is that "monorepo" can mean slightly different things. Does it only contain your companies' code, or also your dependencies? Do you modify those dependencies in-repo?
I can definitely see an "app" (Firefox) as a monorepo but not giant web companies (particularly with the surge of microservices).
BTW Cargo is fantastic (and I'm fairly picky on package management / build tools).
Regarding why the hell you would do it: In a corporate setting you tend to throw all kinds of st into your repo. In my workplace that includes, network configurations, 200 MB binary tools, usernames (and who knows, passwords too probably).
We do this because we want all the different things, including external dependencies and "ops stuff" to be syncronised. That would be totally insane in the open source world where you are releasing to the public -- but it seems corporations can make it work internally.
I know personally that I have had to move components into separate projects (not necessarily repositories but compile units) to avoid developers from accessing things they shouldn't (e.g. access the database directly instead of going through a different layer).
Separate repositories would be enforcing it to a greater extent. But like I said earlier I suppose I could see it the other way (ie visibility and thus prevention).
Even in a distributed team like linux kernel development, code maintainers have the ability to say "this is how we (as core maintainers) want you to access certain data structure. If we catch you doing anything else illegally, your code won't be accepted, and we will keep rejecting your code until you adhere to our coding style and api design."
Separate projects allow less namespace collision accidents.
You could achieve all of this with monorepo but requires proper tooling.
But a whole company as big as FB and Google using a single tree seems like an incoherent nightmare without proper (somewhat proprietary) tooling (filtering of logs, branches, tags, etc).
Seems kinda backwards to me, why not project portions of a monorepo to behave like individual repositories? That way you get atomic commits too.
But on the other hand you now need custom scripts to filter out other projects that are completely irrelevant to you. Just the .hgtags file alone must be a 10k line nightmare... or maybe the big boys just don't deploy/release often. It is a lot more tooling than I think you might think.
Linux and Firefox having a monorepository is not the same as a whole company as big as google and facebook having a single repository for all of their projects.
I'm not even sure if these companies really are doing that. The use of monorepository could just mean a single place of storage but maybe they do allow a couple of trees (or maybe not).
Your "working state" is now in an image (maybe a Docker file that checks out particular revisions of different repos), so putting everything into one giant VC repo is not necessary.
A versioned Dockerfile that says "this is how we built SystemX at version 1.5" is much better than doing it in Git, as it also covers how the underlying server was built, while a giant git repo might have every application you need at the correct version, but won't say anything about the server it is deployed to.
I checked it out years ago, but pretty much settled on Git.
What are the advantages?
2) As well, Git has multiple client implementation (like git, egit, jgit, etc...). Adding new features is a bit more complicated as all the implementations need to add them before they can be more widely used. Mercurial has one implementation that everyone uses. So new features are easier to add.
3) The .git structure is simple, which is great, but it's become the API for git in a way. While mercurial explicitly says you should never rely on the structure of the .hg directory. If you want to interact with the .hg dir from other software, you should either issue 'hg' command or start up a command server to talk with it. So it creates a cleaner API barrier. Because of this, the Mercurial team can make changes to the .hg dir to better serve different needs (like those of a mono-repo), without breaking the world.
The nice thing about this is you can present the same logical model while being flexible about the way that model is persisted, unlike
Mercurial which has a fixed file format upon which operations are based.
tldr: Git is still faster overall, but those who want to extend a dvcs choose Mercurial for its API that exposes the data structures in a stable manner, albeit in Python, which limits use cases.
As for using Python on .NET, that is what Iron Python is for.
Still, despite its flaws, C is the common layer we have to expose an API that you want to be consumed everywhere. That, or a message passing interface with a client/server architecture. A client/server design may lead to zombie servers, while a tightly coupled C API might crash your application, though you can isolate the C API consumer in a supervised and automatically restarted server you talk to with messages, so that's the more flexible API to have.
The Rust rewrite of parts of Mercurial by Facebook is a no-brainer and given the possibility of GC-less C API in Rust, I wouldn't be surprised if a built-in-Rust C API for Mercurial were to follow. I don't like Rust when compared to high-level languages, but it's a viable C replacement with compile-time exclusion of certain bug classes, so I can get behind such a project. That said, the soundness bugs reported on github are worrisome, so I wouldn't trust Rust's checker to be correct or exhaustive, just yet. It's still a step up from C, that's undeniable.
Going off topic, maybe someone will eventually do a SQLLite re-write as well and other critical projects to our modern stacks that still rely on C.
SQLite does not need to be rewritten. It has the best and most comprehensive test suite in the history of software development -- I would go so far as to say that there are no implementation bugs in SQLite (every single branch in the code has been extensively tested and also extensively tested with dummy failures and so on). So a rewrite in a safer language would benefit nobody (and would just be a huge time sink).
Now it probably isn't worth the effort for a very well tested project like sqlite, but that doesn't validate the premise.
It's 100% branch coverage, with 100% fault coverage as well. If there is an "issue lurking that a safer language would prevent" I would honestly be shocked. SQLite is not a good project to mention rewriting, because it is an incredible technical acheivement in terms of how well tested it is.
Which, as I said, is not what I'm doing. I'm only disagreeing with the premise that 100% test coverage means 0% chance of an unsafe bug existing in the code base (for any code base, not just for SQLite).
More subjectively, most people I've chatted to about it seem to find Mercurial's interface much easier to grok / pick up as a new user than Git's (which is somewhat notorious for its quirks).
There are other differences, but these stand out to me. That said, I use Git because adoption + community (and my experience with hg-git has been less successful than some... though it's been a while since I gave it a spin)
If you use phases (draft/secret/public) correctly it's really close to being mutable. For example all our devs forks are having non-publishing repositories since it's a private non-shared space. By keeping all commits as draft it's easy to do a rebase and then push new changed commits.
While the main repo is publishing and once pushed commits are never mutable, also we keep a workflow that only dev forks can have multiple heads while production repo cant.
IMHO it's the nicer form of mutability
Mercurial, on the other hand, is architecturally set up in a way that considers the repo history to be a somewhat "sacred" truthful account.
You still get the same flexibility as git if/when you need the above mutability: it's not 100% immutable, it supports local rebases, and also global mutablity via "phases" - see Marcin's post on this. In an absolute worst case scenario you can also coordinate reclones of a repo of course, but generally speaking the point is that immutability is "on by default".
And by the way, doesn't git rebase not mutate history? I believe it creates a parallel history (and updates the branch and head refs to point to the new history) but the old history still exists in git's storage and can be recovered (until you GC your storage).
In Mercurial, `hg rebase` will abort if you're trying to make it do something to public commits. Public commits are commits that have been shared on a publishing server. The contrast with draft commits, which are commits that have not been shared or only shared on a non-publishing server.
If you really want to edit public commits, you have to manually force them back to the draft phase before you can rebase or rewrite them.
So I'm still not sure the distinction you are making between which one is "immutable" and which one isn't.
Btw, with Mercurial Evolve, there's no need to force-push, as Evolve will propagate meta-history to other users that indicates what commits replace which ones.
Thanks for clarifying.
You can say that because you can always force public commits into drafts that they're not really immutable, but that's a bit of a perversion of what Mercurial's phase system is intended to do.
So the check is in a different place. That doesn't seem to me to imply that Mercurial's history is "immutable" and git's is "mutable", especially since those words have precise meanings, and even WITH the --force, git doesn't change anything in the history, it just writes out a new history and updates the branch ref to point to the new history.
The only thing mutated (in both systems) is the ref to the branch head, right? So aside from warnings and errors being in different places (both before publish time), what is the difference between the two that leads you to argue that one is immutable and one isn't?
I think if you do a code-review and you get the final state of the repo to check, you can nicely tell how it was evolved/squashed/re-ordered.
I find that workflow much more useful than just rewriting history like in git (i know about reflog)
BTW if you checked in a password... you really should now go change that password :)
I am still disappointed that Git won.
Things could get more competitive in terms of global mindshare if more Facebook engineers (some of whom the OP mentioned have forgotten how to use git) start speaking out in favor of Mercurial.
Better extension system. Some like hg evolve are science fiction compared to typical git workflows. Sounds like hg absorb is similar.
If you live in a saner world though, you'll probably benefit more from git's superior cli & tooling
But it can be fixed, too, which is what Facebook and Google are doing.
Our team just moved from hg to git for an enormous project that has ~25 years of history (CVS -> SVN -> hg|git). The biggest improvement to my daily life is that a git pull takes seconds, while an hg pull takes minutes (or even large fractions of hours when I've spent a week or two away from work).
It's possible that your slow pulls were due to the initial storage format being inefficient for very branchy repositories. I've documented migrating to generaldelta to solve this here: https://book.mercurial-scm.org/read/scaling.html#scaling-rep...
Additionally, using the 'clonebundles' feature, it's possible to speed up your initial clone by a huge amount (making it way faster than non-clonebundles Mercurial or Git): https://book.mercurial-scm.org/read/scaling.html#improving-s...
Of course, this is too late for you, I guess...
They always measure Mercurial pull performance using a load tests. For example between 4.2.X and 4.4.X version of our software we went form 1.8s to 1.4s average time for a pull to happen under load.
We only do this via HTTP since this can be really optimized for speed. So having to take minutes sounds like some backend problems like overloaded server, not enough workers to handle connection etc.
Part of the reason for the move is that we wanted to take advantage of the corp-wide infrastructure of an Atlassian stash server hosted in the cloud and professionally maintained, so as to get away from maintaining our own repo.
But the speeds I quote above were for the initial phase of the conversion, when both repos ran on the same ESX VM and direct comparisons were meaningful. Now that its hosted professionally in the cloud, it seems even faster.
Note that I'm across ~2500 miles of VPN from the home office, and that surely has something to do with it.
The initial clone time for hg and git is within the same order of magnitude (order of an hour), though git manages to be about 50% faster at that too.
Why do you need to keep all 25 years of commits?
Judging by the notes in the wiki , however, the purveyors of my preferred server, Kiln, are not so engaged lately:
> Available hosting solutions: Bitbucket, Kallithea (self-hosted), Kiln (still exists?)
I believe it is maintained and even if not maintained would continue to work for ages, but I suspect we will be held to 3.x for a while.
I do know that the person primarily behind kiln harmony left a while ago and now works at Khan Academy.
- Code review UI including comment pane with one level of sub-threading and easy click and drag linking to lines of code, small changeset selection pane, and large scrolling code pane are simple and work well. By contrast, I'm less sure about github and bitbucket pull requests UI.
- Single sign-on and linking with FogBugz (if we reference a FogBugz case in a commit message, then the commits show up in the case notes)
We worked back in the days with Unity which was using kiln to make the code-review workflows similar.
If you miss kiln, you should check RhodeCode out, and our community edition is even open-sourced.
Bitbuckets code review implementation doesn't handle big changes gracefully though, so I primarily inspect the changes in Beyond Compare.
Hilariously, we concluded internally that doing things that way was too complicated/weird for people to use, while GitHub concluded the exact opposite, and the rest is what you see.
hg push --review -r [REV|BOOKMARK|BRANCH]
I know Subversion is pretty much dead, Mercurial was sidelined and Git seems to have conquered the world. The only thing I think would shake things up a little would be Perforce going Open Source. But i dont see that happening as they seems to be very comfortable in their niche.
> Facebook is writing a Mercurial server in Rust. It will be distributed and
> will support pluggable key-value stores for storage (meaning that we could
> move hg.mozilla.org to be backed by Amazon S3 or some such). The primary
> author also has aspirations for supporting the Git wire protocol on the
> server and enabling sub-directories to be git cloned independently of a
> large repo. This means you could use Mercurial to back your monorepo while
> still providing the illusion of multiple "sub-repos" to Mercurial or Git
> clients. The author is also interested in things like GraphQL to query repo
> data. Facebook engineers are crazy... in a good way.
Here's a copy that should be readable on any device:
> Facebook is writing a Mercurial server in Rust. It will be distributed and will support pluggable key-value stores for storage (meaning that we could move hg.mozilla.org to be backed by Amazon S3 or some such). The primary author also has aspirations for supporting the Git wire protocol on the server and enabling sub-directories to be git cloned independently of a large repo. This means you could use Mercurial to back your monorepo while still providing the illusion of multiple "sub-repos" to Mercurial or Git clients. The author is also interested in things like GraphQL to query repo data. Facebook engineers are crazy... in a good way.
I really like that Mercurial is gaining some traction with the big guys, which tries to solve some nice problems at scale.
> - Goal is to open source the server, once it's more than just slideware.
We spend a lot of time on our own to scale our Mercurial backend. Currently with the http based vcs-server and gevent we can support a lot of concurrent hg operations, but imho that thing can put it on the next level...
I wonder if it will support all things like phases etc ootb.
I'd be happy to see a bit more diversity in source control. Git is great but falls over in more than a few scenarios(unmergeable large binary files come to mind).