"We could have spent a lot of time making it more modular in a way that would be friendly to a source control tool, but there are a number of benefits to using a single repository. Even at our current scale, we often make large changes throughout our code base, and having a single repository is useful for continuous modernization. Splitting it up would make large, atomic refactorings more difficult. On top of that, the idea that the scaling constraints of our source control system should dictate our code structure just doesn't sit well with us."
I remember reading about their Git troubles awhile ago, and I still don't buy this argument that it is better to have one large repository. One reason modularization is important is for the precise reason they are trying to get around it: removing the ability to make large scale changes easy and thus increasing reliability.
However, my understanding is that their desire to have one large repo is reflective of their their "move fast and break things" philosophy, which means not being afraid of making large scale changes. So I would be interested in hearing how they mitigate the obvious downsides given how many people they have committing to their codebase. It seems like you would just end up having to create constraints in other ways, so which constraints end up being the lesser of two evils?
We've found that "Removing the ability to make large scale changes easy and thus increasing reliability." isn't actually correct.
As an example, most of your codebase uses an RPC library. You discover an issue with that library that requires an API change which will reduce network usage fleetwide by an order of magnitude.
With a single repo it's easy to automate the API change everywhere run all the tests for the entire codebase and then submit the changes thus accomplishing an API change safely in weeks that might take months otherwise with higher risk.
Keep in mind that the API change equals real money in networking cost so time to ship is a very real factor.
It sounds like facebook also has a very real time to ship need for even core libraries.
What you've got here is a different kind of tradeoff.
Normally, libraries are written under the assumption that clients cannot be modified or updated when the library changes. This brings in the concept of a breaking change, and a set of design constraints for versioning. For example, modifying interfaces becomes verboten, final methods start becoming preferable to virtual methods, implementation detail classes require decreased visibility, etc.
The advantage is that library producers are decoupled from consumers. Ideally the library is developed with care, breaking changes between major versions are minimized, and breakage due to implementation changes are minimal owing to lack of scope (literally) for clients to depend on implementation details.
But under the model you describe, you're leveraging the Theory of the Firm as much as possible - specifically, reducing the transaction costs of potentially updating clients of libraries simultaneously with the library itself.
The downside is the risk of unnecessary coupling between clients and libraries - the costs of a breaking change aren't so severe, so the incentive to avoid them is lessened, and so the abstraction boundaries between libraries is weakened. If the quality of engineers isn't kept high, or they don't know enough about how and why to minimize coupling, there's a risk of a kind of sclerosis that increases costs of change anyway.
The risk you describe is a real risk. But we mitigate it at google with strong code review and high bars for our our core libraries teams.
All of our core libs are owned by a team and you can't make changes to them without permission and a thorough code review. Our perforce infrastructure allows us to prevent submits that don't meet this criteria so we get the benefits of ownership only we use ACL's instead of seperate repo's. It has so far worked very well for us.
You don't need a single repo to be able to run tests across all existing tools. If you have proper dependency management set up, you make the change, push it and a CI server goes off and builds it, then all dependent projects get rebuilt and tested...
Does this matter that much if you version everything?
Commit in library -> successful library build, failed client build.
Commit in client -> successful library build, successful client build.
In both situations you don't really care about the intermediary broken build - you still have the previous library/client versions and can use those. Once everything is committed & fixed you have the new versions and you can upgrade.
The only problem I see if there's a long delay before the second commit. But this can be prevented by a fast CI cycle (always a good idea) and sending notifications for failures across teams (i.e. the library committer is notified
that the client build for his commit failed).
What version control system do you use, or is it a secret? Perforce? Git? I've seen Linus's talk that he gave about Git at the GooglePlex so perhaps you use Git. If so, how have you not run into Facebook's scaling issues?
This question is actually really complicated to answer. The short answer is that we run perforce.
The long answer is that we run perforce with a bunch of caching servers and custom stuff in front of it and some special client wrappers. In fact there is more than one client wrapper. One of them uses git to manage a local thin client of the relevant section of the repo and sends the changes to perforce. This is the one I typically use since I get the nice tooling of git for my daily coding and then I can package all the changes up into a single perforce change later.
Google has invested a lot of infrastructure into code management and tooling to make one repo work. We've reaped a number of benefits as a result.
As others have mentioned though there are trade offs. We made the tradeoff that best suited our needs.
There is no such thing as a high performance git anything. People at Google who use git (I used to be one of them) suffer from far worse performance than the people who just use perforce as such. In particular, the cost of a git gc, which invariably occurs right when you're on the verge of an epic hack, or fighting a gigantic outage, is unreasonable and perforce has no analogous periodic process.
We actually have an open source tool that allows you carve off parts of your Perforce server as Git repos. The repos can overlap in Perforce allowing you to share code seamlessly between different Git repos. You can generate new repos from existing code easily and can even generate shallow git repos that are usage for development.
The story i got from a Googler was that there are many separate projects in a single repository (which is normal for Perforce), and dependencies are handled by making your libraries subprojects (or whatever they're called - like subrepos in Git, essentially symlinks), rather than using an artifact-centred approach.
"Our engineers were comfortable with Git and we preferred to stay with a familiar tool, so we took a long, hard look at improving it to work at scale. After much deliberation, we concluded that Git's internals would be difficult to work with for an ambitious scaling project."
"And then Git itself wasn't working for us anymore because it wasn't scaling when we'd have an operating system release. So we ended up hiring most of the Git team - there's like only one or two core committers now for Git who don't work at Google,"
I love the Mercurial community. We use Mercurial at work and I'm able to get instant support in IRC for any issue we have with an awesome signal/noise ratio. I'm glad Facebook is contributing back so much as well. My suspicion is that open source projects tend towards Git because of GitHub but I think a lot of companies who don't have the option of external code hosting lean towards Mercurial. All anecdotal observations of course ;-)
>My suspicion is that open source projects tend towards Git because of GitHub but I think a lot of companies who don't have the option of external code hosting lean towards Mercurial.
GitHub is a consequence not the cause (ponder for a moment why there is no MercurialHub...) It is about ability to choose best source control tool for multi-versioned distributed concurrent development. Open source devs have such choice while corporate ones - no. FB choosing Mercurial tells a lot about environment there.
That makes no sense. Facebook specifically said they might have chosen git! They had the same choices as any developer except that a) they don't care about github, and b) they wanted to modify the VCS to support massive scale. Apparently out of git and mercurial, mercurial is easier to extend.
I suspect many of us use Git because it was made by Linus Torvalds ( I think everyone agrees that he is a great developer) and is used in the Linux kernel. If it can handle that, then you can be pretty sure it will handle whatever you throw at it (unless you are Facebook, it seems).
Add Github to that mix and you can see why so many developers sleep like babies at night. Of course it's not rigorous, but the choice is not irrational.
Mercurial has seriously improved over the past couple of years. If you tried mercurial a few years ago and were scared away due to speed or functionality issues, you might want to give it another shot.
Does the standard branch workflow still expect you to have a separate repository and directory per branch? I don't care about plugins, here; if the standard workflow doesn't include incredibly lightweight branches, I'll stick with a version control system that does.
Likewise, does the standard workflow still intentionally make it painful to rearrange changes in your local repository to construct a series of patches? Does Mercurial provide built-in commands equivalent to "commit --amend", "rebase -i", and "add -p"?
If you want git-like branches, Mercurial bookmarks are close, but not quite the same. They have some intentional differences, which make sense and do not impede you, but will not make sense if you absolutely want them to behave like git branches.
> "commit --amend", "rebase -i", and "add -p"?
hg commit --amend, hg histedit, hg record (but hg crecord is much nicer).
They're built-in, but you have to flip them on the last two with a config switch.
That post helped, and bookmarks might get the job done, but the mercurial culture doesn't seem to encourage them compared to branches. As a result, you tend to find a lot more separate mercurial repositories than single repos with bookmarks, and a lot more named branches with hard-to-eliminate commits as well (since even "closing" a branch doesn't really get rid of it).
> hg commit --amend, hg histedit, hg record (but hg crecord is much nicer).
> They're built-in, but you have to flip them on the last two with a config switch.
And that's the fundamental attitude that I can't stand in mercurial's culture: "careful, you might accidentally throw away that information you specifically wanted to throw away". Those commands aren't available by default because the culture doesn't encourage that workflow, so even if you use them for your own personal workflow you'll tend to find a lot more junk commits and forever-remembered branch names in the average mercurial repository.
History editing in git is not unsafe: because it's a part of the default set of tools, git offers functionality like the reflog to keep track of old history locally and avoid losing it. It's extremely difficult to permanently lose content in a git repository even using the history editing commands; it's always just a "git reflog" away.
It's not technically unsafe, but it's unsafe from a UI point of view. It obviously confuses some inexperienced people (witness the confusion in this thread), making them feel helpless and like their work has been destroyed, even if it hasn't. Both git and hg actually have lots of fallbacks so that work is never lost, but without experience, these fallbacks may not be obvious.
Hg's rewriting extensions are sometimes disabled by default for this reason: until people understand how they work, they shouldn't use them. If they try to use them, hg tells them how to enable these features.
History editing in Git is actually perfectly safe. The old commits will (at least initially) still be there, though you may need "git fsck --unreachable" to locate them if you don't remember the hashes or have them in your reflog anymore.
What is unsafe in Git is garbage collection. Commits that aren't reachable from either a branch or the reflog will be deleted after a grace period (assuming that you don't disable garbage collection entirely). More importantly, unreachable commits will also not be pushed with "git push" (which means that if your laptop blows up, you may not be able to recover them from a repository server).
This happens not directly as a consequence of history editing, but generally because a branch is being pointed to a new commit or removed (which also implicitly happens during history editing).
> Those commands aren't available by default because the culture doesn't encourage that workflow
That's not the reason. The reason is that they want to keep the standard interface to hg minimal. Unfortunately, I cannot find a citation for that right now.
That said, I, too, would appreciate having more standard extensions enabled by default. But direct your disagreement to the idea that hg should have a minimal standard interface instead of making up reasons :)
Note that extensions that ship with Mercurial are exactly as supported as the tool itself, and come with the same kinds of backwards compatibility guarantees. It's not worth worrying about rebase, histedit, and record being extensions - if you want them, turn them on and use them.
They're not exactly as supported, because they're not on by default, and they're not the thing everyone in the Mercurial culture tells you to use as the obvious solution to problems. Version control is as much about what other people do as what you do, and what other people do tends to align most closely with the defaults and what the tools encourage.
No, they're exactly as supported. What I meant by that was that we promise to not break them, ever, to keep the output formats stable, and accept bug reports for them. That doesn't necessarily mean it's something we'll always recommend (eg mq isn't something I'd recommend for a new user, rebase/histedit/amend are way better and always will be.)
We don't turn them on by default for two reasons: newbie users not shooting their foot off unexpectedly, and UI clutter. mq alone (which we actually recommend avoiding now if you don't really need it) adds a bunch of verbs to the command line interface.
We've talked about having an 'hg hello' command or something that'd give you a proposed template hgrc that suggest you might want rebase/record/histedit/pager/color/progress, but nobody's worked out exactly what that should be like. Does something like that sound like it might be helpful?
Some people like it, but my basic complaint with it is that it introduces a new concept, the "patch", that is basically a crippled commit. When a patch is not a commit, you get rejected diff hunks when it fails to apply, and it can easily forget its history.
An MQ patch is basically a commit that doesn't know how to merge and doesn't keep backups. It's way too easy to make a mistake and lose work. I consider MQ one of hg's youthful mistakes before tools like histedit, rebase, and Evolve came to exist.
mq keeps its patches as an internal per-repo stack. This is much weaker than git's tree branches.
Here's an example: let's say I've been working on a repo, and have 2 current mq patches called A and B pushed. There's no way for me to pop A and B, then apply another patch called D in the current repo (at least as far as I could figure out). Also, even if there were, I'd have no way of then pushing A and B on top of D. The patches only work as a linear stack (D must come on top of B).
It's a very powerful tool, but also a complex one, and one which brings in a bunch of new commands. There are a bunch of extension which provide nice UIs to various subsets of mq's full power, and are much simpler.
If you need mq, use mq, but if you just need to e.g. do some history edition, use histedit.
> They're not exactly as supported, because they're not on by default
I really don't get this. In almost any software there are features that are not 'on by default', but it's still implemented and supported (except explicitly stated otherwise). How can this be a reason to not use the functionality when you need it?
It's like using Outlook to respond to mail on a mailing list. Sure, it's theoretically possible to use some combination of manual configuration and extra work to construct a post that won't violate social norms, but the tool doesn't encourage it, so mailing lists with a significant number of Outlook users on them tend to fail to follow the standard mailing list conventions of other lists, and develop their own conventions.
The same holds for any tool used in a workflow that involves anyone other than yourself: any tool with a default workflow tends to encourage people to follow that default workflow.
The fact they are not on by default seems to actually be hurting Mercurial in the debate against Git. As every review of "Mercurial vs Git" assumes they are not available, just because they are not enabled in default configuration out of the box.
Maybe Mercurial should start shipping them enabled?
> Does the standard branch workflow still expect you to have a separate repository and directory per branch?
You're probably confusing mercurial with bazaar, mercurial has always had branches (though they're not quite the same as git, mercurial's bookmarks are more closely related to git branches) and anonymous heads (contrary to git, an unnamed head is not stuck in limbo).
> I don't care about plugins
That's stupid, mercurial is very much about plugins: there are dozens of official plugins shipped in a standard mercurial install.
> the standard workflow
there is no such thing.
> Does Mercurial provide built-in commands equivalent to "commit --amend", "rebase -i", and "add -p"?
All of them are provided in the base install, you just have to enable the corresponding extensions.
> bazaar expects you to have a separate working copy (directory) for each branch, but you can have multiple branches stored in the same repository without any problem
Shared repositories didn't originally exist and IIRC were added to avoid data duplication. Furthermore, they don't fix the multiple working copies problem (you have to use lightweight checkouts or collocated branches for that).
Those are functionally equivalent, in that they both mean branching is not instantaneous.
They are not functionally equivalent. A "bzr branch" with a shared repository will only populate the working tree and does not have to duplicate the repository. It is functionally closer to "git-new-workdir" than "git clone" or "git branch".
To have instant branching in Bazaar, you can use co-located branches.
Branches with their own directories exist for the use case where the accumulated cost for rebuilds after branch switches is more costly than populating a new directory with a checkout or if you need two checkouts in parallel. They also exist in Bazaar for simulating a workflow with a central server and no local repository data.
This is a description which makes no sense whatsoever, given git and mercurial are both DVCS there's no "client" or "server".
1. Mercurial branches are commit metadata, the branch name lives in the commit. Git branches are pointers to a commit (which move when a new commit is added on top of the existing branch commit), living outside the commit objects.
2. As a separate concern, Git has multiple branch namespaces ("remote" versus "local" branches, where each remote generates a branch namespace, and remote-tracking local branches). Mercurial only has a single namespace.
Yes, by default all branches are pushed, (you would have to use --force if it creates new heads, and --new-branch to push new named branches). But I really don't see how it makes anything "server" or "client" side.
(and you could in any case decide to push just default: hg push -r default)
What parts? That's a very long post. I don't think the overall melancholy tone of defeat applies at all. Hg is a lively and healthy project, and the most frequently offered alternative to git with plenty of commercial backing (Facebook, Google, Atlassian, Microsoft...)
That sentiment comprises a truly mindblowing number of "bug" reports that hit the Mercurial mailing list every month. I honestly wonder sometimes if I'm simply forgetting how CVS users did the same thing when Subversion shipped and I forgot about it/missed it, or what.
Not exactly, but close. Switching away from Git would require a tool that is both compellingly better in some key way as well as not worse for any existing workflow. I have yet to see any other version control system that meets both of those criteria, even leaving aside the network effects of using the most popular system.
I work in several branches in the same file system directory. Branching in mercurial is as simple as "hg branch branchname", and switching branches is just "hg update branchname". As long as you make sure you commit everything you can easily work in one directory.
Merging changes across branches is easy too, as you only need to specify the revision number.
Mercurial won't push new branches unless you specify -f or --new-branch. You can selectively push only the branches you want with hg push -r branch_or_revision (hg push -r . for the current revision). You can also hide branches using phases, so they won't be pushed even with -f or --new-branch.
Mercurial will attempt to push all branches. It will then immediately stop with an error message unless you specify -f or --new-branch if you're trying to push a branch that does not already exist in the remote repository (or even to create a new head in an existing branch). In order to get rid of the error message you need to either (1) explicitly tell Mercurial to push all branches, (2) specify which existing branch(es)/revision(s) you want to push (they must already exist remotely), or (3) hide the branch with hg phase secret -f. This means that your local branches will not leak into the remote repository unless you specifically instruct Mercurial to do so.
Of these, only the phase approach is new (2.1+). The rest hasn't changed.
According to #mercurial, branches (even closed ones) begin to noticeably slow down the repository when you reach around 2000. That's not at all impossible to reach with fine-grained feature branches. Bookmarks are really just pointers to commits, no branches in any sense, so they cannot be used for feature branches either. As a result, we simply don't use feature branches in Mercurial. It's literally the only issue I can think of.
Well, if that true, I'd prefer to investigate and make them fast, instead of using "bookmark-as-a-branch" because of all the cool stuff mercurial branches have (like great log-drawing with branch info).
At a previous job I migrated a quite large code base to git. IIRC it was at least 500k files and a couple of 10MLOCs. We had the same "scaling" issues Facebook mentioned here when trying to place all of this inside one repository. So we ended up going with submodules for this. Another idea was to perhaps enable re-use of repositories and/or disconnect old legacy code this way.
We did took a quick look at Mercurial but since lots of the upstream tools we used was using Git (linux, uboot, yocto, etc) it was an obvious choice. I seem to recall there being two hg extension that where of interest at the time (2010-ish), one to add inotify support and another to store large files outside the repository (hg-largefiles?).
Seem like Facebook's approach to the lstat(2) issue with watchman  is to use inotify on Linux. This has been discussed a couple of times for git as well but nothing has come of it so far .
I wrote a comment about the scaling of repositories (and specifically Facebook's issues) a few days ago that was wiped out by the HN crash, but I've managed to recover it from HNSearch:
Facebook's problem is that they were trying to scale with Git improperly. With conventional CVS[sic] systems like Perforce, you can scale a single repo nearly as large as any company will need. Emphasis on that nearly. At a certain point, with a large enough codebase (and, critically, enough throughput) you start to realize that you are about to hit a brick wall. With perforce, this starts to manifest with service brownouts.
With perforce, you can reasonably expect to run into this brick wall somewhere in the neighborhood of terabytes of metadata and dozens of transactions per second. That changes depending on what sort of beastly hardware you are willing to throw at your version control team.
Git of course hits a brick wall much sooner, somewhere around single-digit gigabytes of data (depending heavily on the average size of every object in the DAG), even ignoring throughput.
Perforce is probably good enough for Facebook in the present, but if you are a company that large and if you are forward looking, it becomes apparent that with existing version control technology, "one repo per company" is not a long term solution. Even "one repo per department".
You can split it even further, but what you realize is that you are developing infrastructure that allows you to use many repos (for instance your build servers and internal code search/browsing tools will now need to understand that concept) but you are losing many of the benefits of Perforce. While you are in the process of adapting your infrastructure to wrap its mind around many repositories, it makes sense to allow dev teams to really take advantage of this splitting. Develop infrastructure that allows Perforce and git repos to coexist in the company, allowing dev teams to spin up new git repos for their every project at will. Done properly, git allows you to create massively scalable systems that you can count on supporting your companies needs for the foreseeable future.
Smooth migration, migration that does not disrupt development, takes months (assuming the right initial conditions), so it is best to recognize the problem and start early, before service becomes disrupted.
If I understand Facebook's situation currently, they are still in the "try to make Mercurial scale" stage of denial, burning developer time and effort to push back that first brick wall (the same one that git hits, though mercurial hits it after git hits it but before perforce hits it...)
An example of building multi-repo infrastructure for large projects with git is Android's repo: http://en.wikipedia.org/wiki/Repo_(script) Repo is just one example though; other, better, solutions are very possible.
I guess the TL;DR here is that I think it is great that Facebook is making a single Mercurial repository work for their purposes right now, but I think they are kidding themselves if they think that is a long term solution. They are doing a pretty good job of making Mercurial scale like Perforce can scale, but that will only work for them for so long.
(In the above comment I talk about building a scalable system with git ("with" git, rather than making git itself scale), but the same can of course be done with mercurial instead of git. I don't mean this to be a comment suggesting that they should use git instead of mercurial.)
Please correct if I'm wrong but it looks like the decision to scale using HG instead of git was made on 2 points: 1) git maintainers basically said you should split repositories and left it at that, and 2) HG's cleaner code and abstractions made it easier to patch.
Fundamentally I like how this cleans up the design flaw of requiring history everywhere. 99% of the time when I clone or pull or diff or whatever I only care about HEAD. Why should I be forced to pull or store GBs of history or even MBs of metadata I can't use? Why not make leaving this data on the server optional? I can see how the decision to push history everywhere was made for simplicity but it doesn't reflect real world usage and clearly isn't scalable. Let's hope these history-option patches continue to be developed and make their way upstream. They certainly have my vote, not just as options but as defaults.
I get this impression every time I see a "look what neat scalability thing we did" post from Facebook Engineering. It's great that they're able to achieve such technical feats, but they refuse to acknowledge that maybe they're using the technology wrong. I'm reminded of the time they hacked the Dalvik VM on Android because apparently they had too many method names for the Dalvik VM to handle. http://jaxenter.com/facebook-s-completely-insane-dalvik-hack...
I mean, they run into a resource allocation problem for an application that is essentially a glorified web view, and they think "how can I hack up the VM to bypass this limitation?" Sounds like insanity to me.
It is insane, but it never comes all at once. I have been at a company that had this type of internal insanity, and it is a bit of a "boil the frog" thing.
Each individual step seems resonable and sane thing -- but when you look at the result of all those steps -- you stand back and just stare in awe at the horror you have created... it hits you just how far you have drifted.
In a recent meeting, we had a corporate redirection a bit -- and a co-worker simply asked the question "Is this a sane approach to what our actual problem is?" -- glad to be working with a crew who askes those kinds of questions.
I don't know if this is Facebook's case, but when I've seen stuff like that happening, it was because a bunch of (otherwise smart) developers were far too arrogant about the quality of their own work, too derogatory and, to some degree, too superficial about the quality of other programmers' work, and -- perhaps fatally -- too driven by the can-do-no-matter-what attitude that is so obnoxiously prevalent in today's corporate world, at the expense of common sense. Between the technical debt and the management pressure, I've seen a bunch of people refusing to admit it's their fault and cleverly save their asses (and earn a hefty bonus on one or two occasions) by doing something any sane programmer would have, in fact, fired them for.
I'm not convinced that HipHop is an example of that. As I understand it, HipHop allowed Facebook to increase performance hugely with a very modest investment, and very low risk. Rewriting all the critical bits of Facebook in a different language to realize the same speed-up would likely have required a lot more resources and been orders of magnitude more risky.
I can't agree. Instead of taking everyone's advice of either switching to perforce (less user-friendly/frustrating repo) or splitting their repo into parts, facebook built a solution where they don't have to make either compromise.
Just because their solution is complex doesn't mean they're using the technology wrong. They set out their requirements and they met all of them. All their engineers are more efficient now and all this complexity happens behind the scenes.
Amazon ran into this exact problem with Perforce. They threw hardware at the problem for a while but, fortunately, soon realized that they weren't getting anywhere with that. They were trying to work "with" git i.e., with one repo per service and were in migration phase when I left.
> An example of building multi-repo infrastructure for large projects with git is Android's repo: http://en.wikipedia.org/wiki/Repo_(script) Repo is just one example though; other, better, solutions are very possible.
I'm curious - do you know of any such better examples, or is this merely a theoretical "I feel like we could do better" statement?
My product is a read-only solution, which simplifies my technical requirements, but it basically de-emphasizes the repository, by making branches the focal point. If you have ever worked with UCM ClearCase by IBM, you'll understand that everything works at the stream/branch level. When you create a new UCM project, you construct it by picking points (baselines) from streams/branches, which can come from different pvobs (repositories). And UCM activities, which are like commits, can contain changes from different branches from different pvobs (repositories).
I personally think, if you want to take Git to the next level, you'll have to implement something like UCM ClearCase activities and projects. Commits should not be bound to a single repository and it should be very easy for people to say I want to create a product that uses starting points from branches x, y and z without having to think about what repo they belong to.
I have a couple examples that shows how my product de-emphasizes the repository. In the following example, you can see my Commits Finder tool, which lets you search for commits by branches.
As you can see, the search results shows commits from 10 different branches from 7 different repositories. My other example is my GitHub Pulls Finder tool which lets you search for pull requests by branches.
Here you can see pull requests from 6 different branches from 6 different repos.
Since my solution is read-only, it simplified things, but I don't think creating a nice layer on top of sub-modules would be be that technically challenging. And it's the direction I would personally go if I wanted to make Git more enterprise friendly.
+1. I was reading the Mercurial mailing list while the Guestrepo design was being hashed out, and i was really impressed: the goals and mechanism were carefully considered. I haven't actually used it, so i can't vouch for the implementation!
I've worked with more pleasant systems, but they are not public and I don't feel comfortable talking about them publicly in much detail.
Basically though, repo is good for a single project ("Android", as a collective entity) but a company looking at working with many repos will want something designed with handling many unrelated projects at once. (For example, all the repos for "Android" and all the repos or "ChromeOS" should be able to live alongside each other in this system without any developer hassle, even though they may not have anything to do with each other).
Also repo is really only one part of the solution a company should look at. Properly done, the sort of system that I am talking about is really several systems that are all basically on the same page. Your build servers should understand how to get the code they need to build something, your internal code search/browsing tools should understand where code lives, etc. Re
Furthermore, allowing different types of repos to exist alongside each other is a good idea. That way each team can individually decide if they want to use Mercurial, Git, or even SVN.. To my knowledge, the repo script is only good for many git repositories.
Would you be willing to chat about it via email? I'd love to know more, as I'm starting to look at implementing something like this and it'd be nice to reuse any work that someone has done in thinking through mistakes before I make them...
How do you import code from one repo into another?
This is the fundamental problem Facebook (and Google, Amazon, etc) are all trying to solve: how to share code company-wide. Sub-repos and making your tools hack around the repos doesn't solve that problem.
You publish packages/libraries/gems/jars/whatever your language calls them, and a package manager. Needing to combine the actual source trees into a single repo like submodules or subtrees allow you to should be for really rare cases.
That reduces the problem, though it doesn't eliminate it. You'll still run into rough spots when you go to rearrange what files are in what packages.
Say one aspect of a package suddenly starts to grow quite fast and take on a life of its own, and you would like to split it into its own dedicated package (and therefore, it's own repo). How do you do that without loosing the history of those files?
It's possible with git.. but it isn't exactly what I would call straightforward. With a one-repo system you simply move those files to their new location, just like any other file move/rename.
But this isn't something that needs to be nice. Files moving between libraries should be rare in well designed libraries, and splitting a subtree out of git is easy enough for when the need does occur.
Keeping a giant repository on the other hand makes it very easy to do stupid things and end up with tightly coupled code. You don't want that to be the path of least resistance.
Advocating a solution that's only simple if all of your code is well-designed all the time is a poor strategy. Everyone has times they don't see the big picture clearly enough, or early enough, to segment or group responsibilities well. Or, even if you see the current big picture clearly, it's difficult to predict how future changes in business goals will change the conceptual model you have for your code today.
I think ultimately, past the scale that you can get with Perforce, that is a problem waiting to be solved by the next generation of VCS.
You can get all your many repos in one place, with all of your internal systems handling them nicely, but moving one file from one repo to another, in a way that history remains clear, is something of an unsolved problem.
>With perforce, you can reasonably expect to run into this brick wall somewhere in the neighborhood of terabytes of metadata and dozens of transactions per second. That changes depending on what sort of beastly hardware you are willing to throw at your version control team.
>Git of course hits a brick wall much sooner, somewhere around single-digit gigabytes of data (depending heavily on the average size of every object in the DAG), even ignoring throughput.
if you use Git the same way you'd use Perforce or Mercurial - one central server handling everything ... you'd better stay with Perforce or Mercurial. Corporate dev process is usually built around notion of "central" repo (and they equate it with "master") - just replacing Perforce server with Git server wouldn't change anything.
That is certain. If you are trying to run at a very large scale with a single repo, Perforce (or apparently Mercurial) is the way to go.
The problem is "how do you grow even further?" There isn't a perfect solution to that (yet), but I think the only existing workable (although absolutely imperfect) solution is splitting your codebase into many small repos.
Thankfully very few companies need to worry about this problem right now. A single Mercurial repo is apparently scaling for Facebook currently, and few companies will manage to stress Perforce.
At what point do we say "enough" though? I mean, start from the absolute maximum: how would one make a system to control all of the source code in the world?
For all of the source code in the world that is source-controlled, it exists in separate repos. It doesn't exist in one repo, and where there are inter-project dependencies, it is the dependency consumer's responsibility to keep abreast of the changes in the dependency and integrate them as necessary. It is self-organizing.
So, if Google and Facebook have grown beyond what can be done with a single repo, either they are not being mindful of their engineering practices (which puts the public at significant risk), or they are approaching a scale that is more similar to "world scale" than it is "corporate scale".
Which I would also say is putting the public at significant risk.
This is a very valid point. It is not clear at all that a single repo for the world is desirable in the long run, despite some of the individual advantages that sort of setup has. It is part of the reason that I don't think we will see the "next generation" of VCS anytime soon. (Which in turn makes me think that "holding out" for those systems, hoping they will rescue you from your scaling problem, is a bad decision. It is best to move to multiple repos sooner rather than later).
> If you are trying to run at a very large scale with a single repo, Perforce (or apparently Mercurial) is the way to go.
that is the problem - single logical repo in Perforce/Mercurial does mean single physical repo. Which obviously causes issues at very large scale. Even at normal enterprise scale :)
Git solves it through kind of distributed scaling where many operations can be performed on local physical repos without ever hitting the central. With centralized solutions, like Perforce, Mercurial, etc... most of the operations are performed on central server and your ability to scale vertically hits the ceiling pretty soon.
Compare the simplest case - each dev updating his local workspace/repo would hit the central server in Perforce where is with Git you can (and normally would have) a small set of downstream repos which the updates would be propagated/distributed through. You can branch left and right in Git in your local and team's repos without "master" repo and "central" server involved (while all your branch activity in Perforce is happening right on the central server in the central repo). The same happens for synchronizing your work inside the team - no need to hit the central. Etc...
my bad. 5+ years since i worked with Mercurial. Digging deep into painful memories, my Mercurial PTSD from that time is absence of in-repo branching - need to clone which is a killer for very large repo we had and no partial commits - again a killer aggravated by the above mentioned absence of in-repo branches. Both issues made working with large repo unreasonably and unnecessary hard.
These days, the standard practice is probably to use bookmarks, which are more or less like Git branches.
No. You have the choice to use either named branches or bookmarks, and which one you choose is a matter of your workflow. Note that even a Git-like workflow does not necessarily require the use of bookmarks.
> With centralized solutions, like Perforce, Mercurial, etc...
I think you're incorrectly putting mercurial and perforce in the same bucket. Mercurial is a lot more like Git than perforce in that it's a DVCS. Facebook made it scale by using a lot of centralized caching, but most operations (diff, looking through history, branching, etc) can be done locally.
Mercurial itself runs into issues way before git does, it's just that Facebook has essentially gutted a bunch of things out Mercurial in order to make the resulting thing fast (punting on things like computing status, downloading diffs, etc).
Also, there's a difference between those open source projects using large repos with git and Facebook wanting to increase developer efficiency. Developers sitting around waiting for a rebase doesn't really pay off, where as an open source project can get away with it.
Its still one of the coolest and best paid companies out there! You really think they couldn't hire a few guys from core git team saying - "Dude, just improve git and we'll pay you?". You must be kidding
The idea is that you should not have to modify git or mercurial. Instead of modifying git or mercurial to handle a single repo for your entire company, you should create a system that uses them to handle many repos in your company. You don't need C programmers to do this with git.
I worked at Google (in a team using Perforce) and now work at a different company that uses multiple interdependent projects using Maven. Using a single monolithic codebase along with a build tool that statically builds everything at trunk has its advantages:
* You immediately get improvements from upstream projects without having to get them manually.
* You can unambiguously answer the question, What code am I using? with a single number. With multiple repositories, you have to list all the versions of each project that you are using.
* Easy API refactoring. You don’t have to worry about coordinating version number bumps across different repositories/dependency manifests when you make major changes to inter-project APIs. With a monolithic repository, you fix all callers of an API using a single code commit. No need to edit the version numbers in your pom files.
* Low cost to split a project into multiple separate projects. With multiple repositories or version numbers, you are reluctant to create new projects because the APIs will forever be harder to refactor (since you will have to worry about version numbers).
* No diamond dependency problem of 2 dependencies using a different version of a base project. Everyone is using the same version of base.
With a monolithic repository+build system, upstream callers are responsible to never make a commit that breaks downstream callers. I feel like it’s similar to the question of optimal currency areas. If your organization is growing in lock-step, then you can all happily share a single gold-standard repository with little friction. But if you can’t trust your upstream projects, you introduce versioning between the projects and have to deal with the mental burden of wondering whether to upgrade to the newest upstream project and whether you’re actually running the latest code.
> * You immediately get improvements from upstream projects without having to get them manually.
You also immediately get regressions. Not trying to be dismissive, but we fundamentally have different software philosophies if you think this point (which is the essence of most of your points) is a good thing that should be encouraged.
You’re right, there are disadvantages to a single codebase and builds at trunk. You pick up regressions, and it is difficult for your team to develop on its own while holding everything else constant.
But I don’t think the speed of receiving fixes is the essence of, or even primarily the source of, all my headaches with versions. The problem with mixing and matching versions within an organization is the enormous complexity that it introduces. Perhaps your downstream coworkers are still using an old version of your project, so they don’t want you to refactor their use of your API. Or perhaps you forgot to update your required upstream dependency when using a new function from a library, and your coworker’s program crashed because they’re still using the old dependency. Or perhaps someone forgot to bump the major version number when changing API or behavior, causing a previously built downstream project that is linking against the new upstream project to crash.
Now, these problems are all solvable if you and your coworkers are very disciplined in updating your version numbers and your required dependency versions. But it means that you constantly have to be aware of what APIs you export and what versions of APIs you are calling. You constantly have to edit the project manifests to bump version numbers. You must think about whether your changes will be major or minor. You carefully read the Changelist before using the newest upstream projects. It is a mental burden.
Contrast this to a monolithic codebase and build system. There are no version numbers in the dependency manifests to other projects in the company. If you want to change an API, you are responsible for fixing all the downstream users (rather than the other way around). Making a new project adds little mental overhead. If there is no impedance mismatch between the different teams of your company, it can make life much easier.
Immediate regressions are good! If someone at Google breaks my code, I will know within half an hour at the latest and I will tell them to go fix it or just revert their changes myself. Immediate regressions also go perfectly with daily (or hourly!) releases. If there's a performance problem it will be identified early and I will only have thousands of changes to investigate instead of tens of millions.
Imagine if I had a regression and I had to go to the other team saying "We just upgraded from the Foo you released 2 years ago to the one from last year and the performance sucks. Help!" I would not get any help. However I get plenty of support when I go to Foo-team to tell them that my Foo-per-second is 10% worse in the noon release compared to the midnight release.
Having artifacts and stable interfaces and library releases and all that is very ivory tower hocus pocus stuff. In practice instant integration is better.
I don't just mean performance regressions. Someone upstream can change an API in a way that doesn't fit well with your use-case, goes in and "fixes" your code (makes sure all the tests pass) to fit the new API but makes it less maintainable in the process.
> * You can unambiguously answer the question, What code am I using? with a single number. With multiple repositories, you have to list all the versions of each project that you are using.
Git submodules may have a number of problems of its own, but it solves this one. There's always an unambiguous version number, which is the commit hash of the top repo. Every subrepository's commit hash is stored in the top repo and the top repo's version is an unambiguous version number of the entire code base.
I wish that more effort would be put in Git submodules, it's pretty much an afterthought addition but I've heard that there have been recent improvements and future improvements may be coming...
In 1.8.2 you can track a branch name instead of a commit hash. This has the benefit of allowing you to work always against the latest version, while foregoing the advantage of having a single unambiguous version number.
Does this approach not isolate Google from the open source community? Most companies nowadays run on open source tools built from open source libraries written in open source languages. Version numbers and stable APIs are crucial.
Ideally, to make a breaking API change, you change the function and all the references within a single commit. Since this repository is only used for statically compiled programs, there is no need to keep the old API anymore.
For protocols and file formats, Google universally uses protocol buffers with many optional fields. The protocol buffer library’s default is that when you read a protocol buffer, modify it, and write it back out, the fields that you didn’t understand are passed through. This means that middleman servers don’t need to be recompiled when you add new optional fields that they don’t use.
But for the actual client and server, you generally don’t have the luxury of replacing them both at the same time. So you have to add the new field that is disabled using a flag, wait for it to rolled out to both the client and server, then enable the new field and disable the old field using the flag, then remove the flag and old field. It’s something that you coordinate with the release engineers. But it’s not formalized in the software version numbers.
To me it seems like having to change all references for a breaking API change could be a debilitating amount of work in some cases. Do you then make your breaking change to a branch and lobby for other teams to catch up before merging to the main branch? What about situations where you have a legion of stable legacy applications that may not be worth updating for any reason other than critical bugs?
Yes, it can be difficult to change a library used by everyone, because you need to get a code review and commit the change before merge conflicts start piling up. But you do it all in one commit; you don’t do refactoring in a branch as far as I remember. Occasionally one would hear from someone like Craig Silverstein touching hundreds of files. By the way, check out his talk on refactoring using clang <http://llvm.org/devmtg/2010-11/>.
If an application is still being used, it is always stored in the source tree, where the unit tests are automatically run. You do still have choices to lock its API or file formats: you can consider the API deprecated and tell everyone to use the new V2 API, or you can move the old program into a branch (but still in the source tree that everyone can see). But you want to branch as little as possible; large unmaintained branches quickly become unmaintainable.
>Splitting it up would make large, atomic refactorings more difficult.
If you split the code base up into multiple projects, you need some sort of meta project to link them all together if you need to deploy it as a single entity. Doing a meta project isn't exactly easy, so it tends to be better to do a single project if everything is all tightly coupled still.
The project could be split up into smaller modules that are linked via shared common interfaces, but takes more time to maintain.
I know nothing of the code Facebook is hosting, but one fairly obvious general benefit of having things in a single repository is that you can use relative paths to link dependent projects, like if you have a directory structure such as:
then myapp can reference library mylib via ../../libs/mylib
If all of these things are in separate repositories, all bets are off about how their paths relate to each other on the local filesystem, resulting in more complicated build procedures.
Please tell me if there's a way to make git-grep (or git-log for that matter) work between submodules.
While I may not personally like behemoth repositories, I do see why people would choose to endure them. They do have their advantages, even if they all eventually end up looking something like BSD ports. (FYI: I happen to like the extremely modular one-git-repo-per-component approach as used in Mer. Hell, I did an early implementation of the OBS+git -based autobuild system back in 2009!)
Same here. As outlined below, we tried to use bookmarks for feature branches a few times and it either doesn't work or we can't figure it out. We used permanent branches for feature branches for a while, but too many of those (even closed) would slow down the repository eventually.
Our coping strategy is MQ. It's not the same, but solves the problem pretty well IMO. MQ patches can be sent around and backed up, so not being able to push them to the server is not such a big issue. Takes a little upfront investment to figure out how they work though, Mozilla has a pretty good guide .
From what I've seen, bookmarks are not "branches" in any sense, really just pointers to commits.
I've tried several times to use bookmarks for feature branches (read: branches developed in parallel to each other and the default branch). I thought I just can't figure it out, but it really seems impossible at this point.
Lots of good stuff there, but a key one is slide 8. In the question session Richard confirmed that Source Depot (the custom version of Perforce) was at that time still used for the source management although TFS was used for bug tracking etc for Office and Windows.
Don't know what has happened in the intervening years...
We eat our own dog food, most teams have moved over to TFS by now. I don't know about the larger orgs (Windows, Office) but for smaller groups TFS is how it is.
MS doesn't have a unified build environment, every team generally does their own thing. It has its pluses and minuses.
Shared code would occasionally be useful at Microsoft, but not as often as you'd think. Generally when relying on another team's code, it is preferred to take it as a binary drop when they do a product release, just like any other customer. This helps prevents needing to deal with churn in one's dependencies.
Microsoft uses .NET a-plenty. Saying that brief marketing hype of over a decade ago isn't lived up to is a bit disingenuous. MS ships numerous apps in C#. Check out the Windows Store and WP Store, I know for a fact lots of MS's stuff in there is C# based. I cannot say for sure what Desktop apps are C# based because, well, it isn't exactly obvious!
C# isn't being used as a systems programming language, but it isn't meant to be one. (That said, I've written high performance code in C# before, you have to know what you are doing and understand your GC, same as writing high performance code in Java!)
Tons of LOB software internal to MS (indeed I'd say the vast, vast, majority of internal LOB software) uses C#. Lots of plugins and extensions to various tools use C#, and I wouldn't be surprised if lots of the PowerShell Cmdlets and Modules are C# based.
Now I have worked on a number of commercial projects that were written in C#, but of course I am unable to discuss them!
(None of this is spoken of as an official MS employee of course, it is not but my own opinions!)
They didn't really scale Mercurial, they basically took it and replaced a lot of it's functionality with remote services.
I understand their reasoning that technology shouldn't dictate the way they develop stuff, however I think splitting the project up and using submodules would have been a cleaner approach. If refactoring everything is something you do all the time, you might be doing it wrong.
To be honest, whilst we have no way to accurately determine whether the code is a mess without a chance to see it, the most surprising line of this article (in my opinion) was that the code base was larger than the Linux kernel. I'm not seeing anything on the front end that would warrant such complexity, guessing a large chunk of the code base is server code. Would be interested in reading a summary of the components of the Facebook code base.
I suspect that the kernel is one of the only things running on Facebook's servers that they didn't write from scratch. Alexandrescu has mentioned that a 1% speedup to HHVM saves FB about $100k per year, and at that sort of scale it's pretty easy for reinventing every wheel to make sense.
> Modularity tends to obviate the need for large, atomic refactorings.
But when you're dealing with code at Facebook's scale, things that "tend not to happen" actually happen quite a lot. In fact, you must plan for them as a matter of course.
So yes, modularity is great, and I because I'm a nice guy I assume Facebook aren't a pack of idiots and that they're writing nice modular code. But even if that's the case, in an organization of Facebook's size you still need to make widespread, atomic refactorings on a regular basis.
I know this from experience, because I work at Google (much larger codebase than Facebook) on a low-level piece of our software stack. We face these issues regularly and while working in a single repo has its drawbacks, it also has real advantages.
Perhaps I don't understand the whole situation here. I hear "all of our code is in one repository" and I think "GMail and Google Maps are in the same repository, in the same repository with GoLang, in the same repository with AdWords."
The more I think about it, the more I think your post reveals a lack of maturity in our industry that lends credence to the pro-engineering-licensing argument that I've argued against many times on my own. That everyone can be so cavalier about this topic.
Because the fact that your companies are so large is EXACTLY why it makes no flipping sense that you're running gigantorepository. You have so many products, so many projects going on, that I just really have a hard time believing that it was disciplined software development that led to all of your code being so interdependent.
But the part that started getting under my skin was the fact that we aren't talking about Bob's Local Software Consultancy here. We're talking about two companies that touch the lives of hundreds of millions, perhaps even billions of people in the world.
If OpenStreetMap doesn't have their code in the same repository as Postgres, Linux, and DuckDuckGo, then there is no excuse for the Facebook Android App to be in the same repository as HHVM.
I think you have this picture in your mind of just one big pile of spaghetti code. The truth is way more nuanced. All the code may be in one big repository, but that doesn't mean it is not well-managed. The code is still modular; code is managed in libraries with clean APIs, and so on.
But whether you keep your code in one big repository or many small repositories, you still need to track and manage those the dependencies between the various parts.
For instance, when a bug is discovered in library X, you need to know which binaries running in production were compiled against a buggy version of that library. At Google we can say "A bug was introduced at revision NNNNN and fixed at revision MMMMM. Please recompile and redeploy any binaries that were built within that range." (And we have tools to tell us exactly which binaries those are.) This is something that using One Giant Repository gives us for free.
If you were taking the many-small-repos approach, for any given binary you'd need to track the various versions of each of its dependencies. You'd also need to manage release engineering for each and every one of those projects, which slows progress a lot (although we do have a release process for really critical components).
But like I said, there are relative advantages and disadvantages to either approach. To write software at this scale requires tools, processes, and good communication. Where you keep your code, at the end of the day, is actually a pretty minor concern compared to all the other stuff you need to do to ship quality products.
No, you're giving me the right picture, and it's mostly the picture I thought it was.
These issues are the same issues the rest of us in the world have to deal with when working with your APIs. Someone in one of the sibling comments has linked to an article discussing Bezos giving the command from on-high that Amazon would dog-food all of its APIs.
And apparently it isn't so minor of a concern if it warrants the first blog post out of Facebook in the last 3 weeks. Maybe that's just a coincidence that this is the first blog post of the year. It seems like they are trying to say "it's a big enough deal that we have and we're going to spend a lot of money on it."
Maybe the problem is that Facebook and Google are just too big. They might have to be as big as they are to be doing the work that they are doing, but is that really the best thing for the rest of the world?
The fact that Google is one of the largest, most successful software companies in history and you are arguing on the internet using the handle "moron4hire" just about sums up the merits of your position.
Just because a company is big doesn't mean they are working in the best way, or working in a way that is to the best benefit of the public. Might does not make right. We don't let large architectural engineering firms get away with doing whatever the hell they want just because they should have a proprietary interest in doing the best job possible, and we shouldn't be letting banks do it, either.
Yes, it's hard. Boo hoo. So is making safe cars. But you don't get the option to take the easy way out. Solve the hard problem, it's the job.
No, using a ramdisk these days usually makes things worse, not better. The reason is that the operating system already holds as much of the filesystem in caches (in RAM) as possible. So as long as you have enough RAM in your system, files will be cached and the result is better than using a ramdisk.
Partitioning may be the answer but this is a huge problem for corporations like Facebook (and the one I am working for). If things have been done with one giant repo from day one, splitting it is going to be a major engineering/political/social problem when there's thousands of engineers working on the code base and you can't just shut down business for the duration of the migration.