Scaling Mercurial at Facebook

danielrhodes · on Jan 7, 2014

"We could have spent a lot of time making it more modular in a way that would be friendly to a source control tool, but there are a number of benefits to using a single repository. Even at our current scale, we often make large changes throughout our code base, and having a single repository is useful for continuous modernization. Splitting it up would make large, atomic refactorings more difficult. On top of that, the idea that the scaling constraints of our source control system should dictate our code structure just doesn't sit well with us."

I remember reading about their Git troubles awhile ago, and I still don't buy this argument that it is better to have one large repository. One reason modularization is important is for the precise reason they are trying to get around it: removing the ability to make large scale changes easy and thus increasing reliability.

However, my understanding is that their desire to have one large repo is reflective of their their "move fast and break things" philosophy, which means not being afraid of making large scale changes. So I would be interested in hearing how they mitigate the obvious downsides given how many people they have committing to their codebase. It seems like you would just end up having to create constraints in other ways, so which constraints end up being the lesser of two evils?

zaphar · on Jan 7, 2014

We use one repository at google.

We've found that "Removing the ability to make large scale changes easy and thus increasing reliability." isn't actually correct.

As an example, most of your codebase uses an RPC library. You discover an issue with that library that requires an API change which will reduce network usage fleetwide by an order of magnitude.

With a single repo it's easy to automate the API change everywhere run all the tests for the entire codebase and then submit the changes thus accomplishing an API change safely in weeks that might take months otherwise with higher risk.

Keep in mind that the API change equals real money in networking cost so time to ship is a very real factor.

It sounds like facebook also has a very real time to ship need for even core libraries.

barrkel · on Jan 8, 2014

What you've got here is a different kind of tradeoff.

Normally, libraries are written under the assumption that clients cannot be modified or updated when the library changes. This brings in the concept of a breaking change, and a set of design constraints for versioning. For example, modifying interfaces becomes verboten, final methods start becoming preferable to virtual methods, implementation detail classes require decreased visibility, etc.

The advantage is that library producers are decoupled from consumers. Ideally the library is developed with care, breaking changes between major versions are minimized, and breakage due to implementation changes are minimal owing to lack of scope (literally) for clients to depend on implementation details.

But under the model you describe, you're leveraging the Theory of the Firm as much as possible - specifically, reducing the transaction costs of potentially updating clients of libraries simultaneously with the library itself.

The downside is the risk of unnecessary coupling between clients and libraries - the costs of a breaking change aren't so severe, so the incentive to avoid them is lessened, and so the abstraction boundaries between libraries is weakened. If the quality of engineers isn't kept high, or they don't know enough about how and why to minimize coupling, there's a risk of a kind of sclerosis that increases costs of change anyway.

zaphar · on Jan 8, 2014

The risk you describe is a real risk. But we mitigate it at google with strong code review and high bars for our our core libraries teams.

All of our core libs are owned by a team and you can't make changes to them without permission and a thorough code review. Our perforce infrastructure allows us to prevent submits that don't meet this criteria so we get the benefits of ownership only we use ACL's instead of seperate repo's. It has so far worked very well for us.

X-Istence · on Jan 8, 2014

You don't need a single repo to be able to run tests across all existing tools. If you have proper dependency management set up, you make the change, push it and a CI server goes off and builds it, then all dependent projects get rebuilt and tested...

enneff · on Jan 8, 2014

No, but you do need a single repo if you want to make the API change and update all the dependencies in one fell swoop.

zaphar · on Jan 8, 2014

Not just make the change in one fell swoop but also rollback the change in one fell swoop if it turns out to have broken something.

oblio · on Jan 8, 2014

Does this matter that much if you version everything?

Commit in library -> successful library build, failed client build. Commit in client -> successful library build, successful client build.

In both situations you don't really care about the intermediary broken build - you still have the previous library/client versions and can use those. Once everything is committed & fixed you have the new versions and you can upgrade.

The only problem I see if there's a long delay before the second commit. But this can be prevented by a fast CI cycle (always a good idea) and sending notifications for failures across teams (i.e. the library committer is notified that the client build for his commit failed).

zaphar · on Jan 9, 2014

If you care about how long it takes to make the change in all your code bases and you care that you identified all the code that needed to change did change then it matters.

zaphar · on Jan 8, 2014

This is what facebook means when they say atomic commits.

igravious · on Jan 7, 2014

What version control system do you use, or is it a secret? Perforce? Git? I've seen Linus's talk that he gave about Git at the GooglePlex so perhaps you use Git. If so, how have you not run into Facebook's scaling issues?

zaphar · on Jan 8, 2014

This question is actually really complicated to answer. The short answer is that we run perforce.

The long answer is that we run perforce with a bunch of caching servers and custom stuff in front of it and some special client wrappers. In fact there is more than one client wrapper. One of them uses git to manage a local thin client of the relevant section of the repo and sends the changes to perforce. This is the one I typically use since I get the nice tooling of git for my daily coding and then I can package all the changes up into a single perforce change later.

Google has invested a lot of infrastructure into code management and tooling to make one repo work. We've reaped a number of benefits as a result.

As others have mentioned though there are trade offs. We made the tradeoff that best suited our needs.

sandGorgon · on Jan 8, 2014

Interesting, has Google made the client parts open source?

It would be great if someone used that to write a high performance git server.

Could they?

thrownaway2424 · on Jan 8, 2014

There is no such thing as a high performance git anything. People at Google who use git (I used to be one of them) suffer from far worse performance than the people who just use perforce as such. In particular, the cost of a git gc, which invariably occurs right when you're on the verge of an epic hack, or fighting a gigantic outage, is unreasonable and perforce has no analogous periodic process.

mataway · on Jan 9, 2014

a wild Perforce shill appears

We actually have an open source tool that allows you carve off parts of your Perforce server as Git repos. The repos can overlap in Perforce allowing you to share code seamlessly between different Git repos. You can generate new repos from existing code easily and can even generate shallow git repos that are usage for development.

Details are at: http://www.perforce.com/product/components/git-fusion

I'm happy to answer questions here or on Twitter: @p4mataway

kyrra · on Jan 8, 2014

Google runs the largest Perforce server out there.

http://www.perforce.com/blog/110607/how-do-they-do-it-google...

twic · on Jan 8, 2014

From what i heard, mostly Perforce.

The story i got from a Googler was that there are many separate projects in a single repository (which is normal for Perforce), and dependencies are handled by making your libraries subprojects (or whatever they're called - like subrepos in Git, essentially symlinks), rather than using an artifact-centred approach.

novaleaf · on Jan 8, 2014

thanks for this remark. i've been struggling with the choice of splitting into multi-repos, but have similar thoughts to what you just confirmed.

agwa · on Jan 7, 2014

For some context of why Facebook choose Hg over Git, here's the mailing list thread where Facebook initially reached out to the Git developers: http://thread.gmane.org/gmane.comp.version-control.git/18977...

novaleaf · on Jan 8, 2014

I personally love mercurial (simpler than git), and have been a bit nervious over the last year or so with the mindshare shift to git.

So hearing about this (FB all in with hg) ensures that hg won't be falling behind... at least in the nearterm.

dgesang · on Jan 8, 2014

Don't believe the hype. hg will be around for some time.

twic · on Jan 8, 2014

Mercurial is the new FreeBSD.

tshepang · on Jan 13, 2014

I created this list to help me stop worrying: http://tshepang.net/major-projects-using-mercurial.

CmonDev · on Jan 8, 2014

Don't follow the lemmings, choose your own path!

hermanradtke · on Jan 8, 2014

That link appears dead. Here is another: http://git.661346.n2.nabble.com/Git-performance-results-on-a...

randartie · on Jan 8, 2014

The discussion appears deleted on that thread

novaleaf · on Jan 8, 2014

fyi, it was working about 8hrs ago. broken now. guess someone didn't like the inuendo ;)

agwa · on Jan 8, 2014

It's working at the moment. I do not believe gmane would have deleted the contents of a public mailing list archive like this, so there must have been a transient technical problem.

xkarga00 · on Jan 7, 2014

"Our engineers were comfortable with Git and we preferred to stay with a familiar tool, so we took a long, hard look at improving it to work at scale. After much deliberation, we concluded that Git's internals would be difficult to work with for an ambitious scaling project."

http://www.techradar.com/news/software/how-open-source-chang...

"And then Git itself wasn't working for us anymore because it wasn't scaling when we'd have an operating system release. So we ended up hiring most of the Git team - there's like only one or two core committers now for Git who don't work at Google,"

durin42 · on Jan 7, 2014

Note that Facebook is trying to scale a single large repository, not an army of slightly smaller ones. It's a very different problem, and has to be solved in a very different way.

daleharvey · on Jan 7, 2014

I have seen presentations mentioning google development done as a single monolothic repository

http://www.infoq.com/presentations/Development-at-Google

tonfa · on Jan 8, 2014

This monolithic repo isn't a git repo.

spiffytech · on Jan 9, 2014

Correct. See https://news.ycombinator.com/item?id=7020859

Timmmmbob · on Jan 7, 2014

Android is only split up into smaller repositories because git can't handle having it all in one.

LukeHoersten · on Jan 7, 2014

I love the Mercurial community. We use Mercurial at work and I'm able to get instant support in IRC for any issue we have with an awesome signal/noise ratio. I'm glad Facebook is contributing back so much as well. My suspicion is that open source projects tend towards Git because of GitHub but I think a lot of companies who don't have the option of external code hosting lean towards Mercurial. All anecdotal observations of course ;-)

VladRussian2 · on Jan 7, 2014

>My suspicion is that open source projects tend towards Git because of GitHub but I think a lot of companies who don't have the option of external code hosting lean towards Mercurial.

GitHub is a consequence not the cause (ponder for a moment why there is no MercurialHub...) It is about ability to choose best source control tool for multi-versioned distributed concurrent development. Open source devs have such choice while corporate ones - no. FB choosing Mercurial tells a lot about environment there.

Timmmmbob · on Jan 7, 2014

That makes no sense. Facebook specifically said they might have chosen git! They had the same choices as any developer except that a) they don't care about github, and b) they wanted to modify the VCS to support massive scale. Apparently out of git and mercurial, mercurial is easier to extend.

ZenoArrow · on Jan 7, 2014

There is a well used "MercurialHub", it's called BitBucket: https://bitbucket.org/

It supports Git now as well, but it was only for Mercurial use when it started.

warmwaffles · on Jan 7, 2014

I could have sworn bitbucket was an SVN host to start with, then patched in mercurial support and then git support later.

barkingcat · on Jan 8, 2014

Bitbucket never supported svn (and still doesn't), it was created with django + uses the pure python mercurial (It was one of the major "posterchild" stories for django).

Not sure if it's still django though.

You might be thinking of http://beanstalkapp.com/ which supports svn.

tonfa · on Jan 8, 2014

Just re-read the original announcement, it was definitely a mercurial only hosting.

> Over the past couple of months, I've been working on creating a simple but powerful hosting service for Mercurial.

dgesang · on Jan 8, 2014

And there is also https://rhodecode.com/

Uchikoma · on Jan 7, 2014

90% of people choose Git because everyone else uses it. Noone has been fired for buying IBM.

Shamanmuni · on Jan 8, 2014

I suspect many of us use Git because it was made by Linus Torvalds ( I think everyone agrees that he is a great developer) and is used in the Linux kernel. If it can handle that, then you can be pretty sure it will handle whatever you throw at it (unless you are Facebook, it seems).

Add Github to that mix and you can see why so many developers sleep like babies at night. Of course it's not rigorous, but the choice is not irrational.

Crito · on Jan 8, 2014

The linux kernel is much smaller than the entire codebase of Facebook.

merrua · on Jan 9, 2014

Most projects are smaller than the Linux kernel though. Or start out that way.

hibbelig · on Jan 8, 2014

Isn't MercurialHub called Bitbucket?

ngoldbaum · on Jan 7, 2014

Mercurial has seriously improved over the past couple of years. If you tried mercurial a few years ago and were scared away due to speed or functionality issues, you might want to give it another shot.

JoshTriplett · on Jan 7, 2014

Does the standard branch workflow still expect you to have a separate repository and directory per branch? I don't care about plugins, here; if the standard workflow doesn't include incredibly lightweight branches, I'll stick with a version control system that does.

Likewise, does the standard workflow still intentionally make it painful to rearrange changes in your local repository to construct a series of patches? Does Mercurial provide built-in commands equivalent to "commit --amend", "rebase -i", and "add -p"?

jordigh · on Jan 7, 2014

> if the standard workflow doesn't include incredibly lightweight branches, I'll stick with a version control system that does.

There is no real standard workflow. There are tools in place to build whatever you want. This is old, but remarkably still relevant:

http://stevelosh.com/blog/2009/08/a-guide-to-branching-in-me...

If you want git-like branches, Mercurial bookmarks are close, but not quite the same. They have some intentional differences, which make sense and do not impede you, but will not make sense if you absolutely want them to behave like git branches.

> "commit --amend", "rebase -i", and "add -p"?

hg commit --amend, hg histedit, hg record (but hg crecord is much nicer).

They're built-in, but you have to flip them on the last two with a config switch.

JoshTriplett · on Jan 7, 2014

That post helped, and bookmarks might get the job done, but the mercurial culture doesn't seem to encourage them compared to branches. As a result, you tend to find a lot more separate mercurial repositories than single repos with bookmarks, and a lot more named branches with hard-to-eliminate commits as well (since even "closing" a branch doesn't really get rid of it).

> hg commit --amend, hg histedit, hg record (but hg crecord is much nicer).

> They're built-in, but you have to flip them on the last two with a config switch.

And that's the fundamental attitude that I can't stand in mercurial's culture: "careful, you might accidentally throw away that information you specifically wanted to throw away". Those commands aren't available by default because the culture doesn't encourage that workflow, so even if you use them for your own personal workflow you'll tend to find a lot more junk commits and forever-remembered branch names in the average mercurial repository.

mixmastamyk · on Jan 7, 2014

I might have read this wrong, but safety for newbies doesn't sound all that bad to me.

JoshTriplett · on Jan 8, 2014

History editing in git is not unsafe: because it's a part of the default set of tools, git offers functionality like the reflog to keep track of old history locally and avoid losing it. It's extremely difficult to permanently lose content in a git repository even using the history editing commands; it's always just a "git reflog" away.

jordigh · on Jan 8, 2014

It's not technically unsafe, but it's unsafe from a UI point of view. It obviously confuses some inexperienced people (witness the confusion in this thread), making them feel helpless and like their work has been destroyed, even if it hasn't. Both git and hg actually have lots of fallbacks so that work is never lost, but without experience, these fallbacks may not be obvious.

Hg's rewriting extensions are sometimes disabled by default for this reason: until people understand how they work, they shouldn't use them. If they try to use them, hg tells them how to enable these features.

rbehrends · on Jan 8, 2014

History editing in Git is actually perfectly safe. The old commits will (at least initially) still be there, though you may need "git fsck --unreachable" to locate them if you don't remember the hashes or have them in your reflog anymore.

What is unsafe in Git is garbage collection. Commits that aren't reachable from either a branch or the reflog will be deleted after a grace period (assuming that you don't disable garbage collection entirely). More importantly, unreachable commits will also not be pushed with "git push" (which means that if your laptop blows up, you may not be able to recover them from a repository server).

This happens not directly as a consequence of history editing, but generally because a branch is being pointed to a new commit or removed (which also implicitly happens during history editing).

maggit · on Jan 8, 2014

> Those commands aren't available by default because the culture doesn't encourage that workflow

That's not the reason. The reason is that they want to keep the standard interface to hg minimal. Unfortunately, I cannot find a citation for that right now.

That said, I, too, would appreciate having more standard extensions enabled by default. But direct your disagreement to the idea that hg should have a minimal standard interface instead of making up reasons :)

durin42 · on Jan 7, 2014

Note that extensions that ship with Mercurial are exactly as supported as the tool itself, and come with the same kinds of backwards compatibility guarantees. It's not worth worrying about rebase, histedit, and record being extensions - if you want them, turn them on and use them.

JoshTriplett · on Jan 7, 2014

They're not exactly as supported, because they're not on by default, and they're not the thing everyone in the Mercurial culture tells you to use as the obvious solution to problems. Version control is as much about what other people do as what you do, and what other people do tends to align most closely with the defaults and what the tools encourage.

durin42 · on Jan 7, 2014

No, they're exactly as supported. What I meant by that was that we promise to not break them, ever, to keep the output formats stable, and accept bug reports for them. That doesn't necessarily mean it's something we'll always recommend (eg mq isn't something I'd recommend for a new user, rebase/histedit/amend are way better and always will be.)

We don't turn them on by default for two reasons: newbie users not shooting their foot off unexpectedly, and UI clutter. mq alone (which we actually recommend avoiding now if you don't really need it) adds a bunch of verbs to the command line interface.

We've talked about having an 'hg hello' command or something that'd give you a proposed template hgrc that suggest you might want rebase/record/histedit/pager/color/progress, but nobody's worked out exactly what that should be like. Does something like that sound like it might be helpful?

rbehrends · on Jan 8, 2014

I have wondered why they're called "extensions" rather than "modules", which I think would characterize their actual use better.

dgesang · on Jan 8, 2014

There are only two hard problems in computer science: cache invalidation and naming things. - Phil Karlton

mushishi · on Jan 7, 2014

Could you elaborate on why to avoid mq?

jordigh · on Jan 7, 2014

Some people like it, but my basic complaint with it is that it introduces a new concept, the "patch", that is basically a crippled commit. When a patch is not a commit, you get rejected diff hunks when it fails to apply, and it can easily forget its history.

An MQ patch is basically a commit that doesn't know how to merge and doesn't keep backups. It's way too easy to make a mistake and lose work. I consider MQ one of hg's youthful mistakes before tools like histedit, rebase, and Evolve came to exist.

ahomescu1 · on Jan 8, 2014

mq keeps its patches as an internal per-repo stack. This is much weaker than git's tree branches.

Here's an example: let's say I've been working on a repo, and have 2 current mq patches called A and B pushed. There's no way for me to pop A and B, then apply another patch called D in the current repo (at least as far as I could figure out). Also, even if there were, I'd have no way of then pushing A and B on top of D. The patches only work as a linear stack (D must come on top of B).

rossjudson · on Jan 8, 2014

I do this all the time. Pop B, pop A, reorder patch queue to put D first, push D, push A, push B. Check the docs.

masklinn · on Jan 7, 2014

It's a very powerful tool, but also a complex one, and one which brings in a bunch of new commands. There are a bunch of extension which provide nice UIs to various subsets of mq's full power, and are much simpler.

If you need mq, use mq, but if you just need to e.g. do some history edition, use histedit.

jpace121 · on Jan 7, 2014

In response to your last sentence, yes.

onestone · on Jan 8, 2014

+1 for the 'hg hello' idea.

dgesang · on Jan 8, 2014

> They're not exactly as supported, because they're not on by default

I really don't get this. In almost any software there are features that are not 'on by default', but it's still implemented and supported (except explicitly stated otherwise). How can this be a reason to not use the functionality when you need it?

JoshTriplett · on Jan 8, 2014

It's like using Outlook to respond to mail on a mailing list. Sure, it's theoretically possible to use some combination of manual configuration and extra work to construct a post that won't violate social norms, but the tool doesn't encourage it, so mailing lists with a significant number of Outlook users on them tend to fail to follow the standard mailing list conventions of other lists, and develop their own conventions.

The same holds for any tool used in a workflow that involves anyone other than yourself: any tool with a default workflow tends to encourage people to follow that default workflow.

watt · on Jan 8, 2014

The fact they are not on by default seems to actually be hurting Mercurial in the debate against Git. As every review of "Mercurial vs Git" assumes they are not available, just because they are not enabled in default configuration out of the box.

Maybe Mercurial should start shipping them enabled?

tshepang · on Jan 13, 2014

I would rather they be made easier to enable (currently one has to edit an .INI file). That would be better than trying to appease some hypothetical from-Git converts.

ngoldbaum · on Jan 7, 2014

Mercurial refers to git-like lightweight branches as bookmarks [1]. Branches are still permanent tags in the repository.

Commit --amend is now part of base mercurial. Rebase is available as an extension that is shipped with the client, like most advanced functionality. I don't know what add -p does.

[1] http://mercurial.selenic.com/wiki/Bookmarks/

durin42 · on Jan 7, 2014

"add -p" is basically the 'hg record' extension.

_ikke_ · on Jan 7, 2014

It's more like `git commit -p`. If you abort `hg record`, the selection you've made is lost.

masklinn · on Jan 7, 2014

> Does the standard branch workflow still expect you to have a separate repository and directory per branch?

You're probably confusing mercurial with bazaar, mercurial has always had branches (though they're not quite the same as git, mercurial's bookmarks are more closely related to git branches) and anonymous heads (contrary to git, an unnamed head is not stuck in limbo).

> I don't care about plugins

That's stupid, mercurial is very much about plugins: there are dozens of official plugins shipped in a standard mercurial install.

> the standard workflow

there is no such thing.

> Does Mercurial provide built-in commands equivalent to "commit --amend", "rebase -i", and "add -p"?

All of them are provided in the base install, you just have to enable the corresponding extensions.

berdario · on Jan 7, 2014

Furthermore, he's probably confusing mercurial with something that don't even exist

bazaar expects you to have a separate working copy (directory) for each branch, but you can have multiple branches stored in the same repository without any problem

(I just wished that git people actually knew how do other tools work... but, alas! Now it's too late for underdogs like bazaar or darcs to catch up)

masklinn · on Jan 8, 2014

> bazaar expects you to have a separate working copy (directory) for each branch, but you can have multiple branches stored in the same repository without any problem

Shared repositories didn't originally exist and IIRC were added to avoid data duplication. Furthermore, they don't fix the multiple working copies problem (you have to use lightweight checkouts or collocated branches for that).

JoshTriplett · on Jan 8, 2014

> bazaar expects you to have a separate working copy (directory) for each branch, but you can have multiple branches stored in the same repository without any problem

Those are functionally equivalent, in that they both mean branching is not instantaneous.

rbehrends · on Jan 8, 2014

Those are functionally equivalent, in that they both mean branching is not instantaneous.

They are not functionally equivalent. A "bzr branch" with a shared repository will only populate the working tree and does not have to duplicate the repository. It is functionally closer to "git-new-workdir" than "git clone" or "git branch".

To have instant branching in Bazaar, you can use co-located branches.

Branches with their own directories exist for the use case where the accumulated cost for rebuilds after branch switches is more costly than populating a new directory with a checkout or if you need two checkouts in parallel. They also exist in Bazaar for simulating a workflow with a central server and no local repository data.

ahomescu1 · on Jan 8, 2014

Mercurial's branches are server side (learned that the hard way the first time I made a branch), git's are client side.

masklinn · on Jan 8, 2014

This is a description which makes no sense whatsoever, given git and mercurial are both DVCS there's no "client" or "server".

1. Mercurial branches are commit metadata, the branch name lives in the commit. Git branches are pointers to a commit (which move when a new commit is added on top of the existing branch commit), living outside the commit objects.

2. As a separate concern, Git has multiple branch namespaces ("remote" versus "local" branches, where each remote generates a branch namespace, and remote-tracking local branches). Mercurial only has a single namespace.

tonfa · on Jan 8, 2014

I think you are confused, git and hg branches are very similar (I'd try to help but I'm not sure what the confusion is).

ahomescu1 · on Jan 8, 2014

It's simple: I created a local branch, worked on it, then tried the git workflow:

1. Switch to the default branch

2. Cherry-pick (with the equivalent hg command) my changes from my own branch into default

3. Push the changes to the remote repo

What happened was that my local branch got pushed to the server, along with the default one. With git this wouldn't happen, it would push the local master to the remote master.

tonfa · on Jan 8, 2014

Yes, by default all branches are pushed, (you would have to use --force if it creates new heads, and --new-branch to push new named branches). But I really don't see how it makes anything "server" or "client" side.

(and you could in any case decide to push just default: hg push -r default)

16bytes · on Jan 8, 2014

This is because hg pushes all changesets by default. There are two workflows in hg that would get you what you want:

hg push -b default

Pushes only changesets from the default branch. You can also do these with phases by marking your branch as private.

JoshTriplett · on Jan 8, 2014

Much of http://www.stationary-traveller.eu/pages/bzr-a-retrospective... applies equally well to Mercurial.

jordigh · on Jan 8, 2014

What parts? That's a very long post. I don't think the overall melancholy tone of defeat applies at all. Hg is a lively and healthy project, and the most frequently offered alternative to git with plenty of commercial backing (Facebook, Google, Atlassian, Microsoft...)

Uchikoma · on Jan 7, 2014

Let me translate: "if this tool is not exactly like Git, I will not use it".

gecko · on Jan 7, 2014

That sentiment comprises a truly mindblowing number of "bug" reports that hit the Mercurial mailing list every month. I honestly wonder sometimes if I'm simply forgetting how CVS users did the same thing when Subversion shipped and I forgot about it/missed it, or what.

durin42 · on Jan 7, 2014

svn was designed to be "CVS done right," which informed a lot of the design and workflow to the point that some CVS misfeatures only got carried through by virtue of that mandate.

(Source of this assertion is prolonged discussions with sussman, one of the early svn authors.)

JoshTriplett · on Jan 8, 2014

Not exactly, but close. Switching away from Git would require a tool that is both compellingly better in some key way as well as not worse for any existing workflow. I have yet to see any other version control system that meets both of those criteria, even leaving aside the network effects of using the most popular system.

mikegioia · on Jan 7, 2014

I work in several branches in the same file system directory. Branching in mercurial is as simple as "hg branch branchname", and switching branches is just "hg update branchname". As long as you make sure you commit everything you can easily work in one directory.

Merging changes across branches is easy too, as you only need to specify the revision number.

ahomescu1 · on Jan 8, 2014

If you push while having open local branches, they get pushed to the remote (or at least that's how it was last time I used them). That's not what you want, in most cases (git branches are local).

rbehrends · on Jan 8, 2014

Mercurial won't push new branches unless you specify -f or --new-branch. You can selectively push only the branches you want with hg push -r branch_or_revision (hg push -r . for the current revision). You can also hide branches using phases, so they won't be pushed even with -f or --new-branch.

ahomescu1 · on Jan 8, 2014

The guide from [1] confirms it works the way I saw it (it pushes all branches):

Mercurial will push/pull all branches by default, while git will push/pull only the current branch.

Did this behavior change in recent versions of hg?

1 - http://stevelosh.com/blog/2009/08/a-guide-to-branching-in-me...

rbehrends · on Jan 8, 2014

Mercurial will attempt to push all branches. It will then immediately stop with an error message unless you specify -f or --new-branch if you're trying to push a branch that does not already exist in the remote repository (or even to create a new head in an existing branch). In order to get rid of the error message you need to either (1) explicitly tell Mercurial to push all branches, (2) specify which existing branch(es)/revision(s) you want to push (they must already exist remotely), or (3) hide the branch with hg phase secret -f. This means that your local branches will not leak into the remote repository unless you specifically instruct Mercurial to do so.

Of these, only the phase approach is new (2.1+). The rest hasn't changed.

k_bx · on Jan 7, 2014

Why do you think mercurial branches are not lightweight? Which operation (create, close, push, pull) on branches is noticeably slower than git's?

wtetzner · on Jan 7, 2014

I don't think he was talking about performance. I think he meant lightweight in the sense that a branch (in git) is just a pointer to a commit.

It's conceptually lightweight.

JoshTriplett · on Jan 8, 2014

Both: it's conceptually lightweight, and because of that "pointer to a commit" structure it's also effectively free to create and move around branches.

k_bx · on Jan 8, 2014

Why is mercurial branches ineffective to move around branches? If I'm not mistaken, if you want to move commits from one branch to another you do absolutely the same rebase.

k_bx · on Jan 8, 2014

Well, mercurial branch is just a record in list of branches, so is it then also "conceptually lightweight"? What is "conceptually lightweight" anyway?

fhd2 · on Jan 8, 2014

According to #mercurial, branches (even closed ones) begin to noticeably slow down the repository when you reach around 2000. That's not at all impossible to reach with fine-grained feature branches. Bookmarks are really just pointers to commits, no branches in any sense, so they cannot be used for feature branches either. As a result, we simply don't use feature branches in Mercurial. It's literally the only issue I can think of.

michal8181 · on Jan 8, 2014

Not true any longer.

As described in http://mercurial.808500.n3.nabble.com/named-branches-vs-book...

> We have users with thousands of named branches in production and have > done tests on up to 10k branches and the performance impact is fairly > minimal.

k_bx · on Jan 8, 2014

Well, if that true, I'd prefer to investigate and make them fast, instead of using "bookmark-as-a-branch" because of all the cool stuff mercurial branches have (like great log-drawing with branch info).

boklm0 · on Jan 7, 2014

According to this page, Mercurial project leader is currently working for Facebook: http://mercurial.selenic.com/wiki/mpm

marmoute · on Jan 8, 2014

Facebook took the Mercurial path about 2 year ago, they hired some core devs one year later to secure and speed up development.

CmonDev · on Jan 8, 2014

Google bought Git devs, Facebook bought Mercurial devs - wise choice for companies that even order custom CPUs.

jordigh · on Jan 8, 2014

There are some hg devs at Google too. :-)

CmonDev · on Jan 8, 2014

Hedging to avoid the risk I guess.

moron4hire · on Jan 7, 2014

> "Our code base has grown organically and its internal dependencies are very complex."

That's a polite way of saying "we write shitty code without any sort of plan."

> "Splitting it up would make large, atomic refactorings more difficult"

Actually, it's the other way around. Modularity tends to obviate the need for large, atomic refactorings.

And what, exactly, is the meaning of these graphs? This is leading me to believe that being a developer at Facebook is about quantity over quality.

enneff · on Jan 8, 2014

> Modularity tends to obviate the need for large, atomic refactorings.

"Tends to".

But when you're dealing with code at Facebook's scale, things that "tend not to happen" actually happen quite a lot. In fact, you must plan for them as a matter of course.

So yes, modularity is great, and I because I'm a nice guy I assume Facebook aren't a pack of idiots and that they're writing nice modular code. But even if that's the case, in an organization of Facebook's size you still need to make widespread, atomic refactorings on a regular basis.

I know this from experience, because I work at Google (much larger codebase than Facebook) on a low-level piece of our software stack. We face these issues regularly and while working in a single repo has its drawbacks, it also has real advantages.

moron4hire · on Jan 8, 2014

Perhaps I don't understand the whole situation here. I hear "all of our code is in one repository" and I think "GMail and Google Maps are in the same repository, in the same repository with GoLang, in the same repository with AdWords."

The more I think about it, the more I think your post reveals a lack of maturity in our industry that lends credence to the pro-engineering-licensing argument that I've argued against many times on my own. That everyone can be so cavalier about this topic.

Because the fact that your companies are so large is EXACTLY why it makes no flipping sense that you're running gigantorepository. You have so many products, so many projects going on, that I just really have a hard time believing that it was disciplined software development that led to all of your code being so interdependent.

But the part that started getting under my skin was the fact that we aren't talking about Bob's Local Software Consultancy here. We're talking about two companies that touch the lives of hundreds of millions, perhaps even billions of people in the world.

If OpenStreetMap doesn't have their code in the same repository as Postgres, Linux, and DuckDuckGo, then there is no excuse for the Facebook Android App to be in the same repository as HHVM.

enneff · on Jan 8, 2014

I think you have this picture in your mind of just one big pile of spaghetti code. The truth is way more nuanced. All the code may be in one big repository, but that doesn't mean it is not well-managed. The code is still modular; code is managed in libraries with clean APIs, and so on.

But whether you keep your code in one big repository or many small repositories, you still need to track and manage those the dependencies between the various parts.

For instance, when a bug is discovered in library X, you need to know which binaries running in production were compiled against a buggy version of that library. At Google we can say "A bug was introduced at revision NNNNN and fixed at revision MMMMM. Please recompile and redeploy any binaries that were built within that range." (And we have tools to tell us exactly which binaries those are.) This is something that using One Giant Repository gives us for free.

If you were taking the many-small-repos approach, for any given binary you'd need to track the various versions of each of its dependencies. You'd also need to manage release engineering for each and every one of those projects, which slows progress a lot (although we do have a release process for really critical components).

But like I said, there are relative advantages and disadvantages to either approach. To write software at this scale requires tools, processes, and good communication. Where you keep your code, at the end of the day, is actually a pretty minor concern compared to all the other stuff you need to do to ship quality products.

moron4hire · on Jan 8, 2014

No, you're giving me the right picture, and it's mostly the picture I thought it was.

These issues are the same issues the rest of us in the world have to deal with when working with your APIs. Someone in one of the sibling comments has linked to an article discussing Bezos giving the command from on-high that Amazon would dog-food all of its APIs.

And apparently it isn't so minor of a concern if it warrants the first blog post out of Facebook in the last 3 weeks. Maybe that's just a coincidence that this is the first blog post of the year. It seems like they are trying to say "it's a big enough deal that we have and we're going to spend a lot of money on it."

Maybe the problem is that Facebook and Google are just too big. They might have to be as big as they are to be doing the work that they are doing, but is that really the best thing for the rest of the world?

thrownaway2424 · on Jan 8, 2014

The fact that Google is one of the largest, most successful software companies in history and you are arguing on the internet using the handle "moron4hire" just about sums up the merits of your position.

krakensden · on Jan 8, 2014

It's not like no one respectable who has ever taken a contrary position:

http://apievangelist.com/2012/01/12/the-secret-to-amazons-su...

moron4hire · on Jan 8, 2014

And your name is "thrownaway2424".

Just because a company is big doesn't mean they are working in the best way, or working in a way that is to the best benefit of the public. Might does not make right. We don't let large architectural engineering firms get away with doing whatever the hell they want just because they should have a proprietary interest in doing the best job possible, and we shouldn't be letting banks do it, either.

Yes, it's hard. Boo hoo. So is making safe cars. But you don't get the option to take the easy way out. Solve the hard problem, it's the job.

skj · on Jan 8, 2014

Go is hosted publicly (on googlecode) and adwords is supersecret or some nonsense and non adwords Googlers can't see it.

But gmail and maps...well, they share a lot of code! For instance, they both run on web servers.

tonfa · on Jan 8, 2014

> adwords is supersecret or some nonsense and non adwords Googlers can't see it

Most code including ads code is readable (and they can propose changes as well) by googlers.

ZenoArrow · on Jan 7, 2014

To be honest, whilst we have no way to accurately determine whether the code is a mess without a chance to see it, the most surprising line of this article (in my opinion) was that the code base was larger than the Linux kernel. I'm not seeing anything on the front end that would warrant such complexity, guessing a large chunk of the code base is server code. Would be interested in reading a summary of the components of the Facebook code base.

plorkyeran · on Jan 8, 2014

I suspect that the kernel is one of the only things running on Facebook's servers that they didn't write from scratch. Alexandrescu has mentioned that a 1% speedup to HHVM saves FB about $100k per year, and at that sort of scale it's pretty easy for reinventing every wheel to make sense.

moron4hire · on Jan 8, 2014

right, so why can't projects like that exiat in their own repository? what keeps HHVM and Facebook tightly coupled?

pydave · on Jan 8, 2014

Some of them are.

> We already have some of the easily separable projects in separate repositories, like HPHP

http://article.gmane.org/gmane.comp.version-control.git/1897...

Presumably some parts aren't so isolated.

igravious · on Jan 8, 2014

This rather surprised me as well. I tend to think of the Linux kernel as one of the larger single code-bases out there. Am I wrong?

michal8181 · on Jan 8, 2014

Yes.

As described here http://www.informationisbeautiful.net/visualizations/million... Facebook code base (~60 MLOC) if almost 4 times bigger than Linux 3.1 (a mere 15 MLOC).

bpicolo · on Jan 8, 2014

It's one of the largest open source projects perhaps. But when you get into a large company like facebook who creates a hell of a lot of different things the numbers are way higher.

Crito · on Jan 8, 2014

The kernel is measured in hundreds of megabytes. There are companies with codebases measured in terabytes.

mindjiver · on Jan 7, 2014

At a previous job I migrated a quite large code base to git. IIRC it was at least 500k files and a couple of 10MLOCs. We had the same "scaling" issues Facebook mentioned here when trying to place all of this inside one repository. So we ended up going with submodules for this. Another idea was to perhaps enable re-use of repositories and/or disconnect old legacy code this way.

We did took a quick look at Mercurial but since lots of the upstream tools we used was using Git (linux, uboot, yocto, etc) it was an obvious choice. I seem to recall there being two hg extension that where of interest at the time (2010-ish), one to add inotify support and another to store large files outside the repository (hg-largefiles?).

Seem like Facebook's approach to the lstat(2) issue with watchman [1] is to use inotify on Linux. This has been discussed a couple of times for git as well but nothing has come of it so far [2].

[1] https://github.com/facebook/watchman [2] http://git.661346.n2.nabble.com/inotify-to-minimize-stat-cal...

Touche · on Jan 7, 2014

> We could have spent a lot of time making it more modular in a way that would be friendly to a source control tool, but there are a number of benefits to using a single repository.

Pray tell?

yonran · on Jan 7, 2014

I worked at Google (in a team using Perforce) and now work at a different company that uses multiple interdependent projects using Maven. Using a single monolithic codebase along with a build tool that statically builds everything at trunk has its advantages:

* You immediately get improvements from upstream projects without having to get them manually.

* You can unambiguously answer the question, What code am I using? with a single number. With multiple repositories, you have to list all the versions of each project that you are using.

* Easy API refactoring. You don’t have to worry about coordinating version number bumps across different repositories/dependency manifests when you make major changes to inter-project APIs. With a monolithic repository, you fix all callers of an API using a single code commit. No need to edit the version numbers in your pom files.

* Low cost to split a project into multiple separate projects. With multiple repositories or version numbers, you are reluctant to create new projects because the APIs will forever be harder to refactor (since you will have to worry about version numbers).

* No diamond dependency problem of 2 dependencies using a different version of a base project. Everyone is using the same version of base.

With a monolithic repository+build system, upstream callers are responsible to never make a commit that breaks downstream callers. I feel like it’s similar to the question of optimal currency areas. If your organization is growing in lock-step, then you can all happily share a single gold-standard repository with little friction. But if you can’t trust your upstream projects, you introduce versioning between the projects and have to deal with the mental burden of wondering whether to upgrade to the newest upstream project and whether you’re actually running the latest code.

Edit: added a couple more.

Touche · on Jan 7, 2014

> * You immediately get improvements from upstream projects without having to get them manually.

You also immediately get regressions. Not trying to be dismissive, but we fundamentally have different software philosophies if you think this point (which is the essence of most of your points) is a good thing that should be encouraged.

yonran · on Jan 7, 2014

You’re right, there are disadvantages to a single codebase and builds at trunk. You pick up regressions, and it is difficult for your team to develop on its own while holding everything else constant.

But I don’t think the speed of receiving fixes is the essence of, or even primarily the source of, all my headaches with versions. The problem with mixing and matching versions within an organization is the enormous complexity that it introduces. Perhaps your downstream coworkers are still using an old version of your project, so they don’t want you to refactor their use of your API. Or perhaps you forgot to update your required upstream dependency when using a new function from a library, and your coworker’s program crashed because they’re still using the old dependency. Or perhaps someone forgot to bump the major version number when changing API or behavior, causing a previously built downstream project that is linking against the new upstream project to crash.

Now, these problems are all solvable if you and your coworkers are very disciplined in updating your version numbers and your required dependency versions. But it means that you constantly have to be aware of what APIs you export and what versions of APIs you are calling. You constantly have to edit the project manifests to bump version numbers. You must think about whether your changes will be major or minor. You carefully read the Changelist before using the newest upstream projects. It is a mental burden.

Contrast this to a monolithic codebase and build system. There are no version numbers in the dependency manifests to other projects in the company. If you want to change an API, you are responsible for fixing all the downstream users (rather than the other way around). Making a new project adds little mental overhead. If there is no impedance mismatch between the different teams of your company, it can make life much easier.

thrownaway2424 · on Jan 8, 2014

Immediate regressions are good! If someone at Google breaks my code, I will know within half an hour at the latest and I will tell them to go fix it or just revert their changes myself. Immediate regressions also go perfectly with daily (or hourly!) releases. If there's a performance problem it will be identified early and I will only have thousands of changes to investigate instead of tens of millions.

Imagine if I had a regression and I had to go to the other team saying "We just upgraded from the Foo you released 2 years ago to the one from last year and the performance sucks. Help!" I would not get any help. However I get plenty of support when I go to Foo-team to tell them that my Foo-per-second is 10% worse in the noon release compared to the midnight release.

Having artifacts and stable interfaces and library releases and all that is very ivory tower hocus pocus stuff. In practice instant integration is better.

Touche · on Jan 8, 2014

I don't just mean performance regressions. Someone upstream can change an API in a way that doesn't fit well with your use-case, goes in and "fixes" your code (makes sure all the tests pass) to fit the new API but makes it less maintainable in the process.

thrownaway2424 · on Jan 8, 2014

Teams (at Google) can't change calling code without the review and approval of the owners of the calling code, so what you state would not happen.

stormbrew · on Jan 8, 2014

You still immediately get regressions from separate repositories, you just don't find out until later. Possibly when you're no longer in the right mental space for dealing with it.

eropple · on Jan 7, 2014

Test better.

warmwaffles · on Jan 7, 2014

Nice passive aggressive comment. Should try enlightening the commenter instead of berating them.

exDM69 · on Jan 8, 2014

> * You can unambiguously answer the question, What code am I using? with a single number. With multiple repositories, you have to list all the versions of each project that you are using.

Git submodules may have a number of problems of its own, but it solves this one. There's always an unambiguous version number, which is the commit hash of the top repo. Every subrepository's commit hash is stored in the top repo and the top repo's version is an unambiguous version number of the entire code base.

I wish that more effort would be put in Git submodules, it's pretty much an afterthought addition but I've heard that there have been recent improvements and future improvements may be coming...

estebank · on Jan 8, 2014

In 1.8.2 you can track a branch name instead of a commit hash. This has the benefit of allowing you to work always against the latest version, while foregoing the advantage of having a single unambiguous version number.

bluesnowmonkey · on Jan 8, 2014

Does this approach not isolate Google from the open source community? Most companies nowadays run on open source tools built from open source libraries written in open source languages. Version numbers and stable APIs are crucial.

shoover · on Jan 8, 2014

Nice list. What then are the mechanics of making a breaking change in a library, file format, network protocol, etc. in the single repo system?

yonran · on Jan 8, 2014

Ideally, to make a breaking API change, you change the function and all the references within a single commit. Since this repository is only used for statically compiled programs, there is no need to keep the old API anymore.

For protocols and file formats, Google universally uses protocol buffers with many optional fields. The protocol buffer library’s default is that when you read a protocol buffer, modify it, and write it back out, the fields that you didn’t understand are passed through. This means that middleman servers don’t need to be recompiled when you add new optional fields that they don’t use.

But for the actual client and server, you generally don’t have the luxury of replacing them both at the same time. So you have to add the new field that is disabled using a flag, wait for it to rolled out to both the client and server, then enable the new field and disable the old field using the flag, then remove the flag and old field. It’s something that you coordinate with the release engineers. But it’s not formalized in the software version numbers.

shoover · on Jan 8, 2014

Useful flexibility from protocol buffers there.

To me it seems like having to change all references for a breaking API change could be a debilitating amount of work in some cases. Do you then make your breaking change to a branch and lobby for other teams to catch up before merging to the main branch? What about situations where you have a legion of stable legacy applications that may not be worth updating for any reason other than critical bugs?

yonran · on Jan 9, 2014

Yes, it can be difficult to change a library used by everyone, because you need to get a code review and commit the change before merge conflicts start piling up. But you do it all in one commit; you don’t do refactoring in a branch as far as I remember. Occasionally one would hear from someone like Craig Silverstein touching hundreds of files. By the way, check out his talk on refactoring using clang <http://llvm.org/devmtg/2010-11/>.

If an application is still being used, it is always stored in the source tree, where the unit tests are automatically run. You do still have choices to lock its API or file formats: you can consider the API deprecated and tell everyone to use the new V2 API, or you can move the old program into a branch (but still in the source tree that everyone can see). But you want to branch as little as possible; large unmaintained branches quickly become unmaintainable.

kyrra · on Jan 7, 2014

From the article:

>Splitting it up would make large, atomic refactorings more difficult.

If you split the code base up into multiple projects, you need some sort of meta project to link them all together if you need to deploy it as a single entity. Doing a meta project isn't exactly easy, so it tends to be better to do a single project if everything is all tightly coupled still.

The project could be split up into smaller modules that are linked via shared common interfaces, but takes more time to maintain.

Touche · on Jan 7, 2014

What is a wrong with submodules or subtree for a meta project? Or better yet, package management.

georgemcbay · on Jan 7, 2014

I know nothing of the code Facebook is hosting, but one fairly obvious general benefit of having things in a single repository is that you can use relative paths to link dependent projects, like if you have a directory structure such as:

libs

libs/mylib

apps

apps/myapp

then myapp can reference library mylib via ../../libs/mylib

If all of these things are in separate repositories, all bets are off about how their paths relate to each other on the local filesystem, resulting in more complicated build procedures.

breck · on Jan 7, 2014

You can still have the benefits of relative links without a single repo if you create a dir at the root level.

For example, you can have a /facebook/ dir on every machine with the source, and all your subprojects can assume they are working off that, ie:

include /facebook/somemodule

markcampbell · on Jan 7, 2014

There's a solution to this. Git submodules.

bostik · on Jan 7, 2014

Please tell me if there's a way to make git-grep (or git-log for that matter) work between submodules.

While I may not personally like behemoth repositories, I do see why people would choose to endure them. They do have their advantages, even if they all eventually end up looking something like BSD ports. (FYI: I happen to like the extremely modular one-git-repo-per-component approach as used in Mer. Hell, I did an early implementation of the OBS+git -based autobuild system back in 2009!)

garenp · on Jan 7, 2014

Sometimes the cure is worse than the disease.

nilved · on Jan 7, 2014

git submodules don't fix anything except maybe a good day.

dudus · on Jan 7, 2014

On the other hand the Mercurial modularity was key for the migration.

natrius · on Jan 7, 2014

Branches are my biggest pain point in Mercurial. What branching workflow does Facebook use?

fhd2 · on Jan 8, 2014

Same here. As outlined below, we tried to use bookmarks for feature branches a few times and it either doesn't work or we can't figure it out. We used permanent branches for feature branches for a while, but too many of those (even closed) would slow down the repository eventually.

Our coping strategy is MQ. It's not the same, but solves the problem pretty well IMO. MQ patches can be sent around and backed up, so not being able to push them to the server is not such a big issue. Takes a little upfront investment to figure out how they work though, Mozilla has a pretty good guide [1].

[1]: https://developer.mozilla.org/en/docs/Mercurial_Queues

k_bx · on Jan 7, 2014

Why not use standard mercurial branches? At least you're able to say in which branch commit was done and draw a clean history for them.

natrius · on Jan 7, 2014

I do use standard Mercurial branches, and I vastly prefer Git's model. I ask because there might be a better way to use them that I've overlooked.

masklinn · on Jan 7, 2014

Mercurial's equivalent to git branches (movable pointer to a commit rather than embedded commit metadata) is bookmarks.

fhd2 · on Jan 8, 2014

From what I've seen, bookmarks are not "branches" in any sense, really just pointers to commits.

I've tried several times to use bookmarks for feature branches (read: branches developed in parallel to each other and the default branch). I thought I just can't figure it out, but it really seems impossible at this point.

masklinn · on Jan 8, 2014

> From what I've seen, bookmarks are not "branches" in any sense, really just pointers to commits.

That's exactly what git branches are.

michal8181 · on Jan 8, 2014

Branches in git are exactly pointers to graph. If you delete such a pointer you loose your branch.

Even worse - with next git gc you will loose your data as parts of graph with no pointer is assumed to be dead by git.

k_bx · on Jan 8, 2014

What do you mean by "git model"? Creating branch per feature? But that's exactly what I do in mercurial!

durin42 · on Jan 7, 2014

In what way? Have you looked at bookmarks?

natrius · on Jan 7, 2014

I have. I'm wondering if they use bookmarks or named branches, and what strategies they've arrived at to work with either.

durin42 · on Jan 7, 2014

I'm pretty sure they're using a bookmarks workflow.

ed_blackburn · on Jan 7, 2014

I wonder what they use at Microsoft. For their sake I hope they don't subject their own engineers to TFS.

mindjiver · on Jan 7, 2014

I read somewhere(?) that they where running Perforce with some internal tooling built on top of it.

boyter · on Jan 7, 2014

I saw/heard that as well (hopefully someone can come up with the link). Things like Windows/Office apparently are in a custom perforce.

I believe that they do use TFS on a lot of the internal projects though and that Visual Studio is now done in TFS.

robaato · on Jan 8, 2014

The following is a presentation in 2008 by Richard Erwin of Microsoft at the BCS CMSG (Configuration Management Specialist Group):

http://bcs-cmsg.org.uk/events/e20081124/2008-11-24-agile-scm...

Lots of good stuff there, but a key one is slide 8. In the question session Richard confirmed that Source Depot (the custom version of Perforce) was at that time still used for the source management although TFS was used for bug tracking etc for Office and Windows.

Don't know what has happened in the intervening years...

Maarten88 · on Jan 7, 2014

Here is a link about TFS use inside Microsoft: http://blogs.msdn.com/b/visualstudioalm/archive/2013/08/20/t...

Apparently they use several large TFS instances.

adamnemecek · on Jan 7, 2014

This is correct.

http://programmers.stackexchange.com/a/85891

com2kid · on Jan 8, 2014

We eat our own dog food, most teams have moved over to TFS by now. I don't know about the larger orgs (Windows, Office) but for smaller groups TFS is how it is.

MS doesn't have a unified build environment, every team generally does their own thing. It has its pluses and minuses.

Shared code would occasionally be useful at Microsoft, but not as often as you'd think. Generally when relying on another team's code, it is preferred to take it as a binary drop when they do a product release, just like any other customer. This helps prevents needing to deal with churn in one's dependencies.

Pxtl · on Jan 8, 2014

> We eat our own dog food.

Except for that whole ".NET framework" thing.

com2kid · on Jan 10, 2014

I'd say 70% of the projects I know about in MS use .NET.

All web sites are of course written in ASP.NET. Giant portions of the Xbox Live service are written in .NET. Heck lots of tools and utilities are also .NET based.

Is someone going to rewrite IE in C#? Not likely, that isn't what C# is for.

jaredsohn · on Jan 10, 2014

I think that sentiment comes from Microsoft saying back in the day (if I remember right) that they were going to rewrite much of their user software (such as MS Office) to use .NET.

Pxtl · on Jan 10, 2014

Exactly. I doubt there's a single piece of .NET client-side software that a non-geek has heard of (Word, IE, paint, etc.).

com2kid · on Jan 10, 2014

Microsoft uses .NET a-plenty. Saying that brief marketing hype of over a decade ago isn't lived up to is a bit disingenuous. MS ships numerous apps in C#. Check out the Windows Store and WP Store, I know for a fact lots of MS's stuff in there is C# based. I cannot say for sure what Desktop apps are C# based because, well, it isn't exactly obvious!

C# isn't being used as a systems programming language, but it isn't meant to be one. (That said, I've written high performance code in C# before, you have to know what you are doing and understand your GC, same as writing high performance code in Java!)

Tons of LOB software internal to MS (indeed I'd say the vast, vast, majority of internal LOB software) uses C#. Lots of plugins and extensions to various tools use C#, and I wouldn't be surprised if lots of the PowerShell Cmdlets and Modules are C# based.

Now I have worked on a number of commercial projects that were written in C#, but of course I am unable to discuss them!

(None of this is spoken of as an official MS employee of course, it is not but my own opinions!)

jabits · on Jan 8, 2014

I bet you must have a lot of experience with TFS...

CmonDev · on Jan 8, 2014

Does it support not checking-out files that were edited but did not change in the end or you still have to revert them manually?

Pxtl · on Jan 8, 2014

I think VS2012 and "local workspaces" fixed that.

riddlemethat · on Jan 8, 2014

Can anyone explain to me how Mercurial is used at Facebook in conjunction with Murder (https://github.com/lg/murder)?

voidr · on Jan 8, 2014

They didn't really scale Mercurial, they basically took it and replaced a lot of it's functionality with remote services.

I understand their reasoning that technology shouldn't dictate the way they develop stuff, however I think splitting the project up and using submodules would have been a cleaner approach. If refactoring everything is something you do all the time, you might be doing it wrong.

cool-RR · on Jan 8, 2014

I wonder whether putting all the code on a ramdisk (with backup of course) is feasible. If so it might be a very cheap solution.

exDM69 · on Jan 8, 2014

No, using a ramdisk these days usually makes things worse, not better. The reason is that the operating system already holds as much of the filesystem in caches (in RAM) as possible. So as long as you have enough RAM in your system, files will be cached and the result is better than using a ramdisk.

rottyguy · on Jan 7, 2014

Any comments on the build system being used for a monolithic hive of this size?

Crito · on Jan 7, 2014

I wrote a comment about the scaling of repositories (and specifically Facebook's issues) a few days ago that was wiped out by the HN crash, but I've managed to recover it from HNSearch:

<old-comment> Facebook's problem is that they were trying to scale with Git improperly. With conventional CVS[sic] systems like Perforce, you can scale a single repo nearly as large as any company will need. Emphasis on that nearly. At a certain point, with a large enough codebase (and, critically, enough throughput) you start to realize that you are about to hit a brick wall. With perforce, this starts to manifest with service brownouts.

With perforce, you can reasonably expect to run into this brick wall somewhere in the neighborhood of terabytes of metadata and dozens of transactions per second. That changes depending on what sort of beastly hardware you are willing to throw at your version control team.

Git of course hits a brick wall much sooner, somewhere around single-digit gigabytes of data (depending heavily on the average size of every object in the DAG), even ignoring throughput.

Perforce is probably good enough for Facebook in the present, but if you are a company that large and if you are forward looking, it becomes apparent that with existing version control technology, "one repo per company" is not a long term solution. Even "one repo per department".

You can split it even further, but what you realize is that you are developing infrastructure that allows you to use many repos (for instance your build servers and internal code search/browsing tools will now need to understand that concept) but you are losing many of the benefits of Perforce. While you are in the process of adapting your infrastructure to wrap its mind around many repositories, it makes sense to allow dev teams to really take advantage of this splitting. Develop infrastructure that allows Perforce and git repos to coexist in the company, allowing dev teams to spin up new git repos for their every project at will. Done properly, git allows you to create massively scalable systems that you can count on supporting your companies needs for the foreseeable future.

Smooth migration, migration that does not disrupt development, takes months (assuming the right initial conditions), so it is best to recognize the problem and start early, before service becomes disrupted.

If I understand Facebook's situation currently, they are still in the "try to make Mercurial scale" stage of denial, burning developer time and effort to push back that first brick wall (the same one that git hits, though mercurial hits it after git hits it but before perforce hits it...)

Here is a google presentation about extreme scaling with Perforce: http://www.perforce.com/sites/default/files/still-all-one-se...

An example of building multi-repo infrastructure for large projects with git is Android's repo: http://en.wikipedia.org/wiki/Repo_(script) Repo is just one example though; other, better, solutions are very possible.

</old-comment>

I guess the TL;DR here is that I think it is great that Facebook is making a single Mercurial repository work for their purposes right now, but I think they are kidding themselves if they think that is a long term solution. They are doing a pretty good job of making Mercurial scale like Perforce can scale, but that will only work for them for so long.

(In the above comment I talk about building a scalable system with git ("with" git, rather than making git itself scale), but the same can of course be done with mercurial instead of git. I don't mean this to be a comment suggesting that they should use git instead of mercurial.)

justinmk · on Jan 7, 2014

> Facebook's problem is that they were trying to scale with Git improperly.

Here's the Feb 2012 thread with Facebook's Joshua Redstone regarding their experiments with git:

http://comments.gmane.org/gmane.comp.version-control.git/189...

In August 2012, they hinted they were near a solution:

http://www.quora.com/Facebook-Engineering/Has-Facebook-solve...

...and that teaser was just today updated with the remark, "Our Mercurial team has just posted the goods! Our problems are solved, mainly by not using Git."

:/

pconf · on Jan 10, 2014

Please correct if I'm wrong but it looks like the decision to scale using HG instead of git was made on 2 points: 1) git maintainers basically said you should split repositories and left it at that, and 2) HG's cleaner code and abstractions made it easier to patch.

Fundamentally I like how this cleans up the design flaw of requiring history everywhere. 99% of the time when I clone or pull or diff or whatever I only care about HEAD. Why should I be forced to pull or store GBs of history or even MBs of metadata I can't use? Why not make leaving this data on the server optional? I can see how the decision to push history everywhere was made for simplicity but it doesn't reflect real world usage and clearly isn't scalable. Let's hope these history-option patches continue to be developed and make their way upstream. They certainly have my vote, not just as options but as defaults.

ibrahima · on Jan 8, 2014

I get this impression every time I see a "look what neat scalability thing we did" post from Facebook Engineering. It's great that they're able to achieve such technical feats, but they refuse to acknowledge that maybe they're using the technology wrong. I'm reminded of the time they hacked the Dalvik VM on Android because apparently they had too many method names for the Dalvik VM to handle. http://jaxenter.com/facebook-s-completely-insane-dalvik-hack...

I mean, they run into a resource allocation problem for an application that is essentially a glorified web view, and they think "how can I hack up the VM to bypass this limitation?" Sounds like insanity to me.

MetaCosm · on Jan 8, 2014

It is insane, but it never comes all at once. I have been at a company that had this type of internal insanity, and it is a bit of a "boil the frog" thing.

Each individual step seems resonable and sane thing -- but when you look at the result of all those steps -- you stand back and just stare in awe at the horror you have created... it hits you just how far you have drifted.

In a recent meeting, we had a corporate redirection a bit -- and a co-worker simply asked the question "Is this a sane approach to what our actual problem is?" -- glad to be working with a crew who askes those kinds of questions.

weland · on Jan 8, 2014

I don't know if this is Facebook's case, but when I've seen stuff like that happening, it was because a bunch of (otherwise smart) developers were far too arrogant about the quality of their own work, too derogatory and, to some degree, too superficial about the quality of other programmers' work, and -- perhaps fatally -- too driven by the can-do-no-matter-what attitude that is so obnoxiously prevalent in today's corporate world, at the expense of common sense. Between the technical debt and the management pressure, I've seen a bunch of people refusing to admit it's their fault and cleverly save their asses (and earn a hefty bonus on one or two occasions) by doing something any sane programmer would have, in fact, fired them for.

r0muald · on Jan 8, 2014

Like building the entire system in PHP, then throwing HipHop at it? Seems to work though.

mseebach · on Jan 8, 2014

I'm not convinced that HipHop is an example of that. As I understand it, HipHop allowed Facebook to increase performance hugely with a very modest investment, and very low risk. Rewriting all the critical bits of Facebook in a different language to realize the same speed-up would likely have required a lot more resources and been orders of magnitude more risky.

Crito · on Jan 8, 2014

The difficulty in "rewriting [critical, or otherwise] bits of Facebook" is probably related to the same decisions that make "single repo for everything" the logical choice for the site.

randartie · on Jan 8, 2014

I can't agree. Instead of taking everyone's advice of either switching to perforce (less user-friendly/frustrating repo) or splitting their repo into parts, facebook built a solution where they don't have to make either compromise.

Just because their solution is complex doesn't mean they're using the technology wrong. They set out their requirements and they met all of them. All their engineers are more efficient now and all this complexity happens behind the scenes.

CmonDev · on Jan 8, 2014

What else do you expect from company which hires based on solving algorithmic problems?

vishnugupta · on Jan 8, 2014

Amazon ran into this exact problem with Perforce. They threw hardware at the problem for a while but, fortunately, soon realized that they weren't getting anywhere with that. They were trying to work "with" git i.e., with one repo per service and were in migration phase when I left.

durin42 · on Jan 7, 2014

> An example of building multi-repo infrastructure for large projects with git is Android's repo: http://en.wikipedia.org/wiki/Repo_(script) Repo is just one example though; other, better, solutions are very possible.

I'm curious - do you know of any such better examples, or is this merely a theoretical "I feel like we could do better" statement?