I remember reading about their Git troubles awhile ago, and I still don't buy this argument that it is better to have one large repository. One reason modularization is important is for the precise reason they are trying to get around it: removing the ability to make large scale changes easy and thus increasing reliability.
However, my understanding is that their desire to have one large repo is reflective of their their "move fast and break things" philosophy, which means not being afraid of making large scale changes. So I would be interested in hearing how they mitigate the obvious downsides given how many people they have committing to their codebase. It seems like you would just end up having to create constraints in other ways, so which constraints end up being the lesser of two evils?
We've found that "Removing the ability to make large scale changes easy and thus increasing reliability." isn't actually correct.
As an example, most of your codebase uses an RPC library. You discover an issue with that library that requires an API change which will reduce network usage fleetwide by an order of magnitude.
With a single repo it's easy to automate the API change everywhere run all the tests for the entire codebase and then submit the changes thus accomplishing an API change safely in weeks that might take months otherwise with higher risk.
Keep in mind that the API change equals real money in networking cost so time to ship is a very real factor.
It sounds like facebook also has a very real time to ship need for even core libraries.
Normally, libraries are written under the assumption that clients cannot be modified or updated when the library changes. This brings in the concept of a breaking change, and a set of design constraints for versioning. For example, modifying interfaces becomes verboten, final methods start becoming preferable to virtual methods, implementation detail classes require decreased visibility, etc.
The advantage is that library producers are decoupled from consumers. Ideally the library is developed with care, breaking changes between major versions are minimized, and breakage due to implementation changes are minimal owing to lack of scope (literally) for clients to depend on implementation details.
But under the model you describe, you're leveraging the Theory of the Firm as much as possible - specifically, reducing the transaction costs of potentially updating clients of libraries simultaneously with the library itself.
The downside is the risk of unnecessary coupling between clients and libraries - the costs of a breaking change aren't so severe, so the incentive to avoid them is lessened, and so the abstraction boundaries between libraries is weakened. If the quality of engineers isn't kept high, or they don't know enough about how and why to minimize coupling, there's a risk of a kind of sclerosis that increases costs of change anyway.
All of our core libs are owned by a team and you can't make changes to them without permission and a thorough code review. Our perforce infrastructure allows us to prevent submits that don't meet this criteria so we get the benefits of ownership only we use ACL's instead of seperate repo's. It has so far worked very well for us.
Commit in library -> successful library build, failed client build.
Commit in client -> successful library build, successful client build.
In both situations you don't really care about the intermediary broken build - you still have the previous library/client versions and can use those. Once everything is committed & fixed you have the new versions and you can upgrade.
The only problem I see if there's a long delay before the second commit. But this can be prevented by a fast CI cycle (always a good idea) and sending notifications for failures across teams (i.e. the library committer is notified
that the client build for his commit failed).
The long answer is that we run perforce with a bunch of caching servers and custom stuff in front of it and some special client wrappers. In fact there is more than one client wrapper. One of them uses git to manage a local thin client of the relevant section of the repo and sends the changes to perforce. This is the one I typically use since I get the nice tooling of git for my daily coding and then I can package all the changes up into a single perforce change later.
Google has invested a lot of infrastructure into code management and tooling to make one repo work. We've reaped a number of benefits as a result.
As others have mentioned though there are trade offs. We made the tradeoff that best suited our needs.
It would be great if someone used that to write a high performance git server.
We actually have an open source tool that allows you carve off parts of your Perforce server as Git repos. The repos can overlap in Perforce allowing you to share code seamlessly between different Git repos. You can generate new repos from existing code easily and can even generate shallow git repos that are usage for development.
Details are at: http://www.perforce.com/product/components/git-fusion
I'm happy to answer questions here or on Twitter: @p4mataway
The story i got from a Googler was that there are many separate projects in a single repository (which is normal for Perforce), and dependencies are handled by making your libraries subprojects (or whatever they're called - like subrepos in Git, essentially symlinks), rather than using an artifact-centred approach.
So hearing about this (FB all in with hg) ensures that hg won't be falling behind... at least in the nearterm.
"And then Git itself wasn't working for us anymore because it wasn't scaling when we'd have an operating system release. So we ended up hiring most of the Git team - there's like only one or two core committers now for Git who don't work at Google,"
GitHub is a consequence not the cause (ponder for a moment why there is no MercurialHub...) It is about ability to choose best source control tool for multi-versioned distributed concurrent development. Open source devs have such choice while corporate ones - no. FB choosing Mercurial tells a lot about environment there.
It supports Git now as well, but it was only for Mercurial use when it started.
Not sure if it's still django though.
You might be thinking of http://beanstalkapp.com/ which supports svn.
> Over the past couple of months, I've been working on creating a simple but powerful hosting service for Mercurial.
Add Github to that mix and you can see why so many developers sleep like babies at night. Of course it's not rigorous, but the choice is not irrational.
Likewise, does the standard workflow still intentionally make it painful to rearrange changes in your local repository to construct a series of patches? Does Mercurial provide built-in commands equivalent to "commit --amend", "rebase -i", and "add -p"?
There is no real standard workflow. There are tools in place to build whatever you want. This is old, but remarkably still relevant:
If you want git-like branches, Mercurial bookmarks are close, but not quite the same. They have some intentional differences, which make sense and do not impede you, but will not make sense if you absolutely want them to behave like git branches.
> "commit --amend", "rebase -i", and "add -p"?
hg commit --amend, hg histedit, hg record (but hg crecord is much nicer).
They're built-in, but you have to flip them on the last two with a config switch.
> hg commit --amend, hg histedit, hg record (but hg crecord is much nicer).
> They're built-in, but you have to flip them on the last two with a config switch.
And that's the fundamental attitude that I can't stand in mercurial's culture: "careful, you might accidentally throw away that information you specifically wanted to throw away". Those commands aren't available by default because the culture doesn't encourage that workflow, so even if you use them for your own personal workflow you'll tend to find a lot more junk commits and forever-remembered branch names in the average mercurial repository.
Hg's rewriting extensions are sometimes disabled by default for this reason: until people understand how they work, they shouldn't use them. If they try to use them, hg tells them how to enable these features.
What is unsafe in Git is garbage collection. Commits that aren't reachable from either a branch or the reflog will be deleted after a grace period (assuming that you don't disable garbage collection entirely). More importantly, unreachable commits will also not be pushed with "git push" (which means that if your laptop blows up, you may not be able to recover them from a repository server).
This happens not directly as a consequence of history editing, but generally because a branch is being pointed to a new commit or removed (which also implicitly happens during history editing).
That's not the reason. The reason is that they want to keep the standard interface to hg minimal. Unfortunately, I cannot find a citation for that right now.
That said, I, too, would appreciate having more standard extensions enabled by default. But direct your disagreement to the idea that hg should have a minimal standard interface instead of making up reasons :)
We don't turn them on by default for two reasons: newbie users not shooting their foot off unexpectedly, and UI clutter. mq alone (which we actually recommend avoiding now if you don't really need it) adds a bunch of verbs to the command line interface.
We've talked about having an 'hg hello' command or something that'd give you a proposed template hgrc that suggest you might want rebase/record/histedit/pager/color/progress, but nobody's worked out exactly what that should be like. Does something like that sound like it might be helpful?
An MQ patch is basically a commit that doesn't know how to merge and doesn't keep backups. It's way too easy to make a mistake and lose work. I consider MQ one of hg's youthful mistakes before tools like histedit, rebase, and Evolve came to exist.
Here's an example: let's say I've been working on a repo, and have 2 current mq patches called A and B pushed. There's no way for me to pop A and B, then apply another patch called D in the current repo (at least as far as I could figure out). Also, even if there were, I'd have no way of then pushing A and B on top of D. The patches only work as a linear stack (D must come on top of B).
If you need mq, use mq, but if you just need to e.g. do some history edition, use histedit.
I really don't get this. In almost any software there are features that are not 'on by default', but it's still implemented and supported (except explicitly stated otherwise). How can this be a reason to not use the functionality when you need it?
The same holds for any tool used in a workflow that involves anyone other than yourself: any tool with a default workflow tends to encourage people to follow that default workflow.
Maybe Mercurial should start shipping them enabled?
Commit --amend is now part of base mercurial. Rebase is available as an extension that is shipped with the client, like most advanced functionality. I don't know what add -p does.
You're probably confusing mercurial with bazaar, mercurial has always had branches (though they're not quite the same as git, mercurial's bookmarks are more closely related to git branches) and anonymous heads (contrary to git, an unnamed head is not stuck in limbo).
> I don't care about plugins
That's stupid, mercurial is very much about plugins: there are dozens of official plugins shipped in a standard mercurial install.
> the standard workflow
there is no such thing.
> Does Mercurial provide built-in commands equivalent to "commit --amend", "rebase -i", and "add -p"?
All of them are provided in the base install, you just have to enable the corresponding extensions.
bazaar expects you to have a separate working copy (directory) for each branch, but you can have multiple branches stored in the same repository without any problem
(I just wished that git people actually knew how do other tools work... but, alas! Now it's too late for underdogs like bazaar or darcs to catch up)
Shared repositories didn't originally exist and IIRC were added to avoid data duplication. Furthermore, they don't fix the multiple working copies problem (you have to use lightweight checkouts or collocated branches for that).
Those are functionally equivalent, in that they both mean branching is not instantaneous.
They are not functionally equivalent. A "bzr branch" with a shared repository will only populate the working tree and does not have to duplicate the repository. It is functionally closer to "git-new-workdir" than "git clone" or "git branch".
To have instant branching in Bazaar, you can use co-located branches.
Branches with their own directories exist for the use case where the accumulated cost for rebuilds after branch switches is more costly than populating a new directory with a checkout or if you need two checkouts in parallel. They also exist in Bazaar for simulating a workflow with a central server and no local repository data.
1. Mercurial branches are commit metadata, the branch name lives in the commit. Git branches are pointers to a commit (which move when a new commit is added on top of the existing branch commit), living outside the commit objects.
2. As a separate concern, Git has multiple branch namespaces ("remote" versus "local" branches, where each remote generates a branch namespace, and remote-tracking local branches). Mercurial only has a single namespace.
1. Switch to the default branch
2. Cherry-pick (with the equivalent hg command) my changes from my own branch into default
3. Push the changes to the remote repo
What happened was that my local branch got pushed to the server, along with the default one. With git this wouldn't happen, it would push the local master to the remote master.
(and you could in any case decide to push just default: hg push -r default)
hg push -b default
Pushes only changesets from the default branch. You can also do these with phases by marking your branch as private.
(Source of this assertion is prolonged discussions with sussman, one of the early svn authors.)
Merging changes across branches is easy too, as you only need to specify the revision number.
Mercurial will push/pull all branches by default, while git will push/pull only the current branch.
Did this behavior change in recent versions of hg?
1 - http://stevelosh.com/blog/2009/08/a-guide-to-branching-in-me...
Of these, only the phase approach is new (2.1+). The rest hasn't changed.
It's conceptually lightweight.
As described in http://mercurial.808500.n3.nabble.com/named-branches-vs-book...
> We have users with thousands of named branches in production and have
> done tests on up to 10k branches and the performance impact is fairly
That's a polite way of saying "we write shitty code without any sort of plan."
> "Splitting it up would make large, atomic refactorings more difficult"
Actually, it's the other way around. Modularity tends to obviate the need for large, atomic refactorings.
And what, exactly, is the meaning of these graphs? This is leading me to believe that being a developer at Facebook is about quantity over quality.
> We already have some of the easily
separable projects in separate repositories, like HPHP
Presumably some parts aren't so isolated.
As described here http://www.informationisbeautiful.net/visualizations/million... Facebook code base (~60 MLOC) if almost 4 times bigger than Linux 3.1 (a mere 15 MLOC).
But when you're dealing with code at Facebook's scale, things that "tend not to happen" actually happen quite a lot. In fact, you must plan for them as a matter of course.
So yes, modularity is great, and I because I'm a nice guy I assume Facebook aren't a pack of idiots and that they're writing nice modular code. But even if that's the case, in an organization of Facebook's size you still need to make widespread, atomic refactorings on a regular basis.
I know this from experience, because I work at Google (much larger codebase than Facebook) on a low-level piece of our software stack. We face these issues regularly and while working in a single repo has its drawbacks, it also has real advantages.
The more I think about it, the more I think your post reveals a lack of maturity in our industry that lends credence to the pro-engineering-licensing argument that I've argued against many times on my own. That everyone can be so cavalier about this topic.
Because the fact that your companies are so large is EXACTLY why it makes no flipping sense that you're running gigantorepository. You have so many products, so many projects going on, that I just really have a hard time believing that it was disciplined software development that led to all of your code being so interdependent.
But the part that started getting under my skin was the fact that we aren't talking about Bob's Local Software Consultancy here. We're talking about two companies that touch the lives of hundreds of millions, perhaps even billions of people in the world.
If OpenStreetMap doesn't have their code in the same repository as Postgres, Linux, and DuckDuckGo, then there is no excuse for the Facebook Android App to be in the same repository as HHVM.
But whether you keep your code in one big repository or many small repositories, you still need to track and manage those the dependencies between the various parts.
For instance, when a bug is discovered in library X, you need to know which binaries running in production were compiled against a buggy version of that library. At Google we can say "A bug was introduced at revision NNNNN and fixed at revision MMMMM. Please recompile and redeploy any binaries that were built within that range." (And we have tools to tell us exactly which binaries those are.) This is something that using One Giant Repository gives us for free.
If you were taking the many-small-repos approach, for any given binary you'd need to track the various versions of each of its dependencies. You'd also need to manage release engineering for each and every one of those projects, which slows progress a lot (although we do have a release process for really critical components).
But like I said, there are relative advantages and disadvantages to either approach. To write software at this scale requires tools, processes, and good communication. Where you keep your code, at the end of the day, is actually a pretty minor concern compared to all the other stuff you need to do to ship quality products.
These issues are the same issues the rest of us in the world have to deal with when working with your APIs. Someone in one of the sibling comments has linked to an article discussing Bezos giving the command from on-high that Amazon would dog-food all of its APIs.
And apparently it isn't so minor of a concern if it warrants the first blog post out of Facebook in the last 3 weeks. Maybe that's just a coincidence that this is the first blog post of the year. It seems like they are trying to say "it's a big enough deal that we have and we're going to spend a lot of money on it."
Maybe the problem is that Facebook and Google are just too big. They might have to be as big as they are to be doing the work that they are doing, but is that really the best thing for the rest of the world?
Just because a company is big doesn't mean they are working in the best way, or working in a way that is to the best benefit of the public. Might does not make right. We don't let large architectural engineering firms get away with doing whatever the hell they want just because they should have a proprietary interest in doing the best job possible, and we shouldn't be letting banks do it, either.
Yes, it's hard. Boo hoo. So is making safe cars. But you don't get the option to take the easy way out. Solve the hard problem, it's the job.
But gmail and maps...well, they share a lot of code! For instance, they both run on web servers.
Most code including ads code is readable (and they can propose changes as well) by googlers.
We did took a quick look at Mercurial but since lots of the upstream tools we used was using Git (linux, uboot, yocto, etc) it was an obvious choice. I seem to recall there being two hg extension that where of interest at the time (2010-ish), one to add inotify support and another to store large files outside the repository (hg-largefiles?).
Seem like Facebook's approach to the lstat(2) issue with watchman  is to use inotify on Linux. This has been discussed a couple of times for git as well but nothing has come of it so far .
* You immediately get improvements from upstream projects without having to get them manually.
* You can unambiguously answer the question, What code am I using? with a single number. With multiple repositories, you have to list all the versions of each project that you are using.
* Easy API refactoring. You don’t have to worry about coordinating version number bumps across different repositories/dependency manifests when you make major changes to inter-project APIs. With a monolithic repository, you fix all callers of an API using a single code commit. No need to edit the version numbers in your pom files.
* Low cost to split a project into multiple separate projects. With multiple repositories or version numbers, you are reluctant to create new projects because the APIs will forever be harder to refactor (since you will have to worry about version numbers).
* No diamond dependency problem of 2 dependencies using a different version of a base project. Everyone is using the same version of base.
With a monolithic repository+build system, upstream callers are responsible to never make a commit that breaks downstream callers. I feel like it’s similar to the question of optimal currency areas. If your organization is growing in lock-step, then you can all happily share a single gold-standard repository with little friction. But if you can’t trust your upstream projects, you introduce versioning between the projects and have to deal with the mental burden of wondering whether to upgrade to the newest upstream project and whether you’re actually running the latest code.
Edit: added a couple more.
You also immediately get regressions. Not trying to be dismissive, but we fundamentally have different software philosophies if you think this point (which is the essence of most of your points) is a good thing that should be encouraged.
But I don’t think the speed of receiving fixes is the essence of, or even primarily the source of, all my headaches with versions. The problem with mixing and matching versions within an organization is the enormous complexity that it introduces. Perhaps your downstream coworkers are still using an old version of your project, so they don’t want you to refactor their use of your API. Or perhaps you forgot to update your required upstream dependency when using a new function from a library, and your coworker’s program crashed because they’re still using the old dependency. Or perhaps someone forgot to bump the major version number when changing API or behavior, causing a previously built downstream project that is linking against the new upstream project to crash.
Now, these problems are all solvable if you and your coworkers are very disciplined in updating your version numbers and your required dependency versions. But it means that you constantly have to be aware of what APIs you export and what versions of APIs you are calling. You constantly have to edit the project manifests to bump version numbers. You must think about whether your changes will be major or minor. You carefully read the Changelist before using the newest upstream projects. It is a mental burden.
Contrast this to a monolithic codebase and build system. There are no version numbers in the dependency manifests to other projects in the company. If you want to change an API, you are responsible for fixing all the downstream users (rather than the other way around). Making a new project adds little mental overhead. If there is no impedance mismatch between the different teams of your company, it can make life much easier.
Imagine if I had a regression and I had to go to the other team saying "We just upgraded from the Foo you released 2 years ago to the one from last year and the performance sucks. Help!" I would not get any help. However I get plenty of support when I go to Foo-team to tell them that my Foo-per-second is 10% worse in the noon release compared to the midnight release.
Having artifacts and stable interfaces and library releases and all that is very ivory tower hocus pocus stuff. In practice instant integration is better.
Git submodules may have a number of problems of its own, but it solves this one. There's always an unambiguous version number, which is the commit hash of the top repo. Every subrepository's commit hash is stored in the top repo and the top repo's version is an unambiguous version number of the entire code base.
I wish that more effort would be put in Git submodules, it's pretty much an afterthought addition but I've heard that there have been recent improvements and future improvements may be coming...
For protocols and file formats, Google universally uses protocol buffers with many optional fields. The protocol buffer library’s default is that when you read a protocol buffer, modify it, and write it back out, the fields that you didn’t understand are passed through. This means that middleman servers don’t need to be recompiled when you add new optional fields that they don’t use.
But for the actual client and server, you generally don’t have the luxury of replacing them both at the same time. So you have to add the new field that is disabled using a flag, wait for it to rolled out to both the client and server, then enable the new field and disable the old field using the flag, then remove the flag and old field. It’s something that you coordinate with the release engineers. But it’s not formalized in the software version numbers.
To me it seems like having to change all references for a breaking API change could be a debilitating amount of work in some cases. Do you then make your breaking change to a branch and lobby for other teams to catch up before merging to the main branch? What about situations where you have a legion of stable legacy applications that may not be worth updating for any reason other than critical bugs?
If an application is still being used, it is always stored in the source tree, where the unit tests are automatically run. You do still have choices to lock its API or file formats: you can consider the API deprecated and tell everyone to use the new V2 API, or you can move the old program into a branch (but still in the source tree that everyone can see). But you want to branch as little as possible; large unmaintained branches quickly become unmaintainable.
>Splitting it up would make large, atomic refactorings more difficult.
If you split the code base up into multiple projects, you need some sort of meta project to link them all together if you need to deploy it as a single entity. Doing a meta project isn't exactly easy, so it tends to be better to do a single project if everything is all tightly coupled still.
The project could be split up into smaller modules that are linked via shared common interfaces, but takes more time to maintain.
then myapp can reference library mylib via ../../libs/mylib
If all of these things are in separate repositories, all bets are off about how their paths relate to each other on the local filesystem, resulting in more complicated build procedures.
For example, you can have a /facebook/ dir on every machine with the source, and all your subprojects can assume they are working off that, ie:
While I may not personally like behemoth repositories, I do see why people would choose to endure them. They do have their advantages, even if they all eventually end up looking something like BSD ports. (FYI: I happen to like the extremely modular one-git-repo-per-component approach as used in Mer. Hell, I did an early implementation of the OBS+git -based autobuild system back in 2009!)
Our coping strategy is MQ. It's not the same, but solves the problem pretty well IMO. MQ patches can be sent around and backed up, so not being able to push them to the server is not such a big issue. Takes a little upfront investment to figure out how they work though, Mozilla has a pretty good guide .
I've tried several times to use bookmarks for feature branches (read: branches developed in parallel to each other and the default branch). I thought I just can't figure it out, but it really seems impossible at this point.
That's exactly what git branches are.
Even worse - with next git gc you will loose your data as parts of graph with no pointer is assumed to be dead by git.
I believe that they do use TFS on a lot of the internal projects though and that Visual Studio is now done in TFS.
Lots of good stuff there, but a key one is slide 8. In the question session Richard confirmed that Source Depot (the custom version of Perforce) was at that time still used for the source management although TFS was used for bug tracking etc for Office and Windows.
Don't know what has happened in the intervening years...
Apparently they use several large TFS instances.
MS doesn't have a unified build environment, every team generally does their own thing. It has its pluses and minuses.
Shared code would occasionally be useful at Microsoft, but not as often as you'd think. Generally when relying on another team's code, it is preferred to take it as a binary drop when they do a product release, just like any other customer. This helps prevents needing to deal with churn in one's dependencies.
Except for that whole ".NET framework" thing.
All web sites are of course written in ASP.NET. Giant portions of the Xbox Live service are written in .NET. Heck lots of tools and utilities are also .NET based.
Is someone going to rewrite IE in C#? Not likely, that isn't what C# is for.
C# isn't being used as a systems programming language, but it isn't meant to be one. (That said, I've written high performance code in C# before, you have to know what you are doing and understand your GC, same as writing high performance code in Java!)
Tons of LOB software internal to MS (indeed I'd say the vast, vast, majority of internal LOB software) uses C#. Lots of plugins and extensions to various tools use C#, and I wouldn't be surprised if lots of the PowerShell Cmdlets and Modules are C# based.
Now I have worked on a number of commercial projects that were written in C#, but of course I am unable to discuss them!
(None of this is spoken of as an official MS employee of course, it is not but my own opinions!)
I understand their reasoning that technology shouldn't dictate the way they develop stuff, however I think splitting the project up and using submodules would have been a cleaner approach. If refactoring everything is something you do all the time, you might be doing it wrong.
Facebook's problem is that they were trying to scale with Git improperly. With conventional CVS[sic] systems like Perforce, you can scale a single repo nearly as large as any company will need. Emphasis on that nearly. At a certain point, with a large enough codebase (and, critically, enough throughput) you start to realize that you are about to hit a brick wall. With perforce, this starts to manifest with service brownouts.
With perforce, you can reasonably expect to run into this brick wall somewhere in the neighborhood of terabytes of metadata and dozens of transactions per second. That changes depending on what sort of beastly hardware you are willing to throw at your version control team.
Git of course hits a brick wall much sooner, somewhere around single-digit gigabytes of data (depending heavily on the average size of every object in the DAG), even ignoring throughput.
Perforce is probably good enough for Facebook in the present, but if you are a company that large and if you are forward looking, it becomes apparent that with existing version control technology, "one repo per company" is not a long term solution. Even "one repo per department".
You can split it even further, but what you realize is that you are developing infrastructure that allows you to use many repos (for instance your build servers and internal code search/browsing tools will now need to understand that concept) but you are losing many of the benefits of Perforce. While you are in the process of adapting your infrastructure to wrap its mind around many repositories, it makes sense to allow dev teams to really take advantage of this splitting. Develop infrastructure that allows Perforce and git repos to coexist in the company, allowing dev teams to spin up new git repos for their every project at will. Done properly, git allows you to create massively scalable systems that you can count on supporting your companies needs for the foreseeable future.
Smooth migration, migration that does not disrupt development, takes months (assuming the right initial conditions), so it is best to recognize the problem and start early, before service becomes disrupted.
If I understand Facebook's situation currently, they are still in the "try to make Mercurial scale" stage of denial, burning developer time and effort to push back that first brick wall (the same one that git hits, though mercurial hits it after git hits it but before perforce hits it...)
Here is a google presentation about extreme scaling with Perforce: http://www.perforce.com/sites/default/files/still-all-one-se...
An example of building multi-repo infrastructure for large projects with git is Android's repo: http://en.wikipedia.org/wiki/Repo_(script) Repo is just one example though; other, better, solutions are very possible.
I guess the TL;DR here is that I think it is great that Facebook is making a single Mercurial repository work for their purposes right now, but I think they are kidding themselves if they think that is a long term solution. They are doing a pretty good job of making Mercurial scale like Perforce can scale, but that will only work for them for so long.
(In the above comment I talk about building a scalable system with git ("with" git, rather than making git itself scale), but the same can of course be done with mercurial instead of git. I don't mean this to be a comment suggesting that they should use git instead of mercurial.)
Here's the Feb 2012 thread with Facebook's Joshua Redstone regarding their experiments with git:
In August 2012, they hinted they were near a solution:
...and that teaser was just today updated with the remark, "Our Mercurial team has just posted the goods! Our problems are solved, mainly by not using Git."
Fundamentally I like how this cleans up the design flaw of requiring history everywhere. 99% of the time when I clone or pull or diff or whatever I only care about HEAD. Why should I be forced to pull or store GBs of history or even MBs of metadata I can't use? Why not make leaving this data on the server optional? I can see how the decision to push history everywhere was made for simplicity but it doesn't reflect real world usage and clearly isn't scalable. Let's hope these history-option patches continue to be developed and make their way upstream. They certainly have my vote, not just as options but as defaults.
I mean, they run into a resource allocation problem for an application that is essentially a glorified web view, and they think "how can I hack up the VM to bypass this limitation?" Sounds like insanity to me.
Each individual step seems resonable and sane thing -- but when you look at the result of all those steps -- you stand back and just stare in awe at the horror you have created... it hits you just how far you have drifted.
In a recent meeting, we had a corporate redirection a bit -- and a co-worker simply asked the question "Is this a sane approach to what our actual problem is?" -- glad to be working with a crew who askes those kinds of questions.
Just because their solution is complex doesn't mean they're using the technology wrong. They set out their requirements and they met all of them. All their engineers are more efficient now and all this complexity happens behind the scenes.
I'm curious - do you know of any such better examples, or is this merely a theoretical "I feel like we could do better" statement?
Basically though, repo is good for a single project ("Android", as a collective entity) but a company looking at working with many repos will want something designed with handling many unrelated projects at once. (For example, all the repos for "Android" and all the repos or "ChromeOS" should be able to live alongside each other in this system without any developer hassle, even though they may not have anything to do with each other).
Also repo is really only one part of the solution a company should look at. Properly done, the sort of system that I am talking about is really several systems that are all basically on the same page. Your build servers should understand how to get the code they need to build something, your internal code search/browsing tools should understand where code lives, etc. Re
Furthermore, allowing different types of repos to exist alongside each other is a good idea. That way each team can individually decide if they want to use Mercurial, Git, or even SVN.. To my knowledge, the repo script is only good for many git repositories.
(my email is listed in my HN profile)
I personally think, if you want to take Git to the next level, you'll have to implement something like UCM ClearCase activities and projects. Commits should not be bound to a single repository and it should be very easy for people to say I want to create a product that uses starting points from branches x, y and z without having to think about what repo they belong to.
I have a couple examples that shows how my product de-emphasizes the repository. In the following example, you can see my Commits Finder tool, which lets you search for commits by branches.
As you can see, the search results shows commits from 10 different branches from 7 different repositories. My other example is my GitHub Pulls Finder tool which lets you search for pull requests by branches.
Here you can see pull requests from 6 different branches from 6 different repos.
Since my solution is read-only, it simplified things, but I don't think creating a nice layer on top of sub-modules would be be that technically challenging. And it's the direction I would personally go if I wanted to make Git more enterprise friendly.
This is the fundamental problem Facebook (and Google, Amazon, etc) are all trying to solve: how to share code company-wide. Sub-repos and making your tools hack around the repos doesn't solve that problem.
Say one aspect of a package suddenly starts to grow quite fast and take on a life of its own, and you would like to split it into its own dedicated package (and therefore, it's own repo). How do you do that without loosing the history of those files?
It's possible with git.. but it isn't exactly what I would call straightforward. With a one-repo system you simply move those files to their new location, just like any other file move/rename.
I don't know how you did it, but I found `git subtree split` to be an easy solution for extracting a directory into its own repo:
(It's a shame git-subtree it's still in contrib/, should really be enabled by default. It can also work as git-submodule replacement in some cases, by the way.)
Either isn't as nice as `p4 move ...` though.
Keeping a giant repository on the other hand makes it very easy to do stupid things and end up with tightly coupled code. You don't want that to be the path of least resistance.
You can get all your many repos in one place, with all of your internal systems handling them nicely, but moving one file from one repo to another, in a way that history remains clear, is something of an unsolved problem.
>Git of course hits a brick wall much sooner, somewhere around single-digit gigabytes of data (depending heavily on the average size of every object in the DAG), even ignoring throughput.
if you use Git the same way you'd use Perforce or Mercurial - one central server handling everything ... you'd better stay with Perforce or Mercurial. Corporate dev process is usually built around notion of "central" repo (and they equate it with "master") - just replacing Perforce server with Git server wouldn't change anything.
The problem is "how do you grow even further?" There isn't a perfect solution to that (yet), but I think the only existing workable (although absolutely imperfect) solution is splitting your codebase into many small repos.
Thankfully very few companies need to worry about this problem right now. A single Mercurial repo is apparently scaling for Facebook currently, and few companies will manage to stress Perforce.
For all of the source code in the world that is source-controlled, it exists in separate repos. It doesn't exist in one repo, and where there are inter-project dependencies, it is the dependency consumer's responsibility to keep abreast of the changes in the dependency and integrate them as necessary. It is self-organizing.
So, if Google and Facebook have grown beyond what can be done with a single repo, either they are not being mindful of their engineering practices (which puts the public at significant risk), or they are approaching a scale that is more similar to "world scale" than it is "corporate scale".
Which I would also say is putting the public at significant risk.
that is the problem - single logical repo in Perforce/Mercurial does mean single physical repo. Which obviously causes issues at very large scale. Even at normal enterprise scale :)
Git solves it through kind of distributed scaling where many operations can be performed on local physical repos without ever hitting the central. With centralized solutions, like Perforce, Mercurial, etc... most of the operations are performed on central server and your ability to scale vertically hits the ceiling pretty soon.
Compare the simplest case - each dev updating his local workspace/repo would hit the central server in Perforce where is with Git you can (and normally would have) a small set of downstream repos which the updates would be propagated/distributed through. You can branch left and right in Git in your local and team's repos without "master" repo and "central" server involved (while all your branch activity in Perforce is happening right on the central server in the central repo). The same happens for synchronizing your work inside the team - no need to hit the central. Etc...
my bad. 5+ years since i worked with Mercurial. Digging deep into painful memories, my Mercurial PTSD from that time is absence of in-repo branching - need to clone which is a killer for very large repo we had and no partial commits - again a killer aggravated by the above mentioned absence of in-repo branches. Both issues made working with large repo unreasonably and unnecessary hard.
These days, the standard practice is probably to use bookmarks, which are more or less like Git branches.
No. You have the choice to use either named branches or bookmarks, and which one you choose is a matter of your workflow. Note that even a Git-like workflow does not necessarily require the use of bookmarks.
Mercurial is still designed under the assumption that people will use named branches and the Git-like aspects of bookmarks exist to ease transition from Git: http://www.selenic.com/pipermail/mercurial/2014-January/0464...
Personally, I use Fossil-like private branches  in order to facilitate local work that I may still want to reorganize before pushing it to a shared server; I use bookmarks solely for local tags.
 Emulated by putting a named branch in the secret phase. I actually have a simple extension to automatically make all branches that start with a dot (e.g, ".tempstuff") secret.
What do you mean by partial commit? I can guess but I would rather know. :-)
I think you're incorrectly putting mercurial and perforce in the same bucket. Mercurial is a lot more like Git than perforce in that it's a DVCS. Facebook made it scale by using a lot of centralized caching, but most operations (diff, looking through history, branching, etc) can be done locally.
But webkit, and chromium, as well as other GIANT projects which as far as I know are larger then Facebook seem to work fine on Git.
webkit and chromium have about 6 or 7 million lines of code and < 200,000 commits.
Facebook 6 months ago had 62 million lines of code and > 1000 commits a day.
Also, there's a difference between those open source projects using large repos with git and Facebook wanting to increase developer efficiency. Developers sitting around waiting for a rebase doesn't really pay off, where as an open source project can get away with it.
If they used git with say only the last year of history in it they would be having zero issues.
Tracking down bugs in the history is a huge deal, and its often absolutely critical. And yes, regressions go past 1 year.