What this comes down to is that git uses a lot of essentially O(n) data structures, and when n gets big, that can be painful.
A few examples:
* There's no secondary index from file or path name to commit hash. This is what slows down operations like "git blame": they have to search every commit to see if it touched a file.
* Since git uses lstat to see if files have been changed, the sheer number of system calls on a large filesystem becomes an issue. If the dentry and inode caches aren't warm, you spend a ton of time waiting on disk I/O.
An inotify daemon could help, but it's not perfect: it needs a long time to warm up in the case of a reboot or crash. Also, inotify is an incredibly tricky interface to use efficiently and reliably. (I wrote the inotify support in Mercurial, FWIW.)
* The index is also a performance problem. On a big repo, it's 100MB+ in size (hence expensive to read), and the whole thing is rewritten from scratch any time it needs to be touched (e.g. a single file's stat entry goes stale).
None of these problems is insurmountable, but neither is any of them amenable to an easy solution. (And no, "split up the tree" is not an easy solution.)
So does, presumably, the cache when you use lstat. (Let's scratch presumably. It does. Bonus points if you can't use Linux and use an OS that seems to chill its caches down as soon as possible. )
I hope I'm wrong, but the proper solution to this seems to be a custom file system - not only will it allow you to more easily obtain a "modified since" list of files, it also allows you to only get local files "on demand". (E.g. http://google-engtools.blogspot.com/2011/06/build-in-cloud-a...)
That still doesn't solve the data structure issues in git, but at least it takes some of the insane amount of I/O off the table.
I'm looking forward to see what you guys cook up :)
So for your first item, it seems like it should be possible to add a (mostly immutable) cache file doing the job of Mercurial's files field in changesets, right? I.e. for each commit, list the files changed. Should be more efficient than searching through trees/manifests for changed files, at least.
For large (in files) trees, it seems like there's no easy solution, except for developing some kind of subtree support. However, that's similar to just splitting up the repository (along the lines of hg subrepo support), in the sense that now you have no real verification that non-checked-out parts of the tree will work with the changes in the part you do have checked out.
Still, the inotify daemon seems like it could alleviate things a bunch; particularly if the repository is on a server anyway, i.e. it's not rebooted that often.
Out of curiosity, why are these benchmarks using regular disk and flash disk? At only 15 GB, what happens using ram disk? Sure SSD is fast, but for these things it's still really slow.
Sorry for asking the obvious, but do you really need huge amount of data to keep development productive? How often do you use history that is several years old? Could you not archive it?
Or is the sheer number of files the problem, even ignoring history?
This is not a "git is perfect, fix your workflow" post, but I'm genuinely interested in what you have to say. Also, it seems like making git faster is a increasingly difficult task, given the amount of effort that has already been put into it.
And that might dovetail nicely with an inotify daemon?
Why doesnt facebook solve this git on huge repos problem and put out a patch for others to see? Oh, right, you want somebody else to solve the problem for you, for free!
So instead of (potentially very enlightening conversation) identifying and talking about limitations and possible solutions in git, we've decided that anyone who can't use git because of its perf issues is "doing it wrong".
I don't know what facebook's use case is, so I have no idea if their repositories are optimally structured. However, I've used git on a very large repository and ran into some of the same performance issues that they did (30+ seconds to run git status), so I don't think it's terribly hard to imagine they're in a similar situation.
What we did to solve it is exactly what you're excoriating the people below for suggesting: we split the repos and used other tools to manage multiple git repos, 'Repo' in some situations, git submodules in others.
However, we moved to that workflow mainly because it had a number of other advantages, not just because it made day-to-day git operations faster.
I hope git gets faster, some of the performance problems described are things we saw too, but things are always more complicated and I see nothing below that looks like the knee-jerk ignorant consensus you're describing.
Sometimes the answer to "it hurts when I do this" is "don't do that... because there's other ways to solve the same issue that work better for a number of other reasons and we haven't bothered fixing that particular one because most of the time the other way works better anyway."
Making stuff modular is often a good idea.
Solving a scaling problem by splitting it is, well, obvious.
And, yes, I also ran github on a couple of projects at $work and the issues are real, seen them.
So, if it hurts when I try to use git - the answer will be don't use git... But the conveniences are so tempting...
Stat'ing a million files is going to take a long time. Perforce doesn't have this problem because you explicitly check out files (p4 edit). (Perforce marks the whole tree read-only, as a reminder to edit the file before you save.)
It seems like large-repo git could implement the same feature. You would just disable (or warn) for operations which require stat'ing the whole tree.
Then the question is how to make the rest of the operations perform well -- git add taking 5-10 seconds seems indicative of an interesting problem, doesn't it?
At least that's what I'd like to see - it's functionality that's orthogonal to those tools.
Programs will still have to inspect those directories to find out what file(s) changed.
To quote https://developer.apple.com/library/mac/#documentation/Darwi...:
To better understand this technology, you should first understand what it is
not. It is not a mechanism for registering for fine-grained notification of
filesystem changes. It was not intended for virus checkers or other technologies
that need to immediately learn about changes to a file and preempt those changes
if needed. [...]
The file system events API is also not designed for finding out when a
particular file changes. For such purposes, the kqueues mechanism is more
The file system events API is designed for passively monitoring a large tree of
files for changes. The most obvious use for this technology is for backup
software. Indeed, the file system events API provides the foundation for Apple’s
This is a nice overview on FSevents
I found the original email equally disappointing, though. It boils down to "We pushed the envelope on size, it's too slow, we'd like to speed it up." Well, duh.
He uses the word 'scalability' early in the email, but shows no indication that he knows what it means. I'd love to hear if different operations slow down at different rates as the repo accumulates commits. Do they scale linearly, sublinearly, or superlinearly as the repo grows? Are there step functions at which there's a sudden dramatic slowdown (ran out of RAM, etc.)?
You don't spill internal processes and configurations without some kind of disclosure agreements and certainly not in a public forum.
He's soliciting opinions. I'm not sure how anybody can comment meaningfully based on the data he's given.
How git performs as repo size grows to 15GB isn't hidden in a vault at facebook somewhere; I suspect they just haven't done anything more detailed than a superficial time measurement.
And as much as I'd like being truly open as an ideal, it falls apart when you're dealing with competition (not cooperation) and money. At best you try to keep things open enough.
It is just surprising when git was designed for the Linux kernel and we all here have a Github mindset.
Git and HG:
1. Require you to be sync'ed to tip before pushing.
2. Cannot selectively check out files.
The former means that in any reasonably sized team, you will be forced to sync 30 times a day, even if you are the only one editing your section of the source tree. The latter means that Joe who is checking in the libraries for (huge open source project) for some testing increases everyones repo by that much, forever, even if it's deleted later.
Needless to say, the universal response is that I'm doing it wrong.
Perforce 4 life!
But seriously, it says that Google adopted Git for their repo --- does anyone know how they use it? I would expect them to want a linear history, but their teams are way too big to be able to have everyone sync'ed to tip to push...
That's not the case. In fact, in the context of Linux kernel development, there's many emails on LKML where Linus is telling someone that they shouldn't be merging random-kernel-of-the-day into their development branch.
That's a common use for git at Google, but not the only one (I'm a SWE). When I do use perforce I've got enough rhythm that it doesn't get in my way, but I really like git at Google for local branches on rapidly-changing subtrees. A lot of time I'll work on a branch to submit as a CL, but then realize I should do something else that depends on it. Perforce is a mess at this situation if the tree is changing much, and git is perfect if you just make a new branch.
A lot of posts on hn describing some problem elicit "Why, that's no problem at all!" responses or "That's the wrong problem to think about" responses.
Honestly that mindset is often really useful in programming, but when we get a problem that doesn't have a shortcut and is relevent, conversation goes to shit. Because I guess that's when programmers normally go into a hole and brute-force brain it out.
How to use mass comms to talk about a difficult open problem is, I suppose, itself an open problem.
It comes out of the Linux kernel, where you need a secure hash of a segment to prevent compromise. For big projects you have submodules, you can only get a level higher later.
In a company, you trust the sources. With Perforce you check out files and work with the part you want.
It is a design decision and they could have known before.
It is my guess (though I have no proof) that most places with particularly large repositories have lots of binary files in them. It's hard to get a 15GB repository if you just have text.
This sort of thing suggests a centralized check-in/check-out model, because binary files are difficult to merge sensibly, and nobody wants to spend terabytes of hard drive space storing the repository locally. And your centralized check-in/check-out needs, whatever scale they might be, are probably tolerably well served by one of the existing solutions.
15 gb is a tiny, tiny repo. I have a 7 tb repo "here" (really, spread amoung various drives, servers, S3, etc). :)
I think they are probably on par with craigslist in profits per employee (i.e. much higher than Google or Facebook. Interestingly I think Facebook has about 1/10 the employees of Google with 1/10 the profits -- off the top of my head feel free to correct -- so I don't think they blew it out of the park with their IPO filing).
If I were to make a snarky comment it would be that Git is for poor people and Perforce is what you use when you grow up. That's not an even remotely reasonable statement, but it does have a teeny, tiny hint of truth to it. :)
I've used for 10+ at work, and it's easy to explain to anyone how to use it (from production to artist, and coders).
It has it's gotcha moments, but I haven't seen better. We normally go by 50-100GB depot, and then often whole branches of the game copied.
Perforce is a great system, but it's showing it's age by now. I think there is probably room for someone to make another product in the high end space and make boatloads of cash from big companies, but it's not easy.
Care to elaborate? Do you mean in terms of distributed -vs- centralised repos?
Another part of it is working disconnected -- with so many people coding on their laptops that's actually a pretty common use case.
Also the lack of need to do sysadmin work on git/hg is really nice. I used to run the free Perforce server a long time ago for myself, but it was annoying to do the backups. With git or hg you get whole-repository backups for free.
The "big repository with all dependencies model" has its drawbacks but it's interesting that facebook finds a lot of use for it, and that git is unsuitable for it. Perforce is probably still their best choice in that case.
Later this year they are adding p4 Sandbox which allows for disconnected work. When that is complete and working I'm honestly not sure what advantage git will have left other than being free.
So often, the tool used to manage the central repository, which needs to cleanly handle a large codebase, is different from the tool developers use for day-to-day work, which only needs to handle a small subset. At Google, everything is in Perforce, but since I personally need only four or five projects from Perforce for my work, I mirror that to git and interact with git on a day-to-day basis. This model seems to scale fairly well; Google has a big codebase with a lot of reuse, but all my git operations execute instantaneously.
Many projects can "shard" their code across repositories, but this is usually an unhappy compromise.
People always use the Linux kernel as an example of a big project, but even as open source projects go, it's pretty tiny. Compare the entire CPAN to Linux, for example. It's nice that I can update CPAN modules one at a time, but it would be nicer if I could fix a bug in my module and all modules that depend on it in one commit. But I can't, because CPAN is sharded across many different developers and repositories. This makes working on one module fast but working on a subset of modules impossible.
So really, Facebook is not being ridiculous here. Many companies have the same problem and decide not to handle it at all. Facebook realizes they want great developer tools and continuous integration across all their projects. And Git just doesn't work for that.
At MS we also use Perforce (aka Source Depot), and I've toyed with the idea of doing something similar. Have you found any guides for "gotchas" or care to share what you've learned going this route?
So if you're planning on doing this at your own company, my advice is to write your own scripts that make whatever conventions you have automatic, and to move everyone over at the same time. That way, you won't be the weird one whose stuff is always broken.
I think most people got burned by cvs2svn and git-svn and think that using two version control systems at once is intrinsically broken. It's not. svn was just too weird to translate to or from. (People that skipped svn and went right from cvs to git had almost no problems, I'm told.)
Can you expand on this? I would love to talk more about the "well known" part, I've never run across it before. I am a maintainer (tools guy actually) of a hg repo with about 120 subrepos, and the whole approach with subrepos is something that we're not thrilled about. Oh, and if you want to communicate via email, I'd be up for that too.
Repo is a repository management tool that we built on top
of Git. Repo unifies the many Git repositories when
necessary, does the uploads to our revision control
system, and automates parts of the Android development
workflow. Repo is not meant to replace Git, only to make
it easier to work with Git in the context of Android. The
repo command is an executable Python script that you can
put anywhere in your path. In working with the Android
source files, you will use Repo for across-network
operations. For example, with a single Repo command you
can download files from multiple repositories into your
local working directory.
With approximately 8.5 million lines of code (not
including things like the Linux Kernel!), keeping this all
in one git tree would've been problematic for a few reasons:
* We want to delineate access control based on location in the tree.
* We want to be able to make some components replaceable at a later date.
* We needed trivial overlays for OEMs and other projects who either aren't ready or aren't able to embrace open source.
* We don't want our most technical people to spend their time as patch monkeys.
The repo tool uses an XML-based manifest file describing
where the upstream repositories are, and how to merge them
into a single working checkout. repo will recurse across
all the git subtrees and handle uploads, pulls, and other
needed items. repo has built-in knowledge of topic
branches and makes working with them an essential part of
Stay away from Repo and Gerrit. I use them at work, and they make my life miserable.
Repo was written years ago when Git did not have submodules, a feature where you can put repositories inside repositories. Git submodules is far superior to Repo, and allows you to e.g. bisect the history of many repositories.
I'm hoping that Google comes to it's senses and starts phasing out Repo in favor of Git submodules in Android development.
The worst part in repo + gerrit is that their default work flow is based on cherry-picking and they introduce a new concept called Change-Id. The Change-Id basically yet another unique identifier for changes that is stored in the commit message. The intent is that you make a change (a single commit patch), a post-commit hook adds the Change-Id to the commit message and then you upload it for review. When you make additions to your change, the previous change gets overwritten. Gerrit tries to maintain some kind of branching (called dependencies), but they mess things up when there's more than one person working on a few changes at the same time.
In comparison with GitHub-style work flow where you make a branch with multiple commits, submit a pull request, get review, add commits, squash and merge, the repo + gerrit model is awfully constraining.
We might be using an old version of repo and/or gerrit and some of the issues I've encountered may be improved. However, I think that repo+gerrit is a mess beyond repair and trying to "fix" it only makes things worse.
Unless you work on Android and are forced to use repo+gerrit because Google does so, stay out of it.
There really isn't another solution out there right now (at least not anything open source) for very large single repositories.
What about Git submodules? They do fundamentally the same thing as Repo, but it's a built-in Git feature and not a bunch of scripts.
Repo can make your life very hard and you have to be a black belt Git ninja to understand what's going on when things don't go as you intended. Git submodules don't depend on having arbitrary GUID strings in your commit messages either (like Repo's Change-Id).
GitHub's reviews can handle Git submodules (but it's not free or open source). If someone knows any open source code review tools that can handle, please tell us.
Sorry for beating a dead horse, but I really want to save someone from fucking up (or at least re-centralizing) their workflow with repo scripts, when native git is better.
So what's the story here: kernel developers put up with longer git times, the kernel is better organized, the scope of facebook is more massive even than the linux kernel, or there's some inherent design in git that works better for kernel work than web work?
This is on the largish side for a single project, but if Facebook likes to keep all their work in single repo then it isn't too difficult to go way beyond those stats. Think of keeping all GNU projects in a single repo.
As mentioned in this talk on how Facebook worked on visualizing interdependence between modules to drive down coupling at
https://www.facebook.com/note.php?note_id=10150187460703920 , there are at least 10k modules with clear dependency information in a front-end repo, and the situation probably is a lot better now that they have that information-dense representation to work from (I don't work on the PHP/www side of things, I spend most of my time in back-end and operations repos).
Then tags at the super-repository level can record the exact state of all submodules.
It's not about not checking the other modules out; you can make this the standard behavior, sure. Instead it's about having git manage reasonable sized blocks of the code base.
1) Instead of doing one large release every week (which facebook does: http://www.facebook.com/video/video.php?v=10100259101684977) you now have dozens or hundreds of smaller releases, a lot more heterogeneity to test for.
2) If you have inter-dependencies on modules you have to grapple with the "diamond dependency" problem. Say module A depends on module B and C, and suppose that module B also depends on C. However, module B depends on C v2.0 but A depends on C v1.0. If they're all split across repositories it's not possible to update a core API in an atomic commit.
3) Now you rely on changes being merged "up" to the root and then you have to merge it "down" to your project. This is one of the reasons Vista was such a slow motion train wreck: http://moishelettvin.blogspot.com/2006/11/windows-shutdown-c... -- kernel changes had to
be merged up to the root, then down to the UI, requiring months of very slow iterations to get it right.
Back-end services have their own release schedules and times, and obviously are made to be highly backward compatible so that they don't need to be done in lock-step with the front-end.
I think you're right about the "diamond dependency" model, but I think the merge-up and merge-down in Vista had more to do with having multiple independent branches in flight at the same time.
There are ALREADY interdependency issues if you're using git. Anyone could have changed anything before you did your last commit. If you pull and then push your changes without running tests, you're already risking breaking the build.
If a diamond dependency conflict came up, it shouldn't ever be committed to the TRUNK. Whoever made the change that causes a conflict should ideally discover that conflict before they submit it to TRUNK. But it's still not relevant to the decision to have one big repo or a hundred smaller ones, since the exact same problem can come up in either case if you fail to test before you submit.
With the proper workflow, having a lot of smaller trees can be functionally equivalent to having one massive tree. Checking in your branch to the TRUNK would require that you have the latest version of the TRUNK, and if you don't, then you'd need to pull the latest, just like how git works now. And updates to the TRUNK ARE atomic: Before you update TRUNK, it's pointing to the previously working version, and after you update, it's pointing to the one you just tested to make sure it still works.
And, just like now, a developer should therefore test to make sure the merge works after they've grabbed the latest.
It sounds like you're assuming that, if I update one of the submodules, it could break TRUNK? That can't happen without someone trying to commit it to TRUNK. Git remembers a specific commit for each submodule, and doesn't move forward to a new commit without being told to.
Any time you were pulling the entire repo tree, it could be slow, yes. But assuming people are only working in one or a small number of repos at once, you can imagine a workflow that didn't involve nearly so many operations on the entire tree.
(I almost habitually run "git status" whenever I've task switched away from code for even a few seconds to make sure I know exactly what I've done, which would have to look over the whole super-repo as well.)
Thankfully we're a while away from the times based on the synthetic test - it's not something I notice at all, but I probably write less code than most engineers here.
In the open-source world, when you want to change an API, you have to either add the change as a new API (leaving the existing API intact) or break backward compatibility and maintain parallel versions, gradually migrating users off of the old version.
Both of these options are a huge pain, and have a direct cost (larger API surface or parallel maintenance/migration efforts). When your entire repo and all callers are in the same code-base, you have a much more attractive option: change the API and all callers in a single changelist. You've now cleaned up your API without incurring any of the costs of the two open-source options.
This is why it can be nice, even if you have a bunch of nicely structured components, to have all code in a single repository.
Oh wait, it is.
I'm not sure what the exact situation at Facebook is with this repository, but I'm positive that if they had to start with a clean slate, this repo would easily find itself broken up into at least a dozen different repos.
Not to mention the fact that if _git_ has issues dealing with 1.3M files, I wonder what other (D)VCS they're thinking of as an alternative that would be more performant.
And no, they don't split into multiple repos, they might well have the entire company's source code in a single repository (code sharing is way easier this way).
This is what package systems are for.
Our source repo at work (a C++ compiler with full commit history going back to the early 90s...) is smaller and more componentized!
Enterprise source control can be ugly - particularly if you have non-text resources (Art, Firmware Binaries, tools) that need to be checked in and version managed as well.
With all that said - I don't really understand why all the code is in a single repository. Surely a company of Facebook's size would experience some fairly great benefits from compartmentalization and published service interfaces. I guess I agree with the parent - sounds like a lot of intertwined spaghetti code. :-)
The biggest advantage of a single repository is pretty intangible - it's cultural. When anyone can change anything or can use any code, people feel like the whole company belongs to them, and they're responsible for Google/Facebook's success as a whole. People will spontaneously come together to accomplish some user need, and they can refactor to simplify things across component boundaries, and you don't get the sort of political infighting that tends to plague large organizations where people in subprojects never interact with each other.
I think if it were my company, I'd want the single repository model, but there need to be tools and practices to manage API complexity. I dunno what those tools & practices would look like; there are some very smart people in Google that are grappling with this problem though.
It's even worse that just disk space and performance issues.
I can totally imagine a huge, busy repository where by the time you've pulled and rebased/merged your stuff, the repo has already been committed to again, invalidating your fast-forward commit and forcing you to pull again and again before you have any chance of pushing back your changes.
This is an inherent problem with DVCS that just can't be solved (trivially) when working on huge repositories that span millions of files and involve thousands of developers.
They keep every project in a single repo, mystery solved.
> We already have some of the easily separable projects in separate repositories, like HPHP.
Yeah, because it makes no sense, it's C++. They probably use for everything PHP i assume then. Is there no good build management tool for it?
This kind of "Duh, look what you're doing" response isn't really justified.
Sure, splitting up your repository would make things faster, but having to maintain multiple repositories is a major headache for the end-users of git. If it's possible, why not fix its scalability so that you don't have to worry about it?
You tend to split repositories based on team responsibilities. I doubt that every developer needs access to update all million+ files.
What this comes down to is that they've made certain architecture decisions that ideally would be changed but it's not possible to do so at this time.
Also, as Ævar writes in the first response, "there's only so much you can do about stat-ing 1.3
That's not true:
> It is based on a growth model of two of our current repositories (I.e., it's not a perforce import). We already have some of the easily separable projects in separate repositories, like HPHP. If we could split our largest repos into multiple ones, that would help the scaling
issue. However, the code in those repos is rather interdependent and we believe it'd hurt more than help to split it up, at least for the medium-term future.
They already have multiple repositories, the stats they're doing there is based on "two of [their] current repositories" implying more than two.
Sounds to me like this: http://thedailywtf.com/Articles/Enterprise-Dependency-Big-Ba...
echo 3 | tee /proc/sys/vm/drop_caches
echo 3 > /proc/sys/vm/drop_caches
I'm just wondering if this is an idiom with a deeper meaning that I'm not aware of.
EDIT: I'm guessing that when you run it in a script (without set -x), rather than on the command line, you can see in the log what it is you sent?
echo 3 | sudo tee /proc/sys/vm/drop_caches
sudo echo 3 > /proc/sys/vm/drop_caches
sudo sysctl -w vm.drop_caches=3
echo 3 | sudo tee /proc/sys/vm/drop_caches
sudo echo 3 > /proc/sys/vm/drop_caches
MSFT with Windows codebase that runs out of several labs. Crazy branching and merging infrastructure. They use source-depot, originally a clone of perforce.
Google with all their source code in one Perforce repo.
Facebook will be on perforce before we know it.
The solution is an internal Github, not one giant project.
* Single rooted tree. Separated repositories would make it harder to share code, leading to more dupication.
* Build from head. We build everything from source, statically linked. No need to worry about multiple versions of dependencies, no lag between a bug fix and it being available to any and all binaries that need it, whenever they're next updated.
I don't think that an "internal Github" is going be a magic bullet here. It's more likely it would be a matter of trading one set of hard problems for another, as we all of a sudden need to figure out how to do cross-dependencies sanely, deal with multiple versions of libraries, etc, at scale. You are correct that one monolithic Perforce repo is a bit of a pain point, but that doesn't necessarily mean that the right decision is to shatter our codebase into different pieces - we'd rather make our repo scale better. For reference, we've already got hundreds of millions of lines of code, 20+ changes/minute 6 months ago (so what, 30+ now?), and plans for scaling the next 10x are in motion.
If you're interested, I recommend http://google-engtools.blogspot.com/. It details a number of the problems we've run into, and our solutions for dealing with them at scale.
I'm not convinced that the difference between a singly rooted tree and a multiple-rooted tree is going to make that much difference. I mean, think about it... if you 100k's or even millions of files, is anybody going to parse through all of that, looking for a reusable function, even if it is on their workstation?
And sure a compiled language would catch naming collisions on functions or whatever, but nothing stops somebody from creating a method
doQuickSort( ... )
and somebody else creating
where they are semantically equivalent (or very nearly so).
It seems to me that the problem of duplicating code, because you don't know that a method already exists to do what you're trying to do, is the same problem regardless of how your tree is laid out; and is ultimately more of a documentation / process / discipline issue. But I'd be curious to hear the counter-argument to that...
Yes, in fact. We have some great tools that give us full search over our entire codebase (think Google Code Search), and you can add a dependency on a piece of code without needing to have it on your workstation already. The magic filesystem our build tools use knows where to get it and can do so on demand. Combined with good code location conventions, an overall attitude that promotes reuse over rewrites* and mandatory code reviews where someone can suggest a better approach, we do a pretty good job. Not everything is eliminated, of course, but I'm pretty happy with the state of things.
To your example, we'd use the STL for most of our sorting needs, but if you were to want, say, case-insensitive string sorting, I can tell you where to find it (ASCII, UTF8, or other). If you want a random number, any RNG you could want is available. Most data structures you could name have been written and tested already. Libraries for controlling how your binaries dump core, command line flags are parsed, callbacks are invoked, etc etc are readily available. We really do reuse code as much as possible, and it's wonderful to have ready access to all of this whenever you could ask.
*At a method level, anyways...we're famous for writing ever more file systems ;).
I've seen that before myself at other companies, and it's a shame. A healthy codebase is an investment in the future - if you're not taking the time to cultivate it you're sacrificing long term usability for short term gains. The larger the codebase the more difficult the task, of course, but for us that's just an excuse to solve the next hard problem :).
One more good link on the topic: our use of Clang to find and fix bugs in our existing codebase, as we find new classes of 'gotchas'. http://google-engtools.blogspot.com/2011/05/c-at-google-here...
Hopefully the methodology will filter out into the wider world one day. . . Anyway, thank you for posting it!
I agree btw, the Github mindset is the best one. Create for every project a new repo and connect them with build tools. But why not hire 100 SOA-Consultants, they have enough money now.
Are you sure about that?
It's not wise to offer Facebook to split the repository. Looks like it's time to improve the tool.
Ideally they should have many small, manageable repositories that are well tested and owned by a specific group/person/whatever. At least something small enough a single dev or team can get their head around.
There is no simple answer. There is only optimization for a particular problem-set you are trying to minimize.
I don't see what this has to do with a discussion of one repo vs multiple repos.
You think that in a multi repo world, the engineers aren't as aware of what code exists and where as they are in a single repo world? You think that code duplication and needing to read docs magically doesn't exist in a single repo world?
The number of repositories is just an organizational construct. Communication still must take place no matter what.
> factor into modules, one project per repo
where i work we have a project with clear module boundaries, but all in the same repo. we have an "app" and some dependencies including our platform/web framework. none of these are stable, they're all growing together. Commits on the app require changes in the platform, and in code review it is helpful to see things all together. Porting commits across different branches requires porting both the application change and the dependent platform changes. Often a client-specific branch will require severe one-off changes so the platform may diverge -- it is not practical for us (right now) to continually rebase client branches onto the latest platform.
this is just our experience, not facebook's, but lets face it: real life software isn't black and white, and discussion that doesn't acknowledge this isn't particularly helpful.
We've got a superproject with our server configs, and sub projects for our background processing, API, and web-frontend respectively.
Often, each project can evolve and be versioned 100% independently. However, often you need to modify multiple projects and (especially with server config changes) coordinate changes via the super project.
It's a little hairy sometimes and often feels like unnecessary overhead, but the mental boundary is extremely valuable on it's own. Being able to add a field to the API and check that commit into the superproject for deployment before the front end features are done is nice. The social impact on implementation modularity is valuable. We write better factored code by letting Git establish a more concrete boundary for us.
Git as so many interesting uses at scale as just a tool that navigates and tracks DAGs over time.
I'd be very interested to see some benchmarks on their current VCS solution for repositories of this scale.
There are good reasons to keep code in one repository; particularly, git's submodule support has a number of nasty interface tradeoffs; I wouldn't say it breaks git, but you have to keep a clear understanding of all your submodules in your head when you have a lot of them.
OK, it pretty much breaks git to have submodules that are interdependent. I know this because I am currently moving one of my organizations off this exact plan -- it's the opposite of useful and speedy to have to worry about versions across a large number of backend / frontend repositories.
It is MUCH easier and therefore better for developers to put them together, and release together.
"We can build a binary that is more than 1GB (after stripping debug information) in about 15 min, with the help of distcc. Although faster compilation does not directly contribute to run-time efficiency, it helps make the deployment process better."
It's how HPHP works.
Now, if it's going to end up OSS, that's a different question. (I'm not implying it's not - I'm saying that's a decision that could go either way)
A good solution will benefit everyone who uses git. Codebases get larger over time. There is more forking and experimentation. More spoken languages can be supported. More computer languages can be interfaced. The O(n) operations becoming less than that will benefit you in the future as your code grows.
If they provided money, I'd provide the time in order to produce a good result. See the problem?
More to the point: FB is all take and no give, as near as I can tell.
Looks like a fairly long list to me.
Since "status" and "commit" perform fairly well after the OS file cache has been warmed up, that probably can be resolved by having background processes that keep it warm. (Also, how long would it take to just simply stat that number of files? )
The issue of "blame" still taking over 10 minutes: We need to know far back in the repository they're searching. What happen if there's one line that hasn't been changed since the initial commit? Are you being forced to go back to through the whole commit history?
How old is the repository? Years? Months? I'm probably guessing in the at least years range based on the number of commits (unless the developers are extremely commit-happy).
At a certain point, you're going to be better off taking the tip revisions off a branch and starting a fresh repository. It doesn't matter what SCM/VCS tool you're using (I've been the architect and admin on the implementation of a number of commercial tools). Keep the old repository live for a while and then archive it.
You'll find that while everyone wants to say that they absolutely need the full revision history of every project, you rarely go back very far (aka the last major release or two). And if you do need that history, you can pull it from the archives.
They've reached out to the developers on git, and I guess that's a first step.
Also, it doesn't only include the current file set - they include files that have been deleted, been split into modular files, been merged, been wholesale rewritten, put into a new hierarchy (some VCS systems handle this better than others).
(I work at Facebook, but not on the team looking into this stuff. I'm a happy user of their systems though. Keep in mind that the 1.3 million file repo is a synthetic test, not reality.)
The follow-up email still mentions a working directory of 9.5gb. I cannot fathom working on a code repository consisting of 9.5gb of text. There must be something else going on here, even considering any peripheral projects like the iOS and android apps, etc.
(edit: if there are huge generated files intermingled with code, shouldn't those be hosted on a "pre-generated cache" web server instead of git, for example?)
It is useful to keep in mind that Facebook isn't just the front-end (and isn't just code, also images, configuration, and so forth).
Just talking about open source stuff, Facebook also generates code like Cassandra, Hive (data warehousing application), Phabricator (a code review and lifecycle tool), HipHop for PHP (the translator/compiler, the interpreter, and the virtual machine), FlashCache (a kernel driver), Thrift, Scribe, and so forth.
We also have had to build applications to support our operations, so think about what sort of effort goes into building scalable monitoring, configuration management, automatic remediation, logging infrastructure, and so forth.
I don't know the actual lines of code across it all, and wouldn't mention it if I did, but people often underestimate the scale here.
If a project depends on other projects, have it reference the other projects. Where appropriate, include exact version numbers and/or commit hashes. Gemfiles are good examples of this good practice at work.
Yes, git has submodules for this sort of thing, but after investigating that route, I decided against using git submodules. Use something independent of the VCS instead. Then git won't do weird or unexpected things when you switch branches. Also, you might want to mix in projects that use other version control systems. And really, why unnecessarily couple a project to its version control system?
If (when?), even after splitting a megaproject into manageable subprojects, these performance issues creep in, I'd certainly be interested in whatever improvements people are coming up with...
Still amazed that breaking it up would do more harm than good when the code isn't even written yet...
Putting aside the question of whether or not an enormous singular repo can be broken up intelligently into modular projects, is there something about the submodule approach that makes it a uniquely unsuitable way for sharing changes amongst projects?
I can certainly see why you would have the latter propagated instantaneously, or close to it.
There's also the point that if you don't propagate change to everybody at the same time, you'll have dozens of slightly different versions of those projects across your company. The question of submodules vs. large repo is not as easily decided as you think - there are large upsides (and downsides) to both approaches.
Say you have two branches: master and next. Stable work goes in master, unstable work goes in next. When the code is ready for consumption, you merge it into master.
Anyone who is using a project has it setup as a submodule. They add post/pre-commit hooks to update all project submodules. These submodules pull from master.
This way, everyone will get all stable changes on all submodule projects at the time of the next change to their own project.
Even more, I (or my team) are not the only ones working on libA. Others are too. So keeping changes in 'next' and pushing to master only on occasion doesn't help much. Yes, it keeps non-working patches out - but that's what local branches on your machine are for.
(I'm not even going to mention the issue of merge conflicts. If you work on a massive scale, the longer you stay in a branch, the more likely you are to get a merge conflict. There's easily the chance to go into a several weeks long merge hell. Pull from master, resolve conflicts, run local tests - oh wait, master is already updated by somebody else)
Amazon and Google have already solved this problem, and the solution is to reorganize things into smaller manageable packages.