
Facebook hit git performance issue on large repository - kaeso
http://thread.gmane.org/gmane.comp.version-control.git/189776
======
bos
Facebook engineer here, working on this problem with Joshua.

What this comes down to is that git uses a lot of essentially O(n) data
structures, and when n gets big, that can be painful.

A few examples:

* There's no secondary index from file or path name to commit hash. This is what slows down operations like "git blame": they have to search every commit to see if it touched a file.

* Since git uses lstat to see if files have been changed, the sheer number of system calls on a large filesystem becomes an issue. If the dentry and inode caches aren't warm, you spend a ton of time waiting on disk I/O.

An inotify daemon could help, but it's not perfect: it needs a long time to
warm up in the case of a reboot or crash. Also, inotify is an incredibly
tricky interface to use efficiently and reliably. (I wrote the inotify support
in Mercurial, FWIW.)

* The index is also a performance problem. On a big repo, it's 100MB+ in size (hence expensive to read), _and_ the whole thing is rewritten from scratch any time it needs to be touched (e.g. a single file's stat entry goes stale).

None of these problems is insurmountable, but neither is any of them amenable
to an easy solution. (And no, "split up the tree" is not an easy solution.)

~~~
groby_b
_An inotify daemon could help, but it's not perfect: it needs a long time to
warm up in the case of a reboot or crash_

So does, presumably, the cache when you use lstat. (Let's scratch presumably.
It does. Bonus points if you can't use Linux and use an OS that seems to chill
its caches down as soon as possible. )

I hope I'm wrong, but the proper solution to this seems to be a custom file
system - not only will it allow you to more easily obtain a "modified since"
list of files, it also allows you to only get local files "on demand". (E.g.
[http://google-engtools.blogspot.com/2011/06/build-in-
cloud-a...](http://google-engtools.blogspot.com/2011/06/build-in-cloud-
accessing-source-code.html))

That _still_ doesn't solve the data structure issues in git, but at least it
takes some of the insane amount of I/O off the table.

I'm looking forward to see what you guys cook up :)

~~~
caf
You might be able to do the "custom file system" as a pass-through FUSE
filesystem.

~~~
patangay
Yea, a FUSE filesystem is also being considered as one of the possible
solutions.

------
lbrandy
Wow. I was expecting an interesting discussion. I was disappointed. Apparently
the consensus on hacker news is that there exists a repository size N above
which the benefits of splitting the repo _always_ outweigh the negatives. And,
if that wasn't absurd enough, we've decided that git can already handle N and
the repository in question is clearly above N. And I guess all along we'll
ignore the many massive organizations who cannot and will not use git for
precisely the same issue.

So instead of (potentially very enlightening conversation) identifying and
talking about limitations and possible solutions in git, we've decided that
anyone who can't use git because of its perf issues is "doing it wrong".

~~~
kinofcain
Your comment was at the top so I continued to read expecting to find a bunch
of ignorant group think about how git is awesome and Facebook is dumb, but
that's not really what's going on down below.

I don't know what facebook's use case is, so I have no idea if their
repositories are optimally structured. However, I've used git on a very large
repository and ran into some of the same performance issues that they did (30+
seconds to run git status), so I don't think it's terribly hard to imagine
they're in a similar situation.

What we did to solve it is exactly what you're excoriating the people below
for suggesting: we split the repos and used other tools to manage multiple git
repos, 'Repo' in some situations, git submodules in others.

However, we moved to that workflow mainly because it had a number of other
advantages, not just because it made day-to-day git operations faster.

I hope git gets faster, some of the performance problems described are things
we saw too, but things are always more complicated and I see nothing below
that looks like the knee-jerk ignorant consensus you're describing.

Sometimes the answer to "it hurts when I do this" is "don't do that... because
there's other ways to solve the same issue that work better for a number of
other reasons and we haven't bothered fixing that particular one because most
of the time the other way works better anyway."

~~~
wisty
On a simliar note, I've heard there of people who would hit the limit on
fortran files, so they put every variable into a function call to the next
file, which itself contained one function and a function call to the next file
after that (if necessary).

Making stuff modular is often a good idea.

~~~
LearnYouALisp
Cellular, modular, and interactive-odular!

------
jrockway
Yes, it's well known that big companies with big continuously integrated
codebases don't manage the entire codebase with Git. It's slow, and splitting
repositories means you can't have company-wide atomic commits. It's convenient
to have a bunch of separate projects that share no state or code, but also
wasteful.

So often, the tool used to manage the central repository, which needs to
cleanly handle a large codebase, is different from the tool developers use for
day-to-day work, which only needs to handle a small subset. At Google,
everything is in Perforce, but since I personally need only four or five
projects from Perforce for my work, I mirror that to git and interact with git
on a day-to-day basis. This model seems to scale fairly well; Google has a big
codebase with a lot of reuse, but all my git operations execute
instantaneously.

Many projects can "shard" their code across repositories, but this is usually
an unhappy compromise.

People always use the Linux kernel as an example of a big project, but even as
open source projects go, it's pretty tiny. Compare the entire CPAN to Linux,
for example. It's nice that I can update CPAN modules one at a time, but it
would be nicer if I could fix a bug in my module and all modules that depend
on it in one commit. But I can't, because CPAN is sharded across many
different developers and repositories. This makes working on one module fast
but working on a subset of modules impossible.

So really, Facebook is not being ridiculous here. Many companies have the same
problem and decide not to handle it at all. Facebook realizes they want great
developer tools _and_ continuous integration across all their projects. And
Git just doesn't work for that.

~~~
Splines
_At Google, everything is in Perforce, but since I personally need only four
or five projects from Perforce for my work, I mirror that to git and interact
with git on a day-to-day basis._

At MS we also use Perforce (aka Source Depot), and I've toyed with the idea of
doing something similar. Have you found any guides for "gotchas" or care to
share what you've learned going this route?

~~~
jrockway
I used git-p4 at my last job, and the only thing that ever got weird was p4
branches. At Google we have an internal tool that's similar to git-p4, and it
always works perfectly for me. Enough developers are using it such that most
of the internal tools understand that a working copy could be a git repository
instead of a p4 client.

So if you're planning on doing this at your own company, my advice is to write
your own scripts that make whatever conventions you have automatic, and to
move everyone over at the same time. That way, you won't be the weird one
whose stuff is always broken.

I think most people got burned by cvs2svn and git-svn and think that using two
version control systems at once is intrinsically broken. It's not. svn was
just too weird to translate to or from. (People that skipped svn and went
right from cvs to git had almost no problems, I'm told.)

~~~
Robin_Message
Eric Raymond talks about the problems of converting svn repos to git and is
promising a new release of reposurgeon soon that handles svn well.
<http://esr.ibiblio.org/?p=4071>

------
ramanujan
This looks like it could be of assistance:

<http://source.android.com/source/version-control.html>

    
    
      Repo is a repository management tool that we built on top 
      of Git. Repo unifies the many Git repositories when 
      necessary, does the uploads to our revision control 
      system, and automates parts of the Android development 
      workflow. Repo is not meant to replace Git, only to make 
      it easier to work with Git in the context of Android. The 
      repo command is an executable Python script that you can 
      put anywhere in your path. In working with the Android 
      source files, you will use Repo for across-network 
      operations. For example, with a single Repo command you 
      can download files from multiple repositories into your 
      local working directory.
    

[http://google-opensource.blogspot.com/2008/11/gerrit-and-
rep...](http://google-opensource.blogspot.com/2008/11/gerrit-and-repo-android-
source.html)

    
    
      With approximately 8.5 million lines of code (not 
      including things like the Linux Kernel!), keeping this all 
      in one git tree would've been problematic for a few reasons:
    
      * We want to delineate access control based on location in the tree.
      * We want to be able to make some components replaceable at a later date.
      * We needed trivial overlays for OEMs and other projects who either aren't ready or aren't able to embrace open source.
      * We don't want our most technical people to spend their time as patch monkeys.
    
      The repo tool uses an XML-based manifest file describing 
      where the upstream repositories are, and how to merge them 
      into a single working checkout. repo will recurse across 
      all the git subtrees and handle uploads, pulls, and other 
      needed items. repo has built-in knowledge of topic 
      branches and makes working with them an essential part of 
      the workflow.
    

Looks like it's worth taking a serious look at this repo script, as it's been
used in production for Android. Might allow splitting into multiple git
repositories for performance while still retaining some of the benefits of a
single repository.

~~~
exDM69
> Looks like it's worth taking a serious look at this repo script, as it's
> been used in production for Android. Might allow splitting into multiple git
> repositories for performance while still retaining some of the benefits of a
> single repository.

Stay away from Repo and Gerrit. I use them at work, and they make my life
miserable.

Repo was written years ago when Git did not have submodules, a feature where
you can put repositories inside repositories. Git submodules is far superior
to Repo, and allows you to e.g. bisect the history of many repositories.

I'm hoping that Google comes to it's senses and starts phasing out Repo in
favor of Git submodules in Android development.

~~~
joelthelion
What about gerrit?

~~~
exDM69
Compared to the code review facilities in GitHub, Gerrit is pretty crappy.
Gets the job done, but the UI and the work flow it forces is a bit annoying.

The worst part in repo + gerrit is that their default work flow is based on
cherry-picking and they introduce a new concept called Change-Id. The Change-
Id basically yet another unique identifier for changes that is stored in the
commit message. The intent is that you make a change (a single commit patch),
a post-commit hook adds the Change-Id to the commit message and then you
upload it for review. When you make additions to your change, the previous
change gets overwritten. Gerrit tries to maintain some kind of branching
(called dependencies), but they mess things up when there's more than one
person working on a few changes at the same time.

In comparison with GitHub-style work flow where you make a branch with
multiple commits, submit a pull request, get review, add commits, squash and
merge, the repo + gerrit model is awfully constraining.

We might be using an old version of repo and/or gerrit and some of the issues
I've encountered may be improved. However, I think that repo+gerrit is a mess
beyond repair and trying to "fix" it only makes things worse.

Unless you work on Android and are forced to use repo+gerrit because Google
does so, stay out of it.

~~~
Tobu
I'm using it, the workflow isn't so bad. In fact it is similar to the way
kernel patches are iterated and reviewed, except centralised instead of e-mail
based.

------
losvedir
Huh, fascinating. git was initially created for the Linux kernel development,
and I haven't heard of any issues there. Offhand I would have said, as a
codebase, the Linux kernel would be larger and more complex than facebook, but
I don't have a great sense of everything involved in both cases.

So what's the story here: kernel developers put up with longer git times, the
kernel is better organized, the scope of facebook is more massive even than
the linux kernel, or there's some inherent design in git that works better for
kernel work than web work?

~~~
marginalboy
It isn't surprising if Facebook has a large, highly coupled code base. Given
their reputation for tight timelines and maverick-advocacy, I'm continually
surprised the thing works at all.

~~~
nbm
I wouldn't say that a large repository implies that the code is highly-
coupled. There are advantages for keeping certain code together in a single
repo. Being able to easily discover users of functions of a library, being
able to "transactionally" commit an update to a library (or set of libraries)
and the code that uses it, being able to do code review over changes of code
in various places, being able to discover if someone else has solved this
problem before, and so forth. If you only have your project and its libraries
checked out, you don't serendipitously discover things in other projects.

As mentioned in this talk on how Facebook worked on visualizing
interdependence between modules to drive down coupling at
<https://www.facebook.com/note.php?note_id=10150187460703920> , there are at
least 10k modules with clear dependency information in a front-end repo, and
the situation probably is a lot better now that they have that information-
dense representation to work from (I don't work on the PHP/www side of things,
I spend most of my time in back-end and operations repos).

~~~
SomeCallMeTim
None of what you mention here precludes breaking up the code into many smaller
repositories, and then having them all linked together in one super-
repository.

Then tags at the super-repository level can record the exact state of all
submodules.

It's not about not checking the other modules out; you can make this the
standard behavior, sure. Instead it's about having git manage reasonable sized
blocks of the code base.

~~~
twinge
Three big problems with a split up codebase:

1) Instead of doing one large release every week (which facebook does:
<http://www.facebook.com/video/video.php?v=10100259101684977>) you now have
dozens or hundreds of smaller releases, a lot more heterogeneity to test for.

2) If you have inter-dependencies on modules you have to grapple with the
"diamond dependency" problem. Say module A depends on module B and C, and
suppose that module B also depends on C. However, module B depends on C v2.0
but A depends on C v1.0. If they're all split across repositories it's not
possible to update a core API in an atomic commit.

3) Now you rely on changes being merged "up" to the root and then you have to
merge it "down" to your project. This is one of the reasons Vista was such a
slow motion train wreck: [http://moishelettvin.blogspot.com/2006/11/windows-
shutdown-c...](http://moishelettvin.blogspot.com/2006/11/windows-shutdown-
crapfest.html) \-- kernel changes had to be merged up to the root, then down
to the UI, requiring months of very slow iterations to get it right.

~~~
nbm
Keep in mind that the talk in question is talking about the web site (and some
other stuff) going into production, and as is mentioned in the talk, it is
done more than once a week, and the whole shebang can be pushed in some number
of minutes (I forget the amount mentioned).

Back-end services have their own release schedules and times, and obviously
are made to be highly backward compatible so that they don't need to be done
in lock-step with the front-end.

I think you're right about the "diamond dependency" model, but I think the
merge-up and merge-down in Vista had more to do with having multiple
independent branches in flight at the same time.

------
yuvadam
While I'd be interested in seeing this issue further unfold, just the prospect
of a 1.3M-file repo gives me the creeps.

I'm not sure what the exact situation at Facebook is with this repository, but
I'm positive that if they had to start with a clean slate, this repo would
easily find itself broken up into at least a dozen different repos.

Not to mention the fact that if _git_ has issues dealing with 1.3M files, I
wonder what other (D)VCS they're thinking of as an alternative that would be
more performant.

~~~
InclinedPlane
A lot of big companies have repos 10 or 100 times that size. With tens of
millions of files, sometimes up to 100 gigs or more of data under source
control.

~~~
brown9-2
True, but don't most places organize one git repo per project, rather than one
for the entirety of the company's source code?

~~~
luser001
Most people like that use Perforce (e.g., Google).

And no, they don't split into multiple repos, they might well have the entire
company's source code in a _single_ repository (code sharing is way easier
this way).

~~~
zoips
That's a pretty terrible way to share code. Simple example: I work on a
project, write some code. Turns out that code is useful for someone else, so
they reach in and include it, which is easy since it's all the same repo and
all delivered to the build system. Now I make a change and break some customer
I didn't even know exists. Oops.

This is what package systems are for.

------
sek
[http://thread.gmane.org/gmane.comp.version-
control.git/18977...](http://thread.gmane.org/gmane.comp.version-
control.git/189776)

They keep every project in a single repo, mystery solved.

Edit:

> We already have some of the easily separable projects in separate
> repositories, like HPHP.

Yeah, because it makes no sense, it's C++. They probably use for everything
PHP i assume then. Is there no good build management tool for it?

~~~
fragsworth
> They keep every project in a single repo, mystery solved.

This kind of "Duh, look what you're doing" response isn't really justified.

Sure, splitting up your repository would make things faster, but having to
maintain multiple repositories is a major headache for the end-users of git.
If it's possible, why not fix its scalability so that you don't have to worry
about it?

~~~
lnguyen
I'm pretty sure that cloning a repository of that size can't be all that fun.

You tend to split repositories based on team responsibilities. I doubt that
every developer needs access to update all million+ files.

What this comes down to is that they've made certain architecture decisions
that ideally would be changed but it's not possible to do so at this time.

------
julian37
Somewhat off-topic, could somebody explain why

    
    
      echo 3 | tee /proc/sys/vm/drop_caches
    

rather than just

    
    
      echo 3 > /proc/sys/vm/drop_caches
    

Is it because the output to stdout lets you be extra sure that the right data
was sent to the kernel?

I'm just wondering if this is an idiom with a deeper meaning that I'm not
aware of.

EDIT: I'm guessing that when you run it in a script (without set -x), rather
than on the command line, you can see in the log what it is you sent?

~~~
pdw
Because you can

    
    
        echo 3 | sudo tee /proc/sys/vm/drop_caches
    

but

    
    
        sudo echo 3 > /proc/sys/vm/drop_caches
    

won't work.

~~~
julian37
Right, that's what I was missing. That makes sense, and come to think about
it, is a very useful addition to my toolbox. Thanks pdw and jochu!

------
dblock
Others have tried and keep throwing more and more smart people at the problem
they just shouldn't have.

MSFT with Windows codebase that runs out of several labs. Crazy branching and
merging infrastructure. They use source-depot, originally a clone of perforce.

Google with all their source code in one Perforce repo.

Facebook will be on perforce before we know it.

The solution is an internal Github, not one giant project.

~~~
sek
Google has everything in one Perforce repo? You mean the search engine, do
you?

I agree btw, the Github mindset is the best one. Create for every project a
new repo and connect them with build tools. But why not hire 100 SOA-
Consultants, they have enough money now.

~~~
mikeocool
No, literally the entire codebase for all of their products is in one Perforce
repo. Ashish Kumar, manager of the Engineering Tools team, mentions it in this
presentation: <http://www.infoq.com/presentations/Development-at-Google>

~~~
rachelbythebay
The kernel? Android? Some other spooky stuff involving the pest control guy
who's holding a big rubber mallet when you fail a unit test?

Are you sure about that?

~~~
nostrademons
Kernel/Android/Chrome/basically anything open-source is different. If the code
is going to be open-sourced, it can't have dependencies on proprietary code
anyway.

~~~
rachelbythebay
Right, so "literally the entire codebase for all their products" is incorrect.
Thanks.

~~~
jrockway
The open-source stuff is a rounding error. Think about all the Google
products; Search, Google+, Gmail, Groups, Translate, Maps, Docs, Calendar,
Checkout, Wallet, Voice, ... those are all in one repository. (Not to mention
all the libraries and internal tools; those are all in there too.)

~~~
nostrademons
To be fair, Android and Chrome are pretty huge projects. I know the numbers
(though I don't think I can share them outside of Google), and while they're
nowhere close to being a big part of the total, they're also big enough to not
be considered a rounding error.

------
gokhan
Large repos bring their own problems, and results in some design decisions
accordingly. For example, Visual Studio itself is 5M+ files and this affected
some of the the initial design decisions (Server side workspaces, for this
example) when developing TFS 2005 (the first version) [1]. That decision suits
MS but not the small to medium clients well. So they're now alternating that
design with client side workspaces.

It's not wise to offer Facebook to split the repository. Looks like it's time
to improve the tool.

[1] [http://blogs.msdn.com/b/bharry/archive/2011/08/02/version-
co...](http://blogs.msdn.com/b/bharry/archive/2011/08/02/version-control-
model-enhancements-in-tfs-11.aspx)

------
iamleppert
I can believe this working with a former facebook employee. They do not
believe in separating or distilling anything into separate repos. Why the fuck
would you want to have a 15GB repo?

Ideally they should have many small, manageable repositories that are well
tested and owned by a specific group/person/whatever. At least something small
enough a single dev or team can get their head around.

Sheesh.

~~~
SoftwareMaven
And then each of those dev teams can spend 1/2 their time writing code other
people in the company have already written _or_ every team can spend 1/2 their
time publishing and reading documentation about what has been written.

There is no simple answer. There is only optimization for a particular
problem-set you are trying to minimize.

~~~
brown9-2
> And then each of those dev teams can spend 1/2 their time writing code other
> people in the company have already written or every team can spend 1/2 their
> time publishing and reading documentation about what has been written.

I don't see what this has to do with a discussion of one repo vs multiple
repos.

You think that in a multi repo world, the engineers aren't as aware of what
code exists and where as they are in a single repo world? You think that code
duplication and needing to read docs magically doesn't exist in a single repo
world?

The number of repositories is just an organizational construct. Communication
still must take place no matter what.

------
dustingetz
the obvious answer, repeatedly mentioned in comments:

> factor into modules, one project per repo

where i work we have a project with clear module boundaries, but all in the
same repo. we have an "app" and some dependencies including our platform/web
framework. none of these are stable, they're all growing together. Commits on
the app require changes in the platform, and in code review it is helpful to
see things all together. Porting commits across different branches requires
porting both the application change and the dependent platform changes. Often
a client-specific branch will require severe one-off changes so the platform
may diverge -- it is not practical for us (right now) to continually rebase
client branches onto the latest platform.

this is just our experience, not facebook's, but lets face it: real life
software isn't black and white, and discussion that doesn't acknowledge this
isn't particularly helpful.

~~~
snprbob86
We've experienced this.

We've got a superproject with our server configs, and sub projects for our
background processing, API, and web-frontend respectively.

Often, each project can evolve and be versioned 100% independently. However,
often you need to modify multiple projects and (especially with server config
changes) coordinate changes via the super project.

It's a little hairy sometimes and often feels like unnecessary overhead, but
the mental boundary is extremely valuable on it's own. Being able to add a
field to the API and check that commit into the superproject for deployment
before the front end features are done is nice. The social impact on
implementation modularity is valuable. We write better factored code by
letting Git establish a more concrete boundary for us.

------
djtriptych
I hope these guys do take the route of developing a large-scale performant
patch.

Git as so many interesting uses at scale as just a tool that navigates and
tracks DAGs over time.

------
courtewing
This was actually pretty fascinating to me. On one hand, I am astonished at
how long it takes to perform seemingly trivial git operations on repositories
at this scale. On the other hand, I'm utterly mystified that a company like
Facebook has such monolithic repositories. Even back when I was using SVN a
lot, I relied on externals and such to break up large projects into their
smaller service-level components.

I'd be very interested to see some benchmarks on their current VCS solution
for repositories of this scale.

~~~
wmf
Given that Facebook is compiled into a single 1 GB executable, a git repo with
1.3 M files doesn't really surprise me.

~~~
jlarocco
What? Do you have a reference for that?

~~~
huytoan_pc
Here you go: <http://www.facebook.com/note.php?note_id=10150121348198920>

"We can build a binary that is more than 1GB (after stripping debug
information) in about 15 min, with the help of distcc. Although faster
compilation does not directly contribute to run-time efficiency, it helps make
the deployment process better."

------
jpdoctor
$100B company, maybe they can afford to put some people onto solving this for
the open software community (and put the solution into the open), especially
since nobody else in the community seems to have this problem.

~~~
rogerbinns
If you proposed a good solution I'm sure they'd be happy to provide time and
money and open source the result. But most of the responses aren't even that
there is a solution - they say to split the repository into smaller pieces and
spend time and money internally having their internal developers deal with
that.

A good solution will benefit everyone who uses git. Codebases get larger over
time. There is more forking and experimentation. More spoken languages can be
supported. More computer languages can be interfaced. The O(n) operations
becoming less than that will benefit you in the future as your code grows.

~~~
jpdoctor
> _If you proposed a good solution I'm sure they'd be happy to provide time
> and money and open source the result._

If they provided money, I'd provide the time in order to produce a good
result. See the problem?

More to the point: FB is all take and no give, as near as I can tell.

~~~
karlshea
<https://developers.facebook.com/opensource/>

Looks like a fairly long list to me.

------
redstone
This is Joshua (who posted the original email). I'm glad to see so much
interest in source control scalability. If there are others who have ever
contemplated investing a bit of time to improving git, it'd be great to
coordinate and see what makes sense to do - even if it turns out that the
right answer is just to make the tools that manage multiple repos so good that
it feels as easy as a single repo.

------
lnguyen
There's two issues: the width of the repository (number of files) and the
depth (the number of commits).

Since "status" and "commit" perform fairly well after the OS file cache has
been warmed up, that probably can be resolved by having background processes
that keep it warm. (Also, how long would it take to just simply stat that
number of files? )

The issue of "blame" still taking over 10 minutes: We need to know far back in
the repository they're searching. What happen if there's one line that hasn't
been changed since the initial commit? Are you being forced to go back to
through the whole commit history?

How old is the repository? Years? Months? I'm probably guessing in the at
least years range based on the number of commits (unless the developers are
extremely commit-happy).

At a certain point, you're going to be better off taking the tip revisions off
a branch and starting a fresh repository. It doesn't matter what SCM/VCS tool
you're using (I've been the architect and admin on the implementation of a
number of commercial tools). Keep the old repository live for a while and then
archive it.

You'll find that while everyone wants to say that they absolutely need the
full revision history of every project, you rarely go back very far (aka the
last major release or two). And if you do need that history, you can pull it
from the archives.

------
pwpwp
Git was designed for the Linux kernel, and it's simply not big: a couple
thousand files, broken up into directories of dozens or hundreds of files.

<http://www.schoenitzer.de/lks/lks_en.html#new_files>

------
teyc
This is an interesting social AND a technical problem. The problem for FB is
that it is all too easy for them to just fork git, create the necessary
interfaces and then hope the git maintainers would accept it (they mightn't)
or release it into the wild (and incur bad karma and wrath of OS developers
who'd see this has schism or even heresy).

They've reached out to the developers on git, and I guess that's a first step.

------
dpcx
I don't want to imagine the _actual_ kind of code that requires 1.3M files to
run.

~~~
nbm
Keep in mind that your average repository doesn't only contain code that is
compiled and executed (or interpreted), there is also documentation, static
assets such as images (that may be processed), configuration, computed files
(that may make sense to pre-compute once rather than compute on a hundred
people's environments every build), and so forth.

Also, it doesn't only include the current file set - they include files that
have been deleted, been split into modular files, been merged, been wholesale
rewritten, put into a new hierarchy (some VCS systems handle this better than
others).

(I work at Facebook, but not on the team looking into this stuff. I'm a happy
user of their systems though. Keep in mind that the 1.3 million file repo is a
synthetic test, not reality.)

~~~
0x0
_Also, it doesn't only include the current file set - they include files that
have been deleted, been split into modular files, been merged, been wholesale
rewritten, put into a new hierarchy (some VCS systems handle this better than
others)._

The follow-up email still mentions a working directory of 9.5gb. I cannot
fathom working on a code repository consisting of 9.5gb of text. There must be
something else going on here, even considering any peripheral projects like
the iOS and android apps, etc.

(edit: if there are huge generated files intermingled with code, shouldn't
those be hosted on a "pre-generated cache" web server instead of git, for
example?)

~~~
eropple
Our codebase at my employer currently hovers around 5GB in SVN. Binaries and
other generated code are intermixed for historical reasons. Removing them is a
non-starter due to the amount of time it'd take to do so; the best solution
I've been able to come up with so far is to break out into multiple SVN repos
(one for images, one for generated language files, etc.) and then, hopefully,
get code into Github while externally using the SVN repos for stuff that
shouldn't be versioned in a distributed manner (versioning that stuff is
useful as a convenience - avoiding conflicts, etc.).

------
ctz
I'm surprised Facebook and all its peripheral development has that much
source. I would expect something like 5-10 million lines of code, not ~100
million lines implied by the example.

~~~
nbm
The example is synthetic, so don't worry too much about the implications.

It is useful to keep in mind that Facebook isn't just the front-end (and isn't
just code, also images, configuration, and so forth).

Just talking about open source stuff, Facebook also generates code like
Cassandra, Hive (data warehousing application), Phabricator (a code review and
lifecycle tool), HipHop for PHP (the translator/compiler, the interpreter, and
the virtual machine), FlashCache (a kernel driver), Thrift, Scribe, and so
forth.

We also have had to build applications to support our operations, so think
about what sort of effort goes into building scalable monitoring,
configuration management, automatic remediation, logging infrastructure, and
so forth.

I don't know the actual lines of code across it all, and wouldn't mention it
if I did, but people often underestimate the scale here.

~~~
moe
And all of that must live in a single repository... because?

~~~
nbm
It doesn't live in a single repository. The commenter I was replying to
mentioned "Facebook and all its peripheral development" and a number of lines
of code. I wanted to give him a little insight into what sort of things all
the peripheral development might include, since it isn't obvious.

------
akg
I don't think Git was designed to perform well with such a large repo. In this
case, the best-practice is probably compartmentalizing the code and using Git
submodules. The Git submodule interface is a little un-friendly, but I think
it does work well for such large repos. I've been using submodules
successfully for our development that tracks source files as well as binary
assets.

------
charlieok
I think it's a bad practice to keep a giant code base in one repo. Split the
code base into purpose-specific modules, just as you would split any project
into purpose-specific modules. In fact, those two things might well line up
1:1.

If a project depends on other projects, have it reference the other projects.
Where appropriate, include exact version numbers and/or commit hashes.
Gemfiles are good examples of this good practice at work.

Yes, git has submodules for this sort of thing, but after investigating that
route, I decided against using git submodules. Use something independent of
the VCS instead. Then git won't do weird or unexpected things when you switch
branches. Also, you might want to mix in projects that use other version
control systems. And really, why unnecessarily couple a project to its version
control system?

If (when?), even after splitting a megaproject into manageable subprojects,
these performance issues creep in, I'd certainly be interested in whatever
improvements people are coming up with...

------
loeg
I'm curious what their performance numbers look like if they host the .git
repo on tmpfs -- 15GB isn't unreasonable on a beefy (24-32GB of ram) machine.

~~~
wmf
Probably the same as the warm cache results, since that's basically what tmpfs
is. I wonder if git does all that stat()ing serially or in parallel, though...

~~~
durin42
I don't have the link handy, but IIRC we did some experiments with that for
Mercurial and found that stat() in parallel didn't really help much until you
were using NFS or similarly slow network-type latency filesystems.

------
railsmax
Hey do you know a lot of sites with such needs? Facebook is the first, and
probably all sites with such needs I can count with fingers on my one hand. I
don't think it's git issue - everyone use this system and all are happy using
it. This is like a new feature, but not issue.

------
alok-g
Is anything known for scalability to such sizes for Subversion, Mercurial,
Bazaar, and others?

------
earino
If this was the crazy size of your git repo, why wouldn't you make a tool that
took your git repo and versioned it? Keep it in repos that can all be
performant, since most of the time you are working with "time local"
information?

------
MikeOnFire
My first thought, as suggested by some on the list, was modularization.
Redstone's response (that the 1.3 million files are essentially all
interdependent) terrifies me.

------
slashclee
These times are for spinning-platter hard drives. I wonder what the numbers
look like on a modern SSD?

------
djb_hackernews
That's projected growth for two of their projects. Sounds like they have
something brewing...

Still amazed that breaking it up would do more harm than good when the code
isn't even written yet...

~~~
Judson
I read it as them being unable to break up a project, and the repo being a
projection of future commits to a project that can't be split up.

------
DannoHung
Multiple people in this conversation section have asserted that code sharing
is _way_ easier when all the code is in a single _repo_ , but from my
understanding of sub-modules, it would be a fairly simple matter of setting up
your pre/post-commit hooks to update submodules to a branch automatically and
get useful company wide change atomicity (after all, changes should only
propagate between teams/projects once they have some stability).

Putting aside the question of whether or not an enormous singular repo can be
broken up intelligently into modular projects, is there something about the
submodule approach that makes it a uniquely unsuitable way for sharing changes
amongst projects?

~~~
groby_b
If you limit change propagation, your changes won't propagate as fast. That
goes for bugs _and_ bug fixes.

I can certainly see why you would have the latter propagated instantaneously,
or close to it.

There's also the point that if you _don't_ propagate change to everybody at
the same time, you'll have dozens of slightly different versions of those
projects across your company. The question of submodules vs. large repo is not
as easily decided as you think - there are large upsides (and downsides) to
both approaches.

~~~
DannoHung
No, you're misunderstanding what I'm saying with regards to publishing stable
changes.

Say you have two branches: master and next. Stable work goes in master,
unstable work goes in next. When the code is ready for consumption, you merge
it into master.

Anyone who is using a project has it setup as a submodule. They add post/pre-
commit hooks to update all project submodules. These submodules pull from
master.

This way, everyone will get all stable changes on all submodule projects at
the time of the next change to their own project.

~~~
groby_b
I do understand what you mean just fine. Except it doesn't work that way. If I
have a critical bug fix in libA, I need it rolled out _now_. What's more, I
need it rolled out across all other projects that use libA, immediately. And
no, I don't want to work until all projects committed a change of their own.

Even more, I (or my team) are not the only ones working on libA. Others are
too. So keeping changes in 'next' and pushing to master only on occasion
doesn't help much. Yes, it keeps non-working patches out - but that's what
_local_ branches on your machine are for.

(I'm not even going to mention the issue of merge conflicts. If you work on a
massive scale, the longer you stay in a branch, the more likely you are to get
a merge conflict. There's easily the chance to go into a several weeks long
merge hell. Pull from master, resolve conflicts, run local tests - oh wait,
master is already updated by somebody else)

------
xxiao
git is not memory efficient by design, i used to push about 1G commit to the
server and it hangs forever, i had to abort it and push in as small chunks
instead

------
r15habh
Does Facebook really believe that because they have the most users, they
should also have the biggest git repo?

Amazon and Google have already solved this problem, and the solution is to
reorganize things into smaller manageable packages.

~~~
r15habh
What I mean is, even if they manage to solve this problem with some tweaks,
they will again hit the bottleneck in an year or two, so they should rethink
their source code management

