
Facebook's git repo is 54GB - ShaneCurran
https://twitter.com/feross/status/459259593630433280
======
gavinpc
In terms of engineering tradeoffs, this reminds me of a recent talk by Alan
Kay where he says that to build the software of the future, you have to pay
extra to get the hardware of the future today. [1] Joel Spolsky called it
"throwing money at the problem" when, five years ago he got SSD's for
everybody at Fog Creek just to deal with a slow build. [2]

I don't use Facebook, and I'm not suggesting that they're building the
software of the future. But surely someone there is smart enough to know that,
for this decision, time is on their side.

[1]
[https://news.ycombinator.com/item?id=7538063](https://news.ycombinator.com/item?id=7538063)

[2]
[http://www.joelonsoftware.com/items/2009/03/27.html](http://www.joelonsoftware.com/items/2009/03/27.html)

~~~
lnanek2
Facebook tends to throw engineer time at the problem, though. I know one
Facebook DevCon I went to they presented how they completely wrote their own
build system because Ant was too slow for them.

~~~
Touche
They built their own build system because once you are dealing with top
engineers NIH sets in quickly and you write your own everything.

~~~
Joeri
They have their own in-house version of just about every dev tool.
[http://phabricator.org/](http://phabricator.org/)

~~~
SEJeff
FYI phabricator originated at Facebook, but the guys who wrote it left and
founded Phalicity, who supports it full time now. Facebook did the world a
service by open sourcing phabricator.

Review board and Gerrit are both awful in comparison

~~~
wincent
I've been using Phabricator for a few months now, and I used Gerrit for over 2
years at my last job.

They each have their strengths, but both of them are infinitely preferable to
not doing code review. Neither is awful.

------
rl3
Although this is large for a company that deals mostly in web-based projects,
it's nothing compared to repository sizes in game development.

Usually game assets are in one repository (including compiled binaries) and
code in another. The repository containing the game itself can grow to
hundreds of gigabytes in size due to tracking revision history on art assets
(models, movies, textures, animation data, etc).

I wouldn't doubt there's some larger commercial game projects that have
repository sizes exceeding 1TB.

~~~
kahoon
But they surely don't use git for that, right? In scenarios like this a
versioning system that does not track all history locally would be a better
fit.

~~~
Peaker
It could use git-annex[1]?

[1] [https://git-annex.branchable.com/](https://git-annex.branchable.com/)

~~~
yepguy
Strange as it sounds, git-annex doesn't really do file versioning very well.

~~~
asdfaoeu
Are you talking about git annex assistant or git annex? git annex does file
versioning very nicely then again it doesn't work on Windows so that's
probably not very useful for most game developers.

~~~
eropple
git-annex does file versioning, but it's extremely uncomfortable to use (and I
say this as somebody totally comfortable with git) and I'd never expect an
artist or other only-semi-technical person to use it even if it worked with
Windows. Especially when Subversion or Perforce are _right there_.

------
antimatter
Didn't they switch to Mercurial?

[https://code.facebook.com/posts/218678814984400/scaling-
merc...](https://code.facebook.com/posts/218678814984400/scaling-mercurial-at-
facebook/)

~~~
joshstrange
Somebody asked that on Twitter and the OP responded with:

>> At least according to the presentation by a Facebook engineer that I just
watched, they're still on git. [0]

[0]
[https://twitter.com/feross/status/459335105853804544](https://twitter.com/feross/status/459335105853804544)

~~~
ableal
You can check out, but you can never leave?

------
VikingCoder
Pay attention to the footnote:

    
    
      *8 GB plus 46 GB .git directory

~~~
danbruc
8 GB is still a lot. Would be interesting to know how much of it is actual
code and how much is just images and so on.

~~~
maaaats
The big .git directory is probably binary revisions? Is there any good way
around that in git?

~~~
spoiler
I'm a bit confused. Whenever I've used git on my projects, I'd make sure the
binaries were excluded, using .gitignore

Don't other people do that, too? What's the benefit of having binaries stored?
I've never needed that; I've never worked on any huge projects, so I might be
missing something crucial.

~~~
maaaats
Well, it depends. Images are for instance binaries where a text diff makes
little sense, so you have a copy of each version of the image ever used. And
many projects use programs where the files are binaries. For instance, I've
been on a project where Flash were used and the files checked in. Or PhotoShop
PSD files, .ai files etc.

~~~
spoiler
I usually keep a separate repo for PSD/AI files (completely neglected to
consider them as binary files).

As for images, icons, fonts and similar, I just have a build script that
copies/generates them, if it's needed.

I guess I've always been a little bit "obsessed" about the tidiness of my
repositories.

------
general_failure
The worrying point here is the checkout of 8GB as opposed to the history size
itself (46GB). If git is fast enough with SSD, this is hardly anything to
worry about.

I actually prefer monolithic repos (I realize that the slide posted might be
in jest). I have seen projects struggle with submodules and splitting up
modules into separate repos. People change something in their module. They
don't test any upstream modules because it's not their problem anymore.
Software in fast moving companies doesn't work like that. There are always
subtle behavior dependancies (re: one module depends on a bug in another
module either by mistake or intentionally). I just prefer having all code and
tests of all modules in one place.

~~~
Touche
How does monolithic repos solve that. Surely people who fix bugs in a library
aren't testing the entirety of Facebook every time (how long would that even
take? Assuming they've even set such a thing up.)

~~~
GregorStocks
I used to work at Facebook. They have servers that automatically run a lot of
their test cases on every commit.

------
alayne
FB had previous scaling problems with git which they discussed in 2012
[http://comments.gmane.org/gmane.comp.version-
control.git/189...](http://comments.gmane.org/gmane.comp.version-
control.git/189776)

It appears they are now using Mercurial and working on scaling that (also
noted by several others in this discussion):
[https://code.facebook.com/posts/218678814984400/scaling-
merc...](https://code.facebook.com/posts/218678814984400/scaling-mercurial-at-
facebook/)

------
TheCoreh
I bet most of that size is made up from the various dependencies Facebook
probably has, though I'm still surprised it's that large. I expected the
background worker things, like the facial recognition system for tagging
people, and the video re-encoding libs, to be housed on separate repositories.

I also wonder if that size includes a snapshot of a subset of Facebook's
Graph, so that each developer has a "mini-facebook" to work on that's large
enough to be representative of the actual site (so that feed generation and
other functionalities take somewhat the same time to execute.)

~~~
indygreg2
Having all code in a single repository increases developer productivity by
lowering the barrier to change. You can make a single atomic commit in one
repository as opposed to N commits in M repositories. This is much, much
easier than dealing with subrepos, repo sync, etc.

Unified repos scales well up to a certain point before troubles arise. e.g.
fully distributed VCS starts to break down when you have hundreds of MB and
people with slow internet connections. Large projects like the Linux kernel
and Firefox are beyond this point. You also have implementation details such
as Git's repacks and garbage collection that introduce performance issues.
Facebook is a magnitude past where troubles begin. The fact they control the
workstations and can throw fast disks, CPU, memory, and 1 gbps+ links at the
problem has bought them time.

Facebook made the determination that preserving a unified repository (and thus
preserving developer productivity) was more important than dealing with the
limitation of existing tools. So, they set out to improve one VCS system:
Mercurial ([https://code.facebook.com/posts/218678814984400/scaling-
merc...](https://code.facebook.com/posts/218678814984400/scaling-mercurial-at-
facebook/)). They are effectively leveraging the extensibility of Mercurial to
turn it from a fully distributed VCS to one that supports shallow clones
(remotefilelog extension) and can leverage filesystem watching primitives to
make I/O operations fast (hgwatchman) and more. Unlike compiled tools (like
Git), Facebook doesn't have to wait for upstream to accept possibly-
controversial and difficult-to-land enhancements or maintain a forked Git
distribution. They can write Mercurial extensions and monkeypatch the core of
Mercurial (written in Python) to prove out ideas and they can upstream patches
and extensions to benefit everybody. Mercurial is happily accepting their
patches and every Mercurial user is better off because of Facebook.

Furthermore, Mercurial's extensibility makes it a perfect complement to a
tailored and well-oiled development workflow. You can write Mercurial
extensions that provide deep integration with existing tools and systems. See
[http://gregoryszorc.com/blog/2013/11/08/using-mercurial-
to-q...](http://gregoryszorc.com/blog/2013/11/08/using-mercurial-to-query-
mozilla-metadata/). There are many compelling reasons why you would want to
choose Mercurial over other solutions. Those reasons are even more compelling
in corporate environments (such as Facebook) where the network effect of Git +
GitHub (IMO the foremost reason to use Git) doesn't significantly factor into
your decision.

~~~
leccine
Hello there, have you heard of service oriented architecture? You must be
joking to justify a single repository with "easier to change". Your problem is
that the code base must be tightly coupled if splitting the services out to
different repos is not possible and you need to contribute to multiple
repositories to get something done. I would say, the biggest change in
Amazon's architecture was moving over to the service oriented way and it was
worth the effort. Developers are forced to separate different functions to
separate services and they are in charge of that service. If it goes down
their are getting the alerts. All of the services are using 250ms timeouts so
there is no cascading effect when a services goes down. The web page consists
of few thousand service calls and it degrades gracefully. Facebook obviously
have some tech depth that they need to fix. Using stupid design justified with
some random crap that does not even make sense is not really acceptable (at
least for me).

~~~
indygreg2
SOA isn't a magic bullet.

What if multiple services are utilizing a shared library? For each service to
be independent in the way I think you are advocating for, you would need
multiple copies of that shared library (either via separate copies in separate
repos or a shared copy via something like subrepos).

Multiple copies leads to copies getting out of sync. You (likely) lose the
ability to perform a single atomic commit. Furthermore, you've increased the
barrier to change (and to move fast) by introducing uncertainty. Are Service X
and Service Y using the latest/greatest version of the library? Why did my
change to this library break Service Z? Oh, it's because Service Z lags 3
versions behind on this library and can't talk with my new version.

Unified repositories help eliminate the sync problem and make a whole class of
problems that are detrimental to productivity and moving fast go away.

Facebook isn't alone in making this decision. I believe Google maintains a
large Perforce repository for the same reasons.

~~~
Crito
> _" What if multiple services are utilizing a shared library? For each
> service to be independent in the way I think you are advocating for, you
> would need multiple copies of that shared library (either via separate
> copies in separate repos or a shared copy via something like subrepos)."_

No, you have a notion of packages in your build system and deployment system.

You want to use FooWizz framework for your new service BarQuxer? Include
_FooWiz+=2.0_ as a dependency of your service. The build system will then get
the suitable package FooWiz when building your BarQuxer. Another team on the
other side of the company also wants to use FooWiz? They do the exact same
thing. There is never a need for FooWiz to be duplicated, anybody can build
with that package as a dependency.

~~~
indygreg2
I think you are missing the point. Versioning and package management problems
can largely go away when your entire code base is derived from a single repo.
After all, library versioning and packaging are indirections to better solve
common deployment and distribution requirements. These problems don't have to
exist when you control all the endpoints. If you could build and distribute a
1 GB self-contained, statically linked binary, library versioning and packages
become largely irrelevant.

~~~
Crito
I'm telling you how corporations with extremely large codebases and a SOA do
things. The problem that you described has been solved as I describe.

SOA is beneficial over monolithic development for many other reasons unrelated
to versioning. It just happens to enable saner versioning as one of it's
benefits.

------
lightblade
> @readyState would massively enjoy that first clone @feross

The first clone does not have to go over the wire. Part of git's distributed
nature is that you can copy the .git to any hard drive and pass it on to
someone else. Then...

> git checkout .

------
slig
Am I missing something or this means a new intern working on a small feature,
for instance, would have access to entire codebase?

~~~
rl3
It should be possible to restrict each employee's access to specific parts of
the repository. However, I can't really see Facebook doing that.

Everyone having access to everything must be worth the security trade-off. On
the other hand, I suppose it's debatable whether it would be a trade-off at
all.

~~~
ProAm
>Everyone having access to everything must be worth the security trade-off.

I would find this extremely hard to believe, especially at Facebook. At any
software company, your code base is what defines you as a company; there is no
way they'd let the good stuff sneak out like that.

~~~
im3w1l
Lets say you managed to sneak out the code from facebook. You take the logo,
draw a red cross over it and scribble "ProAmbook" below. You push it live.

Now what? How do you get users? "We are just like facebook - only your friends
aren't here" probably wouldn't get users excited.

And if you somehow DID manage to get users, don't you think there are
"watermarks" in the code, that they could detect and sue you to death with?

~~~
slig
They have anti-spam heuristics, graph heuristics, models on how to serve the
best ad for each user, tons on bugs that can be only discovered by reading the
source, etc.

------
hk__2
Is there a reason why they keep _everything_ in the same repo? Can’t you just
split the code across multiple smaller repos?

~~~
rcxdude
It becoms a lot harder to keep everything in sync, especially if internal
interfaces change frequently. At facebook scale though it's probably a good
idea to defined boundaries between areas in the application better.

~~~
Oompa
You end up with less developers having to pull & merge/rebase if you have
things in separate repos.

Individual libraries/dependencies get worked on by themselves, with an API
that other applications use. Then the other apps just bump a version number
and get newer code.

~~~
taeric
The problem with this, is that you are assuming the APIs change in some sort
of odd isolation to the parts that use them.

That is, the reason an API changes is because a use site has need of a change.
So, at a minimum, you need to make that change and test it against that site
in a somewhat atomic commit.

Then, if the change has any affect on other uses, you need a good way to test
that change on them at the same time. Otherwise, they will resist pulling this
change until it is fixed.

Add in more than a handful of such use sites, and suddenly things are just
unmanageable in this "manageable" situation.

Not that this is "easy" in a central repo. But at least with the source
dependency, you can get a compiler flag at every place an API change breaks
something.

And, true, you can do this with multiple repos, too. But every attempt I have
seen to do that just uses a frighteningly complicated tool to "recreate" what
looks like a single source tree out of many separate ones. (jhbuild, and
friends)

So, if there is a good tool for doing that, I'd certainly love to hear about
it.

------
com2kid
Meh. I'm working on a comparably small project (~40 developers), and we're
over 16GB.

Mostly because we want a 100% reproducible build environment, so a complete
build environment (compilers + IDE + build system) is all checked into the
repro.

~~~
math0ne
IDE checked into the repo eh? For some reason I kinda like that idea. So
portable... if it works.

------
dnlserrano
Someone recently told me that Facebook had a torrent file that went around the
company that people could use to download the entire codebase using a
BitTorrent client. Is there any truth in this?

I mean, the same guy that told me this, also said that the codebase size was
about 50 times less than the one reported in this slide, so it may all be pure
speculation.

~~~
vmarsy
If you're interested in the deployment process at Facebook, look at the link
of a Facebook engineers paper I submitted in my other comment in this thread :
[https://news.ycombinator.com/item?id=7648802](https://news.ycombinator.com/item?id=7648802)

"The deployed executable size is around 1.5 Gbytes, including the Web server
and compiled Facebook application. The code and data propagate to all servers
via BitTorrent, which is configured to minimize global traffic by exploiting
cluster and rack affinity. The time needed to propagate to all the servers is
roughly 20 minutes."

------
NAFV_P

      NAFV_P@DEC-PDP9000:~$ python
      Python 2.7.3 (default, Feb 27 2014, 19:58:35)
      [GCC 4.6.3] on linux2
      Type "help", "copyright", "credits" or "license" for more information
      >>> t=54*2**30
      >>> t
      57982058496
      # let's assume a char is 2mm wide, 500 chars per meter
      >>> t/500.0
      115964116.992 #meters of code
      # assume 80 chars per line, a char is 5mm high, 200 lines per meter
      >>> u=80*200.0
      >>> v=t/u
      >>> v
      3623878.656 # height of code in meters
      # 1000 meters per km
      >>> v/1000.0
      3623.878656 # km of code, it's about 385,000 km from the Earth to the Moon
      >>> from sys import stdout
      >>> stdout.write("that's a hella lotta code\n")

------
ianphughes
I wonder what their branching strategy is like and how merges are gated with a
single codebase of that size?

~~~
indygreg2
They aim for a completely linear history. They may even have a policy of not
allowing merge commits. It is described in various places on the internet. I
like
[https://secure.phabricator.com/book/phabflavor/article/recom...](https://secure.phabricator.com/book/phabflavor/article/recommendations_on_branching/)
because it and its sister articles on code review and revision control are
terrific reads.

~~~
natrius
Dear everyone: you should be using Phabricator. It is Facebook's collected
wisdom about software development expressed in software. It has improved my
life substantially. The code review is better than Github's, and their linear,
squashed commit philosophy has worked out much better than the way I used to
do things.

~~~
ianphughes
It looks pretty great. How does it compare to Atlassian products (if you have
used any)?

~~~
andrewjshults
Mixed bag. The code review part is much better than stash and significantly
better than crucible. Namely, diff of diffs makes reviewing changes based on
comments infinitely easier (especially on large reviews). We installed
phabricator just for the code review piece initially. Repo browsing is about
on par with stash, but it doesn't seem to experience the horrific slow downs
that our stash server does. We don't use the tasks because a number of non
engineering roles also use JIRA and the tasks functions in phabricator don't
have nearly the depth of security and workflow options we need.

------
jasonnutter
Relevant: [https://code.facebook.com/posts/218678814984400/scaling-
merc...](https://code.facebook.com/posts/218678814984400/scaling-mercurial-at-
facebook/)

------
rickr
I thought I had read an article about facebook switching to perforce due to
their really large git repo. Were they at least thinking about it?

A quick google comes up with nothing but I could have SWORN I read that.

~~~
delroth
You are thinking about Mercurial, not Perforce:
[https://code.facebook.com/posts/218678814984400/scaling-
merc...](https://code.facebook.com/posts/218678814984400/scaling-mercurial-at-
facebook/)

------
pekk
Why don't people use multiple git repos for multiple internal projects? It
seems totally nonsensical and undesirable.

------
korzun
That's actually not that bad for a engineering shop of their size. I would
start archiving metadata at some point.

------
ausjke
gosh, last time I had trouble with 8GB data checking in, it's very memory
hungry when the data set is big and then you need check them in all at once,
how much memory on the server side you need when you want to 'git add .' all
the repo of 54GB?

what about a re-index or something, will that take forever?

I worry at such size the speed will suffer, I feel git is comfortable with
probably a few GBs only?

anyway it's good to know that 54GB still is usable!

------
negativity
...but 8GB for the actual current version.

How much of it is static resources, like CSS sprite images?

------
Dorian-Marie
They must storing a lot of images and binary files I guess.

------
bananas
Company I did a contract for last year has 8MB of (Java) source code and a
52MB SVN repo and make £40 million a year out of it...

We're doing something wrong.

------
kevinsf90
I thought they used Mercurial

------
pearjuice
Someone must have forgotten a .gitignore or two.

------
_ak
well, git gc --aggressive --prune=now, duh.

(jk)

------
SnakeDoc
I hope everyone realizes this is not 54GB of code, but in fact, is more likely
a very public showing of very poor SCM management. They likely have tons of
binaries in there, many many full codebase changes (whitespace, tabs, line
endings, etc). Also not to mention how much dead code lives in there?

~~~
lnanek2
Honestly, I prefer check-in everything shops. It's way too often otherwise
some different Java version, or IDE version, or Maven central being down
screws something up, or you have to wait a long time for a Chef recipe or disk
image to give you the reference version. Half my day today was dealing with
someone updating Java on half our continuous build system slave computers and
breaking everything because it didn't have JAVA_HOME and unlimited strength
encryption all setup properly.

~~~
SnakeDoc
That sounds like either poor Sys Admin'ing and/or poor documentation... SCM
should not have "clutter" in it, otherwise you wind up with an all-day
download of 54GB of dead or useless garbage. The Kernel's repo is only a few
GB's and it has MANY more changes and much more history than FB does...

~~~
kyberias
You should realize that the kernel doesn't have image and other assets that FB
might have. And the repo obviously is the right place for them.

~~~
pekk
"The" repo. As if they had no choice but to put everything for every aspect of
the business into ONE giant repo.

Hopefully they actually have some separate sites, separate tools and separate
libraries. Or could understand how to use submodules or something rather than
literally putting everything in one huge repository.

Whether to put images and other assets into git repos is a separate decision.

------
SnakeDoc
Hey Facebook! You're doing it wrong!!!!!

------
coherentpony
So what? This probably means they're versioning data files they shouldn't be.
I feel like this just exists here as a pissing contest.

~~~
pekk
It is supposed to justify the engineering effort they put into switching to
Mercurial, then trying to make it "scale." (Rather than just using separate
repositories to begin with, according to the design of the tool and best
practices)

~~~
SnakeDoc
aka. a pissing match to show "they are too big for any standard industry
tools". Really speaks to the level of (non)expertise employed at FB.

