Hacker News new | comments | show | ask | jobs | submit login
Facebook's git repo is 54GB (twitter.com)
251 points by ShaneCurran on Apr 25, 2014 | hide | past | web | favorite | 231 comments



In terms of engineering tradeoffs, this reminds me of a recent talk by Alan Kay where he says that to build the software of the future, you have to pay extra to get the hardware of the future today. [1] Joel Spolsky called it "throwing money at the problem" when, five years ago he got SSD's for everybody at Fog Creek just to deal with a slow build. [2]

I don't use Facebook, and I'm not suggesting that they're building the software of the future. But surely someone there is smart enough to know that, for this decision, time is on their side.

[1] https://news.ycombinator.com/item?id=7538063

[2] http://www.joelonsoftware.com/items/2009/03/27.html


You should also watch his talk where he questions if it is really necessary to build large software products from such large codebases that, if printed out, would stack as high as a skyscraper. And, then talks about what he is doing to demonstrate that it is not :)

https://news.ycombinator.com/item?id=7538073


Facebook tends to throw engineer time at the problem, though. I know one Facebook DevCon I went to they presented how they completely wrote their own build system because Ant was too slow for them.


They built their own build system because once you are dealing with top engineers NIH sets in quickly and you write your own everything.


They have their own in-house version of just about every dev tool. http://phabricator.org/


FYI phabricator originated at Facebook, but the guys who wrote it left and founded Phalicity, who supports it full time now. Facebook did the world a service by open sourcing phabricator.

Review board and Gerrit are both awful in comparison


I've been using Phabricator for a few months now, and I used Gerrit for over 2 years at my last job.

They each have their strengths, but both of them are infinitely preferable to not doing code review. Neither is awful.


This might be right in spirit, but Phabricator is maintained by Phacility (some ex-FB people) and used by at least 6 other companies outside of FB.


That's one way to keep your talent from being poached :)


To be fair once you are over 1000 engineers, a 1% improvement in productivity is worth a lot of development time.


Somehow Netflix never got that memo and they seem to be doing just fine.


If you are at the point where you are contemplating building your own dev-tool-everythings... then you've gone very wrong. Let the dev-tool guys make dev-tools, they are much better at it (it's all they do). Instead figure out how to make them work in your business environment. Facebook may have one of the largest sites on the web, and perhaps the most users... but their codebase itself is nothing special and does not warrant "special" tools. That's just BS and a measuring contest.


Having used ant as a build system for Android projects, I don't blame them.

In my admittedly limited experience (Windows 7 x64, ant, Android SDK) ant is terribly slow to build projects with multiple source library dependencies and throwing hardware at the problem doesn't speed it up that much.


I don't see how this is an Ant-specific issue. Ant is just calling into javac with a classpath parameter. The actual execution time spent in Ant should be minimal.


With Android, ant makes naive assumptions.

For example, most open source library projects that you include in a project don't change from build to build, but ant dutifully recompiles them each time instead of caching the output until the files in that project are changed or I manually clean the build output.


That surely means Ant is make, only worse.


"Only worse" is probably not fair. Though, is it really surprising that a newer make file degenerates into the same problems as old ones?

Which will then lead down a path of a set of scripts/utilities on top of said system to standardize on a set of targets and deal with known issues. And suddenly we have reinvented another old tool, autotools. We'll probably find ourselves in their troubles soon enough.


There was a link submitted here (I can't find it now) a few weeks ago that talked exactly about that. Most build systems are just reimplementations of make, which makes them worse , because make has been battle tested for ages.


Amusingly, that story is what convinced me to finally read up on autotools. It has been interesting to see just how much they already covered.

In fact, the only thing the autotools don't do, that every current tool does, is download the dependencies. Which, is not surprising, since fast network connections are a relatively new thing.

And I'm actually not sure I care for the download of dependencies thing. It is nice at first, but quickly gets to people pushing another thing to download, instead of making do with what you have. This is plain embarrassing when the downloaded part offers practically nothing on top of what you had.


I'm confused. That doesn't sound like Make's behavior at all.

Make, when used properly, is still an pretty smart tool.


That's the worse part of Ant.


Doesn't and use prebuild jars for libraries? If not then maybe you should think about switching to maven.


At my company, I wrote our own build system, because "make", "waf", "shake" and various others do not give any useful guarantee, and we've debugged cryptic under-specified dependencies way too many times. Make clean on a large repo and lack of automatic work sharing hurt too.

Also, auto detecting inputs rather than being forced to specify them is nice. Especially as virtually all input specs in Makefiles are wrong or incomplete.

Writing a build system is not such a big deal -- and outdoing the existing work is not very hard.


Perhaps that's true for your very specific use case but the same is likely to be true for other people using your build system. autodetection is great when it works and horrid when it fails.


I use file system hooks to autodetect dependencies, so it should always work as long as the file system is the input of the build and not other channels.

Explicit input specifications are virtually never correct. #include scanners for example, are generally wrong because they do not express the dependency on the inexistence of the headers in previous include paths.


What guarantees did you find lacking in shake?


when you're scaling any given variable, the stock solution is almost never good enough. you end up hitting all kinds of limits.


That quote makes me nostalgic for Silicon Graphics workstations.


Although this is large for a company that deals mostly in web-based projects, it's nothing compared to repository sizes in game development.

Usually game assets are in one repository (including compiled binaries) and code in another. The repository containing the game itself can grow to hundreds of gigabytes in size due to tracking revision history on art assets (models, movies, textures, animation data, etc).

I wouldn't doubt there's some larger commercial game projects that have repository sizes exceeding 1TB.


I am working on a game where head is 1TB. On top of code and assets this size includes a few full builds and a full set of game-ready data (the data build process takes something ridiculous like 7 hours, so that's done on a server and it checks the result in). All in the same repository.

1TB is rather a lot. My previous record was 300GB and even that seemed a bit much. But it is very convenient having everything in one place, including all the stuff that you only need occasionally but is handy to have to hand, such as old builds of the game that you can just run in situ and installers for all the tools.

(I don't know what the entire repository size must be like, but many of the larger files have a limited history depth, so it's probably less than 5-10TB. So not exactly unimaginable, though I'm sure buying that much server-grade disk space - doubtless in duplicate - might cost more than I think.)


Interesting. Where do you work if I may ask?


WOW. BTW what game is that you are developing ?


Half-Life 3 I assume.


But they surely don't use git for that, right? In scenarios like this a versioning system that does not track all history locally would be a better fit.


As the other replies say, Perforce is dominant in commercial game development.

However, Perforce does have Git integration now, allowing for either a centralized or distributed version control model. Considering the popularity of Git, I wouldn't doubt smaller Perforce-based game projects are going the DVCS route.

Also, hypothetically speaking, consider if you had a game project that would eventually grow to 1-2TB in repository size. If you spent $100 per developer to augment each of their workstations with a dedicated 3TB hard drive, you would have an awesome level of redundancy using DVCS (plus all the other advantages). I know it's no replacement for cold, off-site backups, but it would still be nice.


Perforce tends to be big in game dev because it does better with repos full of giant blobs, has locking for working with them etc.


Perforce is king in game development. It's also the only place I personally still use Subversion.


I think most use Perforce.


Git does have support for "shallow" clones, that don't pull down the full history but only the recent history.


shallow clones the only issue for games. Another problem is, unlike code, art assets usually can't be merged. You can't merge 2 photoshop texture files or 2 edits to a 3d character made in Maya.

So, you need some central system to manage the assets so that people know "hey, don't edit this file right now because so-and-so is editing it".

Ideally you'd like to know this BEFORE you start editing. In other words, you don't want to spent 15-60 minutes editing something and only on saving or trying to check in get told "hey, sorry but someone else was editing this file so you'll have to discard your changes and start over". Some editors are better at this than others.

You could try to write something on outside of git to help with this but why when P4 already provides this?


> You could try to write something on outside of git to help with this but why when P4 already provides this?

Maybe because P4 is kind of a PITA? I used it for 10 months on a project (without any noticeable art assets, even; this was just code) and it regularly caused problems. The company had someone whose sole job was to administrate P4, and it was sorely needed.

Of course, it's been many years, and I no longer remember details about the precise problems encountered, just the overall frustration. Although the one thing I do remember is the aggravation caused when a coworker accidentally locked a file they weren't editing and then left for the day.


Even better would be collaborative real-time editing.


It could use git-annex[1]?

[1] https://git-annex.branchable.com/


There are a lot of non developers using source control for game development. Perforce makes it easier for graphic designers and the like to use the VCS and not have to deal with a CLI. Perforce just handles binary blobs well.


Strange as it sounds, git-annex doesn't really do file versioning very well.


Are you talking about git annex assistant or git annex? git annex does file versioning very nicely then again it doesn't work on Windows so that's probably not very useful for most game developers.


git-annex does file versioning, but it's extremely uncomfortable to use (and I say this as somebody totally comfortable with git) and I'd never expect an artist or other only-semi-technical person to use it even if it worked with Windows. Especially when Subversion or Perforce are right there.


I wonder what the biggest code base in the world is.

Like you say, games include all sorts of binary assets. Any idea how much actual code is there?

The Linux kernel is only 175MB https://git.wiki.kernel.org/index.php/GitBenchmarks#Estimate...

The F22 has some 1.7 million LOC http://en.wikipedia.org/wiki/Lockheed_Martin_F-22_Raptor#cit...

This graph shows some pretty big things http://dailyinfographic.com/wp-content/uploads/2013/10/1276_...


looking at the plot, i wonder what the heck is up with healthcare.gov... what on earth does the site do that really requires 5e8 loc?


Support Internet Explorer


Absolutely! BF4 with the expansion packs is > 40GB as a compiled binary and compressed assets. The source (with no history) must be gigantic.


That is mostly data and assets though.


git can not really deal with binary files well, if you have lots of png/gif/mov/mp3/jpg/etc to track git is not the option for you I think. git excels at source code, in txt format that is.



Somebody asked that on Twitter and the OP responded with:

>> At least according to the presentation by a Facebook engineer that I just watched, they're still on git. [0]

[0] https://twitter.com/feross/status/459335105853804544


You can check out, but you can never leave?


Pay attention to the footnote:

  *8 GB plus 46 GB .git directory


The new depth option [1] on git clone as of 1.9.x should make the .git dir much more reasonable.

> --depth <depth>: Create a shallow clone with a history truncated to the specified number of revisions.

If the OS and applications that are creating and modifying files are all honest about the file dates, it should be possible to only scan dates instead of reading out every file. Or even use ionotify-like events to track what changed.

[1]: http://git-scm.com/docs/git-clone

EDIT: As sisk points out below --depth itself is not new, but as of 1.9 the limitations that previously came with shallow clones were lifted. Thanks sisk.


I've wanted this option for so long :-) Are there any downsides to it, apart from the obvious one of 'git log' not going all the way back in time? I mean, do any commands break unexpectedly?


The --depth flag has been around for a long time but, previously, came with a number of limitations (cannot push from, pull from, or clone from a shallow clone). As of 1.9.0, those limitations have been removed.

https://github.com/git/git/commit/82fba2b9d39163a0c9b7a3a2f3...


8 GB is still a lot. Would be interesting to know how much of it is actual code and how much is just images and so on.


I bet they check in (or have checked in at some point) 3rd party libraries, jar files, generated files, etc. I battle this every day at $dayjob and we have a multi GB subversion repository with a separate one for the 3rd party binaries. Svn handles this a bit better than the DVCSes, so just checking out the HEAD is smaller than the full history you get in git/hg, and you can clean up some crud a bit. It just lives on the central server, not in everyone's working copy.


You can clone with a single revision in git. Still ends up being around 2X the size I believe since you have the objects stored for the single revision as well as the working tree.


2X the size of what? The raw files? Assuming little shared data, sure. But svn will make two copies of every file too.


Facebook now has 7000 employees and is 10 years old. Each employee would have had to write 14 kiB of code (357 LOC with 40 (random guess of mine) characters per LOC) every day during this 10 years to produce 8 GiB of code. (Obviously assuming the repository contains only non-compressed code and only one version of everything and no metadata and...)


If they've got memcached with their own patches, linux with their own patches, Hadoop with their own patches, etc. and tons of translations I can see 8 gigs of text.


Why would they put that all in the same repository? I'm pretty sure this 8 GB repo is just their website code. A frontend dev working on a Timeline feature shouldn't have to check out the Linux kernel.


8GB would be at least 100 million lines of code (upper bound with 80 characters per each line). For comparison Linux has 15+ million lines of code, PHP 2+M.


PHP's repo is around 500M. But I'd say it is probably tens to a hundred times smaller than Facebook should be, especially if you count non-public stuff they must have there. So comes out about right.


Exactly, there's no way they wrote 8GB of code.


I think they did, have heard similar figures from them before, all deployed as one file. Macroservices...


unlikely. they probably have a very dirty repo with tons of binaries, images, blah blah blah. It's highly unlikely they actually wrote 8GB of code, and the 46GB .git directory will be littered with binary blob changes, etc. This is really just to "impress" two people: 1) People who love Facebook 2) People who don't know anything about version control and/or how to do proper version control (no binaries in the scm).



I cited this interesting paper from Facebook engineers(2013) in another comment :

https://news.ycombinator.com/item?id=7648802

In 2011 they had 10 millions LoC, up to 500 commits a day,

but if we asssume the plots keeps going up like this, now in 2014 it can be pretty big.

Their binary was 1.5GB when the paper was written.


1.5GB binary that includes their dependencies (other binaries).


The big .git directory is probably binary revisions? Is there any good way around that in git?


When I can't avoid referring to binary blobs in git, I put them in a separate repo and link them with submodules. It keeps the main repo trim and the whole thing fast while still giving me end-to-end integrity guarantees.

I wrote https://github.com/polydawn/mdm/ to help with this. It goes a step further and puts each binary version in a separate root of history, which means you can pull down only what you need.


you're not going to merge binary files, so git isn't the right tool. the standard way is to use maven. git handles your sources, and anything binary (libs, resources etc) goes on the nexus (where it is versioned centrally) and is referenced in your poms: simple and powerful


git is "the stupid content tracker", not "the stupid merge tool". Even for things you have no intention of branching or merging, it still gives you control over versioning... and there's a huge gap between the level of control a hash tree gives you versus trusting some remote server to consistently give you the same file when you give it the same request.


I'm a bit confused. Whenever I've used git on my projects, I'd make sure the binaries were excluded, using .gitignore

Don't other people do that, too? What's the benefit of having binaries stored? I've never needed that; I've never worked on any huge projects, so I might be missing something crucial.


If there is a small number of rarely changing binaries (like icons, tool configs, etc.) then it may not be worth it to move them. Also if space is much cheaper than tool complexity and build time.


Well, it depends. Images are for instance binaries where a text diff makes little sense, so you have a copy of each version of the image ever used. And many projects use programs where the files are binaries. For instance, I've been on a project where Flash were used and the files checked in. Or PhotoShop PSD files, .ai files etc.


I usually keep a separate repo for PSD/AI files (completely neglected to consider them as binary files).

As for images, icons, fonts and similar, I just have a build script that copies/generates them, if it's needed.

I guess I've always been a little bit "obsessed" about the tidiness of my repositories.


If you don't have the source that produced those binaries your only choice is to have them downloadable from somewhere else (which is a real hassle for the developer) or just check them into the repo.


Stripping the binaries with filter-branch, or normalising them using smudge filters (for example, making sure zip-based formats use no compression).


Re-cloning a fresh repo should keep it small. There's also a git gc method which cleans up the repo. I guess another way would be to just archive all the history up to a certain point somewhere.


> Re-cloning a fresh repo should keep it small. There's also a git gc method which cleans up the repo.

There's only so much git gc can do. We've got a 500MB repo (.git, excluding working copy) at work, for 100k revisions. That's with a fresh clone and having tried most combinations of gc and repack we could think of. Considering the size of facebook, I can only expect that their repo is deeper (longer history in revisions count), broader (more files), more complex and probably full of binary stuff.


Is that all actually code? Each commit would have to add an average of 5KB compressed so maybe 20KB of brand new code.


> Is that all actually code?

Of course not, it's an actual project. There is code, there are data files (text and binary) and there are assets.


maven!


FWIW: Every time you modify a file in git, it adds a completely new copy of the file. 100mb text file, 100mb binary file - makes no difference. Modify one line, it's a new 100mb entry in your git repo.


No, git has delta compression. It only saves the changes.


Only when git repack operations run.


which you can run at any time. Git also runs a minor gc every time you do a git push, which does do some compression.


What about unit tests (and other testing code)? Those could occupy decent size too.


how is 8gb a lot, that's a small thumb drive's worth.


Could it be it contains a DB of all their users?


The worrying point here is the checkout of 8GB as opposed to the history size itself (46GB). If git is fast enough with SSD, this is hardly anything to worry about.

I actually prefer monolithic repos (I realize that the slide posted might be in jest). I have seen projects struggle with submodules and splitting up modules into separate repos. People change something in their module. They don't test any upstream modules because it's not their problem anymore. Software in fast moving companies doesn't work like that. There are always subtle behavior dependancies (re: one module depends on a bug in another module either by mistake or intentionally). I just prefer having all code and tests of all modules in one place.


As much as i hate clearcase i have to say it has some "interesting" features to deal with this. When you look at a repo you don't look at one unique version of the whole repo, you can create quite advanced "queries" such that for folder /src you look at your branch, for folder /3dpartylibs you look at version 5 and for folder /tests you look at head(latest). And since your working directory can be network attached (dynamic view) you dont have to "update" manually, head is like a living animal! Its like having subrepos built in.

While this is interesting it also requires a lot of discipline and almost one person dedicated as full time "dba" to not end up with spaghetti. Since there is no unique version number of the repo you have to store these queries and manually keep adding labels to be able to go back to an exact point of time.

It does have some uses like being able to run a new test on old code to find the first version when something broke or being able to change versions of external libraries or blob assets quickly but its hard to say if its worth it since it comes with so many other problems.


Most big tech companies use a service-oriented architecture. The website is composed of many small services which communicate with each other over HTTP or RPC protocols. Each service has its own version control repo and is maintained by a different team. That's generally the best way of scaling up.


Facebook's architecture, for at least the core web app, was a blob of PHP for ages.

They might have since modularized and cleaned it up but it seems unlikely they'd fully SOA-ize the Facebook web app.


That only applies to deployment. You're not building these services from the ground up: they're all going to have common libraries that need to stay up to date.


These are all solved problems. You create a package system that allows you to specify versioned dependencies to other packages. Your build and deployment systems can then build your package even though it depends on code that lives in other repositories owned by different teams. Hell, this even works across different versioning control systems; one team can be lagging along on SVN, another can have packages in P4, and yet another can have theirs in git, but they can all build against each others code.

It works absolutely brilliantly. Division of labor and responsibility becomes clear, repos stay manageable, large scale rewrites can happen safely, in piecemeal, over time... it really is the best way to do it.


As others have noted elsewhere, this "solution" has its own problems if you are rapidly moving. Which I don't think anyone can claim facebook hasn't been doing.

So, yes, if you are able to control growth enough that you can make this happen, it is attractive. If you can't, then this leads to a version of the diamond problem in project dependencies. And is not fun.


Growth is the reason that companies should avoid what Facebook has done. The company I currently work for anticipated the scaling problems that Google later encountered with Perfoce (http://www.perforce.com/sites/default/files/still-all-one-se...) and recognized that while perforce could be scaled further by purchasing increasingly beefy hardware, ultimately you could only reliably scale so far with Perforce.

If you're not growing, then there is no problem. If you have linear growth maybe you can keep pushing it, but who plans on linear growth?

Google is already on multiple Perforce servers because of scaling, and that is not a situation that is going to improve. If you start using multiple centralized version control servers, you are going to want a build/deployment system that has a concept of packages (and package versions) anyway.

> If you can't, then this leads to a version of the diamond problem in project dependencies. And is not fun.

These sort of dependency resolution conflicts can and do happen, but far less often than you would think. Enforcing semantic versioning goes a long way (and along with it, specifying. In practice, the benifits of versioned dependencies (such as avoiding ridiculous workarounds like described by this HN comment: https://news.ycombinator.com/item?id=7649374) far outway any downsides.

You can even create a system that uses versioned packages as dependencies while using a centralized versioning system. In fact, this is probably the easiest migration strategy. Build the infrastructure that you will eventually need anyway while you are still managing on one massive repository. Then you can 'lazy eval' the migration (pulling more and more packages off the centralized system as the company grows faster and faster, avoiding version control brownouts).


I'm assuming you aren't referring to "succeed" in your first sentence. :)

It is amusing the amount of hubris our industry has. Seriously, you are talking about outsmarting two of the most successful companies out there.

I mean, could they do better? Probably. But it is hard to grok the amount of second guessing any of their decisions get.


But are they successful because of this, or despite it?


Really good question. One that I am not pretending to know the answer to.

I do feel that the main reason they are successful is large manpower. That is, competent (but not necessarily stellar) leadership can accomplish a ton with dedicated hard workers. This shouldn't be used as evidence that what they are doing is absolutely correct. But, it should be considered when what they do is called into question.

If you have/know of any studies into that, I know I would love to read them. I doubt I am alone.


If you are a large company, you can move faster if devs aren't all working on the same repository. If all your code is in one repo and one team makes a breaking change to their part of it, everyone's code fails to build. If the source code is broken up into separate packages, each team can just use versioned releases of the packages they depend on and not have to worry about fixing compatibility issues.


While there is a strong appeal to your argument, Facebook stands as a resounding counter example. As does Google.

The counter argument appears to be this. If one team checks in a change that break's another team's code, then the focus should be on getting that fixed as soon as possible.

Now, if you are in multiple repositories, it can be easy to shift that responsibility onto the repository that is now broken. Things then get triaged and tasks must be done such that getting in a potentially breaking fix may take time.

Contrasted with the simple rule of "you can't break the build" in a single repository, where the impetus is on whoever is checking in the code to make sure all use sites still work.

Granted, an "easy" solution to this is to greatly restrict any changes that could break use site code. The Kernel is a good example of this policy. Overall, I think that is a very good policy to follow.


We use separate repos and it works out well. It's nice having separate Git histories that pertain to different areas of the codebase.

Our workflow covers all the potential problems you named (eg. scripts to keep everything up to date, tests that get run at build or push time after everything is already checked out from the individual repos, etc.).

We've been running this way for over a year with literally zero issues.


To get a log for a specific subdirectory, you just:

  git log -- my-teams-subdirectory


>People change something in their module. They don't test any upstream modules because it's not their problem anymore.

If you use any sort of versioning this shouldn't ever cause a problem.


I had a problem so I decided to use versioning, now I have a combinatorially exploding number of problems.


Yeah, doing it this way they can never make API incompatible changes without also fixing everything downstream... which effectively means once a library is popular enough it is locked at it's current API forever.


I've seen that happen at Google. At some point it's easier to write a new (differently named) library. Monolithic codebase + renames = poor man's versioning.

BUT this allows you to pay the prices of versioning (downstream burdened with rewriting => they never do => old code lives indefinitely) only in the worst cases. If done right (lots of tests, great CI infrastructure), fixing everything downstream is practical in many cases, and can be a win.

A subtler question is how this interplays with branching. You can't be expected to update downstream on every random/abandoned branch out there, only the head. Which deters people from branching because then it's their responsibility to keep up...


Or you bump an API version. And the fixes are gradual everywhere.


The parent was advocating not versioning.


How does monolithic repos solve that. Surely people who fix bugs in a library aren't testing the entirety of Facebook every time (how long would that even take? Assuming they've even set such a thing up.)


I used to work at Facebook. They have servers that automatically run a lot of their test cases on every commit.


It is at least easier to correlate the changes. When you have X+ modules, you have potentially X+ histories you have to look at to know when a change was seen with another change.


FB had previous scaling problems with git which they discussed in 2012 http://comments.gmane.org/gmane.comp.version-control.git/189...

It appears they are now using Mercurial and working on scaling that (also noted by several others in this discussion): https://code.facebook.com/posts/218678814984400/scaling-merc...


I bet most of that size is made up from the various dependencies Facebook probably has, though I'm still surprised it's that large. I expected the background worker things, like the facial recognition system for tagging people, and the video re-encoding libs, to be housed on separate repositories.

I also wonder if that size includes a snapshot of a subset of Facebook's Graph, so that each developer has a "mini-facebook" to work on that's large enough to be representative of the actual site (so that feed generation and other functionalities take somewhat the same time to execute.)


Having all code in a single repository increases developer productivity by lowering the barrier to change. You can make a single atomic commit in one repository as opposed to N commits in M repositories. This is much, much easier than dealing with subrepos, repo sync, etc.

Unified repos scales well up to a certain point before troubles arise. e.g. fully distributed VCS starts to break down when you have hundreds of MB and people with slow internet connections. Large projects like the Linux kernel and Firefox are beyond this point. You also have implementation details such as Git's repacks and garbage collection that introduce performance issues. Facebook is a magnitude past where troubles begin. The fact they control the workstations and can throw fast disks, CPU, memory, and 1 gbps+ links at the problem has bought them time.

Facebook made the determination that preserving a unified repository (and thus preserving developer productivity) was more important than dealing with the limitation of existing tools. So, they set out to improve one VCS system: Mercurial (https://code.facebook.com/posts/218678814984400/scaling-merc...). They are effectively leveraging the extensibility of Mercurial to turn it from a fully distributed VCS to one that supports shallow clones (remotefilelog extension) and can leverage filesystem watching primitives to make I/O operations fast (hgwatchman) and more. Unlike compiled tools (like Git), Facebook doesn't have to wait for upstream to accept possibly-controversial and difficult-to-land enhancements or maintain a forked Git distribution. They can write Mercurial extensions and monkeypatch the core of Mercurial (written in Python) to prove out ideas and they can upstream patches and extensions to benefit everybody. Mercurial is happily accepting their patches and every Mercurial user is better off because of Facebook.

Furthermore, Mercurial's extensibility makes it a perfect complement to a tailored and well-oiled development workflow. You can write Mercurial extensions that provide deep integration with existing tools and systems. See http://gregoryszorc.com/blog/2013/11/08/using-mercurial-to-q.... There are many compelling reasons why you would want to choose Mercurial over other solutions. Those reasons are even more compelling in corporate environments (such as Facebook) where the network effect of Git + GitHub (IMO the foremost reason to use Git) doesn't significantly factor into your decision.


Hello there, have you heard of service oriented architecture? You must be joking to justify a single repository with "easier to change". Your problem is that the code base must be tightly coupled if splitting the services out to different repos is not possible and you need to contribute to multiple repositories to get something done. I would say, the biggest change in Amazon's architecture was moving over to the service oriented way and it was worth the effort. Developers are forced to separate different functions to separate services and they are in charge of that service. If it goes down their are getting the alerts. All of the services are using 250ms timeouts so there is no cascading effect when a services goes down. The web page consists of few thousand service calls and it degrades gracefully. Facebook obviously have some tech depth that they need to fix. Using stupid design justified with some random crap that does not even make sense is not really acceptable (at least for me).


SOA isn't a magic bullet.

What if multiple services are utilizing a shared library? For each service to be independent in the way I think you are advocating for, you would need multiple copies of that shared library (either via separate copies in separate repos or a shared copy via something like subrepos).

Multiple copies leads to copies getting out of sync. You (likely) lose the ability to perform a single atomic commit. Furthermore, you've increased the barrier to change (and to move fast) by introducing uncertainty. Are Service X and Service Y using the latest/greatest version of the library? Why did my change to this library break Service Z? Oh, it's because Service Z lags 3 versions behind on this library and can't talk with my new version.

Unified repositories help eliminate the sync problem and make a whole class of problems that are detrimental to productivity and moving fast go away.

Facebook isn't alone in making this decision. I believe Google maintains a large Perforce repository for the same reasons.


> "What if multiple services are utilizing a shared library? For each service to be independent in the way I think you are advocating for, you would need multiple copies of that shared library (either via separate copies in separate repos or a shared copy via something like subrepos)."

No, you have a notion of packages in your build system and deployment system.

You want to use FooWizz framework for your new service BarQuxer? Include FooWiz+=2.0 as a dependency of your service. The build system will then get the suitable package FooWiz when building your BarQuxer. Another team on the other side of the company also wants to use FooWiz? They do the exact same thing. There is never a need for FooWiz to be duplicated, anybody can build with that package as a dependency.


I think you are missing the point. Versioning and package management problems can largely go away when your entire code base is derived from a single repo. After all, library versioning and packaging are indirections to better solve common deployment and distribution requirements. These problems don't have to exist when you control all the endpoints. If you could build and distribute a 1 GB self-contained, statically linked binary, library versioning and packages become largely irrelevant.


I'm telling you how corporations with extremely large codebases and a SOA do things. The problem that you described has been solved as I describe.

SOA is beneficial over monolithic development for many other reasons unrelated to versioning. It just happens to enable saner versioning as one of it's benefits.


You're coupling your 3rd party dependencies too tightly with your app logic, so that's why its so brittle. Start wrapping those functions.


When you are done writing boilerplate to wrap all the third-party functions, don't forget to write equivalent documentation too. And tests. Or are you just going to use the same function names? If the latter it's a pointless exercise, because it's unlikely you can shoehorn A different vendor library into the exact same API later.


Whilst not arguing against your other points which I think are interesting, git has the submodule system which we use for shared libraries.


I never said it was a magic bullet. What I am suggesting is to have separate services with independent software packages. This is all. Btw. you can't break a service with a commit, all the services are using versioned APIs so when you are building API version n+1 you are not ditching version n. Very simple, seamless upgrade, graceful downgrade. FYI such a system is powering Amazon and it works perfectly at scale.


There's gotta be a ton of binaries in there


Reminds me of a recent project we had. The code we wrote came in at about 150k after an npm and bower install it was at 192mb.


Surely a mini-graph would be something that would be loaded by a script, instead of being stored in the repo itself.


> @readyState would massively enjoy that first clone @feross

The first clone does not have to go over the wire. Part of git's distributed nature is that you can copy the .git to any hard drive and pass it on to someone else. Then...

> git checkout .


Am I missing something or this means a new intern working on a small feature, for instance, would have access to entire codebase?


That is a feature not a bug! Discoverability of code helps improve code quality and makes things less fragile.


Discoverability decreases as LOC increase, so that's not true.


Yes, and an intern would be subject to a code review before pushing to master - no way would they have write access to the "master" repo.


Indeed the process of facebook code deployment is pretty complex:

https://www.facebook.com/publications/514128035341603/

And I assume they have extra requirements for interns to push code.


that is a fascinating article, thank you. i like the idea of 'push karma'.


You're welcome! I really liked the 'push karma' too.

I liked this idea too : "all engineers who contributed code must be available online during the push. The release system verifies this by contacting them automatically using a system of IRC bots; if an engineer is unavailable (at least for daily pushes), his or her commit will be reverted."

That way, they are be able to react very quickly in case of a problem.


But this isn't a magic bullet...a commit can't always be reverted without conflicts, and even if reverted there's no guarantee that it leaves the codebase in a correct state. (Guess I should read the article you're talking about.)


Yes, but the new intern would be able to read all the source and "secret sauces".

I doubt that an intern on Google would've access to the search codebase. I'd wager that only a handful of trusted employees have access to that codebase.


All Google code is stored in one master repository, for all products. An intern can look at the search code.


Not so. Search is specifically sequestered.


If you don't trust your employees why did you hire them? Besides, the source code isn't where FB's money is at, that would be their users.


> If you don't trust your employees why did you hire them?

On a small company, I agree. But on FB they're around 5k people. Let's say they have 3k engineers, that's a awful lot of people they're trusting with their source code


Google is also one source tree I think.


You are correct for the search codebase - it is separate from the main tree, as is a lot of the core ad placement code. If you aren't working on those projects, you don't have access.


So it is like the Coca Cola recipe? Only two people allowed to know it and not allowed to get the same flight? Yet everyone else able to mix and serve the syrup?


It should be possible to restrict each employee's access to specific parts of the repository. However, I can't really see Facebook doing that.

Everyone having access to everything must be worth the security trade-off. On the other hand, I suppose it's debatable whether it would be a trade-off at all.


I wonder, if this is the way a majority of big businesses do things, how come we don't see more leaks of entire codebases? It'd be trivial to put something up on TPB and just share all the code, but I don't see things like that happening. I also doubt that every single employee with access to the code has the moral standards not to do this. There must be something else keeping them from doing it.


I'd wager they are more likely to get leaked to the black market for money where someone will keep it a secret in the hopes of finding some 0 days.


I'd say the combination of some moral standards, the fear of getting caught and - first of all - little incentive to leak should be sufficient, no?


> It should be possible to restrict each employee's access to specific parts of the repository.

Not with git. The Hash chain mechanism requires the entire repo to generate valid chains, so it's all or nothing.


>Everyone having access to everything must be worth the security trade-off.

I would find this extremely hard to believe, especially at Facebook. At any software company, your code base is what defines you as a company; there is no way they'd let the good stuff sneak out like that.


Lets say you managed to sneak out the code from facebook. You take the logo, draw a red cross over it and scribble "ProAmbook" below. You push it live.

Now what? How do you get users? "We are just like facebook - only your friends aren't here" probably wouldn't get users excited.

And if you somehow DID manage to get users, don't you think there are "watermarks" in the code, that they could detect and sue you to death with?


They have anti-spam heuristics, graph heuristics, models on how to serve the best ad for each user, tons on bugs that can be only discovered by reading the source, etc.


Having only the code wouldn't be enough. You need to also replicate comparable maintenance staff & data center capability. Even then you'll just have a clone of the tech, but nobody will actually use your service. There's so much more that makes facebook what it is. I think they may even be able to open source their tech with a net positive effect.


I think a lot of their value comes in the data the collect about users and how they use it to sell ads to vendors. That source code would be worth a lot. I'm not speaking about replicating a Facebook on your own servers, I agree there it would be futile. But how they run their business can probably be derived from their source code.


Facebook has never been defined by its code. They have had source leak before and no one has cared--it's PHP (heavily customized at that) and not of much worth outside of casual curiosity. The trick is to get a billion plus people to give you the details of your life every day.


> The trick is to get a billion plus people to give you the details of your life every day.

How they capture this data and use it would also be in their source code, no? This is absolutely where Facebook gets its worth. I would assume this is what they would want to keep in a limited exposure set? I might be wrong, but this is why they hire the best engineers out there.


Is there a reason why they keep everything in the same repo? Can’t you just split the code across multiple smaller repos?


When you split code across smaller repos, you gain a different problem: version management and potentially dependencies using different versions of the same project.


> When you split code across smaller repos, you gain a different problem: version management and potentially dependencies using different versions of the same project.

True, you do. What you gain is the ability of small pieces to move individually through API changes.

If your entire codebase is in one repo (as appears to be the case here), and you want to change an API, you must either do so in a backwards compatible way, and slowly eradicate any old callers, or change them all in one fell swoop.

By splitting to multiple repos, you can version them independently. Thus, a project can (hopefully temporarily) depend on the old API, which only gets bugfixes, while another project can depend on the new version.

The tricky bit is when you have one "binary" or something equivalent referring to two versions of a dependency. (Usually indirectly, i.e., A depends on B which depends on D v1, and A depends on C which depends on D v2, and D v1 and D v2 are incompatible.) You can't really do much about this, but if you keep your components small enough (think services with well separated interfaces) you should be able to keep the dependencies small enough as well.


It becoms a lot harder to keep everything in sync, especially if internal interfaces change frequently. At facebook scale though it's probably a good idea to defined boundaries between areas in the application better.


No, it doesn't. It's actually the complete opposite because you know, 'though shall separate those things that change frequently from those that don't'


You end up with less developers having to pull & merge/rebase if you have things in separate repos.

Individual libraries/dependencies get worked on by themselves, with an API that other applications use. Then the other apps just bump a version number and get newer code.


The problem with this, is that you are assuming the APIs change in some sort of odd isolation to the parts that use them.

That is, the reason an API changes is because a use site has need of a change. So, at a minimum, you need to make that change and test it against that site in a somewhat atomic commit.

Then, if the change has any affect on other uses, you need a good way to test that change on them at the same time. Otherwise, they will resist pulling this change until it is fixed.

Add in more than a handful of such use sites, and suddenly things are just unmanageable in this "manageable" situation.

Not that this is "easy" in a central repo. But at least with the source dependency, you can get a compiler flag at every place an API change breaks something.

And, true, you can do this with multiple repos, too. But every attempt I have seen to do that just uses a frighteningly complicated tool to "recreate" what looks like a single source tree out of many separate ones. (jhbuild, and friends)

So, if there is a good tool for doing that, I'd certainly love to hear about it.



Meh. I'm working on a comparably small project (~40 developers), and we're over 16GB.

Mostly because we want a 100% reproducible build environment, so a complete build environment (compilers + IDE + build system) is all checked into the repro.


IDE checked into the repo eh? For some reason I kinda like that idea. So portable... if it works.


Someone recently told me that Facebook had a torrent file that went around the company that people could use to download the entire codebase using a BitTorrent client. Is there any truth in this?

I mean, the same guy that told me this, also said that the codebase size was about 50 times less than the one reported in this slide, so it may all be pure speculation.


If you're interested in the deployment process at Facebook, look at the link of a Facebook engineers paper I submitted in my other comment in this thread : https://news.ycombinator.com/item?id=7648802

"The deployed executable size is around 1.5 Gbytes, including the Web server and compiled Facebook application. The code and data propagate to all servers via BitTorrent, which is configured to minimize global traffic by exploiting cluster and rack affinity. The time needed to propagate to all the servers is roughly 20 minutes."


Facebook uses BitTorrent to deploy their binaries to their many servers.

So using torrents isn't foreign to them.


That sounds entirely believable. At $PREVIOUS_JOB, the Puppet git repository was large enough (several GB) that cloning it was painful, and new starters were handed a pruned repository via the local network so that they could get something done today.


  NAFV_P@DEC-PDP9000:~$ python
  Python 2.7.3 (default, Feb 27 2014, 19:58:35)
  [GCC 4.6.3] on linux2
  Type "help", "copyright", "credits" or "license" for more information
  >>> t=54*2**30
  >>> t
  57982058496
  # let's assume a char is 2mm wide, 500 chars per meter
  >>> t/500.0
  115964116.992 #meters of code
  # assume 80 chars per line, a char is 5mm high, 200 lines per meter
  >>> u=80*200.0
  >>> v=t/u
  >>> v
  3623878.656 # height of code in meters
  # 1000 meters per km
  >>> v/1000.0
  3623.878656 # km of code, it's about 385,000 km from the Earth to the Moon
  >>> from sys import stdout
  >>> stdout.write("that's a hella lotta code\n")


I wonder what their branching strategy is like and how merges are gated with a single codebase of that size?


They aim for a completely linear history. They may even have a policy of not allowing merge commits. It is described in various places on the internet. I like https://secure.phabricator.com/book/phabflavor/article/recom... because it and its sister articles on code review and revision control are terrific reads.


Dear everyone: you should be using Phabricator. It is Facebook's collected wisdom about software development expressed in software. It has improved my life substantially. The code review is better than Github's, and their linear, squashed commit philosophy has worked out much better than the way I used to do things.


Oh yeah, it's great when one person can break the build and stop all active development. It scales so well. Oh I know, to prevent any issues, let's protect ourselves with feature toggles. Oh and let's build a set of database driven rules to manage those toggles. Oh what about dependencies? Let's build a manager to manage the DAG of feature toggle dependencies.

Need I go on? :-) You've replaced a relatively simple system of merge requests with some pseudo in-code versioning system controlled through boolean global variables.

I'll take feature branches any day of the week over that mess. The github model is far superior IMO.


My team uses Phabricator without any feature toggles in our code. You land code onto master when it's ready to ship. Until then, you have features developed and reviewed on separate branches. I don't get how that's more or less fragile than merges.


I was addressing the idea of committing directly to master, protecting your code with feature toggles so it doesn't break things. Maybe I misunderstood the OP.

I think feature toggles can be extremely useful, but still develop in a branch and merge after review/qa.


We're using Phabricator at my company, and I'm getting to the point of encouraging people to use it and start seeing the benefits. I work on infrastructure, so when people come to me with issues, I find out what problem they're having and then get them to submit a ticket to me. I've also started creating tickets for issues and assigning them to other people to get them to take a look at it, and people seem pretty receptive.

It hasn't become part of everyone's workflow yet, but it's pretty useful.


That might be a silly question but it's not useful for just an individual in a team, correct? We use GitHub at work, but I wouldn't be able to try Phabricator on my own, right?


Mostly correct. You can use the code review tools without needing your repository set up in Phabricator or anyone else with Phabricator accounts. You could comment on a diff and point your coworker to it, but it's unlikely to feel better than Github pull requests in that scenario.


It looks pretty great. How does it compare to Atlassian products (if you have used any)?


Mixed bag. The code review part is much better than stash and significantly better than crucible. Namely, diff of diffs makes reviewing changes based on comments infinitely easier (especially on large reviews). We installed phabricator just for the code review piece initially. Repo browsing is about on par with stash, but it doesn't seem to experience the horrific slow downs that our stash server does. We don't use the tasks because a number of non engineering roles also use JIRA and the tasks functions in phabricator don't have nearly the depth of security and workflow options we need.


I've used JIRA, and I slightly prefer Phabricator's tasks. They're mostly the same.


Seems like you lose most of the value of git if you can't do merges.


You can still merge from master (or whatever your main development branch is) into your feature branches. You just never merge into your main branch. You squash the commits and rebase. The history ends up looking much better, and `git blame` becomes much more useful.


Linear codebase history? Why even use Git then... that's SVN stuff... we use Git now-a-days for a reason...


There's a big difference between a linear history produced by actual linear development and one produced by `git rebase -i`. They both have the advantage of being easier to understand later, however.


That's not true dude. If you're following actual linear development you're likely to see a lot of 'poke build' and 'change css' and stuff like that. Git rebase -i gives you a change to rename your commit and organize it in a readable way.

So git rebase -i will be more readable, while actual linear history is always gibberish.


Last I knew, the FB mainline codebase was in fact still in SVN with git-svn dcommit (possibly hidden under arc) being how things land in "master" (and the revisions being merged quasi-manually to a release branch immediately before HH compile)

FB doesn't need to branch ... Gatekeeper (their A/B, feature flag system) really takes care of that concern logically



I thought I had read an article about facebook switching to perforce due to their really large git repo. Were they at least thinking about it?

A quick google comes up with nothing but I could have SWORN I read that.


You are thinking about Mercurial, not Perforce: https://code.facebook.com/posts/218678814984400/scaling-merc...


Not like they dont have the money for it, but that would be a very expensive for them and sort of anti-Open Source, no?


Why don't people use multiple git repos for multiple internal projects? It seems totally nonsensical and undesirable.


That's actually not that bad for a engineering shop of their size. I would start archiving metadata at some point.


gosh, last time I had trouble with 8GB data checking in, it's very memory hungry when the data set is big and then you need check them in all at once, how much memory on the server side you need when you want to 'git add .' all the repo of 54GB?

what about a re-index or something, will that take forever?

I worry at such size the speed will suffer, I feel git is comfortable with probably a few GBs only?

anyway it's good to know that 54GB still is usable!


...but 8GB for the actual current version.

How much of it is static resources, like CSS sprite images?


They must storing a lot of images and binary files I guess.


Company I did a contract for last year has 8MB of (Java) source code and a 52MB SVN repo and make £40 million a year out of it...

We're doing something wrong.


I thought they used Mercurial


Someone must have forgotten a .gitignore or two.


well, git gc --aggressive --prune=now, duh.

(jk)


I hope everyone realizes this is not 54GB of code, but in fact, is more likely a very public showing of very poor SCM management. They likely have tons of binaries in there, many many full codebase changes (whitespace, tabs, line endings, etc). Also not to mention how much dead code lives in there?


Honestly, I prefer check-in everything shops. It's way too often otherwise some different Java version, or IDE version, or Maven central being down screws something up, or you have to wait a long time for a Chef recipe or disk image to give you the reference version. Half my day today was dealing with someone updating Java on half our continuous build system slave computers and breaking everything because it didn't have JAVA_HOME and unlimited strength encryption all setup properly.


That sounds like either poor Sys Admin'ing and/or poor documentation... SCM should not have "clutter" in it, otherwise you wind up with an all-day download of 54GB of dead or useless garbage. The Kernel's repo is only a few GB's and it has MANY more changes and much more history than FB does...


You should realize that the kernel doesn't have image and other assets that FB might have. And the repo obviously is the right place for them.


"The" repo. As if they had no choice but to put everything for every aspect of the business into ONE giant repo.

Hopefully they actually have some separate sites, separate tools and separate libraries. Or could understand how to use submodules or something rather than literally putting everything in one huge repository.

Whether to put images and other assets into git repos is a separate decision.


Hey Facebook! You're doing it wrong!!!!!


So what? This probably means they're versioning data files they shouldn't be. I feel like this just exists here as a pissing contest.


It is supposed to justify the engineering effort they put into switching to Mercurial, then trying to make it "scale." (Rather than just using separate repositories to begin with, according to the design of the tool and best practices)


aka. a pissing match to show "they are too big for any standard industry tools". Really speaks to the level of (non)expertise employed at FB.


[deleted]


Why's it bad to store source code on Git?


Its not bad, is really nice, but Git has one problem, when you codebase is big, the process takes a long time, imagine git scanning those 8GB every time you do a commit, that is why Facebook was looking to port all their code to another VCS


I think it's worth making a distinction between the Git plumbing and the Git porcelain when talking about performance. The core functionality (the plumbing) is very fast regardless of repository size. The slowdowns people describe are almost always related to the porcelain commands, which are poorly optimized. Almost every porcelain-level command will cause Git to lstat() every file in your tree, as well as check for the presence of .gitignore files in all of the directories. It's very wasteful.

The fix for this is pretty simple: use filesystem watch hooks like inotify to update an lstat cache. I wrote something like this for an internal project and the speed difference was night and day. I remember reading that there had been progress on the inotify front on the git dev mailing list a few years ago, don't know what the current status is.


There was some testing early last year, but I think it's about time for someone to post another reminder

http://git.661346.n2.nabble.com/inotify-to-minimize-stat-cal...


This is when a centralized VCS with checkin-checkout concepts really shines. The client only has to check the checked-out files on commit.


Funny you should mention Facebook maybe having performance problems with Git: http://git.661346.n2.nabble.com/Git-performance-results-on-a... (2012)


I really can't support this claim. I have a repo of 111GB now and it works ok. Not slow at all (unless you do git gc or something like that).


" when you codebase is big, the process takes a long time"

From my experience, we can bet git is the one that takes the least amount of time

Not hitting the network to check which files changed, for a start


[deleted]


I wouldn't say Facebook and Twitter are competitors. Their models are quite different.

- Facebook's complex privacy vs Twitter's binary "public or private"

- Facebook's real names vs Twitter's @usernames

- Facebook's freeform posting length vs Twitter's 140 character limit


They store their code using the Git version-control software, not on Github as you seem to be thinking.


git is not a website...




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: