Pay attention to the footnote: *8 GB plus 46 GB .git directory

infogulch · on April 25, 2014

The new depth option [1] on git clone as of 1.9.x should make the .git dir much more reasonable.

> --depth <depth>: Create a shallow clone with a history truncated to the specified number of revisions.

If the OS and applications that are creating and modifying files are all honest about the file dates, it should be possible to only scan dates instead of reading out every file. Or even use ionotify-like events to track what changed.

[1]: http://git-scm.com/docs/git-clone

EDIT: As sisk points out below --depth itself is not new, but as of 1.9 the limitations that previously came with shallow clones were lifted. Thanks sisk.

rwmj · on April 25, 2014

I've wanted this option for so long :-) Are there any downsides to it, apart from the obvious one of 'git log' not going all the way back in time? I mean, do any commands break unexpectedly?

sisk · on April 25, 2014

The --depth flag has been around for a long time but, previously, came with a number of limitations (cannot push from, pull from, or clone from a shallow clone). As of 1.9.0, those limitations have been removed.

https://github.com/git/git/commit/82fba2b9d39163a0c9b7a3a2f3...

danbruc · on April 25, 2014

8 GB is still a lot. Would be interesting to know how much of it is actual code and how much is just images and so on.

rquirk · on April 25, 2014

I bet they check in (or have checked in at some point) 3rd party libraries, jar files, generated files, etc. I battle this every day at $dayjob and we have a multi GB subversion repository with a separate one for the 3rd party binaries. Svn handles this a bit better than the DVCSes, so just checking out the HEAD is smaller than the full history you get in git/hg, and you can clean up some crud a bit. It just lives on the central server, not in everyone's working copy.

michaelmior · on April 25, 2014

You can clone with a single revision in git. Still ends up being around 2X the size I believe since you have the objects stored for the single revision as well as the working tree.

Dylan16807 · on April 25, 2014

2X the size of what? The raw files? Assuming little shared data, sure. But svn will make two copies of every file too.

danbruc · on April 25, 2014

Facebook now has 7000 employees and is 10 years old. Each employee would have had to write 14 kiB of code (357 LOC with 40 (random guess of mine) characters per LOC) every day during this 10 years to produce 8 GiB of code. (Obviously assuming the repository contains only non-compressed code and only one version of everything and no metadata and...)

patmcguire · on April 25, 2014

If they've got memcached with their own patches, linux with their own patches, Hadoop with their own patches, etc. and tons of translations I can see 8 gigs of text.

zhemao · on April 25, 2014

Why would they put that all in the same repository? I'm pretty sure this 8 GB repo is just their website code. A frontend dev working on a Timeline feature shouldn't have to check out the Linux kernel.

mixedbit · on April 25, 2014

8GB would be at least 100 million lines of code (upper bound with 80 characters per each line). For comparison Linux has 15+ million lines of code, PHP 2+M.

smsm42 · on April 25, 2014

PHP's repo is around 500M. But I'd say it is probably tens to a hundred times smaller than Facebook should be, especially if you count non-public stuff they must have there. So comes out about right.

cliveowen · on April 25, 2014

Exactly, there's no way they wrote 8GB of code.

justincormack · on April 25, 2014

I think they did, have heard similar figures from them before, all deployed as one file. Macroservices...

SnakeDoc · on April 25, 2014

unlikely. they probably have a very dirty repo with tons of binaries, images, blah blah blah. It's highly unlikely they actually wrote 8GB of code, and the 46GB .git directory will be littered with binary blob changes, etc. This is really just to "impress" two people: 1) People who love Facebook 2) People who don't know anything about version control and/or how to do proper version control (no binaries in the scm).

justincormack · on April 25, 2014

In 2012 the Facebook binary was 1.5GB http://arstechnica.com/business/2012/04/exclusive-a-behind-t...

vmarsy · on April 25, 2014

I cited this interesting paper from Facebook engineers(2013) in another comment :

https://news.ycombinator.com/item?id=7648802

In 2011 they had 10 millions LoC, up to 500 commits a day,

but if we asssume the plots keeps going up like this, now in 2014 it can be pretty big.

Their binary was 1.5GB when the paper was written.

SnakeDoc · on April 25, 2014

1.5GB binary that includes their dependencies (other binaries).

matsemann · on April 25, 2014

The big .git directory is probably binary revisions? Is there any good way around that in git?

heavenlyhash · on April 25, 2014

When I can't avoid referring to binary blobs in git, I put them in a separate repo and link them with submodules. It keeps the main repo trim and the whole thing fast while still giving me end-to-end integrity guarantees.

I wrote https://github.com/polydawn/mdm/ to help with this. It goes a step further and puts each binary version in a separate root of history, which means you can pull down only what you need.

namdnay · on April 25, 2014

you're not going to merge binary files, so git isn't the right tool. the standard way is to use maven. git handles your sources, and anything binary (libs, resources etc) goes on the nexus (where it is versioned centrally) and is referenced in your poms: simple and powerful

heavenlyhash · on April 26, 2014

git is "the stupid content tracker", not "the stupid merge tool". Even for things you have no intention of branching or merging, it still gives you control over versioning... and there's a huge gap between the level of control a hash tree gives you versus trusting some remote server to consistently give you the same file when you give it the same request.

spoiler · on April 25, 2014

I'm a bit confused. Whenever I've used git on my projects, I'd make sure the binaries were excluded, using .gitignore

Don't other people do that, too? What's the benefit of having binaries stored? I've never needed that; I've never worked on any huge projects, so I might be missing something crucial.

smsm42 · on April 25, 2014

If there is a small number of rarely changing binaries (like icons, tool configs, etc.) then it may not be worth it to move them. Also if space is much cheaper than tool complexity and build time.

matsemann · on April 25, 2014

Well, it depends. Images are for instance binaries where a text diff makes little sense, so you have a copy of each version of the image ever used. And many projects use programs where the files are binaries. For instance, I've been on a project where Flash were used and the files checked in. Or PhotoShop PSD files, .ai files etc.

spoiler · on April 26, 2014

I usually keep a separate repo for PSD/AI files (completely neglected to consider them as binary files).

As for images, icons, fonts and similar, I just have a build script that copies/generates them, if it's needed.

I guess I've always been a little bit "obsessed" about the tidiness of my repositories.

zhemao · on April 25, 2014

If you don't have the source that produced those binaries your only choice is to have them downloadable from somewhere else (which is a real hassle for the developer) or just check them into the repo.

Tobu · on April 25, 2014

Stripping the binaries with filter-branch, or normalising them using smudge filters (for example, making sure zip-based formats use no compression).

rickyc091 · on April 25, 2014

Re-cloning a fresh repo should keep it small. There's also a git gc method which cleans up the repo. I guess another way would be to just archive all the history up to a certain point somewhere.

masklinn · on April 25, 2014

> Re-cloning a fresh repo should keep it small. There's also a git gc method which cleans up the repo.

There's only so much git gc can do. We've got a 500MB repo (.git, excluding working copy) at work, for 100k revisions. That's with a fresh clone and having tried most combinations of gc and repack we could think of. Considering the size of facebook, I can only expect that their repo is deeper (longer history in revisions count), broader (more files), more complex and probably full of binary stuff.

Dylan16807 · on April 25, 2014

Is that all actually code? Each commit would have to add an average of 5KB compressed so maybe 20KB of brand new code.

masklinn · on April 26, 2014

> Is that all actually code?

Of course not, it's an actual project. There is code, there are data files (text and binary) and there are assets.

namdnay · on April 25, 2014

maven!

syntheticlife · on April 25, 2014

FWIW: Every time you modify a file in git, it adds a completely new copy of the file. 100mb text file, 100mb binary file - makes no difference. Modify one line, it's a new 100mb entry in your git repo.

JelteF · on April 25, 2014

No, git has delta compression. It only saves the changes.

syntheticlife · on April 26, 2014

Only when git repack operations run.

Skinney · on April 26, 2014

which you can run at any time. Git also runs a minor gc every time you do a git push, which does do some compression.

vijayr · on April 25, 2014

What about unit tests (and other testing code)? Those could occupy decent size too.

hnriot · on April 25, 2014

how is 8gb a lot, that's a small thumb drive's worth.

hiphopyo · on April 25, 2014

Could it be it contains a DB of all their users?