Hacker News new | past | comments | ask | show | jobs | submit login

Pay attention to the footnote:

  *8 GB plus 46 GB .git directory



The new depth option [1] on git clone as of 1.9.x should make the .git dir much more reasonable.

> --depth <depth>: Create a shallow clone with a history truncated to the specified number of revisions.

If the OS and applications that are creating and modifying files are all honest about the file dates, it should be possible to only scan dates instead of reading out every file. Or even use ionotify-like events to track what changed.

[1]: http://git-scm.com/docs/git-clone

EDIT: As sisk points out below --depth itself is not new, but as of 1.9 the limitations that previously came with shallow clones were lifted. Thanks sisk.


I've wanted this option for so long :-) Are there any downsides to it, apart from the obvious one of 'git log' not going all the way back in time? I mean, do any commands break unexpectedly?


The --depth flag has been around for a long time but, previously, came with a number of limitations (cannot push from, pull from, or clone from a shallow clone). As of 1.9.0, those limitations have been removed.

https://github.com/git/git/commit/82fba2b9d39163a0c9b7a3a2f3...


8 GB is still a lot. Would be interesting to know how much of it is actual code and how much is just images and so on.


I bet they check in (or have checked in at some point) 3rd party libraries, jar files, generated files, etc. I battle this every day at $dayjob and we have a multi GB subversion repository with a separate one for the 3rd party binaries. Svn handles this a bit better than the DVCSes, so just checking out the HEAD is smaller than the full history you get in git/hg, and you can clean up some crud a bit. It just lives on the central server, not in everyone's working copy.


You can clone with a single revision in git. Still ends up being around 2X the size I believe since you have the objects stored for the single revision as well as the working tree.


2X the size of what? The raw files? Assuming little shared data, sure. But svn will make two copies of every file too.


Facebook now has 7000 employees and is 10 years old. Each employee would have had to write 14 kiB of code (357 LOC with 40 (random guess of mine) characters per LOC) every day during this 10 years to produce 8 GiB of code. (Obviously assuming the repository contains only non-compressed code and only one version of everything and no metadata and...)


If they've got memcached with their own patches, linux with their own patches, Hadoop with their own patches, etc. and tons of translations I can see 8 gigs of text.


Why would they put that all in the same repository? I'm pretty sure this 8 GB repo is just their website code. A frontend dev working on a Timeline feature shouldn't have to check out the Linux kernel.


8GB would be at least 100 million lines of code (upper bound with 80 characters per each line). For comparison Linux has 15+ million lines of code, PHP 2+M.


PHP's repo is around 500M. But I'd say it is probably tens to a hundred times smaller than Facebook should be, especially if you count non-public stuff they must have there. So comes out about right.


Exactly, there's no way they wrote 8GB of code.


I think they did, have heard similar figures from them before, all deployed as one file. Macroservices...


unlikely. they probably have a very dirty repo with tons of binaries, images, blah blah blah. It's highly unlikely they actually wrote 8GB of code, and the 46GB .git directory will be littered with binary blob changes, etc. This is really just to "impress" two people: 1) People who love Facebook 2) People who don't know anything about version control and/or how to do proper version control (no binaries in the scm).



I cited this interesting paper from Facebook engineers(2013) in another comment :

https://news.ycombinator.com/item?id=7648802

In 2011 they had 10 millions LoC, up to 500 commits a day,

but if we asssume the plots keeps going up like this, now in 2014 it can be pretty big.

Their binary was 1.5GB when the paper was written.


1.5GB binary that includes their dependencies (other binaries).


The big .git directory is probably binary revisions? Is there any good way around that in git?


When I can't avoid referring to binary blobs in git, I put them in a separate repo and link them with submodules. It keeps the main repo trim and the whole thing fast while still giving me end-to-end integrity guarantees.

I wrote https://github.com/polydawn/mdm/ to help with this. It goes a step further and puts each binary version in a separate root of history, which means you can pull down only what you need.


you're not going to merge binary files, so git isn't the right tool. the standard way is to use maven. git handles your sources, and anything binary (libs, resources etc) goes on the nexus (where it is versioned centrally) and is referenced in your poms: simple and powerful


git is "the stupid content tracker", not "the stupid merge tool". Even for things you have no intention of branching or merging, it still gives you control over versioning... and there's a huge gap between the level of control a hash tree gives you versus trusting some remote server to consistently give you the same file when you give it the same request.


I'm a bit confused. Whenever I've used git on my projects, I'd make sure the binaries were excluded, using .gitignore

Don't other people do that, too? What's the benefit of having binaries stored? I've never needed that; I've never worked on any huge projects, so I might be missing something crucial.


If there is a small number of rarely changing binaries (like icons, tool configs, etc.) then it may not be worth it to move them. Also if space is much cheaper than tool complexity and build time.


Well, it depends. Images are for instance binaries where a text diff makes little sense, so you have a copy of each version of the image ever used. And many projects use programs where the files are binaries. For instance, I've been on a project where Flash were used and the files checked in. Or PhotoShop PSD files, .ai files etc.


I usually keep a separate repo for PSD/AI files (completely neglected to consider them as binary files).

As for images, icons, fonts and similar, I just have a build script that copies/generates them, if it's needed.

I guess I've always been a little bit "obsessed" about the tidiness of my repositories.


If you don't have the source that produced those binaries your only choice is to have them downloadable from somewhere else (which is a real hassle for the developer) or just check them into the repo.


Stripping the binaries with filter-branch, or normalising them using smudge filters (for example, making sure zip-based formats use no compression).


Re-cloning a fresh repo should keep it small. There's also a git gc method which cleans up the repo. I guess another way would be to just archive all the history up to a certain point somewhere.


> Re-cloning a fresh repo should keep it small. There's also a git gc method which cleans up the repo.

There's only so much git gc can do. We've got a 500MB repo (.git, excluding working copy) at work, for 100k revisions. That's with a fresh clone and having tried most combinations of gc and repack we could think of. Considering the size of facebook, I can only expect that their repo is deeper (longer history in revisions count), broader (more files), more complex and probably full of binary stuff.


Is that all actually code? Each commit would have to add an average of 5KB compressed so maybe 20KB of brand new code.


> Is that all actually code?

Of course not, it's an actual project. There is code, there are data files (text and binary) and there are assets.


maven!


FWIW: Every time you modify a file in git, it adds a completely new copy of the file. 100mb text file, 100mb binary file - makes no difference. Modify one line, it's a new 100mb entry in your git repo.


No, git has delta compression. It only saves the changes.


Only when git repack operations run.


which you can run at any time. Git also runs a minor gc every time you do a git push, which does do some compression.


What about unit tests (and other testing code)? Those could occupy decent size too.


how is 8gb a lot, that's a small thumb drive's worth.


Could it be it contains a DB of all their users?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: