

How to handle big repositories with Git - mmastrac
http://blogs.atlassian.com/2014/05/handle-big-repositories-git/ 

======
theli0nheart
I wrote a Git extension about a year ago that transparently stores data in S3
/ Cloudfiles / etc. and doesn't store any of the actual data in your Git repo.
I've used it with a few projects but I think it could be battle tested a bit
more. It integrates perfectly with GitHub / Bitbucket. Pull requests welcome!

[https://github.com/lionheart/git-bigstore](https://github.com/lionheart/git-
bigstore)

------
greggman
Version control is not enough for binary files. Binary files need access
control since changes can not be merged. Older centralized version control
often provides access control as well. Distributed version control can't do
this by it's very nature

~~~
jordigh
Like they say, there's an extension for that:

[http://mercurial.selenic.com/wiki/LockExtension](http://mercurial.selenic.com/wiki/LockExtension)

~~~
greggman
Unfortunately by definition that's not a solution. Because hg is distributed
checking out the file locally does nothing unless the fact that it's locked is
propagated to everyone else's copy of the repo immediately and that doesn't
happen.

~~~
jordigh
By definition?

Did you even see what this does?

------
jayvanguard
> Even though the bounds that identify a repository as massive are pretty high
> – for example the latest Linux kernel clocks at 15+ million lines of code

Yeah 15 million lines of code isn't a massive repository, it is medium-large
at best. Any one of the big enterprise software companies has repos an order
of magnitude bigger for each major product they sell.

------
CJefferson
This is very disappointing when it comes to dealing with large files.

It looks like git-bigfiles was abandoned years ago, after making very little
progress, and neither bitbucket nor github seem to usefully support git-annex.

------
jordigh
I have been quite happy with Mercurial's largefile extension, which has been
part of core hg for quite some time:

[http://mercurial.selenic.com/wiki/LargefilesExtension](http://mercurial.selenic.com/wiki/LargefilesExtension)

I think this must be one of the reasons why hg has some popularity in gamedev.

------
jonalmeida
Didn't Linus mention this in his talk at Google, Mercurial offers similar
speeds, but Perforce would be the option for large binary files.

I personally haven't come across any need to use anything except shallow
clones in large repos. Most of the time, you want to keep those other topic
branches regardless.

They've linked to their previous most regarding submodule in the post[1], but
it's worth re-mentioning that if you need to use submodule, you should almost
always use subtree.

[1]: [http://blogs.atlassian.com/2013/05/alternatives-to-git-
submo...](http://blogs.atlassian.com/2013/05/alternatives-to-git-submodule-
git-subtree/)

------
jradd
I cannot tell if I am way off topic here, so pardon my attempt to hijack your
thread, but does anybody know of a solid solution in this respect? Or even a
good way to just simply sync and manage larger datasets or files over www.

I am not so concerned about version tracking/management so much as just a good
way to sync large amounts of data while having a git–like CLi arsenal. Or a
comparable CLi solution, or API with decent version tracking and solid
support.

I have mounted google drive as a linux volume which is buggy to say the least.
I have used github, and local git repos, but I am wondering if I am missing
out on anything.

For example, BitTorrent has a Sync tool that seems close to what I want.

~~~
montecarl
I don't have an answer, but I have an additional complaint. Git itself is
pretty terrible at dealing with large repositories if for no other reason than
because "git clone" has no resume feature. If your internet connection is
interrupted at any point during the initial clone it dies and you have to
start over. This means that on an unreliable internet connection it is nearly
impossible to clone large repositories.

~~~
jerf
A git clone is not the same as just rsyncing a .git directory, but if you know
Git, the latter can be made to work. You may want to GC the source repo first,
and you'll need to manually add a remote and possibly clean up branches after
you're done but a quick bash script can finish that up a treat too. (Good task
to practice bash scripting on, if you're not familiar with it.)

~~~
bronson
I've done that, waste of time. There's just no excuse for git being unable to
resume a clone.

------
shoo
some of these problems / symptoms can be framed as a dependency-management
problem, which can be addressed by other tools:
[http://blogs.atlassian.com/2014/04/git-project-
dependencies/](http://blogs.atlassian.com/2014/04/git-project-dependencies/)

that said, it's not like that doesn't introduce other problems

------
jephir
I've used submodules before to separate game assets and it just ended up being
more trouble than it's worth. The problem is that it separates the history
between your two repositories so it takes more time to figure out which asset
commit relates to which code commit. In the end, we just merged everything
back into one repository.

~~~
eropple
Agreed. I have a but of a hack that I've stayed to use - I've been using SVN
for my game assets and I have git hooks to pull the correct version. Still not
optimal, but better.

------
caust1c
I'll throw in a word for git-fat[1]. It works great and has a dead simple
implementation that actually has a real copy of the file in the working copy
of the repository (as a posed to symlinks).

[https://github.com/cyaninc/git-fat](https://github.com/cyaninc/git-fat)

------
hackerpolicy
Would Android's `repo` be a good solution to this?

~~~
fishywang
Yes and No.

Repo was born to resolve this kind of problem, but it might not be a "good"
one. We (git/gerrit team at Google) are working on bring cross-repository
atomic submit and other stuff to git/gerrit and our goal is to replace repo
with git submodule.

Here're a _very brief_ slides[1] and notes[2] about this topic at this year's
Gerrit User Summit.

[1]
[https://docs.google.com/presentation/d/1qG1eAiDmyozZBiVE6R4A...](https://docs.google.com/presentation/d/1qG1eAiDmyozZBiVE6R4AVqaZQGRjSUldsDkvFW8AMzA/edit#slide=id.g1d6bea5f2_0134)

[2]
[https://docs.google.com/document/d/1a2eFhVr1HUiKOjhaRHn_89mf...](https://docs.google.com/document/d/1a2eFhVr1HUiKOjhaRHn_89mfnDAIsfpEiHd858ks8vs/edit)

~~~
ibrahima
That's very interesting, as someone who occasionally contributes to the odd
Android fork. I kind of feel like repo is this weird semi-black box which does
things I don't expect, like reverting my topic branches back to the remote
branch but leaving the name in place (it does this if your topic branch does
not have a remote tracking branch, and you run repo sync). I feel that at a
minimum repo should play nice with standard git workflows. I should probably
just read the repo source but I can't imagine why this behavior would be a
good idea, and I'm usually left feeling like I would rather just use git
directly than repo most of the time.

------
TheJamie
lol @ handling big repositories

