

Ask HN: Large code repository in Git? - ansible

Hello All,<p>We&#x27;ve currently got a large repository we&#x27;d like to maintain in git.  By &#x27;large&#x27;, I mean 45000 files, and 5GB in disk space.  The files consist of source code as well as binaries (compilers, other tools, compiled libraries, etc.).<p>I&#x27;d prefer to just keep all this in a single git repository if possible.  However, as you may know, git struggles with repositories this large, and operations that normally take milliseconds instead take seconds (or longer).  Overall it is not a pleasant experience.<p>I&#x27;m aware of other options like Subversion and Perforce that might be a better fit for this, but I&#x27;d like to know what are my best options to stay with git.<p>I&#x27;ve used git-submodules, and Google&#x27;s repo [0] before, and found the workflow with these to be somewhat cumbersome.<p>So far, I&#x27;ve been looking at git-annex [1], git-fat [2], and git-media [3].  I was wondering what other options I might have.  Of these, git-annex seems the most mature, though it involves a significant workflow change.  We aren&#x27;t really distributed in the sense that git-annex handles, where the users may have a bunch of working directories all over the place (on a desktop, laptop, at home, on a USB drive) which all may have different subsets of all the files.<p>With git-media and git-fat, the user is designating certain file extensions to be handled specially, which may work OK for us.  I&#x27;m just not sure.<p>I&#x27;d like to hear from anyone who has used any of these tools for development, to get a better idea of the performance as well as the ups and downs of daily usage.<p>Thanks!<p>[0] https:&#x2F;&#x2F;source.android.com&#x2F;source&#x2F;developing.html
[1] https:&#x2F;&#x2F;git-annex.branchable.com&#x2F;
[2] https:&#x2F;&#x2F;github.com&#x2F;jedbrown&#x2F;git-fat
[3] https:&#x2F;&#x2F;github.com&#x2F;alebedev&#x2F;git-media
======
danudey
We have a highly distributed system at our company, with multiple library
packages, daemons, django apps, etc., as well as configuration files, and
several 'media' git repos which are in the 5GB range.

Our solution was to build a simple (well, originally simple) tool to handle
cloning out our code. Each of our repos has a .config file in it which defines
what it depends on, and we have a simple script which 'installs' packages by
git clone and then creates symlinks into the proper places. Because we're not
using submodules, we don't have any of the standard submodule issues (e.g. if
you do a 'git pull' you have to commit that to the main repository and push or
it shows as a change).

The tool also handles scanning through our 'install' directory, determining
what packages are installed, doing a 'git pull', post-install/post-update
hooks, multi-branch support, show local changes/uncommitted changes, etc.

The other option for you (since our dependency list is fairly flat and
uncomplicated, and yours may not be) would be to create a yaml/xml/json file
to specify the directory structure and which packages go where, and then have
the tool automatically update them appropriately. Alternately, store a
.config.yml file in each repo telling what other repos should be cloned where,
and let your tool recurse through it.

The nice thing about a tool like that is you can add functionality like e.g.
doing a batch-commit (commit all changes in all/specified repositories, with
the same commit message, create the same tag name for all of them in all
repos, etc), batch-rollback, and so on.

Just a thought.

~~~
ansible
_The other option for you (since our dependency list is fairly flat and
uncomplicated, and yours may not be) would be to create a yaml /xml/json file
to specify the directory structure and which packages go where, and then have
the tool automatically update them appropriately. Alternately, store a
.config.yml file in each repo telling what other repos should be cloned where,
and let your tool recurse through it._

 _The nice thing about a tool like that is you can add functionality like e.g.
doing a batch-commit (commit all changes in all /specified repositories, with
the same commit message, create the same tag name for all of them in all
repos, etc), batch-rollback, and so on._

I think you've basically re-invented Google's repo application. :-)

~~~
stevekemp
Or even the "mr" tool which allows you to work with multiple-repositories
easily.

I have a few locations where I have a config-file listing repositories, their
target destinations, and some arbitrary shell commands (read: making symlinks)
that I distribute to people. Users just run "mr -c config.cfg checkout" to get
all the code in the right place.

[http://myrepos.branchable.com/](http://myrepos.branchable.com/)

------
anaidioschrono
A 4th option that is comparable to git-fat is git-bin
([https://github.com/seeq12/git-bin/tree/v1.5](https://github.com/seeq12/git-
bin/tree/v1.5)). We at Seeq use it to store large binary files that would
otherwise clutter the repo, much as you describe. While git-bin is a C#
project, it runs well under Mono, and we use in on Windows, OSX, and Ubuntu.
It's a small, simple, program that fits seamlessly into the standard Git
workflow. We forked it from the original author and have been actively
updating it to increase its robustness and ease of use.

~~~
ansible
Thanks for sharing that.

------
brudgers
The BitBucket company, Atlassian, discusses [1] the idea that repositories
grow in two orthogonal directions:

\+ Number of files

\+ Size on disk due to binaries.

and some approaches to managing this growth on their blog. They also appear to
do some consulting.

[http://blogs.atlassian.com/2014/05/handle-big-
repositories-g...](http://blogs.atlassian.com/2014/05/handle-big-repositories-
git/)

------
caust1c
I'm a bit late to this discussion, but I put a lot of work into the fork of
git fat: [https://github.com/cyaninc/git-fat](https://github.com/cyaninc/git-
fat)

The reason why we chose it at the time was because of the reason you
mentioned. git-annex was really difficult to set up and manage in a team
environment and the fact that it used a separate branch made things difficult
for those new to git (we were constantly hunting down lost files on peoples
drives).

The other major plus for us was that the files weren't symlinks. This made our
deployment process easier since there was one less edge case to check.

Let me know if you have any questions!

------
akushner1
Do you know how you are trending regarding how big your repo will be in 1
year? 5 years? (Number of commits and number of files)

45,000 is the size of the linux kernel git repo and is a small repo compared
to repos at large companies. What operations are slow for you? How many people
use your repo on a regular basis?

------
coppolaemilio
If those 45000 files are in separated chunks you could try with separated
repos, I don't see any other better way unfortunately. Are the binaries
necesary? Can you make an automated script to build them instead and just
leave the sources? It might be a good oportunity to clean that codebase :)

Good luck!

~~~
ansible
If we go with separate repos, then we'll need Google's repo tool, use git-
submodules, or some other solution.

The binaries have been created by a third party, and can't necessarily be
built by us. I haven't checked in all cases, the codebase is a twisty maze of
different build systems, not all integrated.

------
curtis
One thing I'm unclear on -- do you already have a really large Git repository
which is getting too slow to use, or are you just contemplating building a
repo that will be that large and you think that it might be too large?

~~~
ansible
We've been trying to use git for this existing repository, but it is too slow
to use, in part because it is too large in total size, or in number of files.
Or both.

~~~
curtis
I had a very nice write-up about how Git can efficiently handle much larger
repositories than people think, but it sounds like it doesn't apply in your
case.

Something that might be worth trying: If you have a bunch of files that will
very rarely ever change, you might try just putting them in a tar file and
modify your build system to unpack them on demand. This may sound crazy, but
in my experience it's not raw file size that gives Git problems, it's simply
the file count. I don't know about the latest Git versions, but back in the
1.8 era when I was doing extensive testing, Git was notorious for doing a
file-stat on every single checked out file every time you ran "git status". If
you've got fewer files, regardless of size, your performance will likely be
better. (The initial git clone is of course an exception.)

