Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Storing large binary files in Git repositories (2015) (deveo.com)
54 points by pmoriarty on March 8, 2016 | hide | past | favorite | 23 comments


This is a good article, but its git-annex information is getting slightly out of date. git-annex has a new repository format available which makes its interface work more like git-lfs. http://joeyh.name/blog/entry/git-annex_v6

Also, it's worth noting that git-lfs stores 2 copies of each large file on local disk. git-annex stores 1 copy. Which is a pretty big plus the article left out.


Have you considered using the copy-on-write features that some filesystems (btrfs) now support? For example, the lowly cp command now supports CoW via --reflink=always

Also, that ks for git-annex! I use it for hundreds of gigabytes of files, and find the one-copy-per-file support essential for how I use it.


Whoa, this looks great!

Sorry for possible off-topic, but I wonder if anyone knows a way to convert/upgrade older (created an year ago) git-annex repo to a newer v6 one?



It may be sacrelege but you could store your binaries in hg and in hg use a git submodule for the git repos that were used to build those binaries.

This is what I'm doing after a disastrous excursion into git-fat.

But really, if git's maintainers don't want to store binaries, why shoehorn it in? Use the right tool for the job.


For some added context about handling binaries in Mercurial, see the (in development) Mercurial book: http://hgbook.org/read/scaling.html#handle-large-binaries-wi...


I also went with the hg solution to enjoy it's core large-file support but ultimately switched to git-annex so I didn't have to support and learn each technology. git-lfs looks to be good but I'm still waiting a small, standalone production-ready server implementation.


Having a separate git repo for large files and shallow cloning it should be similar to having another repo in hg/svn for large files.


We made git-fit, which was inspired by git-media: http://github.com/dailymuse/git-fit

On the plus side, it's stupidly simple. On the down side, it's stupidly simple. The readme explains how it works.


It would be nice if git would support large files natively rather than relying on these extensions.


I have a couple of projects with huge binary assets under source control (movie files, 4K images, essentially the whole website). Deployment via jenkins and rsync, it's really fast after the initial checkout.

How? Well, the framework stores all assets inside /assets, which normally is a git submodule. Jenkins works pretty fine with this construction, and for developers there are a couple self written PHP shellscripts that execute a shallow clone outside the repository and then do a bit of rsync magic to sync it all.


isn't it idiotic? if you can't get a textual diff, or if a textual diff is meaningless, what benefit do you even get having the file on git?

this sounds like a problem for teams blindly following orders from clueless managers.


    > if you can't get a textual diff, or if a textual diff is meaningless, what benefit do you even get having the file on git?
A nice property of checking out whole repository using only single program. All assets are downloaded at once and always proper version of code comes with proper version of binary assets.


Version control isn't about generating a textual-diff. It's about storing all of the necessary assets to be able to generate an application. Some of those assets are code, some are binary. And you want to keep them in sync, in some way.

Using multiple different systems to get myself "up to date" is a massive pain - I should be able to run one command and then have a wholly buildable set of files.


So, you keep your compiler source and binary in your project source tree?

yeah, you don't. :D


I suspect many of the binary-in-git cases could be better handled with just a bit more work on the part of repo maintainers. There are many package managers that work well with binary files, for instance, and part of the build process could be calling out to those. Then you'd have a nice separation between assets that are generated from code and assets that are generated by some fiddly manual process.


Git isn't usually for the assets generated from code, but for the code itself. Code is an asset generated by a fiddly manual process, and both code and other assets go into producing the final results, so is that really a point of separation?


Perhaps I was unclear. git should store code, and the build process should use that code to build whatever assets might be built from code. All other assets should be stored elsewhere and used by the build and/or deployment process in some appropriate manner that doesn't involve git.

[EDIT:] Perhaps we were both unclear! By "code", I mean stuff that can be read and edited by normal humans, which by some automatic mechanism can be used to control computers. E.g., if you had a log of photoshop actions that could be replayed to produce a given image, that would be code, and it would be suitable for storage in git. In general, the image itself ought to be elsewhere.


My point is that code and other assets aren't really that different. Both are human produced, human editable files which go into producing the final product. So why should the "code" go into git but the "not-code" human-produced files go elsewhere?


Because code and other assets are in fact that different. Both are human produced, yes, but I'd argue that they aren't exactly human-editable in the same way. Code (text files) are editable by any text file editor, while non-code is human-editable only when using specific tool to decode the file. This brings us to the third property - only code/text files are human-readable without the need to decode them. And since git is text-file-oriented, it is a good tool for storing text files, but not as good to store binary files. Right tool for the right job, and all that.

To give another example - diff is a pretty common action to see changes between two specific versions of the code. Now if I'd treat both code and non-code assets as equivalent (because they are human produced, etc.), then I'd expect the tool which manages them to also show me diff between versions of binary files. But simple byte-by-byte diff* would be pointless, exactly because non-text files need to be decoded before they can be used. So if I'd treat them as equivalent, I'd expect graphical binary assets to show me the picture with differences between two versions; I'd expect video binary file to show me which time points are different in the recording, etc. etc.

* Well, yes, not exactly byte-by-byte, since it needs to be encoding-aware. But still, much simpler than parsing any random binary file format to show reasonable diff.


Git is not just textual diff. Branching, tagging, commit history are still useful without diff.


You couldn't do a textual diff, but you could do a binary diff, and storing a bunch of binary deltas could save you a ton of space when you have lots of large, frequently changing binaries to store. Further, you'd have all the other benefits of version control, such as versioning, accountability, comments/logging, etc.


I have GIT repo over my home folder for incremental backups.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: