This is a good article, but its git-annex information is getting slightly out of date. git-annex has a new repository format available which makes its interface work more like git-lfs. http://joeyh.name/blog/entry/git-annex_v6
Also, it's worth noting that git-lfs stores 2 copies of each large file on local disk. git-annex stores 1 copy. Which is a pretty big plus the article left out.
Have you considered using the copy-on-write features that some filesystems (btrfs) now support? For example, the lowly cp command now supports CoW via --reflink=always
Also, that ks for git-annex! I use it for hundreds of gigabytes of files, and find the one-copy-per-file support essential for how I use it.
I also went with the hg solution to enjoy it's core large-file support but ultimately switched to git-annex so I didn't have to support and learn each technology. git-lfs looks to be good but I'm still waiting a small, standalone production-ready server implementation.
I have a couple of projects with huge binary assets under source control (movie files, 4K images, essentially the whole website). Deployment via jenkins and rsync, it's really fast after the initial checkout.
How? Well, the framework stores all assets inside /assets, which normally is a git submodule. Jenkins works pretty fine with this construction, and for developers there are a couple self written PHP shellscripts that execute a shallow clone outside the repository and then do a bit of rsync magic to sync it all.
> if you can't get a textual diff, or if a textual diff is meaningless, what benefit do you even get having the file on git?
A nice property of checking out whole repository using only single program. All assets are downloaded at once and always proper version of code comes with proper version of binary assets.
Version control isn't about generating a textual-diff. It's about storing all of the necessary assets to be able to generate an application. Some of those assets are code, some are binary. And you want to keep them in sync, in some way.
Using multiple different systems to get myself "up to date" is a massive pain - I should be able to run one command and then have a wholly buildable set of files.
I suspect many of the binary-in-git cases could be better handled with just a bit more work on the part of repo maintainers. There are many package managers that work well with binary files, for instance, and part of the build process could be calling out to those. Then you'd have a nice separation between assets that are generated from code and assets that are generated by some fiddly manual process.
Git isn't usually for the assets generated from code, but for the code itself. Code is an asset generated by a fiddly manual process, and both code and other assets go into producing the final results, so is that really a point of separation?
Perhaps I was unclear. git should store code, and the build process should use that code to build whatever assets might be built from code. All other assets should be stored elsewhere and used by the build and/or deployment process in some appropriate manner that doesn't involve git.
[EDIT:] Perhaps we were both unclear! By "code", I mean stuff that can be read and edited by normal humans, which by some automatic mechanism can be used to control computers. E.g., if you had a log of photoshop actions that could be replayed to produce a given image, that would be code, and it would be suitable for storage in git. In general, the image itself ought to be elsewhere.
My point is that code and other assets aren't really that different. Both are human produced, human editable files which go into producing the final product. So why should the "code" go into git but the "not-code" human-produced files go elsewhere?
Because code and other assets are in fact that different. Both are human produced, yes, but I'd argue that they aren't exactly human-editable in the same way. Code (text files) are editable by any text file editor, while non-code is human-editable only when using specific tool to decode the file. This brings us to the third property - only code/text files are human-readable without the need to decode them. And since git is text-file-oriented, it is a good tool for storing text files, but not as good to store binary files. Right tool for the right job, and all that.
To give another example - diff is a pretty common action to see changes between two specific versions of the code. Now if I'd treat both code and non-code assets as equivalent (because they are human produced, etc.), then I'd expect the tool which manages them to also show me diff between versions of binary files. But simple byte-by-byte diff* would be pointless, exactly because non-text files need to be decoded before they can be used. So if I'd treat them as equivalent, I'd expect graphical binary assets to show me the picture with differences between two versions; I'd expect video binary file to show me which time points are different in the recording, etc. etc.
* Well, yes, not exactly byte-by-byte, since it needs to be encoding-aware. But still, much simpler than parsing any random binary file format to show reasonable diff.
You couldn't do a textual diff, but you could do a binary diff, and storing a bunch of binary deltas could save you a ton of space when you have lots of large, frequently changing binaries to store. Further, you'd have all the other benefits of version control, such as versioning, accountability, comments/logging, etc.
Also, it's worth noting that git-lfs stores 2 copies of each large file on local disk. git-annex stores 1 copy. Which is a pretty big plus the article left out.