
Announcing Git Large File Storage - dewski
https://github.com/blog/1986-announcing-git-large-file-storage-lfs
======
theli0nheart
I'm sure GitHub did their due diligence before starting to work on this, but I
can't lie: it bums me out a bit that they didn't find git-bigstore [1] (a
project I wrote about 2 years ago) before they started, since it works in
almost the exact same way. Three-line pointer files, smudge and clean filters,
use of .gitattributes for which files to sync, and remote service integration.

Compare "Git Large File Storage"'s file spec:

    
    
        version https://git-lfs.github.com/spec/v1
        oid sha256:4d7a214614ab2935c943f9e0ff69d22eadbb8f32b1258daaa5e2ca24d17e2393
        size 12345
    

And bigstore's:

    
    
        bigstore
        sha256
        96e31e44688cee1b0a56922aff173f7fd900440f
    

Bigstore has the added benefit of keeping track of file upload / download
history _entirely in Git_, using Git notes (an otherwise not-so-useful
feature). Additionally, Bigstore is also _not_ tied to any specific service.
There are built-in hooks to Amazon S3, Google Cloud Storage, and Rackspace.

Congrats to GitHub, but this leaves a sour taste in my mouth. FWIW,
contributions are still welcome! And I hope there is still a future for
bigstore.

[1]: [https://github.com/lionheart/git-
bigstore](https://github.com/lionheart/git-bigstore)

~~~
jedbrown
Looks like you wrote git-bigstore a few months after I wrote git-fat (also
Python and a similar design; partially inspired by git-media). It would be
interesting to do some performance comparisons and merge our capabilities,
perhaps with support for each other's stub formats if we can do it in a
compatible way.

~~~
tomphoolery
It's also somewhat notable to mention that git-media was written by one of the
founders of GitHub. So...that may be why? :)

------
joeyh
It's interesting that this uses smudge/clean filters. When I considered using
those for git-annex, I noticed that the smudge and clean filters both had to
consume the entire content of the file from stdin. Which means that eg, git
status will need to feed all the large files in your work tree into git-lfs's
smudge filter.

I'm interested to see how this scales. My feeling when I looked at it was that
it was not sufficiently scalable without improving the smudge/clean filter
interface. I mentioned this to the git devs at the time and even tried to
develop a patch, but AFAICS, nothing yet.

Details: <[https://git-annex.branchable.com/todo/smudge>](https://git-
annex.branchable.com/todo/smudge>)

~~~
jedbrown
As the author of git-fat, I have to say the smudge/clean filter approach is a
hack for large files and the performance is not good for a lot of use cases.
The reality is that it's common to need fine-grained control over what files
are really present in the repository, when they are cached locally, and when
they are fetched over the network. Git-annex does better than the smudge/clean
tools (git-fat, git-media, git-lfs) but at somewhat increased complexity. I
think our tools have stepped over the line of "as simple as possible but no
simpler" and cut ourselves off from a lot of use cases. Unfortunately, it's
hard for people to evaluate whether these tools are a good fit now and in a
couple years.

As for git-lfs relative to git-fat: (1) the Go implementation is probably
sensible because Python startup time is very slow, (2) git-lfs needs server-
side support so administration and security is more complicated, (3) git-lfs
appears to be quite opinionated about when files are transferred and inflated
in the working tree. The last point may severely limit ability to work
offline/on slow networks and may cause interactive response time to be
unacceptable. Some details of the implementation are different and I'd be
curious to see performance comparisons among all of our tools.

~~~
joeyh
Thanks for verifying my somewhat out of date guesses about smudge performance!

Re the python startup time, this is particularly important for smudge/clean
filters because git execs the command once per file that's being checked out
(for example). I suppose even go/haskell would be a little too slow starting
when checking out something like the 100k file repos some git-annex users
have. ;)

~~~
clacke2
Ok, that means I don't have to check out git-lfs, git-fat or git-bigstore. My
annex is 250k symlinks pointing to 250 GiB of data. It's slow enough as it is.

~~~
joeyh
At 250k files in one branch, you are starting to run into other scalability
limits in git too, like the innefficient method it uses to update .git/index
(rewriting the whole thing).

~~~
bicolao
There is the new "split-index" mode to avoid this (see "git update-index" man
page). The base index will contains 250k files but .git/index only contains
entries you update, should be a lot fewer than 250k files.

~~~
weakish
It seems that `--split-index` is only available at 'update-index'. Can it be
used with `add` or via `git config`?

------
vvanders
This looks like it misses the mark a bit.

As anyone who's worked on project with large binary files(the docs assume
PSDs) you need to be able to lock unmergeable binary assets. Otherwise you get
two people touching the same file and someone has to destroy their changes.
That never makes anyone happy.

It's also unseen how good the disk performance is. These two areas are the
reason why Perforce is still my go-to solution for large binary files.

~~~
infogulch
That's an interesting point I hadn't considered before. Git, as a
_distributed_ vcs, takes the position that everyone can edit all files, and
rare conflicts can be managed easily because changes are relatively small and
diffable.

But these assumptions break down with binary assets. Changes are _not_ small,
they typically change entire files at once. They're also not diffable. As a
result, conflicts are not rare, they're _common_ , and impossible to resolve
for both parties. That's why locking or "checking out" certain files is a
needed feature, so changes are ordered strictly linearly -- a graph structure
doesn't work.

~~~
jordigh
> They're also not diffable.

I think that really is a problem with the diff tools, not the format itself.
This is why both Mercurial and git allow you to pick special diff and merge
tools per filename extension.

~~~
infogulch
Most binary assets are compressed. Diff tools can't work with compressed
assets; you'll have to decompress before diffing. Sometimes they also include
checksum information which would invalidate any attempt to merge.

How far do you take the decompression? For raster images, you'll probably have
to decompress all the way to bitmap because the same image could have multiple
completely different binary representations in a format like png.

How do you diff changes? If we have a raster image and one person changes one
thing by a small amount, say increases brightness 1%, this could alter every
pixel of the image! How would you detect that change and interleave it with
something like a contrast adjustment of 1% that could also change every pixel?
Sure, the merger would still have to choose which adjustment goes first if the
changes aren't independent, but how would they know that's what changed? I.e.
how would the diff tool know that the changes are "brightness +1%" and
"contrast +1%" and not some other arbitrary number of adjustments?

~~~
jordigh
I don't think the problem is trivial, but I don't think it's hopeless either.
If we can measure a person's pulse and mood with a camera pointed at their
face, I'm sure we can come up with a tool that can approximate semantically
meaningful diffs of artwork. For image formats like xcf that store parts of
the image or the editing history independently, this problem becomes even more
tractable.

~~~
WallWextra
It's very tractable if you don't insist on the diff reconstructing the target
file byte-for-byte. But this would require pretty invasive changes in the
version control system, no?

If you change the top-left pixel of a PNG, for example, between the intra
prediction and the DEFLATE compression, the new file can be totally different,
and to reconstruct it you either hope the destination is using the exact same
libpng with the exact same settings, or you have to find a space-efficient way
to write down all the arbitrary encoding decisions the format allows.

------
Doji
So basically it's git-annex, but tied to GitHub. [http://git-
annex.branchable.com/](http://git-annex.branchable.com/)

~~~
bhuga
> tied to GitHub.

The protocol is open ([https://github.com/github/git-
lfs/blob/master/docs/api.md](https://github.com/github/git-
lfs/blob/master/docs/api.md)) and the client additions are open source. There
is a reference server implementation at [https://github.com/github/lfs-test-
server](https://github.com/github/lfs-test-server).

edit: added protocol spec

~~~
chimeracoder
> The protocol is open and the client additions are open source. There is a
> reference server implementation at [https://github.com/github/lfs-test-
> server](https://github.com/github/lfs-test-server).

This isn't about this particular instance (Github's LFS), but in general, a
"reference implementation" isn't the same thing as having an open protocol.

Having a reference implementation without a proper specification means that
any other implementations have to re-implement the existing reference
implementation, _including_ any bugs. The purpose of a specification is to
outline _undefined_ behavior as much as it is to outline defined behavior.
That is, the specification says, "these are the portions of the program which
you may _not_ rely on".

We've seen this happen in some languages in which a particular implementation
is either the _de facto_ or _de jure_ standard. Other compilers or
interpreters end up having to mimic their bugs when it comes to things like
arithmetic overflow/precision errors, because developers have come to rely on
the language behaving one way, in the absence of any clear rules telling them
otherwise[0].

[0] Not that developers may not rely on things that a specification explicitly
tells them not to - there are plenty of examples of this too - but at least
then it's possible to say determine either that a particular program will run
on any standards-compliant implementation, or that it is implementation-
specific.

~~~
bhuga
You're right, the protocol is also required. I didn't link to it in my
comment, but it is also open and well-defined. I've updated my comment.
Thanks!

------
jxf
This looks really interesting. You basically trade the ability to have diffs
(nearly meaningless on binary files anyway) for representing large files as
their SHA-256 equivalent values on a remote server.

What will be interesting is to see whether GitHub's implementation of LFS
allows a "bring your own server" option. Right now the answer seems to be no
-- the server knows about all the SHAs, and GitHub's server only supports
their own storage endpoint. So you couldn't use, say, S3 to host your Git LFS
files.

~~~
icebraining
_You basically trade the ability to have diffs (nearly meaningless on binary
files anyway) for representing large files as their SHA-256 equivalent values
on a remote server._

That's exactly what git-annex does. Except it can host on your own servers, or
S3, or Tahoe-LAFS, or rsync.net, etc. And it's free software. And it supports
multiple servers for the same repo, so you have redundancy.

Adding an S3 remote is just setting the AWS keys and running a single command:
[http://git-annex.branchable.com/tips/using_Amazon_S3/](http://git-
annex.branchable.com/tips/using_Amazon_S3/)

~~~
rakoo
And if you want the simplest solution (ie store blobs with the rest of your
code), gitlab offers git-annex compatibility
([https://about.gitlab.com/2015/02/17/gitlab-annex-solves-
the-...](https://about.gitlab.com/2015/02/17/gitlab-annex-solves-the-problem-
of-versioning-large-binaries-with-git/))

~~~
sytse
Thanks for mentioning us rakoo. Using git-annex on GitLab.com is completely
free. We're thinking about a 5Gb per repo cap but right now its unlimited.

------
ot
> Every user and organization on GitHub.com with Git LFS enabled will begin
> with 1 GB of free file storage and a monthly bandwidth quota of 1 GB.

Does this mean that with the free tier I can upload a 1GB file which can be
downloaded at most _once a month_? Even a small 10MB file, which fits
comfortably in a git repo, could be downloaded only 100 times a month. Maybe
they meant 1TB bandwidth?

~~~
mdlowman
I would suspect the point is rather that you have a bunch of megabyte range
files, and you rarely update them and don't have to sync. But for most
workflows this feature seems targeted at, the free tier seems insufficient.

~~~
jandrese
I'm having trouble seeing where a 1GB/month quota in any way meshes with
"large file" support. The free tier is basically "test out the API, don't even
think about using it for real".

~~~
cortesoft
Yes, that is the free tier. If you want to use it seriously, it will cost some
money. OR you can use it and host your own file server, for free.

I don't think these facts are a problem. They create an open source tool,
provide a location to try it out, and a service to pay to use it if you like
it and don't want to host yourself. Seems like a fair offer.

------
jewel
The "filter-by-filetype" approach used here is going to work a lot better for
mixed-content repositories than git-annex, which doesn't have that capability
built-in (to my knowledge).

git-annex has been great for my photo collection (which is strictly binary
files). It lets me keep a partial checkout of photos on my laptop and desktop,
while replicating the backup to multiple hosts around the internet.

At work we have a bunch of video themes that are partially XML and INI files
and partially JPG and MP4. LFS would work great for us, except we don't use
github (we don't have a need for it.) It looks like this is going to be very
simple for that kind of workflow.

Just yesterday HN user dangero was looking for this exact sort of thing, large
file support in git that didn't add too much complexity to the workflow:
[https://news.ycombinator.com/item?id=9330125](https://news.ycombinator.com/item?id=9330125)

~~~
icebraining
The filter-by-filetype can be replace by small script that augments git:
[http://git-annex.branchable.com/forum/help_running_git-annex...](http://git-
annex.branchable.com/forum/help_running_git-
annex_on_top_of_existing_repo/#comment-41049e5e5de08f6cc0b76f9298b8e4c0)

~~~
sytse
Would be nice to replace that with git hooks so you can just use regular git
commands. Any idea's if that is feasible?

~~~
icebraining
I'm not sure, but I doubt a hook would do. An alternative would be to have a
frontend script to git that would shadow the git command (using shell aliases)
and call git-annex when appropriate.

~~~
sytse
Yeah, I don't like shell aliases but it would work. I wonder how the LFS
client works.

------
justinsb
This solves a real problem, but I can't help but feel it is a band-aid hack.

The main fundamental advantage (vs implementation quirks of git) I can see is
that these files are only fetched on a git checkout. But (of course) this
breaks offline support, and it requires additional user action.

Wouldn't it have been fairly easy to build exactly the same functionality into
git itself? "Big" blobs aren't fetched until they are checked-out? This also
has the advantage the definition of "big" could depend on your connectivity /
disk space / whatever, rather than being set per-repo.

~~~
ocdtrekkie
I have a feeling the decision for how to arrange, as a separate thing, is
likely to feed the monetization component. Particularly if it's limited to
using GitHub's storage.

~~~
acveilleux
Specs are open, there's an open server implementation. It might be easiest to
set it up with github, but if it catches on, I expect implementations will be
readily available from all github-like platforms and as stand-alone.

~~~
ocdtrekkie
Awesome. I did notice they called it "Git LFS" instead of "GitHub LFS", which
should be a clue there, though from other comments I figured it might be
GitHub specific.

------
Rondom
Has someone had a closer look and can say how this compares to Git-Annex?

~~~
jefurii
This and git-annex (and git-fat and others) use the same basic architecture of
storing links in Git and schlepping the binaries around separately.

Git-annex renames binaries with their SHA256 hashes, puts them in a
.git/annex/ dir, and replaces files in the working dir with symlinks. Git-LFS
seems to use small metadata pointer files (SHA256 hash, file size, git-lfs
version) instead of symlinks. Not sure whether the files reside in something
like the .git/annex/ dir; I'm guessing the do or there wouldn't be those
pointer files.

You can clone a repo without having to download the files.

With git-annex you can sync between non-bare and bare repositories without
having a central server. Git-LFS seems to have a separate server for binaries.
It looks like it may act like a git-annex special remote rather than git-
annex's usage of synced/master branch.

Git-annex repos share information about the locations of annex files, how many
repos contain a given file, etc. You can trust and un-trust repos. It doesn't
look like Git-LFS offers this.

Git-LFS has a REST API. I'm using an old version of git-annex so I can't say
if it does (I think it does).

Git-LFS is written in Go, git-annex is Haskell.

Git-LFS is a GitHub project. GitHub will offer object hosting.

Update: clarity, speling, added a bullet point.

~~~
joeyh
The lack of location tracking looks like the most significant difference to
me. While the git-lfs documentation does mention that different git remotes
can have different LFS endpoints configured, all git-lfs knows about a file is
its SHA256. So how can it tell which remote to download the file from? The
best it could do is try different remotes until it finds one that has the
file.

I hesitate to say this means git-lfs is not distributed at all, but it seems
significantly less distributed than git-annex, which can keep track of files
that might be in Glacier, or on an offline drive, or a repo cloned on a nearby
computer, and so can be used in a more peer-to-peer fashion when storing and
retrieving the large files.

~~~
paulcody
At a minimum, the metafile should include a URI, stating the last known
location of the file, rather than just a bare SHA256.

~~~
joeyh
That doesn't work very well. Consider what happens if two different clones of
a repo update the metadata last-known url at the same time with different
urls. Merge conflicts.

This is why git-annex uses a separate branch for location tracking
information, which it can merge in a conflict-free manner.

~~~
paulcody
Late reply: that's fine, as long as there is some information as to where the
location is, and not GitHub-only.

------
geoffreyirving
After a quick scan, I'm a bit worried that this is too tied to a server in
practice. For example, if I've downloaded everything locally, can I easily
clone the whole download (including all lfs files) into a separate repo? If I
can, can changes to each be swapped back and forth?

------
m0th87
Our solution is likely a lot more duct tape-y, but we developed a straight-
forward tool in Go for managing large assets in git:
[https://github.com/dailymuse/git-fit](https://github.com/dailymuse/git-fit)

There's a number of other solutions open source out there, some of which are
documented in our readme.

------
callum85
Can someone explain to me what problem this solves in layman's terms... How
are version control systems are "impractical" for large files?

Or to put another way, what problems will I run into if I just commit large
media files without using this?

~~~
ndepoel
With distributed version control systems such as Git or Mercurial, when you
clone a repository you get the entire history of that repository (or of a
selected branch). This means that if you place large media files directly in
the repository, then every clone will contain each and every revision of that
file. In time, this will cause an enormous amount of bloat in your repository
and slow work on the repository down to a crawl. Cloning a repository several
dozens of gigabytes in size is no fun, I can tell you.

Centralized version control systems such as Subversion don't have this problem
(or at least, to a lesser extent), because as a user you only download a
single revision of each file when you check out the repository.

Extensions like git-media, git-fat and now git-lfs solve this issue by only
storing references to large media files inside the Git repository, while
storing the actual files elsewhere. With this, you will only download the
revision of the large file that you actually need, when you need it. It's sort
of a hybrid solution in-between centralized and decentralized version control.

------
zmmmmm
Is there any hint on pricing? Slighty annoying to have a section titled
"Pricing" which .... doesn't tell you the price. I would much rather use my
own external server for hosting large files, it is going to need to be price
competitive with other options to be interesting I would think.

------
saljam
What's stopping git from storing large files using Merkle trees + a rolling
hash?

I'm probably missing something since there this, and git-annex, and git-
bigstore, and others...

------
duartetb
Does this mean gamedevs might start droping Perforce for this? If its not too
expensive maybe?

~~~
Tiktaalik
This is certainly the issue that is preventing game devs from adopting Git.

On the other hand game devs at this point are very used to Perforce, and it
looks like Perforce is interested in solving this problem from the other side,
by adding Git features to Perforce Helix and making it distributed.

~~~
dhruvgupta
Helix Versioning Engine is a native DVCS. This is in addition to a Git
management solution that is part of the solution. Choice of workflow, combined
with efficiency in large file, large repo handling -- definitely interested in
solving the problem right, instead of using band-aid.

------
sytse
I like the ease of use of 'git lfs track "*.psd"' and being able to use normal
git commands after that.

Would it be possible to extend git-annex with a command that lets you set one
or more extensions? By using git hooks you can probably ensure that the normal
git commands work reliably.

------
sytse
To celebrate the broader support for git with large files we just raised the
storage limit of GitLab.com to 10GB
[https://about.gitlab.com/2015/04/08/gitlab-dot-com-
storage-l...](https://about.gitlab.com/2015/04/08/gitlab-dot-com-storage-
limit-raised-to-10gb-per-repo/) also, we're glad GitHub open sourced it and
didn't call it assman

~~~
sytse
Someone asked if this was temporarily or permanent, it is permanent, see
[https://news.ycombinator.com/item?id=9344984](https://news.ycombinator.com/item?id=9344984)

------
Poiesis
Has anyone seen what happens for a user who doesn't have this installed when
cloning? I've tried it out but it seems to not affect local clones.

------
Animats
Does Github really do this using git's "smudge" and "clean" filters? That
would mean reprocessing the whole file for each access. That's inefficient.
It's useful only if someone else is paying for the disk bandwidth, and
necessary only if you don't have control of the storage system. Why would
GitHub do that to itself?

------
luckydude
BitKeeper has had a better version of this since around 2007. Better in that
we support a cloud of servers so there is no "close to the server" thing,
everyone is close to the server.

What we don't have is the locking. I agree with the people commenting here
that locking is a requirement because you can't merge. We need to do that.

------
Pirate-of-SV
Can't wait too see what Linus got to say about this. I suppose he got an
arguably better solution to the problem?

~~~
jgrowl
My guess is that Linus _doesn 't care_ about large binary files.

~~~
bananaboy
Yeah I've read interviews where he's said that he wrote git for his use case,
which I imagine doesn't include large binary files. I don't think he'd really
care.

------
amelius
Wouldn't it be nicer if we had something like this on the level of the
filesystem, instead of on the level of a version control system? Advantages
would be that git and any other user-space application wouldn't need much
extension, and files could be opened as if they were on the local file system.

------
nodesocket
> Every user and organization on GitHub.com with Git LFS enabled will begin
> with 1 GB of free file storage and a monthly bandwidth quota of 1 GB.

A GB doesn't get you very far if you are working with raw audio and video.

Does it make sense to think about storing virtual machines images (.vmdk) in
git on GitHub with LFS?

------
spb
I still don't get why you wouldn't just check large binaries into a submodule
and host that everywhere you would an annex/LFS.

------
niche
Yes! Bringing us all one step closer to the whiysi (we host it you store it)
dev paradigm. Bravo!

------
patcon
Cool! Can't want for future integration of content-addressable systems like
ipfs :)

------
markvitals
This is very handy for designers, who want to use Photoshop or Illustrator
with Git

------
dmitrypolushkin
Hopefully bup will implement something like that for the backuping.

------
jbramble
Does this mean github could become useful for music production?

------
lesplat
So does this mean the large files are actually versioned?

~~~
gabeio
From how I read it, it sounds like a little of yes and no... it's similar to
git's way but I am not sure if they are really going to keep versions of all
of the old large files... I guess if they are going to be fully reverse
compatible like being able to go backwards in git you have to...

------
silon3
Can it link to torrent?

------
mahouse
Ah, cool! At last I will be able to store my database backups in GitHub.

~~~
dmitrypolushkin
Hopefully they will provide such functionality, but I don't think in the
nearest future.

------
ElectricFeel
my name is Lars & i do projects for LiveIT! this is exciting

