Hacker News new | past | comments | ask | show | jobs | submit login
Announcing Git Large File Storage (github.com)
620 points by dewski on Apr 8, 2015 | hide | past | web | favorite | 159 comments



I'm sure GitHub did their due diligence before starting to work on this, but I can't lie: it bums me out a bit that they didn't find git-bigstore [1] (a project I wrote about 2 years ago) before they started, since it works in almost the exact same way. Three-line pointer files, smudge and clean filters, use of .gitattributes for which files to sync, and remote service integration.

Compare "Git Large File Storage"'s file spec:

    version https://git-lfs.github.com/spec/v1
    oid sha256:4d7a214614ab2935c943f9e0ff69d22eadbb8f32b1258daaa5e2ca24d17e2393
    size 12345
And bigstore's:

    bigstore
    sha256
    96e31e44688cee1b0a56922aff173f7fd900440f
Bigstore has the added benefit of keeping track of file upload / download history _entirely in Git_, using Git notes (an otherwise not-so-useful feature). Additionally, Bigstore is also _not_ tied to any specific service. There are built-in hooks to Amazon S3, Google Cloud Storage, and Rackspace.

Congrats to GitHub, but this leaves a sour taste in my mouth. FWIW, contributions are still welcome! And I hope there is still a future for bigstore.

[1]: https://github.com/lionheart/git-bigstore


I honestly hadn't seen git-bigstore. Git LFS started out as a Git Media update, actually. We changed the name pretty late in the process so that it didn't clash with existing Git Media repositories. Go was picked so that we could ship static binaries so its users don't have to worry about installing the correct ruby/python runtime and 3rd party dependencies.

Git LFS isn't tied to any specific service either. You can install our reference server somewhere, and start using it with your GitHub (or any host really) repositories without having to sit in our wait list or pay us a dime. Though our reference server isn't really production ready, so I wouldn't advise that for real work just yet :)


Looks like you wrote git-bigstore a few months after I wrote git-fat (also Python and a similar design; partially inspired by git-media). It would be interesting to do some performance comparisons and merge our capabilities, perhaps with support for each other's stub formats if we can do it in a compatible way.


It's also somewhat notable to mention that git-media was written by one of the founders of GitHub. So...that may be why? :)


Yep, agreed. That would be awesome.


Looks like you and GitHub are also both duplicating the git-media extension [1].

I haven't settled on one for my own use, but I'll compare features of bigstore and git-media before I do. Thanks for making your project available!

[1] https://github.com/alebedev/git-media


A lot of inspiration did come from the git-media project which goes back to at least 2009: https://github.com/alebedev/git-media/commit/705ea59bd98a3d1...

git-bigstore is awesome too though!


It seems to be based on git-media initially, would be interesting to know why they changed that https://github.com/github/git-lfs/commit/10a8eceefdb081edf61...


Very cool. Maybe this will shine a light on your project. First thing I looked for in their announcement was S3 support...


If you want s3 support consider using git-annex. GitLab.com and GitLab Enterprise Edition support it out of the box.


It's interesting that this uses smudge/clean filters. When I considered using those for git-annex, I noticed that the smudge and clean filters both had to consume the entire content of the file from stdin. Which means that eg, git status will need to feed all the large files in your work tree into git-lfs's smudge filter.

I'm interested to see how this scales. My feeling when I looked at it was that it was not sufficiently scalable without improving the smudge/clean filter interface. I mentioned this to the git devs at the time and even tried to develop a patch, but AFAICS, nothing yet.

Details: <https://git-annex.branchable.com/todo/smudge>


As the author of git-fat, I have to say the smudge/clean filter approach is a hack for large files and the performance is not good for a lot of use cases. The reality is that it's common to need fine-grained control over what files are really present in the repository, when they are cached locally, and when they are fetched over the network. Git-annex does better than the smudge/clean tools (git-fat, git-media, git-lfs) but at somewhat increased complexity. I think our tools have stepped over the line of "as simple as possible but no simpler" and cut ourselves off from a lot of use cases. Unfortunately, it's hard for people to evaluate whether these tools are a good fit now and in a couple years.

As for git-lfs relative to git-fat: (1) the Go implementation is probably sensible because Python startup time is very slow, (2) git-lfs needs server-side support so administration and security is more complicated, (3) git-lfs appears to be quite opinionated about when files are transferred and inflated in the working tree. The last point may severely limit ability to work offline/on slow networks and may cause interactive response time to be unacceptable. Some details of the implementation are different and I'd be curious to see performance comparisons among all of our tools.


Thanks for verifying my somewhat out of date guesses about smudge performance!

Re the python startup time, this is particularly important for smudge/clean filters because git execs the command once per file that's being checked out (for example). I suppose even go/haskell would be a little too slow starting when checking out something like the 100k file repos some git-annex users have. ;)


Yep. It's really a problem that needs to be fixed in git proper. I'm surpised that github of all people didn't realize this and/or invest the time to do it right.

The one major obvious drawback for it being fixed in git is that it's not backwards compatible with old clients though. Doing it in go is probably a good improvement over the existing solutions of git-media and git-fat but I don't think is the final one.

Funny enough, although I had thought this since I started working with git-fat, I only recently admitted it[1]. Perhaps if I had admitted it when I first started work on it then there's a chance they would have seen it! :-P

[1]https://github.com/cyaninc/git-fat/issues/41#issuecomment-88...


Ok, that means I don't have to check out git-lfs, git-fat or git-bigstore. My annex is 250k symlinks pointing to 250 GiB of data. It's slow enough as it is.


At 250k files in one branch, you are starting to run into other scalability limits in git too, like the innefficient method it uses to update .git/index (rewriting the whole thing).


There is the new "split-index" mode to avoid this (see "git update-index" man page). The base index will contains 250k files but .git/index only contains entries you update, should be a lot fewer than 250k files.


It seems that `--split-index` is only available at 'update-index'. Can it be used with `add` or via `git config`?


I looked at git-fat as an option for me, but what killed it was rsync as the only backend; I really wanted to send files to S3.

I also looked at git-annex, and I could see using it if it were just me on the project (or as a way of keeping fewer files on my laptop drive), but I was reluctant to add any more complexity to the source control process, since explaining how to use git-annex to the entire team was too big of a barrier.


Thanks for the feedback. There is a PR for S3 support, but it's dormant because it was mixed with other changes that broke compatibility. I haven't personally wanted S3, so haven't made time to rework the PR.


Ah, that is my fault. We're using the fork quite actively, but need to revive that PR and improve the config settings.


So if I have large (1GB+) files in my repo, you recommend against git-fat?

I have been enjoying the simplicity of git-fat, but running git diff and especially git grep makes me think I should switch to something else.


As someone who worked on it a lot, I'd say it'd be worth it to switch to git-lfs. Exactly the same designs with different formats but written in golang.


Seems that git status nowadays does manage to avoid running the smudge filter, unless the file's stat has changed. This overhead does still exist for other operations, like git checkout.


Also, is there any reason Git LFS can't be used as a special remote for git-annex?

It would provide an easy way for people to host their git-annex repos entirely on GitHub.


Yeah, git-annex is very interested in having a special remote for everything and anything. And if someone creates 4 shell commands, I could have a demo working in half an hour. The commands would be:

  lfs-get SHA256 > file
  lfs-store SHA256 < file
  lfs-remove SHA256 (optional)
  lfs-check SHA256 # exit 0 or 1, or some special code if github is not available
Presumably the right way would be to use their http api, but these 4 commands seem generally useful to have anyway.


This looks like it misses the mark a bit.

As anyone who's worked on project with large binary files(the docs assume PSDs) you need to be able to lock unmergeable binary assets. Otherwise you get two people touching the same file and someone has to destroy their changes. That never makes anyone happy.

It's also unseen how good the disk performance is. These two areas are the reason why Perforce is still my go-to solution for large binary files.


Mercurial has largefiles and locking too:

http://mercurial.selenic.com/wiki/LargefilesExtension

http://mercurial.selenic.com/wiki/LockExtension

Like other people have noticed, you can have the good parts of a DVCS and the good parts of a CVCS. It doesn't have to be either-or:

https://blogs.janestreet.com/centralizing-distributed-versio...

http://bitquabit.com/post/unorthodocs-abandon-your-dvcs-and-...


We turned on largefiles extension almost a year ago and have come to regret that decision. The major pain point is with integrations, from Eclipse plugin to most any repository management/hosting solution tend to either have bugs or full-on don't support those repos. With plain mercurial workflow it works in most situations, but still rears its ugly head. The maintainers of mercurial classify it as a "feature of last resort" (Read http://mercurial.selenic.com/wiki/FeaturesOfLastResort).

Switching off largefiles requires rebuilding the repository which rebuilds the entire repository. Orchestrating the migration to a new repository for engineering department is also painful which is why we're stuck for the near future (for example ongoing support for a version that's built from largefiles repo with ongoing feature work in non-largefiles repo).

The tooling for mercurial tends to lack behind git's, likely due to git's enormous popularity - so I personally would recommend avoiding largefiles extension.


OOC, what kinds of problems did you see with editor integrations? I work on hg, and I'm wondering if there's anything we can do to make the situation less painful.


We primarily use Eclipse as our editor developing Java. I think I'm using the correct plugin (MercurialEclipse) but there's another one (HgEclipse, or there used to be) which has often caused confusion.

Some of the issues are likely due to our project which is >93k commits at around 1GB for the repo size - I think we have a commit in the history that has 40kb commit message - I believe the developer mixed up some streams/pipes for some reason or other while committing - I'm not sure why we got stuck with it in our history though I suspect this could be the cause of some issues. I can list a few of the issues I still see regularly, but since about a year or so back I've abandoned using the plugin in favor of doing everything manually on the command line, except for viewing history and resolving merge conflicts.

My setup: - Eclipse 4.4.2 (configured to run with 2GB memory) - though most all of these issues I've seen since 3.something - MercurialEclipse: http://mercurialeclipse.eclipselabs.org.codespot.com/hg.wiki... - OS X (10.10 and 10.9), mercurial installed through homebrew.

1. During some actions (I believe to be cloning/sharing projects, maybe elsewhere) the the config for mercurial in eclipse pops up thinking it can't find the hg binary, giving an error about bad location. Nothing has moved or changed on the system, simply focusing in the location field of the hg location and back out causes it to re-validate. I've seen this one recently, maybe a few days ago.

2. Some files can't retrieve history or show annotations. I suspect because the files have been through numerous edits over the years and the plugin runs out of memory or hits a timeout when trying to load everything.

3. Occasionally refreshing status throws error and pops up dialog informing me. When this happens the plugin becomes unusable until eclipse is restarted, but popups continue to show up when trying to do any sort of task (such as refreshing project)

4. Using mq patches cause the mercurial status of project to not update regularly (in Navigator pane). I've since moved away from using mq patches in favor of managing multiple local heads.

5. The merge view/window no longer automatically shows up when performing a merge in eclipse. It has to be manually opened, but used to automatically show up - I liken it to performing a search but the search results not automatically showing up unless it's part of your workspace layout already.

These are what I can recall off the top of my head, but I've also shied away from using the plugin since quite a while ago. In general the speed is sometimes frustrating to deal with, when wanting to view history of a file. It feels like it's gotten slower after some time that we've had largefiles extension turned on, though it's unfair to assume so because I'm biased.


FWIW: It seems like none of these issues are related to largefiles. It might be problems with the eclipse plugin or "something else" (configuration issues, bad interaction with other tools, incompatible versions, bugs).

The only exception could be lack of history. Largefiles make some tricks with storing the hash of file X in a .hglf/X file and folding it back in the right namespace is tricky and had some errors. Annotate and largefiles are conceptually incompatible; largefiles is intended for big and binary files where annotate wouldn't work no matter what.

My experience is that the main problem with largefiles is the problem it is trying to solve. It is not a good idea to store large files in a VCS - especially not in a DVCS. Storing large files in VCS is last resort. Given a situation where you have to / want to do it anyway, largefiles is a fine solution. It works quite well for us and without significant problems.


Maybe I misread durin42's post - I thought they were asking about issues I mentioned with hg + editor plugin. There have been issues with the plugin for several years that I had moved away to primarily just using the command-line by the time it was decided to use largefiles. I'm not aware of any specific largefiles + editor plugin issues, other than a suspicion about things getting slower with regards to looking at history.

One other thing that's been a problem in the past with hg + largefiles (or only started happening since around when we turned on largefiles): Cloning a largefiles repo using "--uncompressed" flag would re-open all the closed named-branches in the newly cloned repo.


That's an interesting point I hadn't considered before. Git, as a distributed vcs, takes the position that everyone can edit all files, and rare conflicts can be managed easily because changes are relatively small and diffable.

But these assumptions break down with binary assets. Changes are not small, they typically change entire files at once. They're also not diffable. As a result, conflicts are not rare, they're common, and impossible to resolve for both parties. That's why locking or "checking out" certain files is a needed feature, so changes are ordered strictly linearly -- a graph structure doesn't work.



Useful for seeing what changed, not so useful for merging unless the images are extremely simple.


> They're also not diffable.

I think that really is a problem with the diff tools, not the format itself. This is why both Mercurial and git allow you to pick special diff and merge tools per filename extension.


Most binary assets are compressed. Diff tools can't work with compressed assets; you'll have to decompress before diffing. Sometimes they also include checksum information which would invalidate any attempt to merge.

How far do you take the decompression? For raster images, you'll probably have to decompress all the way to bitmap because the same image could have multiple completely different binary representations in a format like png.

How do you diff changes? If we have a raster image and one person changes one thing by a small amount, say increases brightness 1%, this could alter every pixel of the image! How would you detect that change and interleave it with something like a contrast adjustment of 1% that could also change every pixel? Sure, the merger would still have to choose which adjustment goes first if the changes aren't independent, but how would they know that's what changed? I.e. how would the diff tool know that the changes are "brightness +1%" and "contrast +1%" and not some other arbitrary number of adjustments?


I don't think the problem is trivial, but I don't think it's hopeless either. If we can measure a person's pulse and mood with a camera pointed at their face, I'm sure we can come up with a tool that can approximate semantically meaningful diffs of artwork. For image formats like xcf that store parts of the image or the editing history independently, this problem becomes even more tractable.


It's very tractable if you don't insist on the diff reconstructing the target file byte-for-byte. But this would require pretty invasive changes in the version control system, no?

If you change the top-left pixel of a PNG, for example, between the intra prediction and the DEFLATE compression, the new file can be totally different, and to reconstruct it you either hope the destination is using the exact same libpng with the exact same settings, or you have to find a space-efficient way to write down all the arbitrary encoding decisions the format allows.


I don't think those are your scm's business. git is the stupid content tracker, it tracks whatever you push into it.

If I were to make a contrived analogy, how do you know how to diff random arrays of bytes ? Where do you start, where do you stop ? How do you know that "\n" or "\r\n" is some kind of delimiter ? You put that knowledge in "diff" and in your editor, and git stores the raw array of bytes. It's the same with binary content: git doesn't care that you don't deal with UTF-8 characters, it doesn't care that it isn't bounded by newline characters.

If you take things this way, you start to understand that the "diff" tool you use must be appropriate to the content you have, and it's not the scm's business. Now, how exactly would a diff work for images, I have absolutely no idea.


An image diff is pretty simple. If you represent an image as a vector of values, diffing just means subtracting two vectors abs(A-B) and writing out the result into a new image.

    template<typename T>
    T diff(const T& a, const T& b)
    {
        auto sz = std::min(a.size(), b.size());
        T img(sz, 1);
        for (auto i = 0; i < sz; ++i) {
            img[i] = std::fabs(a[i] - b[i]);
        }
        return img;
    }
    // usage
    vector<float> a, b;
    a.emplace_back(1.0); b.emplace_back(0.5);
    a.emplace_back(1.0); b.emplace_back(0.0);
    a.emplace_back(0.5); b.emplace_back(0.5);

    auto img = diff(a, b);
    write_exr("filename.exr", img);
The resulting image ends up with 0.0 black in pixels that are identical and non-zero values in the pixels that differ. When you look at it in an image viewer only the portions that differ will be visible.

You often need to crank up the gain when the differences are small.


Showing diffs for binary assets doesn't need to include things like "brightness 1%". GitHub currently supports image diffs, they're simply displayed side by side, or on top of each other.


This isn't about showing diffs, it's about merging diffs from two separate changes. The best github can do for that right now is let you choose which one you want to keep, it doesn't let you stack changes to keep work from both committers. For that's you need fine grained explanatory stackable diffs.


A bit-exact differ for, say, jpegs, probably wouldn't work all that well. Even for losslessly compressed formats, it's very complex and I'm not sure how small the diffs would be.


I think the problem here is that people are jumping from "git should be able to store binary assets" to "git should be able to store binary compilation objects."

You would never check a .o file into your SCM; an SCM, as the name implies, is for managing source files, not object files. A PSD or a DOCX is also a source file, despite being binary: they're representations of the work-in-progress itself, containing enough data to let you resume editing the document. A JPEG, on the other hand, is a compiled object—something you export from your image editor, not something you open and edit and save again. (Unless you're intentionally going for that recompressed-shitpost look, I suppose.)

When you pull down a source repo—a thing you edit—you should expect to get source. That applies to both your text/code assets, and your image/binary assets.

If, on the other hand, you need some assets to just sit there and be consumed by your project, then those aren't source, and so don't belong in your source repo. Those are likely dependencies, which can be resolved to (a triggered compilation of) the relevant source repo for those assets, or which can be resolved to a linear(!)-versioned binary package containing the compiled objects for those assets.

Which is all to say: PSDs, given an appropriate diff tool, could go in git. Final, "product" JPEGs, on the other hand? Those should be sitting in a gem (or equivalent), which was the tagged continuous-integration result of pointing a buildbot (exportbot?) at the relevant source repo full of PSDs. When you build your project, that gem gets pulled down (hopefully from your own private CDN) and suddenly you have some JPEGs, just like suddenly you have some native module .so files.


Someone could write an image editor where the project format is a directory with the source jpeg or png assets and a makefile which generates the resulting image with imagemagick (or other) calls. The makefile could be easily diffable and mergeable and people wouldn't touch the binary assets (or at least wouldn't expect to diff or merge them).

Brushes wouldn't be easy though. This could be handled by trivial svg or ps paths which would get rendered in the make process. This way even brush strokes could be diffable. I wounder if it would be possible to convert existing PSD files or Gimp project files into such a format.


> A bit-exact differ for, say, jpegs, probably wouldn't work all that well.

Depends. For some uses (e.g. a change in one corner of the file, encoded by the same tool with the same parameters), the diff of JPEG would be fine, differences will be restricted to the 8x8 pixel blocks that were touched by the change. Other changes (e.g., changing encoding quality, trimming a row of pixels off the edge of an image) would lead to more complex diffs.


What about compressing the same image with two different encoders? Or with different zlib settings? Even if the diff after lossless decompression is small, you have to somehow reconstruct the decisions that the compressor made on the target file.


Anyone who currently stores large files in git already has this problem; git-lfs is strictly better than plain old git in this case.

Format-specific diffing and merging tools that are aware of git-lfs would probably help, and those can come later.


This looks to be competing with AWS CodeCommit too which will allow any size of files to be committed since it is all just backed by S3.

http://aws.amazon.com/codecommit/


While the use case you describe isn't solved by this, other use cases are. For example, wanting to easily access versioned, compiled binaries for a revision.


So basically it's git-annex, but tied to GitHub. http://git-annex.branchable.com/


> tied to GitHub.

The protocol is open (https://github.com/github/git-lfs/blob/master/docs/api.md) and the client additions are open source. There is a reference server implementation at https://github.com/github/lfs-test-server.

edit: added protocol spec


> The protocol is open and the client additions are open source. There is a reference server implementation at https://github.com/github/lfs-test-server.

This isn't about this particular instance (Github's LFS), but in general, a "reference implementation" isn't the same thing as having an open protocol.

Having a reference implementation without a proper specification means that any other implementations have to re-implement the existing reference implementation, including any bugs. The purpose of a specification is to outline undefined behavior as much as it is to outline defined behavior. That is, the specification says, "these are the portions of the program which you may not rely on".

We've seen this happen in some languages in which a particular implementation is either the de facto or de jure standard. Other compilers or interpreters end up having to mimic their bugs when it comes to things like arithmetic overflow/precision errors, because developers have come to rely on the language behaving one way, in the absence of any clear rules telling them otherwise[0].

[0] Not that developers may not rely on things that a specification explicitly tells them not to - there are plenty of examples of this too - but at least then it's possible to say determine either that a particular program will run on any standards-compliant implementation, or that it is implementation-specific.


You're right, the protocol is also required. I didn't link to it in my comment, but it is also open and well-defined. I've updated my comment. Thanks!


Why reinvent the same thing? What were the technical deficiencies of git-annex that necessitated this? Or is this NIH syndrome? I think these are all reasonable questions.


GitLab CEO here. It would have been nice if they would have developed this in the open, Joey from git-annex is pretty open minded, maybe we could have prevented another standard.


I would have thought that it would have been particularly important for this to have been open to review and critique from the start given GitHub's status and importance to the wider community. Heck, even Microsoft is developing new .Net components in the open.

Would have been nice to have had debate on existing solutions to weigh the pros/cons.


Thanks Rapzid, I also think that an open process would have lead to a better outcome. I hope GitHub follows up with a rationale for the points Joey mentioned in this thread.


To be honest, it doesn't look like they put much additional thought into the problem at all. git-filters are going to be the limiting factor for performance here and depending on your filesize might still cause a considerable slowdown. I would have liked to see a solution with patching git and `sendfile` myself.

See my other comment here: https://news.ycombinator.com/item?id=9345242


Embrace. Extend. Extinguish.


All GitLab contributors are working on an alternative ending.


As far as I can tell, less flexible (far less) then annex but it can be made very seamless to the user (no/few special commands and lfs tracking by filemask.)

I would think bridging the gap in annex to track by file pattern would be easy but a lot of people might prefer not to know how to make annex go. So using simplicity as differentiator.


(Not a git-annex user here). I suppose functionally, these two are similar. But the use case is different. git-annex seems to be more for managing files and making sure they don't disappear on you. GitHub's new thing is for keeping track of larger objects inside your git project efficiently. Basically, yeah, you can use git-annex to store the PSD, the audio samples, the promo video, etc. but wouldn't it be nice to have it all tied in with your normal project workflow?


> (Not a git-annex user here)

You could at least read the examples on the git-annex page[1] before passing judgement that the use cases are at all different (they're not). Instead of using a new 'lfs' command that ties you to GitHub, you use an 'annex' command (along with a few others).

Git-annex does just fine "keeping track of larger objects inside your git project efficiently", and is no more divorced from your normal project workflow than GitHub's lfs.

[1]http://git-annex.branchable.com/git-annex/


I tried git-annex a couple of times to sync my 2 OSX and linux based computers at my house and play around. It wasn't the easiest thing to get setup and working and I couldn't get one of the OSX hosts to work at all.

I'm sure this will be much easier to use for the end user like other github products and will "just work" out of the box.


As a counterpoint, I just set up git-annex and sync'd a couple of local servers plus a remote server, with no issues, by following along in the walkthrough. Granted, it's not exactly an out-of-the-box setup, like say, syncthing, but it wasn't anything overly difficult.


If you want to use git-annex without any setup consider using GitLab.com, it is enabled by default and free to use.


Make a Kickstarter to put it in Gitlab CE, I'd propose.


We'd be up for that :)


I did read the use cases and the main pitch. I guess these don't highlight very well how it actually functions. Can I do "git add large.mp4 && git commit" the way I do now?


Yes, you can. That's one other advantage over git-annex (albeit slight).

The documentation lays out the workflow: https://help.github.com/articles/configuring-large-file-stor...

As does the website: https://git-lfs.github.com/ (see: "Getting Started")


Righto. So what is the equivalent thing with git-annex? Would I have to essentially do two commits, one for source code and one for large objects?


I'm still ramping up on git-annex, but basically (once set up):

    git annex add large_file
    git commit
And with mixed (haven't experimented with that yet), I'm pretty sure you could: git annex add large_file git add small_file git commit


Yes, you can mix them just fine.


Yeah, that was exactly my feeling.

"Not invented here" much?


they are trying to make a service. If you are making a product you generally want to be in control of its core parts.


If you use another open source project that gives you control right? It would be nice if everyone reused git-annex like they reused git.


This looks really interesting. You basically trade the ability to have diffs (nearly meaningless on binary files anyway) for representing large files as their SHA-256 equivalent values on a remote server.

What will be interesting is to see whether GitHub's implementation of LFS allows a "bring your own server" option. Right now the answer seems to be no -- the server knows about all the SHAs, and GitHub's server only supports their own storage endpoint. So you couldn't use, say, S3 to host your Git LFS files.


You basically trade the ability to have diffs (nearly meaningless on binary files anyway) for representing large files as their SHA-256 equivalent values on a remote server.

That's exactly what git-annex does. Except it can host on your own servers, or S3, or Tahoe-LAFS, or rsync.net, etc. And it's free software. And it supports multiple servers for the same repo, so you have redundancy.

Adding an S3 remote is just setting the AWS keys and running a single command: http://git-annex.branchable.com/tips/using_Amazon_S3/


And if you want the simplest solution (ie store blobs with the rest of your code), gitlab offers git-annex compatibility (https://about.gitlab.com/2015/02/17/gitlab-annex-solves-the-...)


Thanks for mentioning us rakoo. Using git-annex on GitLab.com is completely free. We're thinking about a 5Gb per repo cap but right now its unlimited.


This is also free software, and you can also use your own server.


I think the question is why did they role their own solution when there was one already an open and freely one available. If it wasn't suitable in some way I would really like to know why.


If git-annex really worked well, and was easy to use, I imagine there'd be much more uptake of it.


"Build it and they will come"

Besides, I'm sure Joey Hess wouldn't refuse the help.


Joey has been most helpful to us at GitLab, I'm sure he would be happy to help out GitHub too.


Indeed similar to git-annex, why does GitHub not support that? With GitLab we added support for that recently and it was pretty easy to do. Anyway, interesting that linking to sha's is becoming the default solution. Looking forward to playing with this and comparing the solutions.


Because they are going to charge for LFS, I suspect. Just as they don't support private repos for free, either.


They could also charge for git-annex I think. It is not that they need their own type of private repo's to charge for that. By the way, GitLab.com offers unlimited repo's and 5GB per repo for free.


How many users/how long will you be able to provide that for free?


Forever, our business model is to make money with on-premises software (GitLab EE). In the long term we might add a marketplace to GitLab.com were you can subscribe to additional services (like Heroku does).


> You basically trade the ability to have diffs (nearly meaningless on binary files anyway) for representing large files as their SHA-256 equivalent values on a remote server.

That's exactly how Mercurial's largefiles works too:

http://mercurial.selenic.com/wiki/LargefilesExtension#The_lo...

Also, you don't need any kind of special server. Any hg repo can turn into a largefiles store by just flipping the bit in the repo configuration.


There's an example server that was also open sourced today: https://github.com/github/lfs-test-server


If the extension is open-source, you could probably build your own server implementation and fork the extension to change what server it contacts.



Nitpick: they gave four examples of large binary files: audio samples, datasets, graphics, and videos. Diffs would be meaningful in every case.


Surely not diffs offered by the git ecosystem?


> Every user and organization on GitHub.com with Git LFS enabled will begin with 1 GB of free file storage and a monthly bandwidth quota of 1 GB.

Does this mean that with the free tier I can upload a 1GB file which can be downloaded at most once a month? Even a small 10MB file, which fits comfortably in a git repo, could be downloaded only 100 times a month. Maybe they meant 1TB bandwidth?


I would suspect the point is rather that you have a bunch of megabyte range files, and you rarely update them and don't have to sync. But for most workflows this feature seems targeted at, the free tier seems insufficient.


I'm having trouble seeing where a 1GB/month quota in any way meshes with "large file" support. The free tier is basically "test out the API, don't even think about using it for real".


Yes, that is the free tier. If you want to use it seriously, it will cost some money. OR you can use it and host your own file server, for free.

I don't think these facts are a problem. They create an open source tool, provide a location to try it out, and a service to pay to use it if you like it and don't want to host yourself. Seems like a fair offer.


GitLab.com offers 5GB per repo support for git-annex (unlimited repos).


The "filter-by-filetype" approach used here is going to work a lot better for mixed-content repositories than git-annex, which doesn't have that capability built-in (to my knowledge).

git-annex has been great for my photo collection (which is strictly binary files). It lets me keep a partial checkout of photos on my laptop and desktop, while replicating the backup to multiple hosts around the internet.

At work we have a bunch of video themes that are partially XML and INI files and partially JPG and MP4. LFS would work great for us, except we don't use github (we don't have a need for it.) It looks like this is going to be very simple for that kind of workflow.

Just yesterday HN user dangero was looking for this exact sort of thing, large file support in git that didn't add too much complexity to the workflow: https://news.ycombinator.com/item?id=9330125


You don't need a script; git-annex has the same capability, although configured differently.

For example:

   git config annex.largefiles "*.mp3 or *.mp4 or *.jpg or largerthan(100kb)"
   git annex add .


The filter-by-filetype can be replace by small script that augments git: http://git-annex.branchable.com/forum/help_running_git-annex...


Would be nice to replace that with git hooks so you can just use regular git commands. Any idea's if that is feasible?


I'm not sure, but I doubt a hook would do. An alternative would be to have a frontend script to git that would shadow the git command (using shell aliases) and call git-annex when appropriate.


Yeah, I don't like shell aliases but it would work. I wonder how the LFS client works.


This solves a real problem, but I can't help but feel it is a band-aid hack.

The main fundamental advantage (vs implementation quirks of git) I can see is that these files are only fetched on a git checkout. But (of course) this breaks offline support, and it requires additional user action.

Wouldn't it have been fairly easy to build exactly the same functionality into git itself? "Big" blobs aren't fetched until they are checked-out? This also has the advantage the definition of "big" could depend on your connectivity / disk space / whatever, rather than being set per-repo.


I have a feeling the decision for how to arrange, as a separate thing, is likely to feed the monetization component. Particularly if it's limited to using GitHub's storage.


Specs are open, there's an open server implementation. It might be easiest to set it up with github, but if it catches on, I expect implementations will be readily available from all github-like platforms and as stand-alone.


Awesome. I did notice they called it "Git LFS" instead of "GitHub LFS", which should be a clue there, though from other comments I figured it might be GitHub specific.


Has someone had a closer look and can say how this compares to Git-Annex?


This and git-annex (and git-fat and others) use the same basic architecture of storing links in Git and schlepping the binaries around separately.

Git-annex renames binaries with their SHA256 hashes, puts them in a .git/annex/ dir, and replaces files in the working dir with symlinks. Git-LFS seems to use small metadata pointer files (SHA256 hash, file size, git-lfs version) instead of symlinks. Not sure whether the files reside in something like the .git/annex/ dir; I'm guessing the do or there wouldn't be those pointer files.

You can clone a repo without having to download the files.

With git-annex you can sync between non-bare and bare repositories without having a central server. Git-LFS seems to have a separate server for binaries. It looks like it may act like a git-annex special remote rather than git-annex's usage of synced/master branch.

Git-annex repos share information about the locations of annex files, how many repos contain a given file, etc. You can trust and un-trust repos. It doesn't look like Git-LFS offers this.

Git-LFS has a REST API. I'm using an old version of git-annex so I can't say if it does (I think it does).

Git-LFS is written in Go, git-annex is Haskell.

Git-LFS is a GitHub project. GitHub will offer object hosting.

Update: clarity, speling, added a bullet point.


The lack of location tracking looks like the most significant difference to me. While the git-lfs documentation does mention that different git remotes can have different LFS endpoints configured, all git-lfs knows about a file is its SHA256. So how can it tell which remote to download the file from? The best it could do is try different remotes until it finds one that has the file.

I hesitate to say this means git-lfs is not distributed at all, but it seems significantly less distributed than git-annex, which can keep track of files that might be in Glacier, or on an offline drive, or a repo cloned on a nearby computer, and so can be used in a more peer-to-peer fashion when storing and retrieving the large files.


At a minimum, the metafile should include a URI, stating the last known location of the file, rather than just a bare SHA256.


That doesn't work very well. Consider what happens if two different clones of a repo update the metadata last-known url at the same time with different urls. Merge conflicts.

This is why git-annex uses a separate branch for location tracking information, which it can merge in a conflict-free manner.


Late reply: that's fine, as long as there is some information as to where the location is, and not GitHub-only.


git annex uses symlinks in indirect mode, but can use small files in direct mode (useful for filesystems which don't support symlinks, like many Android sdcards).


I think it is much closer to git-fat [1] (as jefurii mentions).

I've been trying out git-fat on a large repository that has some binaries in it. Since we are already using ssh to access the server (and have the usual ssh-agent setup), it was easy to integrate it with scp.

Unfortunately, the performance for our workload (45K files, 5GB total) with git-fat wasn't much different than plain git.

It seems most of my problems stem from the number of files, rather than the size. If I had a smaller number of very large files, git-fat might be a good solution.

[1] https://github.com/jedbrown/git-fat


After a quick scan, I'm a bit worried that this is too tied to a server in practice. For example, if I've downloaded everything locally, can I easily clone the whole download (including all lfs files) into a separate repo? If I can, can changes to each be swapped back and forth?


Our solution is likely a lot more duct tape-y, but we developed a straight-forward tool in Go for managing large assets in git: https://github.com/dailymuse/git-fit

There's a number of other solutions open source out there, some of which are documented in our readme.


Can someone explain to me what problem this solves in layman's terms... How are version control systems are "impractical" for large files?

Or to put another way, what problems will I run into if I just commit large media files without using this?


With distributed version control systems such as Git or Mercurial, when you clone a repository you get the entire history of that repository (or of a selected branch). This means that if you place large media files directly in the repository, then every clone will contain each and every revision of that file. In time, this will cause an enormous amount of bloat in your repository and slow work on the repository down to a crawl. Cloning a repository several dozens of gigabytes in size is no fun, I can tell you.

Centralized version control systems such as Subversion don't have this problem (or at least, to a lesser extent), because as a user you only download a single revision of each file when you check out the repository.

Extensions like git-media, git-fat and now git-lfs solve this issue by only storing references to large media files inside the Git repository, while storing the actual files elsewhere. With this, you will only download the revision of the large file that you actually need, when you need it. It's sort of a hybrid solution in-between centralized and decentralized version control.


Is there any hint on pricing? Slighty annoying to have a section titled "Pricing" which .... doesn't tell you the price. I would much rather use my own external server for hosting large files, it is going to need to be price competitive with other options to be interesting I would think.


What's stopping git from storing large files using Merkle trees + a rolling hash?

I'm probably missing something since there this, and git-annex, and git-bigstore, and others...


Does this mean gamedevs might start droping Perforce for this? If its not too expensive maybe?


This is certainly the issue that is preventing game devs from adopting Git.

On the other hand game devs at this point are very used to Perforce, and it looks like Perforce is interested in solving this problem from the other side, by adding Git features to Perforce Helix and making it distributed.


Helix Versioning Engine is a native DVCS. This is in addition to a Git management solution that is part of the solution. Choice of workflow, combined with efficiency in large file, large repo handling -- definitely interested in solving the problem right, instead of using band-aid.


Most developers I know dropped Perforce for git and svn a long time ago. Configure them with the appropriate ignores and it works fine.


This hasn't been my experience working in AAA console games, although I could definitely see it being the case in mobile. What developers do you know?


Agreed, a lot of developers still use Perforce in my experience (my own indie studio included). I don't think this will make anything better for game developers generally. The big stumbling blocks are: Git is not artist/non-programmer friendly; and an inability to lock a file when editing (specifically binary files or files that are difficult to merge, e.g. complex level files like Unity's).


Mobile and mid-tier console developers. Not AAA, but with budgets firmly in the "you spent what, for that?" range.


I like the ease of use of 'git lfs track "*.psd"' and being able to use normal git commands after that.

Would it be possible to extend git-annex with a command that lets you set one or more extensions? By using git hooks you can probably ensure that the normal git commands work reliably.


To celebrate the broader support for git with large files we just raised the storage limit of GitLab.com to 10GB https://about.gitlab.com/2015/04/08/gitlab-dot-com-storage-l... also, we're glad GitHub open sourced it and didn't call it assman


Someone asked if this was temporarily or permanent, it is permanent, see https://news.ycombinator.com/item?id=9344984


Has anyone seen what happens for a user who doesn't have this installed when cloning? I've tried it out but it seems to not affect local clones.


Does Github really do this using git's "smudge" and "clean" filters? That would mean reprocessing the whole file for each access. That's inefficient. It's useful only if someone else is paying for the disk bandwidth, and necessary only if you don't have control of the storage system. Why would GitHub do that to itself?


BitKeeper has had a better version of this since around 2007. Better in that we support a cloud of servers so there is no "close to the server" thing, everyone is close to the server.

What we don't have is the locking. I agree with the people commenting here that locking is a requirement because you can't merge. We need to do that.


Can't wait too see what Linus got to say about this. I suppose he got an arguably better solution to the problem?


My guess is that Linus doesn't care about large binary files.


Yeah I've read interviews where he's said that he wrote git for his use case, which I imagine doesn't include large binary files. I don't think he'd really care.


Linus has a family, and so pictures etc. So he should care. ;)


Wouldn't it be nicer if we had something like this on the level of the filesystem, instead of on the level of a version control system? Advantages would be that git and any other user-space application wouldn't need much extension, and files could be opened as if they were on the local file system.


> Every user and organization on GitHub.com with Git LFS enabled will begin with 1 GB of free file storage and a monthly bandwidth quota of 1 GB.

A GB doesn't get you very far if you are working with raw audio and video.

Does it make sense to think about storing virtual machines images (.vmdk) in git on GitHub with LFS?


I still don't get why you wouldn't just check large binaries into a submodule and host that everywhere you would an annex/LFS.


Yes! Bringing us all one step closer to the whiysi (we host it you store it) dev paradigm. Bravo!


Cool! Can't want for future integration of content-addressable systems like ipfs :)


This is very handy for designers, who want to use Photoshop or Illustrator with Git


Hopefully bup will implement something like that for the backuping.


Does this mean github could become useful for music production?


So does this mean the large files are actually versioned?


From how I read it, it sounds like a little of yes and no... it's similar to git's way but I am not sure if they are really going to keep versions of all of the old large files... I guess if they are going to be fully reverse compatible like being able to go backwards in git you have to...


Can it link to torrent?


Ah, cool! At last I will be able to store my database backups in GitHub.


Hopefully they will provide such functionality, but I don't think in the nearest future.


my name is Lars & i do projects for LiveIT! this is exciting




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: