Avoid Git LFS if possible

asymptosis · on May 12, 2021

Something missing from the list of problems: Git LFS is a http(s) protocol so is problematic at best when you are using Git over ssh[1].

The git-lfs devs obviously don't use ssh, so you get the feeling they are a bit exasperated by this call to support an industry standard protocol which is widely used as part of ecosystems and workflows involving Git.

[1] https://github.com/git-lfs/git-lfs/issues/1044

alkonaut · on May 12, 2021

That issue has come a long way though, already a draft PR! This seems like it could actually happen.

Aeolun · on May 12, 2021

Did he really just try to make the argument that we shouldn’t use LFS because Git will have large file support at some unspecified point in the future?

LFS has existed for several years, and as far as I know Git still doesn’t have support for large files. At this point I’m not holding out much hope.

cerved · on May 13, 2021

git supports large files, it just can't track changes in binary files efficiently and if they're large you check in a new blob every modification.

If they're just sitting around it's fine, but then why would you have them in VC

pooya13 · on May 13, 2021

It tracks the changes fine. It’s just that it doesn’t make sense to track changes in a binary.

AstralStorm · on May 13, 2021

It does make sense, and there are forms of delta compression particularly suited to various binary formats, which if combined with a unpacker for compressed files make great sense. However, git does not have an efficient binary diff implemented yet.

LRzip happens to have such a format preprocessor that would make for exceedingly efficient binary history at cost of being more similar to git pack file than incremental versions.

Then again, GitHub in particular sets a very low limit on binary size in version control.

rurban · on May 13, 2021

In embedded almost everybody uses efficient binary delta diffs and patching for DFOTA (delta firmware over the air update). Jojodiff exists as GPL and MIT variants.

http://jojodiff.sourceforge.net/

https://github.com/janjongboom/janpatch

Rsync is also very popular, even if not that efficient. xdelta, bsdiff, BDelta, bdiff are all crap.

cerved · on May 13, 2021

So I decided to check this out. Used dd if=/dev/random to create a 100mb file, checked that in, used dd again to modify 10mb of that file, checked that in and the result were two 98mb objects.

Tracking changes of binaries makes a lot of sense if you use that to only store incremental changes to the file. Git stores each modification of a binary file as a separate blob since it doesn't know how to track its changes.

This is mitigated in large parts by the compression applied in git-gc, after packed, objects went from 196mb to 108mb.

smileybarry · on May 13, 2021

Git LFS has the advantage of not pulling all versions of a large file, too. Instead, it only pulls the version it's checking out.

In our project it helped dramatically as you only pull X MB instead of X * Y MB when a CI or developer clone the (already big) repo.

danudey · on May 13, 2021

This is true. Git-LFS can dramatically increase the size of the repository on-disk (e.g. in our GitLab cluster), but dramatically decrease the size of the clone a user must perform to get to work.

Note that this can now be accomplished with Git directly, by using --filter=blob:none when you clone; this will cause Git to basically lazy-load blobs (i.e. file contents) by only downloading blobs from the server when necessary (i.e. when checkout out, when doing a diff, etc).

cerved · on May 13, 2021

I prefer keeping large files out of source control and thus far I've not encountered a problem where their introduction has been required.

rakmos · on May 13, 2021

While I share the sentiment of keeping large files out of source control, one use-case I believe warrants having large files in source control is game development.

cerved · on May 13, 2021

that's about the only conceivable niche I can think of, even then I'm skeptical about turning the VCS into an asset manager

You can't diff and I'm not not convinced the VCS should carry the burden of version controlling assets. Seems better to have a separate dedicated system for such purposes

Then again I don't do game development so I'm not familiar with the requirements of such projects

wokwokwok · on May 13, 2021

I despise LFS.

I’m sure that if you know how to use it... maybe... you can figure it out.

That said; here’s my battle story:

Estimate the time it’ll take to move all our repositories from a to b they said.

Us: with all branches?

Them: just main and develop.

Us: you just clone and push to the new origin, it’s not zero but it’s trivial.

Weeks later...

Yeah. LFS is now banned.

LFS is not a distributed version control system; once you use it, a clone is no longer “as good” as the original, because it refers to a LFS server that is independent of your clone.

...also, actually cloning all the LFS content from git lab is both slow and occasionally broken in a way that requires you to restart the clone.

:(

iab · on May 13, 2021

I would rather maintain a handwritten journal of 1s and 0s than use git LFS again

_meqs · on May 13, 2021

honestly had no idea what git LFS was before seeing this, but okay then

tardyp · on May 13, 2021

You can easily mirror LFS objects by just using

git lfs fetch --all

every modern git hosting servers now have support for LFS directly inside the git server (gitlab, github, gerrit to my knowledge)

This solve the authentication issue nicely so make it easy for developers.

git lfs starts to be adopted by vendors and becomes usable. It solves real problem when you are tired of having to double your git server CPUs every 6 months as your git upload packs are taking huge time trying to recompress those big files over and over.

blablabla123 · on May 14, 2021

> It solves real problem when you are tired of having to double your git server CPUs every 6 months

Putting blobs into an SCM was always a bad idea, but in git it's particularly bad because the whole tree is always checked out at once. (I think last year a major change was added, that makes blob handling slightly more efficient though)

Still, I think there is still no independently developed git lfs server. I know, most people don't run git servers themselves but in the LFS use case it actually makes sense. There is also git annex which is completely open and free but adoption is very poor and the handling is even more obscure.

pooya13 · on May 13, 2021

What is your alternative then? Version control binary files and have your repos grow gigabytes?

AstralStorm · on May 13, 2021

If you're rewriting files and need the version history, yes.

If you're not rewriting the files, also yes.

If you don't need the history, put them on a normal web server.

jmcnulty · on May 13, 2021

If you need history then still put them on a web server and increment the filenames. Storing large files in a git repo is a misappropriation of the tool. It wasn't designed for that use case.

carlmr · on May 13, 2021

I mean for me I'd like the convenience of having it all together. Oftentimes it would suffice if I could just store the current version of a binary efficiently in git. Marking it, so that git forgets the previous versions.

Currently I'm using an artifactory for this, but it would be much nicer if this could be integrated.

Game_Ender · on May 12, 2021

The latest version of git has a very similar feature called “partial clones” to what the author describes for Mercurial. All the data is still in your history, no extra tools are needed, but you only fetch the blobs from the server for the commits you checkout. So just like LFS larger blobs not on master are effectively free, but you still grab all the blobs for your current commit.

You need server side support, which GitHub and GitLab have, and then a special clone command:

    git clone --filter=blob:none

Some background about the feature is here: https://github.blog/2020-12-21-get-up-to-speed-with-partial-...

brandmeyer · on May 12, 2021

This looks too aggressive. The nice thing about git-lfs is that only the binary file type(s) you care about are run through git-lfs. All other ordinary diffable text is treated normally.

The blobless clone is going to be ensaddening the next time that I'm examining the history of some source code when I'm hacking away without a network connection.

Game_Ender · on May 12, 2021

You can mitigate a bit of this by only ignoring blobs over a certain size like "--filter=blob:limit=256k" which should allow most ordinary text files through.

In the end it's the same as LFS though in that without a network examining old commits without a network is a bummer. No free lunch here besides something a bit more complex like git-annex.

brandmeyer · on May 13, 2021

Closer, but we're still relying on a proxy for the developer's intent.

The gitattributes file provides a version-controlled and review-controlled mechanism to decide exactly which objects get special treatment and which ones don't. Since its a part of the repository itself, you don't have to remind new developers to specify some unusual arguments to git at clone time to avoid a performance disaster.

> In the end it's the same as LFS though in that without a network examining old commits without a network is a bummer.

Except for a crucial detail: The tools I use to examine history are the log, diff[tool], and blame. All of those tools continue to function normally on an LFS-enabled offline clone. IIUC, `--filter=blob:none` doesn't work at all, and `--filter=blob:limit=256k` is a proxy which almost, but doesn't quite work.

Someone1234 · on May 12, 2021

All three points are really just the same point repeated three times: That it isn't part of core/official GIT ("stop gap" until official, irreversible to later official solution, and adds complexity that an official version would lack due to extra/third party tooling).

I'm frankly surprised GIT hasn't made LFS an official part by now. It fixes the problem, the problem is common and real, and GIT hasn't offered a better alternative.

If LFS was made official it would solve this critique, since that is really the only critique here.

klodolph · on May 12, 2021

> All three points are really just the same point repeated three times

Absolutely not. Having worked with Mercurial LFS and Git LFS, the differences seem subtle but they are there. Basically,

In Mercurial, LFS is (to an extent) an implementation detail of how you check out a repository. It doesn't mean altering the repository contents itself (the data), it just means altering how you get that data. Contrast with Git LFS, where the data itself must be altered in order to become LFS data, and the "LFS flag" is recorded in history.

This is not something that you would solve by upstreaming LFS. You would need to redesign LFS.

swiley · on May 13, 2021

LFS is definitely outside the scope of git.

IshKebab · on May 13, 2021

It shouldn't be though.

cerved · on May 13, 2021

Why do you think it should be within the scope of git?

IshKebab · on May 13, 2021

Because it's something that lots of people want to use Git for, and because it would work better if it were a core part of Git.

cerved · on May 13, 2021

okay, but why?

Lots of people want to use document databases as if they were relational, lots of people want to use their RDBMS as a file server and lots of people use spreadsheets for just about everything.

Lots of people wanting to use a product in certain way doesn't mean it's a good idea, nor that someone else should make that work for them that way

IshKebab · on May 13, 2021

Ok but those are unreasonable things to do. Versioning large binary files is clearly not an unreasonable thing to want to do.

cerved · on May 15, 2021

No but that doesn't mean a decentralized VCS is the best tool for the job. A file server may be sufficient and more efficient

swiley · on May 13, 2021

Maybe I want to use vim as an interactive WYSIWYG modeling program but that doesn't mean it's in the scope of the project.

LFS belongs in your build scripts, the model the git extension use doesn't even match the git VCS model.

IshKebab · on May 13, 2021

Version control of large files is clearly a reasonable thing for a version control system to do.

I think if you introspect you'll see you are defending a flaw in something you like with spurious technical objections because you don't want to admit it isn't perfect. That's understandable. Happens a lot.

swiley · on May 14, 2021

I didn't say versioning large files isn't a reasonable thing to do, I said it doesn't make sense in git.

How is it supposed to work? The LFS way where you're storing a pointer to an external http resource? Just put a script in your repo to fetch that then.

Or maybe stick the data in git's merkle tree and have a really slow repo? Why bother with LFS then?

IshKebab · on May 14, 2021

> How is it supposed to work?

One obvious improvement would be for Git to use the hash of the object rather than the pointer file when calculating tree hashes. That makes the storage method for the actual files independent of the commit hashes.

People have mentioned several other VCSs in this thread that do it already.

dwohnitmok · on May 12, 2021

git-annex is an interesting alternative the HTTP-first nature of Git LFS and the one-way door bother you.

You can remove it after the fact if you don't like it, it supports a ton of protocols, and it's distributed just like git is (you can share the files managed by git-annex among different repos or even among different non-git backends such as S3).

The main issue that git-annex does not solve is that, like Git LFS, it's not a part of git proper and it shows in its occasionally clunky integration. By virtue of having more knobs and dials it also potentially has more to learn than Git LFS.

riedel · on May 13, 2021

dispite loving the idea of git annex and having tried it multiple time in my workflows, it was really to complex to wrap my head around for simple use cases. Also i never got it to run on cygwin which is essential to me because it heavily uses symlinks (havent checked if the Windows version finally supports native symlinks). The examples are all about nontech creative ppl but i never managed to explain it to anyone who wanted to check in just large amounts of graphics to a bit repo...

krupan · on May 13, 2021

I don't know about the windows problem, but git annex can be configured to be pretty simple to use. See:

https://bryan-murdock.blogspot.com/2020/03/git-annex-is-grea...

theamk · on May 13, 2021

Last time we used git-annex was a few years ago, and it was too decentralized: the "sync" command that we used to download the remote content would also upload the information about current state.

This means there are were no read-only operations: you just want some files.. and that throwaway clone and CI machine would get recorded into the global repo state. If you are not careful, and will be propagated forever and would appear in the various reports.

rakoo · on May 13, 2021

That's the whole premise of git-annex: not distributing content but distributing what machine has the content. If you just want to get the content you have to hack git-annex, probably by reading the manifest, to get the url and download content in a third party process

cesarb · on May 13, 2021

With git-annex you have the same "one-way door" behavior: it replaces large files with a pointer to the content (in git-annex, it's a relative symbolic link which by default encodes the real file's size and hash), which is stored in git-annex's own database.

dwohnitmok · on May 13, 2021

Sort of. The way that the author talks about Mercurial as not having this problem makes me think they're talking about something related but subtly different. In particular, AFAICT, Mercurial requires the exact same thing as what you're pointing out. If you want to completely disable use of largefiles then you still have to run `hg lfconvert` at some point. That also changes your revision history.

The "one-way door" as I understand the article to be describing is talking about the additional layer of centralization that Git LFS brings. In particular it's pretty annoying to have to always spin up a full HTTPS server just to be able to have access to your files. There is now always a source of truth that is inconvenient to work around when you might still have the files lying around on a bunch of different hard drives or USB drives.

Whereas with git-annex, it is true that without rewriting history, even if you disable git-annex moving forward, you'll still have symlinks in your git history. However, as long as you still have your exact binary files sitting around somewhere, you can always import them back on the fly, so e.g. to move away from git-annex you can just commit the binary files directly to your git directory and then just copy them out to a separate folder whenever you go back to an old commit and re-import them.

But perhaps I'm interpreting the author incorrectly, in which case it's hard for me to see how any solution for large files in git would allow you to move back without rewriting history to an ordinary git repository without large file support.

petertodd · on May 13, 2021

> so e.g. to move away from git-annex you can just commit the binary files directly to your git directory and then just copy them out to a separate folder whenever you go back to an old commit and re-import them.

Exactly. Here's an (anonymized) example of a git-annex symlink from one of my repos:

    ../../.git/annex/objects/AA/BB/SHA256-s123456--abcdf...1234/SHA256-s8968192--abcdf...1234

It's just a link to a file with a SHA256 hash in the name and path. The simplest way to reconstruct that in the future is to just check-in the whole `objects` directory into the repo, and copy/symlink it back to `.git/annex` when needed. You definitely don't need the git-annex software itself to view the data in the future.

I personally have hundreds of gigabytes of data in git-annex repos. It works great!

neandrake · on May 13, 2021

I don’t think it’s clear but mercurial has two solutions for large file support. The original “largefiles” which had all the same designs and issues as Git LFS they bring up in the blog post, and “lfs” which is newer.

I’ve used largefiles and ran into these issues and ended up having to turn it off after a few years because it’s so problematic with the tooling since it modifies the underlying mercurial commit structure like git lfs.

However it sounds like mercurial lfs is different in that it only modifies the transport layer, though I’m not totally clear on the details and have been meaning to look into it further.

dwohnitmok · on May 13, 2021

To preface: though I've read a fair amount about Mercurial, I can count on my fingers the number of times I've actually used a Mercurial repo and I've used largefiles only ever as a toy, so I am very much a Mercurial newbie. So there is a chance I may get something wrong here.

However, my impression is that in fact largefiles is basically the only game in town and Mercurial LFS if anything is meant to be even more like Git LFS to the point of being compatible with it.

The thing I'm more curious about is I don't immediately see how large file support in git (or mercurial), whether implemented as a separate tool or natively, could ever feasibly be "transparently erasable," that is rewindable back to be absolutely identical to a repository with no large files support without rewriting revision history.

It doesn't seem impossible (e.g. maybe you could somehow maintain a duplicate shadow revision history and transparently intercept syscalls?), but the approaches I can think of all have pretty hefty downsides and feel even more like hacks than the current crop of tools.

remram · on May 13, 2021

That content can easily be moved in bulk though. It is true that you have to use git-annex command to do so, but this is different from LFS where the complete set of historical files is only stored on the server and can't be moved at all.

edit: The article claims it's a "one-way door" because you can't move to an altogether different system without rewriting history, which is true of git-annex. My bad.

CreepGin · on May 12, 2021

I've been using Git LFS with several large Unity projects in the past several years. Never really had any problems. It was always just "enable and forget" kind of thing.

P_I_Staker · on May 13, 2021

Yeah, unless you can avoid large files entirely, or are okay with a separate tool (this pisses of a lot of devs IME), then just don't use git? I don't like that option. I think this is sensational. At this point LFS is a really big deal. I don't think it's going anywhere or that LFS users will be shafted.

coley · on May 12, 2021

This is my experience as well so far. It took 15-20 minutes to learn about it, install it, and setup configs. Since then I haven't had to think about it once.

goodcjw2 · on May 12, 2021

A side topic: is there a concrete reason why github's LFS solution has to be so expensive?

IIRC, it's $5 per 50GB per month? That's really a deal breaker to me and wondering whether people actually use LFS at volume will avoid LFS-over-GitHub.

tux1968 · on May 12, 2021

$60 / year for a decent fraction of a hard disk and the associated backup resources, seems pretty fair to me. What price would you expect?

Dylan16807 · on May 13, 2021

> $60 / year for a decent fraction of a hard disk and the associated backup resources, seems pretty fair to me.

What you describe sounds fair to me. The problem is that 50GB is not a decent fraction of a hard disk.

0xbadcafebee · on May 13, 2021

I don't know if you remember back when service providers would put customers' entire dataset on a single set of spinning platters. The drive would of course die and the customer would get super pissed off when the provider would say "You were supposed to keep a backup..." So now providers like GitHub and GitLab are basically super-redundant storage, network, and application providers, who also happen to run Git.

If you store 50GB in AWS S3 (US-East-2), download 1000GB, do 100 PUT operations and 1000 GET operations, the cost is $89.68 per month.

Considering that GitHub isn't just providing you with storage, but a complete Git LFS solution plus storage, plus traffic that you can just use and not think about, I think it's worth the expense. But then again I probably wouldn't store binary blobs in Git.

Dylan16807 · on May 13, 2021

The number you're citing is basically entirely bandwidth, which has two main problems.

One is that amazon has an enormous markup on bandwidth, compared to their other products.

The other is that GitHub does not actually let you download each file 20x in a month and "not think about" it. 50GB of space for a month only gets you 50GB of bandwidth.

If Amazon didn't explicitly ban people from using Lightsail bandwidth with other services, you could put together an all-AWS package that has 150GB of high quality S3 storage and enough bandwidth to download it 2-3x per $5 (minimum order quantity 2). For a service like B2 you could store 250GB twice (each copy having its own cross-server RAID) and download it once for $5. At digitalocean $5 will get you 250GB of probably-redundant data with 1TB of bandwidth, though it eventually tapers off toward 167GB/$5.

Danieru · on May 13, 2021

Just be thankful you are not on Gitlab. We pay $60/year for 10GB.

I need the data for my team, so we pay, but I use it as an excuse to force the team to clean up data every so often. As a game company we need large repos. It was either Gitlab or Azure DevOps.

acdha · on May 13, 2021

$60/year is $5/month. You said “game” and “company” — how is that not a tiny fraction of your total costs?

goodcjw2 · on May 13, 2021

Nowadays games assets are just crazily huge. AAA games tends to ship at 50G. Now think about the raw assets size and saving all the intermediate revisions.

It's quite easy to burn 1000G of storage on GitHub/GitLab (again, don't forget all the revisions). That puts just the storage cost at $6000/year. At this price point, it's really worth hosting on your own.

As Danieru mentioned they are forcing ppl to manually do cleanups, that probably indicates the storage costs are even higher which worth manual interventions.

Danieru · on May 14, 2021

Can I afford it? Yes. Do I need it? Yes.

An I happy paying 6 dollars a year for a single gigabyte: I'd rather pay less.

Edit: I pay attention to costs, and repo size is the sort of thing which has a. Habit of growing. All else being equal I'd prefer my team receive the money. A dollar I can give to my team/employees feels good, a dollar paying for over priced storage feels bad.

acdha · on May 14, 2021

Doesn't that also apply to, say, GitLab paying their team? The pricing for a highly-available service isn't as cheap as it could be but it doesn't seem terribly far off of S3's pricing.

I know that a small game studio must have a tight budget — and yours looks like a really interesting project (just subscribed) — but it seemed like an awfully strong objection to what I would have assumed would be a small fraction of your total expenses.

Dylan16807 · on May 14, 2021

It's good for GitLab to pay their teams, but that stops being a justification to pay them more once the profit margin for a service gets high enough.

> it doesn't seem terribly far off of S3's pricing

GitHub's offering is close to S3 but only because AWS charges so much for bandwidth. The storage portion is less than a quarter of the equivalent bill.

And then GitLab is charging 5x as much as GitHub.

Dylan16807 · on May 13, 2021

The problem is that it's per ten gi-- actually, looking it up, I found a page saying there is a 10GB cap!

moshmosh · on May 12, 2021

It's a very large markup on the small-user retail cost of the basic thing they're providing (web-accessible, access-controlled file storage—see, for example, BackBlaze B2) but that's utterly typical of services that can get away with charging you a "convenience fee" for that sort of thing once you're on their SaaS. 2-3x markup isn't unusual, and that's about what this is, and that's above typical retail—even if GH's not managing the storage and such themselves, they're likely getting an even better (bulk) rate.

klodolph · on May 12, 2021

Yes, I would estimate that the markup is more like 5-10x for GitHub LFS.

goodcjw2 · on May 12, 2021

Exactly. For us we are dealing with lots of large gaming assets and those are burning through those $50 data packs like butters :)

adkadskhj · on May 12, 2021

Yea i actually wrote my own file chunking and general git-lfs-like backend for this exact reason. I liked Git LFS, but Github's pricing felt insane for my indie dev. For my needs i could backup onto a local server, network drive, or w/e at an insanely cheaper price.

Hell even uploading to an S3 compatible API was insanely cheaper than Github.

That and i really hated the feeling that Git LFS was being designed for a server architecture. I didn't have an easy way to locally dump the written files without running an HTTP server.

There are a couple Git LFS servers that upload to, say, S3 - but i really just wanted a dumb FS or SSH dump of my LFS files. Running a localhost server feels so.. anti-Git to me.

elcritch · on May 13, 2021

Do you have a repo for it?

adkadskhj · on May 13, 2021

I do, but i'm not comfortable sharing it. It's an experimental project, very new, and not polished enough for general use. A POC if you will. Plus this is an anonymous account :P

While i was writing it i found the basic process of basic large file storage to be insanely simple with Git. I debated doing the same thing but backed by a seasoned backup solution, like Borg/Bup/etc.

toomuchtodo · on May 12, 2021

Consider S3 storage and egress costs. You’re paying a flat rate to store and then pull that 50GB data (edit: removed an incorrect statement here).

Someone1234 · on May 12, 2021

Just for clarity, it is 50 GB of storage and 50 GB of bandwidth.

So definitely not "as much as you want." If you pull it too many times you may get charged another $5.

remram · on May 13, 2021

For S3, that's $1.15 storage ($0.023/GB) + $4.50 transfer ($0.09/GB) so it's actually very comparable.

Dylan16807 · on May 13, 2021

Yikes. Being comparable to amazon in price, purely because of how much amazon overcharges for bandwidth, is not exactly a flattering comparison.

If you were to add the $1.15 storage cost onto a reasonable bandwidth number like $0.01/GB you'd just about reach a third of $5.

remram · on May 13, 2021

I can't argue with that. I didn't expect the prices to align when I researched it.

toomuchtodo · on May 12, 2021

Thanks for the correction!

hpcjoe · on May 12, 2021

Just this past week, git lfs was throwing smudge errors for me. Not really sure what the issue was, I followed the recommendations to disable, pull, and re-enable. And got them again. So I disabled. And left it disabled.

Not a solution.

This said, the whole git-lfs bit feels like a (bad) afterthought the way its implemented. I'd love to see some significant reduction of complexity (you shouldn't need to do 'git lfs enable', it should be done automatically), and increases in resiliency (sharding into FEC'ed blocks with distributed checksums, etc.) so we don't have to deal with 'bad' files.

I was a fan of mercurial before I switched to git ... it was IMO an easier/better system at the time (early 2010s). Not likely to switch now though.

klodolph · on May 12, 2021

I would say that if you care about good LFS support, that is a sufficient reason to use Mercurial. Harder to find Mercurial hosting these days, though, but I'm not worried that the Mercurial project will die off (since both Facebook and Google use it, in some manner).

jfim · on May 13, 2021

Is Facebook still using mercurial? It seems that there was a blog post about it in 2014, but their repo[0] just seems to say that their codebase was originally based on/evolved from mercurial.

[0] https://github.com/facebookexperimental/eden

klodolph · on May 13, 2021

I think this is just really the demarcation problem. At what point does it stop being “Mercurial”?

neandrake · on May 13, 2021

Losing Facebook as a contributor and community member is the larger loss, regardless of how much their internal vcs resembles or integrates with mercurial.

acdha · on May 13, 2021

This is really overstating the cost of a one-time setup step. History rewriting is only necessary for preexisting projects and you can use things like GitLab’s push rules to ensure that it’s never necessary in the future.

I get that a mercurial developer has different preferences but I don’t think that this is an especially effective form of advocacy.

dheera · on May 12, 2021

Okay, so I should avoid it. What is the alternative?

I see so many git repos with READMEs saying download this huge pretrained weights file from {Dropbox link, Google drive link, Baidu link, ...} and I don't think that's a very good user experience compared to LFS.

LFS itself sucks and should be transparent without having to install it, but it's slightly better than downloading stuff from Dropbox or Google Drive.

jayd16 · on May 12, 2021

According to the article you should use mercurial or PlasticSCM because otherwise you might have to rewrite your history to get to some hypothetical git solution that isn't even on the roadmap.

I think I'll stick to LFS.

rkangel · on May 12, 2021

Some combination of the following two features:

Partial clones (https://docs.gitlab.com/ee/topics/git/partial_clone.html)

Shallow clones (see the --depth argument: https://linux.die.net/man/1/git-clone)

The problem with large files is not so much that putting a 1Gb file in Git is a problem. If you just have one revision of it, you get a 1Gb repo, and things run at a reasonable speed. The problem is when you have 10 revisions of the 1Gb file and you end up dealing with 10Gb of data when you only want one, because the default git clone model is to give you the full history of everything since the beginning of time. This is fine for (compressible) text files, less fine for large binary blobs.

Git-lfs is a hack and it has caused me pain every time I've used it, despite Gitlab having good support for it. Some of this is more implementation detail - the command line UI has some wierdness to it, there's no clear error if someone doesn't have git-lfs when cloning and so something in your build process down the line breaks with a weird error because you've got a marker file instead of the expected binary blob. Some of it is inherent though - the hardest problem is that we now can't easily mirror the git repo from our internal gitlab to the client's gitlab because the config has to hold the http server address with the blobs in. We have workarounds but they're not fun.

The solution is to get over the 'always have the whole repository' thing. This is also useful for massive monorepos because you can clone and checkout just the subfolder you need and not all of everything.

I say this, but I haven't yet used partial clones in anger (unlike git-lfs). I have high hopes though, and it's a feature in early days.

snovv_crash · on May 12, 2021

I found using git-lfs only in a subrepo worked well, since subrepos by default are checked out shallow.

nerdponx · on May 12, 2021

DVC [0] is great for data science applications, but I don't see why you couldn't use it as a general-purpose LFS replacement.

It doesn't fix all of the problems with LFS, but it helps a lot with some of them (and happens to also be a decent Make replacement in certain situations).

[0]: https://dvc.org/

alkonaut · on May 12, 2021

If you are like most people you use systems that speak git (from Microsoft, Jetbrains, GitHub, Atlassian…) but rarely or less fluently anything else so the problem I’m trying to solve isn’t “which VCS lets me work well with large files” but rather “I’m stuck with git so what do I do with my large files”.

Your option is basically Git LFS, possibly also VFSForGit, or putting your large files in separate storage.

cerved · on May 13, 2021

It's easy enough to script the download of external files, I'm not sure I see what the big deal is here.

To me, most cases of large files in VCS seem like using a hammer as a screwdriver.

korijn · on May 12, 2021

I'm honestly super content with LFS. Wrote our own little API server to hook it up to Azure Blob Storage, never have issues with it. I don't recognize the issues mentioned in the article at all. Our whole team relies on it for years, and it delivers. No problems. Keep up the great work, git-lfs maintainers! Much love.

sam_goody · on May 13, 2021

If you have a hundred images in git, and one cannot be downloaded for any reason, git smudge will not be able to run, and you won't be able to git pull at all.

We had an image on AWS go bad, still not sure how. Our devs lost the ability to pull. Disabling LFS could not be done (because of rewriting history). "disable smudge" is not an official option, and none of the hacks work reliably. We finally excluded all images from smudge, and downloaded them with SFTP. Git status shows all the images as having changed, and we are downright unhappy...

It would be happy to hear that I just don't know how to use LFS - but even if so, that means the docs are woefully not useful.

I want to: 1) Tell LFS to get whatever files it could, and just throw a warning on issues. 2) If image is restored not using LFS, git should still know the file has not been modified (by comparing the checksum or whatever smudge would do).

madjam002 · on May 12, 2021

As much as Git LFS is a bit of a pain, on recent projects I've resorted to committing my node_modules with Yarn 2 to Git using LFS and it works really well.

Note that with Yarn 2 you're committing .tar.gz's of packages rather than the JS files themselves, so it lends itself quite well to LFS as there are a smaller number of large files.

https://yarnpkg.com/features/zero-installs#how-do-you-reach-... https://yarnpkg.com/features/zero-installs#is-it-different-f...

devinrhode2 · on May 23, 2021

Does yarn2 recommend also using LFS? Do you see any performance improvements when using LFS?

cerved · on May 13, 2021

why are you committing packages?

madjam002 · on May 16, 2021

Because why not? It’s recommended in Yarn 2 and I don’t see there being any downsides with Git LFS as the files stores in Git are essentially pointers.

hobofan · on May 13, 2021

I would assume to prevent situations like the left-pad incident.

cerved · on May 13, 2021

PMs are made for managing and hosting packages, VCS are made for versioning source-code. If you're checking in packages into VC, you're going against the designs of both your PM and VCS. It's a bad idea. Don't.

If you for some reason require redundancy of a package repo, then host your own.

TeeMassive · on May 12, 2021

The reason the author provides is in my opinion weak compared to both his alternatives.

Sure, lfs contaminates a repository, so do large files, sensitive data removal, and references to packages and package managers that might become obsolete or non-existent in the future. The chance of your project compiling after 15 years, the age of git by the way, are very slim, and the chance that having a entirely compilable history being useful even slimmer.

And I think the author's statement about setupping up lfs being hard is exaggerated. It's a handful of command lines that should be in the "welcome at our company" manual anyway.

I've used lfs in the past and while it can be misused, as with all other tools, it does the job without too much headaches compared submodules and ignored tracked files.

breck · on May 12, 2021

My practice for storing large files with Git is to include the metadata for the large file in a tiny file(s):

1. Type information. Enough to synthesize a fake example.

2. A simple preview. This can be a thumb or video snippet, for example.

3. Checksum and URL of the big file.

This way your code can work at compile/test time using the snippet or synthesized data, and you can fetch the actual big data at ship time.

You can then also use the best version control tool for the job for the particular big files in question.

CobrastanJorji · on May 12, 2021

Is this just a manual equivalent of git LFS, or is there some advantage here?

theamk · on May 13, 2021

This is pretty superior to git LFS in many aspects:

- You have file type and preview that you can use without getting the full thing

- You have a custom metadata for each file enforced by your scripts -- for example for archives, you may store the list of files inside. This will allow your CI tests to validate the references into the files without having to download the whole huge thing.

- You fully control remote fetch logic. Multiple servers? Migration rules for old revisions? That weird auth scheme that your IT insists on? It is all supported with a bit of code.

- You fully control local storage. Do you want a computer-wide shared CAS cache between multiple users? What if you have NAS that most users mount? Or maybe s3fs is your thing? Adding support is easy.

The main downside is that you get to do all the tooling and documentation, so I would not recommend this for the smaller teams. Nor would I recommend this for open-source projects.

But if your infra team is big enough to support this, you'll definitely have the better experience than generic Git LFS.

breck · on May 12, 2021

It's a design pattern that ensures testability of the system without any dependencies on the big files.

dmm · on May 12, 2021

Tools like git-annex or dvc support similar strategies.

remram · on May 13, 2021

Is git-annex still alive? Last time I tried to use it, it was very rough, and the official wiki (that serves as doc + bug tracker) gives database errors trying to create an account.

Details: I wanted to have a remote I can push to but anonymous users can only pull from, couldn't piece it together.

Izkata · on May 13, 2021

I've always liked the simplicity of git-fat [0]:

* Initial setup includes git filter rules so that "git add" automatically uses get-fat for matching files (no need to remember to invoke git-fat when adding/changing files).

* It works by rsync'ing to/from the remote. The setup for this is in a single ".gitfat" file, separate from the filter rules.

* You do need to run "git fat push" and "git fat pull"; this can probably be automated with hooks.

So just offhand without even trying to think about the "right" way to do what you want, the committed ".gitfat" could be to a read-only remote, then you can swap it with your own un-committed file for a push that has an rsync-writeable remote.

Also, the whole thing is a single 628-line python file, so worst case it would be easy to tweak it to read something like ".gitfat-push" and not have to manually swap it.

[0] https://github.com/jedbrown/git-fat

remram · on May 13, 2021

Thanks I didn't know about this one! It seems to only support rsync though, so using it for public repositories would be difficult.

rsync · on May 13, 2021

FWIW, rsync.net is currently deploying LFS support such that operations like:

  ssh user@rsync.net git clone blah

... will properly handle LFS assets, etc.

This is in response to several requests we have had for this feature...

KETpXDDzR · on May 12, 2021

This opinion only lists issues, not solutions. Sure, they advertise mercurial, but migrating from git to mercurial is unrealistic for many cases.

I'd title it: "Why Mercurial is better than git+LFS"

cerved · on May 13, 2021

The author is detailing the problems wrt git-lfs, why they are problems and how those problems are overcome in a similar technical solution in a similar VCS. I think the original title is fine

shabbyrobe · on May 13, 2021

Here's another fun one: https://github.com/git-lfs/git-lfs/issues/2434

> Git on Windows client corrupts files > 4Gb

It's apparently an upstream issue with Git on Windows, but if you depend on something, you inherit its issues.

robmsmt · on May 12, 2021

Pushing Github past the 100mb limit has to be the most requested feature. Ridiculous that we have to use the fudge that is GitLFS.

It just adds complication for a limit that shouldn't be there anyway.

viraptor · on May 12, 2021

You can use gitlab instead with a few GB limit instead.

robmsmt · on May 13, 2021

You can also self host with GOGS / Gitea and no limit but getting my company to move from Github will be a large undertaking. It's not worth it for GitLFS on it's own.

wbillingsley · on May 12, 2021

The solution I've tended to use in classes (where there'll always be some student who hasn't installed LFS) is to store the large files in Artifactory, so they are pulled in at build-time in the same way as libraries.

This seemed to me a sensible approach as Artifactory is a repository for binaries (usually, the compiled output of a project). It also seemed to me that the decisions on which versions to retain and when an update to a binary is expected or when that resource is now frozen and a replacement would be a new version is similar to the decision on when a build is a snapshot vs a release.

temac · on May 12, 2021

If you just don't jump on random tech without good reasons, you already naturally apply this advice. Especially since once you really need it and also wants Git, there is not much alternative (as the author recognizes). In this context, just waiting for a potential "better support for handling of large files" of official Git makes little sense; plus I make the wild prediction that what will actually happen is that it's Git LFS that will (continue to) be improved and used by most people (and maybe even integrated in "official Git"?)

ziml77 · on May 13, 2021

This is what I was thinking too. There's really nothing about Git LFS that should come as a surprise. Yes it rewrites history, but how else are you going to cut bloat from the repo after it's been stuffed in there? And the fact that the file is stored on something completely outside of git is clearly and concisely explained as the main text, directly above the download button, on https://git-lfs.github.com/

> Git Large File Storage (LFS) replaces large files such as audio samples, videos, datasets, and graphics with text pointers inside Git, while storing the file contents on a remote server like GitHub.com or GitHub Enterprise.

justaguy88 · on May 12, 2021

> Git LFS is a Stop Gap Solution

Build the real thing then..

klodolph · on May 12, 2021

The author of the article is a Mercurial maintainer, and Mercurial has the "real thing" implemented already (and it has been part of Mercurial since at least 2012, at least in some form). So it's already done, just not for Git.

formerly_proven · on May 12, 2021

He does.

> Since I'm a maintainer of the Mercurial version control tool,

ecnahc515 · on May 13, 2021

You don't need to rewrite history unless you weren't using LFS or accidently committed large files to the repository. Nothing about LFS "requires" rewriting history.

Not to mention, many users are paying for a service that provides LFS, and hosting an LFS service isn't crazy hard. It's a file server with a custom API, it's mostly doable using S3 as a backend. It's not like this is crazy complicated stuff.

cerved · on May 13, 2021

other stuff might require require rewriting history

SavantIdiot · on May 12, 2021

Yep. All of this. I tried using Git LFS for a project and reverted back to links to cloud server for the large binary blobs and hashes on those blobs.

tpoacher · on May 13, 2021

I keep hearing the mantra that "svn is better for large files than git" but never really understood why. To me a large file is a large file; if you make changes, worst case scenario you add the entire new file to the commit, best case you add some sort of binary diff. Does git do the former and svn the latter by any chance?

P_I_Staker · on May 13, 2021

For a sizeable project, or one with lots of binary commits, eg. 1-10 GB sized items per day... that type of thing, you have to use LFS or SVN.

Your repo size could easily balloon to terabytes, for every clone. Additionally, I think there's other performance issues, but I don't allow this to happen, so I'm not sure.

SVN happily handles terabytes, due to the client server interface. As does LFS. My biggest gripe with LFS is that it turns your distributed tool into a client server one. I kinda wish they had and easy "skip lfs" type option.

lmz · on May 13, 2021

An svn working copy has one version stored locally. A git clone has all versions stored locally. All versions of a large file takes up lots of space.

tpoacher · on May 13, 2021

I see. So the idea is not that svn's handling of large files at the repo-level is somehow better than that of a git repo per se, but that it's fine for the (possibly remote) svn repo to take the 'large file' hit, since the (presumably local) wc is disjoint from it, and thus unaffected in terms of local storage. Ok, that makes sense...

I've been using a decentralised svn workflow at work for so long, I didn't even think of this :)

cerved · on May 13, 2021

I suppose it depends how much disk you have :P

sjburt · on May 13, 2021

The thing that always rubbed me the wrong way about git-lfs was that they cloned the git-scm.org site design. It's not part of git!

[1]https://git-lfs.github.com/

[2]https://git-scm.com/

alkonaut · on May 12, 2021

I’m using Git+LFS because my issue tracker, CI/CD etc natively speaks it. Not because it’s in any way superior or even on par with the large file handling of Mercurial (or even SVN to be honest).

slaymaker1907 · on May 12, 2021

Is rewriting the history for large repos really that difficult besides coordinating with other contributors? My understanding is that it shouldn't be that much worse than "git gc --aggressive". Yes it is expensive, but it is the sort of thing you can schedule to do overnight or on a weekend.

sjansen · on May 12, 2021

The issue is breaking external references.

Do you include git SHAs in your bug tracking system? Or perhaps your department wiki links to a specific commit to document lessons learned? Maybe you're using Sentry and find including the git SHA of the build to be invaluable for troubleshooting?

For some organizations, rewriting history would be a non-event and for others it would be a major disruption.

cookiecaper · on May 12, 2021

Yeah, git is really not a mature or well-designed VCS. The fact that you can trivially lose the supposed permanent reference -- and that it's encouraged as part of several common workflows at that -- should be more than enough to demonstrate this. If you care about history, use a VCS like Fossil.

inimino · on May 13, 2021

What is encouraged by Github is not always exactly the same as what is encouraged by the maintainers of git.

cerved · on May 13, 2021

the SHA is permanent, you're responsible for backup

alkonaut · on May 12, 2021

The problem I see is that things like commit hashes which are etched in history in bug reports, version tags etc, instantly lose meaning. Whether or not that’s a problem depends on how much of that you have.

cerved · on May 13, 2021

git gc doesn't rewrite history, it packs objects in your local repertory into a pack file

pooya13 · on May 13, 2021

Maybe I am missing the point. What is the alternative this article proposes then?... Also, Git is not central so how can you ever integrate large file support without a separate server?

thenoblesunfish · on May 13, 2021

Good points, but it seems optimistic to assume that git will have good, native, large file support anytime soon. I‘ve been waiting quite a while for git submodules to improve..

chrisdbanks · on May 13, 2021

The main argument here seems to be that we shouldn’t use LFS because Git will have large file support at some unspecified point in the future? Similarly you could argue that we shouldn't use a Covid vaccine because we'll develop a cure in the future..why vaccinate billions of people when we can just treat the 1% of people who get ill? Clearly that argument doesn't work. People need a solution now. Ironically we had to stop using mercurial because it didn't have an LFS alternative even though I prefer it. LFS is definitely not ideal but as a solution to a real world problem, it works. There may be issues around cloning repos and losing history in the future, but those are one off issues where you have to accept the pain, rather than living in pain every day.