
GitHub’s Large File Storage is no panacea for Open Source - narner
https://medium.com/@megastep/github-s-large-file-storage-is-no-panacea-for-open-source-quite-the-opposite-12c0e16a9a91#.310ve89hz
======
jedbrown
In the interest of not propagating this common misconception:

"The main problem with Git is that binary files are stored “as is” in the
history of the project, so that every single revision of a new binary file
(even if just a single byte has changed) is stored in full. [...] On the other
hand, source files being mostly text, they are more intelligently handled and
typically only differences between revisions are stored in the commits."

This is false. Git stores the full version of each file in "loose" format and
uses compressed incremental diffs (originally based on xdiff) in packfiles
(after "git gc") without distinguishing text vs binary in either case. The
issue is that binary files are often compressed themselves (so a one-byte
semantic change has nonlocal effect) or have positional references (like jump
targets in an executable, causing small changes to cascade).

These factors explain the inefficient handling of binary files, but improving
efficiency requires changing the semantics. LFS follows in the path of a few
other tools (based on smudge/clean filters) that try to hide the semantic
difference from the casual user, though that difference seems to bite people
more frequently than we'd like.

~~~
paulddraper
This. Unlike many other systems, compression in git has nothing to do with
commit order or file types or really anything VCS related. The way delta
chains work in git are ingenious and transparent.

The problem is that "binaries" are large amounts of data with high entropy.

------
pbiggar
How uncharitable can a single blog post be! The entire post is discredited by
the author repeatedly projecting his unfounded opinions onto GitHub, such as

"My guess is that some high-level greedy marketing dickwad, completely unaware
of the asinine implications of his brilliant idea, signed off on this dumb-as-
a-bag-of-rocks pricing model."

"All the marketing material pimping GitHub’s LFS support [...]. I do not
believe this is unintentional."

"This is completely batshit. The side effect of this pernicious, greedy
pricing model is to [...]"

"I honestly couldn’t believe that GitHub would be willing to do something that
shortsighted, visibly motivated by greed from the cash they thought they could
extract from some of their users".

Charitable explanation for forks not working: they haven't yet written the
code to make this work with forks, and it's better to ship something working
early, than to make it work in all cases.

Charitable explanation for charging for bandwith: bandwidth costs money. (I
believe this is a real problem for Dropbox, which doesn't charge for bandwidth
but must still pay for it). Also, all CDNs, and also AWS charge for bandwidth.

Overall, while GitHub may be able to support it's OSS folks better by changing
the pricing on some parts of its product, this post is incredibly
uncharitable. I hope the OP will consider removing the unfounded narrative
that he's projecting onto GitHub (esp the "marketing dickwad" thing - wtf) and
focus on the facts.

[Disclaimer: my company partners with GitHub on lots of stuff]

------
paulddraper
> Case in point: if a very popular Github repository (such as the one for the
> Linux kernel) decided to start using LFS for some of their files, they would
> instantly alienate all of their users. They would no longer be able to
> properly fork the project, or even clone it to get its binary files stored
> via LFS. Nobody would be able to send a pull request to Linus as a result
> without considerable effort.

Odd example. Linux doesn't use GitHub pull requests.

~~~
Jasper_
They still use pull requests, though, just in the form of email-based ones.

[https://git-scm.com/docs/git-request-pull](https://git-scm.com/docs/git-
request-pull)

~~~
jumpwah
Except that's not a "pull request".

~~~
Vendan
Except it is. Just cause github does pull requests differently doesn't mean
you can't do pull requests in pure git. Remember, git came before github.

~~~
jumpwah
> Remember, git came before github.

That's exactly my point. Using a "request-pull" instead of a "pull-request"
will probably mean you can do this git-lfs thing with it. My understanding was
that it was not working with "forked" repositories, which you need to have to
make a "pull request". To make a "request pull" (or to just send a patch
file), you don't need to "fork". ;)

------
dantiberian
There's a lot of assumptions here about GitHub being greedy. I've got no idea
how much money it costs GitHub to support Open Source projects, but it must
easily be in the millions. I think that by this point GitHub deserves the
benefit of the doubt before launching into vicious accusations.

~~~
SwellJoe
While I don't think github is deserving of "vicious accusations", I do believe
it is foolish to assume that the github we know today will be the github of
tomorrow.

SourceForge.net was once an excellent and trustworthy steward of Open Source
software projects. It was predicted by some folks in the free software
community that it would not always be the case, and alternatives like Savannah
were maintained in order to act as a hedge against that concern. I believe it
is more than reasonable to assume that github will change, and it would be
downright dangerous to assume that we can rely on a profit-motivated
corporation (even one as cool as github currently is) to remain a trustworthy
repository forever.

So, sure, say nice things about github; I also think github is a good product,
and I appreciate their free hosting for OSS projects. And, sure, you should
use github if it provides value for you and you're willing to accept the
price. But, don't ask me to trust they'll never change, because history
indicates they will. It's probably also unfair to suggest that someone
criticizing some valid concerns about github's current behavior, based on
their own experience with Open Source projects hosted at github, are making
"vicious accusations".

~~~
pbiggar
Really? These aren't vicious?

"My guess is that some high-level greedy marketing dickwad, completely unaware
of the asinine implications of his brilliant idea, signed off on this dumb-as-
a-bag-of-rocks pricing model."

"All the marketing material pimping GitHub’s LFS support [...]. I do not
believe this is unintentional."

"This is completely batshit. The side effect of this pernicious, greedy
pricing model is to [...]"

"I honestly couldn’t believe that GitHub would be willing to do something that
shortsighted, visibly motivated by greed from the cash they thought they could
extract from some of their users"

~~~
SwellJoe
OK, those are maybe a little vicious, and probably not entirely fair. Still,
the implication of not being able to fork a whole project from github if you
use LFS is a pretty big deal.

~~~
pbiggar
GitHub has added features, and made their product better for their customers,
but have not yet made it possible to use from everywhere on their platform.

Perhaps that is a big deal (I don't use LFS and until recently neither did
anyone else), but it's a far cry from what you said in your original comment,
such as "I do believe it is foolish to assume that the github we know today
will be the github of tomorrow."

They implement a feature in a restricted manner and all of a sudden they're
evil?

~~~
SwellJoe
No, of course not. I have nowhere said github is evil. I've said it would be
foolish to believe they never will be, because we have seen on a number of
occasions that good companies turn bad (with varying definitions of "good" and
"bad") when given sufficient monetary motivation to do so.

SourceForge is the best past analog for github, and I think it's worth
learning from history. SF.net didn't start out evil and untrustworthy; they
started out good. Who's to say github won't do the same?

~~~
pbiggar
Its not that I believe good forever, but rather that I think it (it being your
first comment) was a weird way to take the conversation. Like, why did that
even occur to you in this context?

~~~
SwellJoe
I left some of the early parts of my thought process out of the conversation.

My thought process went something like this: "This is a feature that is
currently, probably accidentally, causing vendor lock-in for github users, as
there is no easy way to take a project in its entirety back out of github, if
it has enabled this feature. That kind of lock-in has been used in the past,
by vendors across a wide spectrum, for evil purposes. Github, were it ever to
become evil, would find this the kind of thing that would screw users and
produce profit."

I don't believe github _today_ has evil intent (though the author of the
article seemingly does believe that), but I reserve the right to be skeptical
of what the company's intentions will be in the future. Just as I should have
been more skeptical of the the future intentions of SourceForge in the past. I
think my position on this is entirely fair to github (which is a company and
product I like), but I'm also trying to not to be a total sucker and make the
same mistakes over and over again.

~~~
pbiggar
I see. Yes, lock in has been used for evil in the past.

I thought the OP noted that MS also offers git+LFS (and for free)!

------
alkonaut
The most interesting takeaway for me was that Microsoft seems to privide the
only(?) free git hosting that includes LFS?

Does anyone know if their repos supports forking in combination with LFS too?

------
nkurz
This seems like an odd problem, but I'm not as familiar with Git as I should
be. Is there not a reasonable way to download only the most recent version of
these large binary files on the initial request, and then download the
historical versions only in the (likely very rare) case that the user actually
wants to use them? This would seem more useful in this case than hoping that
binary diffs the repository small enough.

~~~
icebraining
If you're talking about binary files merged into Git itself (not Git LFS,
which is a separate mechanism), you can use "git clone --depth <n>" to get
only the latest <n> revisions of the tree, and then use "git pull --unshallow"
if you need to fetch the rest of the history.

~~~
alkonaut
Can I do that automatically so that only binaries are fetched shallow, and
text is fetched deep? Otherwise it's not very useful.

~~~
icebraining
My best guess would be to keep the binaries in a submodule, then after fully
cloning the main repo, you would fetch the binaries with "git submodule update
--init --depth 1".

------
lemevi
Edit: I was wrong, however I learned from the conversation so I am leaving it
here! Thanks to those who corrected me.

> On the other hand, source files being mostly text, they are more
> intelligently handled and typically only differences between revisions are
> stored in the commits.

This is completely incorrect, git stores whole blobs from one commit to the
other.

svn stored patches, but git does not. Every version of a file is stored in its
entirety in your git tree since the beginning of the repository's existence.
This is one of the reasons why git is so fast. You can go through your objects
in your .git directory and verify this for yourself[0].

    
    
        $ find .git/objects -type f
          .git/objects/ff/a5d733354ae6f8bdc67764d58d87c9a3161f66
          .git/objects/ff/deb08f4856bd6eb5b31d7f800b3e480ae3e2e0
        $ git cat-file -p ffa5d733354ae6f8bdc67764d58d87c9a3161f66
        ...file contents appear...
    

[0] [https://git-scm.com/book/en/v2/Git-Internals-Git-Objects](https://git-
scm.com/book/en/v2/Git-Internals-Git-Objects)

~~~
saurik
This is only true for recent commits: as you accumulate commits, garbage
collections are performed of the loose blobs and the remaining generation is
stored into a pack file, which has been carefully ordered by similarity and
stored using a delta-encoding. For more information, this chapter from one of
the popular online books about git might suffice.

[https://git-scm.com/book/en/v2/Git-Internals-Packfiles](https://git-
scm.com/book/en/v2/Git-Internals-Packfiles)

(edit: After I started responding to your comment, you edited your comment to
link to the same book! I recommend you continue reading the later chapters:
"you'll never believe how it works" ;P.)

------
bhuga
I suspect setting up the free LFS reference/test server[1] that GitHub
provides would have taken less time than writing this post complaining that
GitHub isn't free enough.

1: [https://github.com/github/lfs-test-server](https://github.com/github/lfs-
test-server)

~~~
nmc
This being a test server implementation probably indicates that it is not
meant to be run in production environment.

------
sytse
At GitLab we're working to support LFS. Initial support might or might not
work with forks. As now with our Git Annex support storage will be free with a
soft limit of 10GB of disk space per project (includes Git, Git Annex and Git
LFS data) and there is no bandwidth limit. It will work with public and
private projects (both are free).

------
eshamow
I'm not sure I understand why artifacts can't be stored in a different service
- even an S3 bucket, if not a real repository service - and fetched
dynamically via a build process.

Is there a reason why binary blobs need to be stored directly next to code in
order to be versioned?

~~~
megastep
The way I understand it, it should be possible, though it's more complicated
since they seem to infer the LFS URL from the repo URL by default. So if you
wanted to say keep your repo on Github, and store your LFS files on S3, you'd
need to explicitly tell git where to write the files. There are configuration
values for that.

Also you'd need the necessary LFS server piece on Amazon's side.

~~~
eshamow
I'm thinking that rather than using git for versioning the binary artifacts as
well, you tag and version your git repo, then tag and name/label your
artifacts in another storage service. You then allow a build tool to assemble
from both locations.

------
forrestthewoods
I wonder if Perforce Cloud will be able to fill this role at all. Probably
not. Open Source isn't their target audience. But it could be a consideration.

Has anyone tried the new Perforce/Git stuff? Is it any good? We're still on an
older pre-Helix version.

------
bsimpson
The post seems hyperbolic. I'd love to hear GitHub's rebuttal.

~~~
megastep
I am the author and yes this was very much unapologetically hyperbolic. At
least it got the conversation started.

~~~
ansiton
I don't think you needed paragraphs like "My guess is that some high-level
greedy marketing dickwad, completely unaware of the asinine implications of
his brilliant idea, signed off on this dumb-as-a-bag-of-rocks pricing model.
He then directed the grunts to somehow implement his grand vision on GitHub’s
servers. That’s when shit started to hit the fan." ... to get the conversation
started. Your other points were sensible and lucid. This was a distraction and
had the paradoxical effect of making me more sympathetic to github. The same
looks to be true of other posters in this thread.

If you were consciously choosing to take a hyperbolic tone, can I ask if you
might reconsider that decision in future posts? Or at least concretely test
your idea that calling people "dickwads" and "grunts" gets you more traction.

I appreciated you raising the bandwidth question, and comparing it with other
services. You made a good argument. Thank you!

~~~
nacs
It's his personal blog not some corporate blog or newspaper.

I'm not sure it's within your rights to ask him to change the way he writes
within his own bubble because you don't like his word choice.

~~~
bsimpson
It's just as much within his rights to call someone out for it as it is for
someone to write as he likes.

Clearly the post was written for an audience. Being needlessly inflammatory
could certainly turn the audience off, and/or undercut the author's
credibility. Ansiton's advice was both helpful and valid.

------
cwyers
> I honestly couldn’t believe that GitHub would be willing to do something
> that shortsighted, visibly motivated by greed from the cash they thought
> they could extract from some of their users.

They're a business. C'mon here.

~~~
megastep
My argument was that this was actually bad for their business. This is not a
customer-friendly move.

~~~
i386
People who make products make these sort of tradeoffs all the time and the
tradeoff is rarely something thats permanent. For the fast majority of people
who need this kind of functionality (LFS), breaking forking is a inconvenience
compared to the value being added.

~~~
megastep
I would completely agree with you if that was the way GitHub presented that
feature, being open about the consequences of adopting it. They haven't done
that, and as a result their customers are not properly informed on the trade-
offs they are making.

------
i386
Shock and horror: commercial company has a paid value add.

GitHub is not a charity.

~~~
facetube
I wouldn't necessarily describe a paid feature that breaks all forking for an
entire repository a "value add".

~~~
megastep
Yes, my criticism was not that they're trying to make money from this, but
rather that they are crippling their core product in an effort to monetize it
some more.

~~~
i386
No, they are trying to solve a problem for their customers - storing large
files. They charge money for this feature. It's disingenuous to think that
they are breaking their product on purpose to extract money from you.

~~~
megastep
It is not disingenuous if it is exactly what they are doing. The entire cause
of this problem, and the reason it breaks so many things, is because they are
trying to monetize bandwidth usage. This model is unsustainable because free
users only get a paltry 1GB/month and their attempt to enforce that is what
breaks forks.

If they would just stop trying to do that, then we would have nothing to talk
about.

It is not at all uncalled for to criticize the way they are trying to do
business, especially when it affects you as an existing customer.

