
Git partial clone lets you fetch only the large file you need - moyer
https://about.gitlab.com/blog/2020/03/13/partial-clone-for-massive-repositories/
======
beagle3
There is one note piece to the puzzle to make git perfect for every use case I
can think of: store large files as a list of blobs broken down by some rolling
hash a-la rsync/borg/bup.

That would e.g. make it reasonable to check in virtual machine images or iso
images into a repository. Extra storage (and by extension, network bandwidth)
would be proportional to change size.

git has delta compression for text as an optimization but it’s not used on big
binary files and is not even online (only on making a pack). This would
provide it online for large files.

Junio posted a patch that did that ages ago, but it was pushed back until
after the sha1->sha256 extension.

~~~
pas
Do ISOs and other large blob types support only partial (block) modification?
Wouldn't all subsequent blocks change too?

~~~
beagle3
Sometimes they do - e.g. if you replace a file in the ISO that is the same
size up to block alignment, which is common when e.g. editing a text file or
recompiling an executable with a minor change. They almost always do when it's
a VM image representing a disk - only some blocks change every write.

However, with self synchronizing hashes of the kind used by rsync bup and
borg, it doesn't matter - you could have a 1TB file, delete a single byte at
position 100 - and you only need to store or transfer one new block (with
average size 8KB for rsync, configurable for borg) if you already have a copy
of the version before the change.

It's somewhat comparable with diff/patch but not exactly; it's worse in that
change granularity is only specified on average; It's better in that it works
well on binary files, does not require a specific reference diff (can
reference all previous history), and efficiently supports reordering as well
small changes - if you divide a 4000 line text file to four 1000-line sections
and reorder them 1,2,3,4 -> 3,1,4,2 you will find the diff/patch to be as long
as a new copy, whereas a self synchronizing hash decomposition will hardly
take any space for the reordered file given the original.

~~~
pas
Oh, I used rsync many times but I thought it simply retransmits changed files.
(Oh, it needs the --checksum argument to do this, okay.)

So how do these self-synchronizing hashes work? Like a Merkle Tree? (Ah, okay
[https://en.wikipedia.org/wiki/Rsync#Determining_which_parts_...](https://en.wikipedia.org/wiki/Rsync#Determining_which_parts_of_a_file_have_changed)
)

So rsync uses 8KB for chunk size, so for a file 1GB it has 125 000 chunks.
(And if every chunk needs 16 bytes of hash data to send, that's about 2MB,
pretty darn efficient, especially if it can spot reorders.) Though according
to Wikipedia it only does this if the target file has the same size, so adding
new files to ISOs might not work in case of rsync, but still, the possibility
is there for diff algos and version control systems.

~~~
beagle3
No, target doesn’t have to be same size. As an optimization, if size and
datetime are the same, rsync will assume no change and will not hash at all
(though you can force it to).

But it will definitely use hashes when size differs (unless forced to copy
whole files, or copying between local file systems)

------
derefr
Has anyone used Git submodules to isolate large binary assets into their own
repos? Seems like the obvious solution to me. You already get fine-grained
control over which submodules you initialize. And, unlike Git LFS, it might be
something you’re already using for other reasons.

~~~
matheusmoreira
The problem with git submodules is they can't be used like a hyperlink to
another repository. Updating the submodule requires updating the superproject
as well. The new commits are invisible to the superproject until that is done.

It'd be great if they worked like Python's editable package installations.

~~~
sjburt
Then the state of the superproject would depend on when the checkout occurred.
That would be disastrous for consistency, you’d be unable to replicate a
checkout later or elsewhere. The state of a repo after a checkout should only
depend on the commit that was checked out.

~~~
derefr
It’s interesting that we’ve never developed the equivalent for Git of what
every programming-language ecosystem has: keeping two parallel listings of
dependencies, one in terms of version constraints to satisfy, and the other in
terms of exact refs.

I could totally see a .gitmodules.reqs file specified in terms of semver specs
against tags, or just listing a branch to check out the HEAD of; resolving to
the same .gitmodules file we already have. Not even a breaking change!

~~~
sjburt
It would mean attaching a semantic meaning to tags, but git doesn't do that,
ever, for any reference. You don't even have to have a master branch, much
less tags that follow semver. Linux doesn't even use semver!

~~~
damnyou
Correct, this feature should be built on top of the source control system, not
as part of it.

------
vvanders
Also known as workspace views in P4.

It's interesting to see the wheel reinvented. We used to run a 500gb art
sync/200gb code sync with ~2tb back end repo back when I was in gamedev. P4
also has proper locking, it is really the is right tool if you've got large
assets that need to be coordinated and versioned.

Only downside of course is that it isn't free.

~~~
jasondclinton
This kind of comment isn't helpful. Of course, there have been ways to copy
large files around since there were networks. What's new in this protocol
enhancement is that this works within the context of a Merkle tree-based
technology (upon which all DVCS's are based). To use your analogy, yes this is
a wheel but it's built with rubber instead of wood and iron.

~~~
vvanders
I guess I should have expanded more.

DVCS is in direct opposition of workflows that include binary files(yes I'm
aware that git lfs has locking, it's also centrally orchestrated) because you
can't merge almost every binary format.

We were using P4 ~15 years ago for these workflows and rather than
understanding what made them work people are just rediscovering the same
problems that have already been solved.

My guess is that we'll next see a solution that dynamically caches most
downloaded files in a geographic friendly way, heck we may even call it
"P4Proxy".

I've seen so much FUD around how git is the "one true workflow" because other
solutions "don't scale" when they don't understand the constraints that
certain workflows impose. Git/DVCS is great for a lot of things but sometimes
you should use the right tool for the job rather than hack something together.

[Edit] These reasons are exactly why you see Unreal supporting P4/SVN out of
the box[1] and no mention of git.

[1] [https://docs.unrealengine.com/en-
US/Engine/UI/SourceControl/...](https://docs.unrealengine.com/en-
US/Engine/UI/SourceControl/index.html)

~~~
dagmx
I think this is a very p4 centric view of the world.

Locking helps with preventing collisions, but honestly the issue is still
always communication. Why are people even touching files they shouldn't be
touching?

Meanwhile perforce is a pain for code heavy projects and requiring a central
perforce server. Git works great there.

The issue is neither is a silver bullet for the others workflow and needs, and
they both suck horribly for mixed code and binary asset workflows.

That's not even considering cost.

Meanwhile film studios generally prefer keeping the considerations separate
and using symlinks or URI to their data store and that works really well. But
that doesn't work great for remote workflows.

So again, I think you're applying a very p4, game centric view to this. There
are lots of different use cases and team structures that none of these version
control systems are able to address in their entirety.

~~~
maccard
> but honestly the issue is still always communication. Why are people even
> touching files they shouldn't be touching?

Because there's 300+ people working on a project, and it's not feasible to
know what every other person is working on or planning on working on.

The file lock (code can be merged too, just like git, so this is only really
for binary assets) is a crude communication tool saying "hey I'm using this
file".

> Meanwhile perforce is a pain for code heavy projects and requiring a central
> perforce server. Git works great there.

I think you're applying a git biased view here. I don't think perforce is
unsuitable for code heavy projects at all (see workspace views as a prime
example), and for the majority of people, having a central server isn't an
issue. Most people treat github/gitlab as a centralised server anyway. I've
_never_ in a decade of programming heard someone suggest adding an extra
remote to git so I can share your changes, it's always been "push it as a
separate branch and I'll merge it". If you have a missing internet connection,
you're likely not able to share code _anyway_, and with p4 you can always
reconcile offline work when you're back online.

~~~
dagmx
Meanwhile film studios go on with 300+ people without hitting the issue of
people hitting the same asset files without needing locking.

I think locking is a fine utility to have but I think a lot of workflows use
it to workaround poor communication.

And I think you're constraining your views of git to just your workflow.

I've worked in a lot of scenarios where you need multiple remotes such as
having an internal repo and an external one.

And similarly there are lots of scenarios where having a decentralized copy of
the report is very useful for being able to work in offline scenarios and
compare multiple branches. Things like when commuting on a plane or being in
low connection areas.

I don't see how my view is git centric. I'm saying each VCS has very useful
areas and equally big rough spots. The problem is that each VCS group believes
there's is the only right system.

------
scarecrow112
This is interesting and could be a savior for Machine Learning(ML) engineering
teams. In a typical ML workflow, there are three main entities to be managed:
1\. Code 2\. Data 3\. Models Systems like Data Version Control(DVC) [1], are
useful for versioning 2 & 3\. DVC improves on usability by residing inside the
project's main git repo while maintaining versions of the data/models that
reside on a remote. With Git partial clone, it seems like the gap could still
be reduced between 1 & 2/3.

[1] - [https://dvc.org/](https://dvc.org/)

------
itroot
Also --reference (or --shared) is a good parameter to speed-up cloning (for
build, for example), if you have your repository cached in some other place. I
was using it a long time ago when I was working on system that required to
clone 20-40 repos to build. This approach decreased clone timings by an order
of magnitude.

~~~
mikepurvis
Do you actually need clones in that scenario? I worked on a build system that
grabbed source from several hundred repos at the starting point, and it turned
out to be way faster to just grab it all as tarballs with aria2c.

~~~
madsbuch
Grapping the tarbell from where? To my best knowledge, tarbell export is not a
part of git, but something git hosts provide.

Git is a distributed VCS, and we should support keeping it that way.

~~~
mikepurvis
Almost any project you work on will have an authoritative copy of the repo in
some kind of web-accessible tool, most of which provide a tarball-download
function.

And GitHub's scheme is pretty much a de-facto standard at this point—GitLab's
implementation is an exact copy of it, for example:

    
    
        https://<host>/<org/project>/archive/<ref/branch/tag>.tar.gz
    

Edit to add: Also, git-archive --remote is actually most of the way there, but
it's not an HTTP download, of course. :(

~~~
saagarjha
GitHub doing something one way and GitLab copying it doesn't make a standard.

------
microtherion
That seems quite useful, though Git LFS mostly does the job.

One of my biggest remaining pain points is resumable clone/fetch. I find it
near impossible to clone large repos (or fetch if there were lots of new
commits) over a slow, unstable link, so almost always I end up cloning a copy
to a machine closer to the repo, and rsyncing it over to my machine.

~~~
hinkley
What’s your take on this line?

> Partial Clone is a new feature of Git that replaces Git LFS and makes
> working with very large repositories better by teaching Git how to work
> without downloading every file.

~~~
microtherion
I believe partial clone makes the situation a little better, but it's not
nearly as good as resumable cloning, because you have to partition your repo
in advance.

------
shaklee3
This is great. We use get lfs extensively, and one of the biggest complaints
we have is users have to clone 7GB of data just to get the source files.
There's a work around in that you don't have to enter your username and
password from the lfs repo, and let it timeout, but that's a kluge.

~~~
elephantum
There’s an option for that: GIT_LFS_SKIP_SMUDGE=1 git clone SERVER-REPOSITORY

------
danbolt
In the AAA games industry git has been a bit slower on the uptake (although
that’s changing quickly) as large warehouses of data are often required (eg:
version history of video files, 3D audio, music, etc.). It’s nice to see git
have more options for this sort of thing.

~~~
Keverw
Surprised this new idea doesn’t support object storage. Sounds like Git LFS
would still be the right way to go for repos with assets for games like
meshes, sounds, etc.

However I’ve heard many studios use Perforce instead. However not being open
source is a downside to some, but I don’t really know too much about it
personally.

Then if working with a lot of non code files, sounds like some solutions have
locking. I guess not two people could edit the same Blender or PSD file at the
same time and then merge them later on.

Kinda wouldn’t surprise me if some companies actually run multiple versioning
control systems. Code on one system, game assets on another.

~~~
danbolt
I think in terms of game production, software licensing usually isn’t the
largest cost center for a project. Proprietary software isn’t a concern as
much, given that games traditionally are “shipped” and then completed. (Note
that this changes as games that are more online service-based with live
operations, rather than a specific release date and a “final” copy sent for
production; the internet has changed things a lot)

You’re more right than you think about multiple versioning systems, although
keeping synchronized becomes an issue. Perforce is a bit of a boon for
management, as they get a GUI for versioning across a multidisciplinary team.

------
jniedrauer
This could actually be a really good solution to the maximum supported size of
a Go module. If you place a go.mod in the root of your repo, then every file
in the repo becomes part of the module. There's also a hardcoded maximum size
for a module: 500M. Problem is, I've got 1G+ of vendored assets in one of my
repos. I had to trick Go into thinking that the vendored assets were a
different Go module[0]. Go would have to add support for this, but it would be
a pretty elegant solution to the problem.

[0]:
[https://github.com/golang/go/issues/37724](https://github.com/golang/go/issues/37724)

~~~
lima
That does sound like a "you're holding it wrong" issue. As one of the Go team
members pointed out, defining a separate module is not a hack, but the
intended way of doing it.

How would a partial checkout help?

~~~
jniedrauer
Go modules are built around git, unlike many other languages package systems.
That means you don't get to pick and choose what goes into them. Imagine if
you had to put an empty package.json in every (non-node) directory of your git
repo to exclude it from an NPM package, or an install.py in every (non-python)
directory to exclude it from a PyPI package. Multi-language repos would get
ridiculous pretty quickly.

~~~
yencabulator
> Go modules are built around git

Not really. Modules are specced based on zip files and metadata in text files.
There's just support for extracting that data from git repos transparently.

Here's a slightly out of date write-up: [https://research.swtch.com/vgo-
module](https://research.swtch.com/vgo-module)

------
krupan
I started a project recently and for the first time ever I've wanted to keep
large files in my repo. I looked into git LFS and was disappointed to learn
that it requires either third party hosting or setting up a git LFS server
myself. I looked into git annex and it seems decent. This, once it is ready
for prime time, will hopefully be even better

------
nikivi
Is it possible given a git repo (hosted on say GitHub) to only 'clone'
(download) certain files from it? Without `.git`

~~~
fizixer
I believe you're looking for the 'working tree' only. You could do the
following:

git archive --remote=<your-URL> | tar -t

source:
[https://stackoverflow.com/questions/3946538](https://stackoverflow.com/questions/3946538)

------
vicosity
I'm still unconvinced. Will this provide a user friendly approach to managing
design assets.

~~~
madsbuch
My impression is that it will use the normal git experience managing design
assets. Ie. with this there should be no need for additional tooling. If it
works, that would be so great!

------
piliberto
> One reason projects with large binary files don't use Git is because, when a
> Git repository is cloned, Git will download every version of every file in
> the repository.

Wrong? There's a --depth option for the git fetch command which allows the
user to specify how many commits they want to fetch from the repository

~~~
sewer_bird
Yes, but 95% of devs, even fairly talented ones, don't really know how to use
Git.

~~~
colonwqbang
Author seems to be a manager, not necessarily a dev.

------
smitty1e
In AWS, it's worth considering putting those large files in an S3 bucket.

