FYI: LLVM-project repo has exceeded GitHub upload size limit (2022)

low_tech_punk · on Jan 30, 2023

On a philosophical note, git, being a distributed source control system has no limits to repo sizing, but github, being is a centralized "hub" suffers scaling problems. It's fascinating to watch these tradeoffs play out. Maybe, IPFS would come to the rescue one day.

tsimionescu · on Jan 30, 2023

I don't understand where git's distributed nature comes into play here. Perforce is fully centralized, and it also doesn't have any limits to repo size - as long as you have enough disk, P4 will handle it.

In fact, Git's distributed nature actually makes it infamously bad at scaling with repo size - since it requires every "participant" in a repo to have a copy of the entire repo before they can start any work on it.

dblitt · on Jan 30, 2023

This is where something like VFSForGit [0] helps out. Instead of cloning the entire repo, it creates a virtual file system and fetches objects on demand. MSFT uses it internally for the Windows source tree (which now exceeds 300GB).

[0]: https://github.com/microsoft/VFSForGit

btown · on Jan 30, 2023

VFSForGit seems to be in maintenance mode now - do you know what it was replaced by, and if there's something that works cross-platform?

AaronFriel · on Jan 30, 2023

It was replaced by Scalar and is now merged into Git:

Introduced: https://devblogs.microsoft.com/devops/introducing-scalar/

Integration into Git: https://github.blog/2022-10-13-the-story-of-scalar/

rvbissell · on Jan 30, 2023

Shallow cloning of Git repos is a thing. Basically, you get a fake head commit (that includes all the files) and none of the real history. Useful if you only intend to build, or make changes locally. If you want to push, you have to unshallow first.

someguy101010 · on Jan 30, 2023

would be cool to be able to commit without unshallowing first

phil-m · on Jan 30, 2023

Not possible, due to the way how git works: It's a merkle tree of commits, where each of these commits point to a file tree (content-addressed by the hash) and the previous commit

iudqnolq · on Jan 30, 2023

No, creating a git commit merely requires knowing the current state of all the files and the hash of the previous commit. You don't need the actual contents of the previous commits.

https://git-scm.com/book/en/v2/Git-Internals-Git-Objects

Pushing from a shallow clone to a remote is more complex, but is supported in modern git versions.

https://stackoverflow.com/a/6900428

Arnt · on Jan 30, 2023

I believe the problem here is related to LLVM using a CI model, where all changes eventually go to a single tree and tests are run on that single tree. Hosting that CI happen on another site than github wouldn't really change anything. Maybe change the limit, but all such CI hosts have a limit, for the same reason that all the search engines have a size limit for web pages they scrape.

cabirum · on Jan 30, 2023

The problem here is unbounded growth of a git repo. In this specific case, a size limit was triggered. In other circumstances, it would have required too much time to transfer or no storage space left.

Anyway, the problem is that git stores all changes, forever. A better approach would be to clean old commits, or somehow merge them into snapshots of fixed timespans (like, anything older than a year get compressed into monthly changesets)

howinteresting · on Jan 30, 2023

This is not a "problem", this is why source control exists. You should never rewrite published history.

tonnydourado · on Jan 30, 2023

A more conservative approach would be some sort of layered storage/archiving, I guess. The older the commit, the less likely it is to be used, so it could be archive in a different storage, optimized for long term. This way you keep the "hot" history small, while keeping the full history still available.

howinteresting · on Jan 30, 2023

That's generally how git packs are used at large organizations that host their own repositories. I'm sure GitHub does something similar.

torstenvl · on Jan 30, 2023

I don't think I can agree with that.

Accidentally publish secrets/credentials? Rotate them yes but also remove them from the published history.

Accidentally publish a binary for a build tool without the proper license? Definitely remove it (and add it to your .gitignore so it doesn't happen again!)

You discover a major flaw that bricks certain systems or causes data loss? Retroactively replace the Makefile/configure script/whatever to print out a warning instead of building the bad build.

I'm sure there are others.

GoblinSlayer · on Jan 31, 2023

AFAIK, copyright problems are fixed by just another commit without history rewrite. There's also no need to care about outdated credentials. Should bugs be fixed by deleting all history too? That code was buggy, bad, bad, must delete? VCS just becomes glorified ftp this way.

crznp · on Jan 30, 2023

I don't think that your points are actually in conflict.

If this is my source code, I want the whole history. I want that 10-year old commit that isn't used in any current branch. A build machine may not need any history: it just wants to check out a particular branch as it is right now, and that works too.

But there is an intermediate case: Let's say that I have an issue with a dependency. I might check out that code and want some history to know what has changed recently, but I don't need that huge zip file that was accidentally checked in and then removed 4 years ago. If it were a consistent problem, perhaps you'd invent some sort of 'shallow' or 'partial' clone, something like this:

https://github.blog/2020-12-21-get-up-to-speed-with-partial-...

howinteresting · on Jan 30, 2023

True, though shallow clones have performance issues: https://github.com/Homebrew/discussions/discussions/225

cabirum · on Jan 30, 2023

A value of a commit approaches zero as it gets older. After a certain threshold, no one will ever see it. Never say never; any reason why should we keep deadweight around?

themerone · on Jan 30, 2023

As long as a line code is in use, there is value in knowing who and when it was authored.

If a 10 year old vulnerability is found in OpenSSL, it be nice to be able to investigate if it was an accident or an act of espionage.

howinteresting · on Jan 30, 2023

Your premise is incorrect. The other day I was looking around a repository that's been through many migrations, and found a commit from 2004 that was relevant to my interests.

mrguyorama · on Jan 30, 2023

Git allows you to rewrite history, so you can "crush" old commits to reduce the size of history as needed.

jacobr1 · on Jan 30, 2023

It would be great to have some kind of way to do this while still maintaining the merkle-tree.

jonhohle · on Jan 30, 2023

Isn’t that what packs are for? The raw, content addressable object store has no inherent optimization for reducing repo size. Any changed file is completely copied until a higher level does something to compress that down.

cabirum · on Jan 30, 2023

Sure, but rewriting it manually is a tedious process. Should be automated on github side, to keep repo size approx constant over time.

saghm · on Jan 30, 2023

I think this is what `git filter-branch` its supposed to be for: https://git-scm.com/docs/git-filter-branch

I've never used it before, but from what I understand, it's very powerful but also very confusing and easy to mess up, and of course with a sort of vague ambiguous name that makes it hard to discover; in other words, it's quintessentially git.

Arnavion · on Jan 30, 2023

That would be silent data loss, so absolutely should not be automated.

blueflow · on Jan 30, 2023

Github cant rewrite the refs on their own without breaking users stuff. They can only repack the existing objects, the squashing needs to be done by the developers. Also its a non-fast-forward thing, so it needs to be coordinated between the git users anyways.

billti · on Jan 30, 2023

Tangentially related, but after installing LLVM on Windows I notice that the clang, clang++, clang-cl, and clang-cpp binaries are all identical. That's four copies of the same 116MB file. The linkers (ld.lld, ld64.lld, lld, lld-link, wasm-ld) are likewise the same 84MB binary repeated 5 times, and similarly llvm-ar, llvm-lib and more are identical.

Overall looks like ~750MB of unnecessary file duplication for one install of LLVM.

I get that Windows doesn't have the same niceties when it comes to symbolic links as macOS and Linux, but that seems really suboptimal and wasteful.

nequo · on Jan 30, 2023

If not symlinks, might those be three or four hard links to the same file? I’m not sure how to check on Windows.

tadfisher · on Jan 31, 2023

NTFS has had symbolic links since Windows Vista. You enable them in Git with the `core.symlinks` setting. There may be other reasons to avoid them, though; for example, if some tooling like VC++ can't deal with calling symlinked executables. I don't use Windows so this is all conjecture.

WirelessGigabit · on Jan 31, 2023

You still need to enable developer mode to make sure that applications can use it without elevation.

https://blogs.windows.com/windowsdeveloper/2016/12/02/symlin...

mooreds · on Jan 30, 2023

Would git prune ( https://git-scm.com/docs/git-prune ) help here? Or is the entire repo of reachable objects more then 2GB?

trynewideas · on Jan 30, 2023

Considering this from the thread, as a response from GitHub to their repack request:

> As it turns out, we only run repacks on a repository network level, which means that repacks need to consider objects from all forks of a given repository.

> Repacking entire repository networks will always lead to less optimal pack sizes compared to repacking just objects from a single fork. For GitHub, disk space is not the only thing we optimize for, but also performance across forks and client performance.

Repacking the repo locally more than halves the size:

> I tried locally right now to run git repack -a -d -f --depth=250 --window=250 and the size of the .git folder went from 2417MB to 943MB…

But GitHub repacking the repo network reduces the size by only 20%. So presumably GitHub can't aggressively remove things from a repo without affecting how forks work.

This blog post is very old (2015) but there's a section that describes their use of Git alternates to facilitate forks: https://github.blog/2015-09-22-counting-objects/#your-very-o...

Arnt · on Jan 30, 2023

The latter. It's >20 years of history, and if you want a straw test as to whether the history is relevant: I personally needed to look at >20-year-old history this winter.

zffr · on Jan 30, 2023

Why did you need to look at >20 year old history?

LeifCarrotson · on Jan 30, 2023

Not OP, but I've worked on lots of 20-50 year old CNC machines, and have done controls for press brakes and resistance welders that are more than 100 years in age (granted, the PLCs and PCs were added later to streamline the relay logic created around World War 1, but the cast iron and motion works date to that era). Two weeks ago I fixed up an RS232 DNC pipeline for a 1985 mill, the source code manuals were heavily yellowed but I eventually figured out the now-esoteric RS232 configuration. One of my coworkers worked on this equipment a bit more than a decade ago, but the newest part - a Windows 7 PC - was the first thing to break.

The addage "if it ain't broke, don't fix it" comes with the corollary that if you build it right it won't break...until it does. And then someone has to do some archaeology to get a process back online that hasn't been documented since dot-matrix printers and typewriters, much less git development. I'm also building stuff and making decisions today for machines that have 10, 20, or 30-year expected lifetimes, and the core components should last indefinitely as long as maintenance is performed.

Many businesses aren't built for and aren't compatible with a 2- or 3-year obsolescence cycle.

Arnt · on Jan 30, 2023

Because code written >20 years ago still runs today. I wanted to understand some odd behaviour, ran git blame to find the relevant commits, then looked at the commit messages. Some of the lines involved were old.

themerone · on Jan 30, 2023

A lot of big applications have features that are coded and go years with minimal changes.

My applications definitely have modules that only get touched when we do a major UI refresh (about once every 10 years.)

weinzierl · on Jan 30, 2023

Built LLVM about a year ago and not only is the source large but the build artifacts filled my disk to the brim.

mshockwave · on Jan 30, 2023

There are some workarounds to shrink the size. I think the most noticeable ones being only build certain backends (-DLLVM_TARGETS_TO_BUILD="X86;..."), built as shared libraries (-DBUILD_SHARED_LIBS=ON) and split (DWARF) debug info for debug builds (-DLLVM_USE_SPLIT_DWARF=ON). Plus, usually I just build the tools I need instead of firing `ninja` or `ninja all`

lwhsiao · on Jan 30, 2023

Only tangentially related, but I've found that git-sizer [1] is handy for getting a sense of repository metrics.

[1]: https://github.com/github/git-sizer

nativecoinc · on Jan 30, 2023

I cloned

    git clone https://github.com/llvm/llvm-project

When entering the directory my fancy prompt timed out

    [WARN] - (starship::utils): Executing command "git" timed out.

But it seemed OKish after `git status` (disk cache?)

    time git status
    ...
     real 0m0.287s
     user 0m0.189s
     sys 0m0.392s

The size of `.git/objects` after the clone was 2.5GB.

Output of `git-sizer`: https://pastebin.com/T5HRMfg9

rebolek · on Jan 30, 2023

It's interesting that the post doesn't deal deal with the basic problem that if your repo is +2GB, the problem may be on your side.

sammyteee · on Jan 30, 2023

How so? You realize there can be more than just code in a repo? And even then, what if there is a lot of code?

mrguyorama · on Jan 30, 2023

What large data assets does a compiler have? Is there an HD video for you to watch while you compile?

Arnt · on Jan 30, 2023

The LLVM repo is that large because a lot of people have worked on a lot of code.

It's not so difficult. A team of tens of programmers can reach that size in a couple of decades, just by writing code and textual documentation. No graphic assets required, all it takes is a generation of steady work by a mid-sized team.

rcxdude · on Jan 30, 2023

The latest checkout of the main branch of the LLVM project is 1.3GB, basically entirely text. Of that, ~900MB is tests, largely consisting of reference test vectors (this apparently isn't even the entirety of LLVM's tests, just the core ones). LLVM code itself weighs about 100MB, consisting of about 2 million lines of code (not including comments). clang (also in the same repo) is another 50MB, and about 800k lines of code. The remainder is other utilities and documentation.

jcranmer · on Jan 30, 2023

The test suites of a compiler typically involve compiling very large projects; SPEC cpu2017 is I believe a gigabyte or so. Of course, these tests aren't in the llvm source repo, they're in a separate llvm-test-suite repo (although SPEC itself isn't distributed there because it's a benchmark you have to buy).

pertymcpert · on Jan 30, 2023

LLVM is much more than a compiler. It's a set of projects under the umbrella of the LLVM project. It includes not only LLVM and clang, but a linker, a collection of runtimes to support many features (like various sanitizers), the MLIR project, a fortran front-end, c/c++ libraries, a debugger, openmp support, a polyhedral optimization framework, and all the tests that every feature in every one of those projects has.

Is that too much under one umbrella? Probably. But it's not just a compiler. It's a monorepo.

thewebcount · on Jan 30, 2023

I work on video-related applications and plug-ins, and yes we do sometimes need to include HD, 4K and larger assets in our code base. That's probably not the case for LLVM, but it's not at all out of the question for other types of projects.

staringback · on Jan 30, 2023

> Is there an HD video for you to watch while you compile?

This made me laugh

psychphysic · on Jan 30, 2023

He said 'may be'.

Are you really suggesting it's impossible for bad practices to bloat a got repo?

hobofan · on Jan 30, 2023

> You realize there can be more than just code in a repo?

Yes, and for large assets there are extra solutions like Git LFS (which GitHub has support for).

sixstringtheory · on Jan 30, 2023

Maybe at this point it shouldn’t be hosted or even mirrored on GitHub.

CocoaPods and Homebrew both hit similar issues in the past with huge work trees by hosting all their specs on GitHub, resulting in breaking workflows for their users.

Those projects made changes to their workflow to mitigate this. Since the source here is around 6 months old, have LLVM done something similar?

newaccount74 · on Jan 30, 2023

Wasn't the problem with brew that they were (maybe still are) using the git repo as a content delivery network? As I recall the problem wasn't the repo size itself, the problem was millions of users have to do a "git clone" to update homebrew and it overwhelmed Githubs infrastructure.

trogdc · on Jan 30, 2023

No they still use GitHub, and there are plans to switch code reviews to use GitHub PRs as well.

ok_computer · on Jan 31, 2023

Hi, I’m curious here would git-lfs (large file storage) work along with rewrite the history to remove the larger files and artifacts?

    git filter-branch --index-filter "git rm -rf --cached --ignore-unmatch /path/to/file" HEAD

https://stackoverflow.com/questions/43762338/how-to-remove-f...

Or is git-lfs generally avoided for good reason? Or so all the changes and not necessarily large files cause this repo to exceed the limit?

WirelessGigabit · on Jan 31, 2023

Then you're rewriting history. And this should be avoided at all cost, as git hashes are part of the cryptographic history of a repo.

Who are you going to trust to do this? Someone will need to execute this and then do a force push. Are you going to compare it? Are you going through all the history of files removed to see if there is no change to the source code? And what about the files you're actually going to move to git-lfs? How can you prove that they haven't changed in the migration?

Provenance is a thing. https://slsa.dev/provenance/v0.2

ok_computer · on Feb 11, 2023

I'll admit I've done this early on in history of a pet project repo scrubbing a credential :/, not implementing git-lfs though.

I knew that rewrites compromise the history. It is low stakes and I didn't want to allocate a new repo to start anew, just learned my lesson there.

I'm mostly curious about git-lfs as large file viability or it should be avoided and plan for large build artifacts to host elsewhere.

oauea · on Jan 30, 2023

> As it turns out, we only run repacks on a repository network level, which means that repacks need to consider objects from all forks of a given repository.

> Repacking entire repository networks will always lead to less optimal pack sizes compared to repacking just objects from a single fork. For GitHub, disk space is not the only thing we optimize for, but also performance across forks and client performance.

So the lesson here is you can DoS existing open source projects somehow by forking them and increasing the forked repo size >2GB?

pointlessone · on Jan 30, 2023

I don't think that's the case.

The issue is that GH doesn't accept too big packs. git by default pack everything into a single pack. Maximum pack size can be specified either in config or as an argument to repack. The way I read the error message a user can push a huge repo by making sure it's packed into a few packs under 2GB limit.

It doesn't seem like there's an easy way to turn this into a DoS. GH would repack the fork network on its own schedule. A user probably can not trigger repacks. The repacks on GH side would probably be smaller then their limit, too. packs are probably scoped to a fork and the server is an active client so it most likely wouldn't return objects from other forks. I don't think it would be easy to DoS GH just by pushing big packs (either under or over the limit).

tagraves · on Jan 30, 2023

How would the DoS work? As I understand it the issue here only occurs if you try to push the entire project to a _new_ repo. It doesn't affect existing repos.

pointlessone · on Jan 30, 2023

I'm almost certain it affects all repos. It's just more unusual to push huge packs into exiting repos. If for whatever reason you happen to have a huge branch clicking in over 2GB you'd get the same error.

mayli · on Jan 30, 2023

True, for github.

cycomanic · on Jan 30, 2023

Funny I just encountered something similar last week. My org runs a gitlab instance and I creates a repo with lecture slides etc.. The slides are html files and some of them have generated videos (for various reasons I wanted to upload the slides not the source scripts). Turns out that my org put a 150MB upload limit and one of the commits exceeded that limit, so my push was failing. The fix was non trivial, but git-sizer was definitely helpful.

MithrilTuxedo · on Jan 30, 2023

I ran into this at work trying to push a P4 clone that sat at 20GB in my local .git folders. Needless to say, I didn't have room on my local to actually check out a workspace.

A shell loop can push N commits at a time, using git's handy syntax for that:

`git push origin p4/master~${x}:master`

saurik · on Jan 31, 2023

As LLVM is a monorepo of a number of different (giant) projects that kind of feel separate--and in fact used to be separate--I often wish I could have a submodule direct to a subfolder so I don't have to do a checkout of all of LLVM.

CoastalCoder · on Jan 30, 2023

I'm not sure what the current status of this issue is, based on the linked Discourse thread.

Is it still a fundamental limitation, but with a known client-side workaround?

t344344 · on Jan 30, 2023

It only happens if you push entire repo at once. Some script may split upload into multiple smaller parts and solve this.

silverwind · on Jan 30, 2023

Sounds like Git CLI could do this automatically, if it can detect these errors (I think it can't).

pionar · on Jan 30, 2023

This is a GitHub-specific issue, so I don't think Git should do anything about it.

DethNinja · on Jan 30, 2023

What is taking so much space though, why not just use LFS? It can't possibly just be the text files.

secondcoming · on Jan 30, 2023

June 2022

londons_explore · on Jan 30, 2023

I think github has some whitelist for large popular projects. Chromium is hundreds of GB for example, but you can still fork it on github and make any changes you like and push them.

madeofpalk · on Jan 30, 2023

If I understand correctly, the limitation is on the size of a push.

> If you happened to push the entire llvm-project to another new (personal) GitHub repo recently, you might encounter the following error message before the whole process bails out:

> The crux here is that we tried to push the entire repo, which has exceeded GitHub’s upload limit (2GB) at once. This is not a problem for majority of the developers who already had a copy of llvm-project in their separated GitHub repos

If you fork the repo on the Github website, they manage the fork server side, and you never have to push up the entire repo.

GoblinSlayer · on Jan 31, 2023

They seriously wrote 2gb of C++?

nativecoinc · on Jan 31, 2023

The total size of the files in the repo is 1.35 GiB. The total size of all the blobs is 84.9 GiB.

https://pastebin.com/T5HRMfg9