Hacker News new | past | comments | ask | show | jobs | submit login
FYI: LLVM-project repo has exceeded GitHub upload size limit (2022) (llvm.org)
147 points by hippich on Jan 30, 2023 | hide | past | favorite | 78 comments



On a philosophical note, git, being a distributed source control system has no limits to repo sizing, but github, being is a centralized "hub" suffers scaling problems. It's fascinating to watch these tradeoffs play out. Maybe, IPFS would come to the rescue one day.


I don't understand where git's distributed nature comes into play here. Perforce is fully centralized, and it also doesn't have any limits to repo size - as long as you have enough disk, P4 will handle it.

In fact, Git's distributed nature actually makes it infamously bad at scaling with repo size - since it requires every "participant" in a repo to have a copy of the entire repo before they can start any work on it.


This is where something like VFSForGit [0] helps out. Instead of cloning the entire repo, it creates a virtual file system and fetches objects on demand. MSFT uses it internally for the Windows source tree (which now exceeds 300GB).

[0]: https://github.com/microsoft/VFSForGit


VFSForGit seems to be in maintenance mode now - do you know what it was replaced by, and if there's something that works cross-platform?


It was replaced by Scalar and is now merged into Git:

Introduced: https://devblogs.microsoft.com/devops/introducing-scalar/

Integration into Git: https://github.blog/2022-10-13-the-story-of-scalar/


Shallow cloning of Git repos is a thing. Basically, you get a fake head commit (that includes all the files) and none of the real history. Useful if you only intend to build, or make changes locally. If you want to push, you have to unshallow first.


would be cool to be able to commit without unshallowing first


Not possible, due to the way how git works: It's a merkle tree of commits, where each of these commits point to a file tree (content-addressed by the hash) and the previous commit


No, creating a git commit merely requires knowing the current state of all the files and the hash of the previous commit. You don't need the actual contents of the previous commits.

https://git-scm.com/book/en/v2/Git-Internals-Git-Objects

Pushing from a shallow clone to a remote is more complex, but is supported in modern git versions.

https://stackoverflow.com/a/6900428


I believe the problem here is related to LLVM using a CI model, where all changes eventually go to a single tree and tests are run on that single tree. Hosting that CI happen on another site than github wouldn't really change anything. Maybe change the limit, but all such CI hosts have a limit, for the same reason that all the search engines have a size limit for web pages they scrape.


The problem here is unbounded growth of a git repo. In this specific case, a size limit was triggered. In other circumstances, it would have required too much time to transfer or no storage space left.

Anyway, the problem is that git stores all changes, forever. A better approach would be to clean old commits, or somehow merge them into snapshots of fixed timespans (like, anything older than a year get compressed into monthly changesets)


This is not a "problem", this is why source control exists. You should never rewrite published history.


A more conservative approach would be some sort of layered storage/archiving, I guess. The older the commit, the less likely it is to be used, so it could be archive in a different storage, optimized for long term. This way you keep the "hot" history small, while keeping the full history still available.


That's generally how git packs are used at large organizations that host their own repositories. I'm sure GitHub does something similar.


I don't think I can agree with that.

Accidentally publish secrets/credentials? Rotate them yes but also remove them from the published history.

Accidentally publish a binary for a build tool without the proper license? Definitely remove it (and add it to your .gitignore so it doesn't happen again!)

You discover a major flaw that bricks certain systems or causes data loss? Retroactively replace the Makefile/configure script/whatever to print out a warning instead of building the bad build.

I'm sure there are others.


AFAIK, copyright problems are fixed by just another commit without history rewrite. There's also no need to care about outdated credentials. Should bugs be fixed by deleting all history too? That code was buggy, bad, bad, must delete? VCS just becomes glorified ftp this way.


I don't think that your points are actually in conflict.

If this is my source code, I want the whole history. I want that 10-year old commit that isn't used in any current branch. A build machine may not need any history: it just wants to check out a particular branch as it is right now, and that works too.

But there is an intermediate case: Let's say that I have an issue with a dependency. I might check out that code and want some history to know what has changed recently, but I don't need that huge zip file that was accidentally checked in and then removed 4 years ago. If it were a consistent problem, perhaps you'd invent some sort of 'shallow' or 'partial' clone, something like this:

https://github.blog/2020-12-21-get-up-to-speed-with-partial-...


True, though shallow clones have performance issues: https://github.com/Homebrew/discussions/discussions/225


A value of a commit approaches zero as it gets older. After a certain threshold, no one will ever see it. Never say never; any reason why should we keep deadweight around?


As long as a line code is in use, there is value in knowing who and when it was authored.

If a 10 year old vulnerability is found in OpenSSL, it be nice to be able to investigate if it was an accident or an act of espionage.


Your premise is incorrect. The other day I was looking around a repository that's been through many migrations, and found a commit from 2004 that was relevant to my interests.


Git allows you to rewrite history, so you can "crush" old commits to reduce the size of history as needed.


It would be great to have some kind of way to do this while still maintaining the merkle-tree.


Isn’t that what packs are for? The raw, content addressable object store has no inherent optimization for reducing repo size. Any changed file is completely copied until a higher level does something to compress that down.


Sure, but rewriting it manually is a tedious process. Should be automated on github side, to keep repo size approx constant over time.


I think this is what `git filter-branch` its supposed to be for: https://git-scm.com/docs/git-filter-branch

I've never used it before, but from what I understand, it's very powerful but also very confusing and easy to mess up, and of course with a sort of vague ambiguous name that makes it hard to discover; in other words, it's quintessentially git.


That would be silent data loss, so absolutely should not be automated.


Github cant rewrite the refs on their own without breaking users stuff. They can only repack the existing objects, the squashing needs to be done by the developers. Also its a non-fast-forward thing, so it needs to be coordinated between the git users anyways.


Tangentially related, but after installing LLVM on Windows I notice that the clang, clang++, clang-cl, and clang-cpp binaries are all identical. That's four copies of the same 116MB file. The linkers (ld.lld, ld64.lld, lld, lld-link, wasm-ld) are likewise the same 84MB binary repeated 5 times, and similarly llvm-ar, llvm-lib and more are identical.

Overall looks like ~750MB of unnecessary file duplication for one install of LLVM.

I get that Windows doesn't have the same niceties when it comes to symbolic links as macOS and Linux, but that seems really suboptimal and wasteful.


If not symlinks, might those be three or four hard links to the same file? I’m not sure how to check on Windows.


NTFS has had symbolic links since Windows Vista. You enable them in Git with the `core.symlinks` setting. There may be other reasons to avoid them, though; for example, if some tooling like VC++ can't deal with calling symlinked executables. I don't use Windows so this is all conjecture.


You still need to enable developer mode to make sure that applications can use it without elevation.

https://blogs.windows.com/windowsdeveloper/2016/12/02/symlin...


Would git prune ( https://git-scm.com/docs/git-prune ) help here? Or is the entire repo of reachable objects more then 2GB?


Considering this from the thread, as a response from GitHub to their repack request:

> As it turns out, we only run repacks on a repository network level, which means that repacks need to consider objects from all forks of a given repository.

> Repacking entire repository networks will always lead to less optimal pack sizes compared to repacking just objects from a single fork. For GitHub, disk space is not the only thing we optimize for, but also performance across forks and client performance.

Repacking the repo locally more than halves the size:

> I tried locally right now to run git repack -a -d -f --depth=250 --window=250 and the size of the .git folder went from 2417MB to 943MB…

But GitHub repacking the repo network reduces the size by only 20%. So presumably GitHub can't aggressively remove things from a repo without affecting how forks work.

This blog post is very old (2015) but there's a section that describes their use of Git alternates to facilitate forks: https://github.blog/2015-09-22-counting-objects/#your-very-o...


The latter. It's >20 years of history, and if you want a straw test as to whether the history is relevant: I personally needed to look at >20-year-old history this winter.


Why did you need to look at >20 year old history?


Not OP, but I've worked on lots of 20-50 year old CNC machines, and have done controls for press brakes and resistance welders that are more than 100 years in age (granted, the PLCs and PCs were added later to streamline the relay logic created around World War 1, but the cast iron and motion works date to that era). Two weeks ago I fixed up an RS232 DNC pipeline for a 1985 mill, the source code manuals were heavily yellowed but I eventually figured out the now-esoteric RS232 configuration. One of my coworkers worked on this equipment a bit more than a decade ago, but the newest part - a Windows 7 PC - was the first thing to break.

The addage "if it ain't broke, don't fix it" comes with the corollary that if you build it right it won't break...until it does. And then someone has to do some archaeology to get a process back online that hasn't been documented since dot-matrix printers and typewriters, much less git development. I'm also building stuff and making decisions today for machines that have 10, 20, or 30-year expected lifetimes, and the core components should last indefinitely as long as maintenance is performed.

Many businesses aren't built for and aren't compatible with a 2- or 3-year obsolescence cycle.


Because code written >20 years ago still runs today. I wanted to understand some odd behaviour, ran git blame to find the relevant commits, then looked at the commit messages. Some of the lines involved were old.


A lot of big applications have features that are coded and go years with minimal changes.

My applications definitely have modules that only get touched when we do a major UI refresh (about once every 10 years.)


Built LLVM about a year ago and not only is the source large but the build artifacts filled my disk to the brim.


There are some workarounds to shrink the size. I think the most noticeable ones being only build certain backends (-DLLVM_TARGETS_TO_BUILD="X86;..."), built as shared libraries (-DBUILD_SHARED_LIBS=ON) and split (DWARF) debug info for debug builds (-DLLVM_USE_SPLIT_DWARF=ON). Plus, usually I just build the tools I need instead of firing `ninja` or `ninja all`


Only tangentially related, but I've found that git-sizer [1] is handy for getting a sense of repository metrics.

[1]: https://github.com/github/git-sizer


I cloned

    git clone https://github.com/llvm/llvm-project
When entering the directory my fancy prompt timed out

    [WARN] - (starship::utils): Executing command "git" timed out.
But it seemed OKish after `git status` (disk cache?)

    time git status
    ...
     real 0m0.287s
     user 0m0.189s
     sys 0m0.392s
The size of `.git/objects` after the clone was 2.5GB.

Output of `git-sizer`: https://pastebin.com/T5HRMfg9


It's interesting that the post doesn't deal deal with the basic problem that if your repo is +2GB, the problem may be on your side.


How so? You realize there can be more than just code in a repo? And even then, what if there is a lot of code?


What large data assets does a compiler have? Is there an HD video for you to watch while you compile?


The LLVM repo is that large because a lot of people have worked on a lot of code.

It's not so difficult. A team of tens of programmers can reach that size in a couple of decades, just by writing code and textual documentation. No graphic assets required, all it takes is a generation of steady work by a mid-sized team.


The latest checkout of the main branch of the LLVM project is 1.3GB, basically entirely text. Of that, ~900MB is tests, largely consisting of reference test vectors (this apparently isn't even the entirety of LLVM's tests, just the core ones). LLVM code itself weighs about 100MB, consisting of about 2 million lines of code (not including comments). clang (also in the same repo) is another 50MB, and about 800k lines of code. The remainder is other utilities and documentation.


The test suites of a compiler typically involve compiling very large projects; SPEC cpu2017 is I believe a gigabyte or so. Of course, these tests aren't in the llvm source repo, they're in a separate llvm-test-suite repo (although SPEC itself isn't distributed there because it's a benchmark you have to buy).


LLVM is much more than a compiler. It's a set of projects under the umbrella of the LLVM project. It includes not only LLVM and clang, but a linker, a collection of runtimes to support many features (like various sanitizers), the MLIR project, a fortran front-end, c/c++ libraries, a debugger, openmp support, a polyhedral optimization framework, and all the tests that every feature in every one of those projects has.

Is that too much under one umbrella? Probably. But it's not just a compiler. It's a monorepo.


I work on video-related applications and plug-ins, and yes we do sometimes need to include HD, 4K and larger assets in our code base. That's probably not the case for LLVM, but it's not at all out of the question for other types of projects.


> Is there an HD video for you to watch while you compile?

This made me laugh


He said 'may be'.

Are you really suggesting it's impossible for bad practices to bloat a got repo?


> You realize there can be more than just code in a repo?

Yes, and for large assets there are extra solutions like Git LFS (which GitHub has support for).


Maybe at this point it shouldn’t be hosted or even mirrored on GitHub.

CocoaPods and Homebrew both hit similar issues in the past with huge work trees by hosting all their specs on GitHub, resulting in breaking workflows for their users.

Those projects made changes to their workflow to mitigate this. Since the source here is around 6 months old, have LLVM done something similar?


Wasn't the problem with brew that they were (maybe still are) using the git repo as a content delivery network? As I recall the problem wasn't the repo size itself, the problem was millions of users have to do a "git clone" to update homebrew and it overwhelmed Githubs infrastructure.


No they still use GitHub, and there are plans to switch code reviews to use GitHub PRs as well.


Hi, I’m curious here would git-lfs (large file storage) work along with rewrite the history to remove the larger files and artifacts?

    git filter-branch --index-filter "git rm -rf --cached --ignore-unmatch /path/to/file" HEAD

https://stackoverflow.com/questions/43762338/how-to-remove-f...

Or is git-lfs generally avoided for good reason? Or so all the changes and not necessarily large files cause this repo to exceed the limit?


Then you're rewriting history. And this should be avoided at all cost, as git hashes are part of the cryptographic history of a repo.

Who are you going to trust to do this? Someone will need to execute this and then do a force push. Are you going to compare it? Are you going through all the history of files removed to see if there is no change to the source code? And what about the files you're actually going to move to git-lfs? How can you prove that they haven't changed in the migration?

Provenance is a thing. https://slsa.dev/provenance/v0.2


I'll admit I've done this early on in history of a pet project repo scrubbing a credential :/, not implementing git-lfs though.

I knew that rewrites compromise the history. It is low stakes and I didn't want to allocate a new repo to start anew, just learned my lesson there.

I'm mostly curious about git-lfs as large file viability or it should be avoided and plan for large build artifacts to host elsewhere.


> As it turns out, we only run repacks on a repository network level, which means that repacks need to consider objects from all forks of a given repository.

> Repacking entire repository networks will always lead to less optimal pack sizes compared to repacking just objects from a single fork. For GitHub, disk space is not the only thing we optimize for, but also performance across forks and client performance.

So the lesson here is you can DoS existing open source projects somehow by forking them and increasing the forked repo size >2GB?


I don't think that's the case.

The issue is that GH doesn't accept too big packs. git by default pack everything into a single pack. Maximum pack size can be specified either in config or as an argument to repack. The way I read the error message a user can push a huge repo by making sure it's packed into a few packs under 2GB limit.

It doesn't seem like there's an easy way to turn this into a DoS. GH would repack the fork network on its own schedule. A user probably can not trigger repacks. The repacks on GH side would probably be smaller then their limit, too. packs are probably scoped to a fork and the server is an active client so it most likely wouldn't return objects from other forks. I don't think it would be easy to DoS GH just by pushing big packs (either under or over the limit).


How would the DoS work? As I understand it the issue here only occurs if you try to push the entire project to a _new_ repo. It doesn't affect existing repos.


I'm almost certain it affects all repos. It's just more unusual to push huge packs into exiting repos. If for whatever reason you happen to have a huge branch clicking in over 2GB you'd get the same error.


True, for github.


Funny I just encountered something similar last week. My org runs a gitlab instance and I creates a repo with lecture slides etc.. The slides are html files and some of them have generated videos (for various reasons I wanted to upload the slides not the source scripts). Turns out that my org put a 150MB upload limit and one of the commits exceeded that limit, so my push was failing. The fix was non trivial, but git-sizer was definitely helpful.


I ran into this at work trying to push a P4 clone that sat at 20GB in my local .git folders. Needless to say, I didn't have room on my local to actually check out a workspace.

A shell loop can push N commits at a time, using git's handy syntax for that:

`git push origin p4/master~${x}:master`


As LLVM is a monorepo of a number of different (giant) projects that kind of feel separate--and in fact used to be separate--I often wish I could have a submodule direct to a subfolder so I don't have to do a checkout of all of LLVM.


I'm not sure what the current status of this issue is, based on the linked Discourse thread.

Is it still a fundamental limitation, but with a known client-side workaround?


It only happens if you push entire repo at once. Some script may split upload into multiple smaller parts and solve this.


Sounds like Git CLI could do this automatically, if it can detect these errors (I think it can't).


This is a GitHub-specific issue, so I don't think Git should do anything about it.


What is taking so much space though, why not just use LFS? It can't possibly just be the text files.


June 2022


I think github has some whitelist for large popular projects. Chromium is hundreds of GB for example, but you can still fork it on github and make any changes you like and push them.


If I understand correctly, the limitation is on the size of a push.

> If you happened to push the entire llvm-project to another new (personal) GitHub repo recently, you might encounter the following error message before the whole process bails out:

> The crux here is that we tried to push the entire repo, which has exceeded GitHub’s upload limit (2GB) at once. This is not a problem for majority of the developers who already had a copy of llvm-project in their separated GitHub repos

If you fork the repo on the Github website, they manage the fork server side, and you never have to push up the entire repo.


They seriously wrote 2gb of C++?


The total size of the files in the repo is 1.35 GiB. The total size of all the blobs is 84.9 GiB.

https://pastebin.com/T5HRMfg9




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: