The problem here is unbounded growth of a git repo. In this specific case, a siz...

howinteresting · on Jan 30, 2023

This is not a "problem", this is why source control exists. You should never rewrite published history.

tonnydourado · on Jan 30, 2023

A more conservative approach would be some sort of layered storage/archiving, I guess. The older the commit, the less likely it is to be used, so it could be archive in a different storage, optimized for long term. This way you keep the "hot" history small, while keeping the full history still available.

howinteresting · on Jan 30, 2023

That's generally how git packs are used at large organizations that host their own repositories. I'm sure GitHub does something similar.

torstenvl · on Jan 30, 2023

I don't think I can agree with that.

Accidentally publish secrets/credentials? Rotate them yes but also remove them from the published history.

Accidentally publish a binary for a build tool without the proper license? Definitely remove it (and add it to your .gitignore so it doesn't happen again!)

You discover a major flaw that bricks certain systems or causes data loss? Retroactively replace the Makefile/configure script/whatever to print out a warning instead of building the bad build.

I'm sure there are others.

GoblinSlayer · on Jan 31, 2023

AFAIK, copyright problems are fixed by just another commit without history rewrite. There's also no need to care about outdated credentials. Should bugs be fixed by deleting all history too? That code was buggy, bad, bad, must delete? VCS just becomes glorified ftp this way.

crznp · on Jan 30, 2023

I don't think that your points are actually in conflict.

If this is my source code, I want the whole history. I want that 10-year old commit that isn't used in any current branch. A build machine may not need any history: it just wants to check out a particular branch as it is right now, and that works too.

But there is an intermediate case: Let's say that I have an issue with a dependency. I might check out that code and want some history to know what has changed recently, but I don't need that huge zip file that was accidentally checked in and then removed 4 years ago. If it were a consistent problem, perhaps you'd invent some sort of 'shallow' or 'partial' clone, something like this:

https://github.blog/2020-12-21-get-up-to-speed-with-partial-...

howinteresting · on Jan 30, 2023

True, though shallow clones have performance issues: https://github.com/Homebrew/discussions/discussions/225

cabirum · on Jan 30, 2023

A value of a commit approaches zero as it gets older. After a certain threshold, no one will ever see it. Never say never; any reason why should we keep deadweight around?

themerone · on Jan 30, 2023

As long as a line code is in use, there is value in knowing who and when it was authored.

If a 10 year old vulnerability is found in OpenSSL, it be nice to be able to investigate if it was an accident or an act of espionage.

howinteresting · on Jan 30, 2023

Your premise is incorrect. The other day I was looking around a repository that's been through many migrations, and found a commit from 2004 that was relevant to my interests.

mrguyorama · on Jan 30, 2023

Git allows you to rewrite history, so you can "crush" old commits to reduce the size of history as needed.

jacobr1 · on Jan 30, 2023

It would be great to have some kind of way to do this while still maintaining the merkle-tree.

jonhohle · on Jan 30, 2023

Isn’t that what packs are for? The raw, content addressable object store has no inherent optimization for reducing repo size. Any changed file is completely copied until a higher level does something to compress that down.

cabirum · on Jan 30, 2023

Sure, but rewriting it manually is a tedious process. Should be automated on github side, to keep repo size approx constant over time.

saghm · on Jan 30, 2023

I think this is what `git filter-branch` its supposed to be for: https://git-scm.com/docs/git-filter-branch

I've never used it before, but from what I understand, it's very powerful but also very confusing and easy to mess up, and of course with a sort of vague ambiguous name that makes it hard to discover; in other words, it's quintessentially git.

Arnavion · on Jan 30, 2023

That would be silent data loss, so absolutely should not be automated.

blueflow · on Jan 30, 2023

Github cant rewrite the refs on their own without breaking users stuff. They can only repack the existing objects, the squashing needs to be done by the developers. Also its a non-fast-forward thing, so it needs to be coordinated between the git users anyways.