Hacker News new | past | comments | ask | show | jobs | submit login

The problem here is unbounded growth of a git repo. In this specific case, a size limit was triggered. In other circumstances, it would have required too much time to transfer or no storage space left.

Anyway, the problem is that git stores all changes, forever. A better approach would be to clean old commits, or somehow merge them into snapshots of fixed timespans (like, anything older than a year get compressed into monthly changesets)




This is not a "problem", this is why source control exists. You should never rewrite published history.


A more conservative approach would be some sort of layered storage/archiving, I guess. The older the commit, the less likely it is to be used, so it could be archive in a different storage, optimized for long term. This way you keep the "hot" history small, while keeping the full history still available.


That's generally how git packs are used at large organizations that host their own repositories. I'm sure GitHub does something similar.


I don't think I can agree with that.

Accidentally publish secrets/credentials? Rotate them yes but also remove them from the published history.

Accidentally publish a binary for a build tool without the proper license? Definitely remove it (and add it to your .gitignore so it doesn't happen again!)

You discover a major flaw that bricks certain systems or causes data loss? Retroactively replace the Makefile/configure script/whatever to print out a warning instead of building the bad build.

I'm sure there are others.


AFAIK, copyright problems are fixed by just another commit without history rewrite. There's also no need to care about outdated credentials. Should bugs be fixed by deleting all history too? That code was buggy, bad, bad, must delete? VCS just becomes glorified ftp this way.


I don't think that your points are actually in conflict.

If this is my source code, I want the whole history. I want that 10-year old commit that isn't used in any current branch. A build machine may not need any history: it just wants to check out a particular branch as it is right now, and that works too.

But there is an intermediate case: Let's say that I have an issue with a dependency. I might check out that code and want some history to know what has changed recently, but I don't need that huge zip file that was accidentally checked in and then removed 4 years ago. If it were a consistent problem, perhaps you'd invent some sort of 'shallow' or 'partial' clone, something like this:

https://github.blog/2020-12-21-get-up-to-speed-with-partial-...


True, though shallow clones have performance issues: https://github.com/Homebrew/discussions/discussions/225


A value of a commit approaches zero as it gets older. After a certain threshold, no one will ever see it. Never say never; any reason why should we keep deadweight around?


As long as a line code is in use, there is value in knowing who and when it was authored.

If a 10 year old vulnerability is found in OpenSSL, it be nice to be able to investigate if it was an accident or an act of espionage.


Your premise is incorrect. The other day I was looking around a repository that's been through many migrations, and found a commit from 2004 that was relevant to my interests.


Git allows you to rewrite history, so you can "crush" old commits to reduce the size of history as needed.


It would be great to have some kind of way to do this while still maintaining the merkle-tree.


Isn’t that what packs are for? The raw, content addressable object store has no inherent optimization for reducing repo size. Any changed file is completely copied until a higher level does something to compress that down.


Sure, but rewriting it manually is a tedious process. Should be automated on github side, to keep repo size approx constant over time.


I think this is what `git filter-branch` its supposed to be for: https://git-scm.com/docs/git-filter-branch

I've never used it before, but from what I understand, it's very powerful but also very confusing and easy to mess up, and of course with a sort of vague ambiguous name that makes it hard to discover; in other words, it's quintessentially git.


That would be silent data loss, so absolutely should not be automated.


Github cant rewrite the refs on their own without breaking users stuff. They can only repack the existing objects, the squashing needs to be done by the developers. Also its a non-fast-forward thing, so it needs to be coordinated between the git users anyways.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: