
Measuring the many sizes of a Git repository - edmorley
https://blog.github.com/2018-03-05-measuring-the-many-sizes-of-a-git-repository/
======
bluedino
>> What we find is that many of the repositories that tax our servers the most
are not unusually big. The most challenging repositories to host are often
those that have an unusual internal layout that Git is not optimized for.

Like CocoaPods!

[https://github.com/CocoaPods/CocoaPods/issues/4989#issuecomm...](https://github.com/CocoaPods/CocoaPods/issues/4989#issuecomment-193772935)

~~~
liquid_room
GitHub is pretty awesome :)

------
colemannugent
I chuckled a bit when their _git-sizer_ tool pointed out a high level of
concern for the 66 parent octopus merge in the Linux kernel.

See [https://www.destroyallsoftware.com/blog/2017/the-biggest-
and...](https://www.destroyallsoftware.com/blog/2017/the-biggest-and-weirdest-
commits-in-linux-kernel-git-history) for the story behind the Cthulhu commit.

~~~
LukeShu
From the article:

> _Update: it was [an accident][1], which Linus responded to in his usual
> fashion._

>
> _[1]:[http://lkml.iu.edu/hypermail/linux/kernel/1603.2/01926.html](http://lkml.iu.edu/hypermail/linux/kernel/1603.2/01926.html)
> _

Those who only know Linus from his rants might be surprised that here "his
usual fashion" means:

\- Acknowledging that the root cause was Github's documentation being
misleading.

\- Not blaming the contributor for being mislead by Github: "I can see why
that documentation would make you think it's the right thing to do."

\- Admit that the ease with which the accident happened is a deficiency in
Git's UI.

\- CC the Git maintainer to discuss improving Git to make it harder to do this
by accident. (Which eventually lead to the --allow-unrelated-histories flag
being needed to do this kind of merge.)

------
pandem
_The Linux kernel has been developed over 25 years by thousands of
contributors, so it is not at all alarming that it has grown to 1.5 GB. But if
your weekend class assignment is already 1.5 GB, that’s probably a strong hint
that you could be using Git more effectively!_

Git is only 12 years old, how does Linux have 25 years of history there? As
far as I know Linux used patches on mailing lists before git, are those also
somehow transferred to the repo?

~~~
mdaniel
_patches on mailing lists before git, are those also somehow transferred to
the repo_

Given that conceptually git is just a linked list of patches, I can't imagine
why they wouldn't have that history

~~~
pedrocr
Actually that's what VCSs before git used to be and what git changed. Git
doesn't keep patches, it keeps full states of the repository in a content
addressable fashion. It's one of its key insights. Instead of having to have
an always correct way to encode deltas just encode the state itself and leave
it to the tools to figure out what the diff should be. That way you're not
encoding in your disk format something that can be done better in a later
version of the tool.

~~~
simcop2387
That said, git doesn't just store direct copies either. It will bundle things
up into packfiles as it calls them to do compression and encoding of various
forms to reduce disk space and make it quicker to find a given version of a
file

[https://git-scm.com/book/en/v2/Git-Internals-Packfiles](https://git-
scm.com/book/en/v2/Git-Internals-Packfiles)

~~~
pedrocr
Git the tool does packfiles but that's an implementation detail. Git the VCS
can work with any object storage backend.

------
ktpsns
I noticed just today that Github has a number of counter-measures for absurd
git repositories built in when you try to push something. For instance, I
imported a huge (3GB, mostly due to large frequently updated files in the
history) subversion repository to git and got failures due to individual
commits exceeding 100MB. This was quite helpful to bring the size of my
repository to a reasonable state. Tools like the
[https://rtyley.github.io/bfg-repo-cleaner/](https://rtyley.github.io/bfg-
repo-cleaner/) are indispensable to do this kind of filtering without
headache.

~~~
nerdponx
BFG is a lifesaver. If only I could convince my employer to donate! (Yes I
already donated privately).

------
eridius
This looks great. Surprised to see no package manager support for it though.
I'd love to see MacPorts or Nix support for this.

~~~
deadbunny
While they're at it why not: dpkg, docker, entropy, flatpak, guix, ipkg,
netpkg, opkg, pkgng, pacman, rmp, snappy?

It's better to leave packaging to each distro's maintainers rather than
spending 80% of your time preparing the release packaging for every single
package manager there is. Or super keen folks who want to do it specifically
for your project, even then they'll only be super keen about one or two
platforms.

~~~
eridius
Two things:

1\. Releasing a new binary tool without any package manager support just sucks
for your users in general, because it means they're required to manually
install it and most of them will probably end up with a horribly-outdated
version of your tool installed for a long time because their package manager
can't ever tell them that it's out of date.

2\. macOS isn't a distro, so you can't just say "let your distro maintainers
do it". If you don't submit to MacPorts, the only way you'll get in there is
if someone else steps up to submit on your behalf, but that kinda sucks
because you're package will likely end up out-of-date in MacPorts unless the
volunteer maintainer is super diligent about noticing new releases and
updating the Portfile.

Nix is a more general-purpose packaging system, but it also suffers from this
problem. In fact, in my experience, Nix packages do tend to be out of date for
a while before someone notices and fixes it.

FWIW I don't really expect people to actually submit their own tools to Nix
anyway, because there's a fairly steep learning curve there, but it would be
really awesome if people did. But submitting to MacPorts is more
straightforward.

~~~
deadbunny
2\. macOS isn't a distro, so you can't just say "let your distro maintainers
do it". If you don't submit to MacPorts, the only way you'll get in there is
if someone else steps up to submit on your behalf, but that kinda sucks
because you're package will likely end up out-of-date in MacPorts unless the
volunteer maintainer is super diligent about noticing new releases and
updating the Portfile.

So by that logic I have to support mac/nix over every other system by default
as they don't have maintainers? That sounds like a mac/nixos problem, not a
developer problem.

~~~
eridius
If you want your tool to actually get used, you should put in at least a
little effort towards trying to get it in package managers. I don't know why
you're acting so surprised about that.

------
harshbutfair
This seems like a very useful tool, and it provides useful information on our
repository.

But in RHEL7 it gives this error: error: couldn't open Git repository: git
rev-parse failed: Unknown option: -C

I assume it requires a later version of git.

~~~
jwilk
git-sizer uses the -C option, which was added in git v1.8.5:

[https://github.com/git/git/commit/44e1e4d67d5148c245db362cc4...](https://github.com/git/git/commit/44e1e4d67d5148c245db362cc48c3cc6c2ec82ca)

RHEL7 ships with git v1.8.3.1.

There might be other compatiblity issues, e.g.:
[https://github.com/github/git-sizer/issues/18](https://github.com/github/git-
sizer/issues/18)

~~~
deadbunny
If you need modern (but stable and maintained by Rackspace) packages in CentOS
check out [https://ius.io/](https://ius.io/)

------
zspitzer
I've always wondered why GitHub doesn't display the size of files at the
folder level? the only way on the website is to drill down to the individual
file.

