
Supercharging the Git Commit Graph IV: Bloom Filters - matthewwarren
https://blogs.msdn.microsoft.com/devops/2018/07/16/super-charging-the-git-commit-graph-iv-bloom-filters/
======
jaytaylor
This is interesting work.

I've never noticed the referenced git operation as being slow. In my
experience,

    
    
        git log -- <path/to/file/or/dir>
    

has always seemed instantaneous or nearly so. It's quick even on code bases
that are quite large (years of work from large teams, 10's of thousands of
commits).

Maybe data and performance information on before vs. after bloom filters would
help clarify the specific design goals.

I love seeing bloom filters in use in widely-used software in real life, in
any case!

~~~
dingo_bat
> I've never noticed the referenced git operation as being slow

On our monorepo, a simple 'git log -- random_file.cpp' can take anywhere from
10s to a couple of minutes.

Edit: This is on VMware instances with 16 Xeon cores and 32 gigs of memory
with SSD backed storage. My puny laptop would probably struggle to even clone
the repo.

~~~
mwcampbell
Sounds like your team would benefit from GVFS [1].

[1]: [https://github.com/Microsoft/GVFS](https://github.com/Microsoft/GVFS)

------
avar
This thread on the Git mailing list from back in May has some more technical
details: [https://public-
inbox.org/git/86zi1fus3t.fsf@gmail.com/](https://public-
inbox.org/git/86zi1fus3t.fsf@gmail.com/)

Derrick Stolee, the author of this post, chimes in with some more details
about how this works under the hood.

~~~
gilleain
Huh I wondered why his github had so much on git lately.

I know his work from generating graphs (eg
[https://arxiv.org/abs/1104.5261](https://arxiv.org/abs/1104.5261))

------
ythn
Coming from a game programming background - before clicking the link I was
expecting a VFX "bloom filter" effect on a visual commit graph where "light
sources" are exaggerated and seem to "glow".

~~~
Heliophite
That was my reaction as well. "Oh, neat, visualization!" ".. oh, the óther
bloom."

------
lozenge
Does it work when I'm looking for all commits changing a directory? I.e. Do
directories get added to the bloom filter or just filenames?

~~~
tempay
I thought git didn't have a concept of directories, just blobs that have a
path? If so, I presume "all commits changing a directory" is found by
individually checking each file contained by the directory.

~~~
masklinn
Git very much has directories, it calls them "tree": a commit links to a tree,
which links to a number of trees or blobs associating each with a name:
[https://git-scm.com/book/gr/v2/Git-Internals-Git-Objects](https://git-
scm.com/book/gr/v2/Git-Internals-Git-Objects)

A blob is a file (ish, it can also be use for some other things since it's not
intrinsically named), a tree is a directory.

However Git doesn't treat filesystem directories themselves as first-class
e.g. you can't version an empty directory. A tree only exists in order to
contain sub-items (ultimately blobs).

------
crb002
Glad MS poached Stolee from Iowa State.

------
kanox
Can you make `$ git branch` fast? Because it takes 5-10 seconds for me on
linux.

~~~
stolee
Do you mean `git branch --contains` or `git branch -vv`? There are many
options to `git branch` that cause Git to be very slow.

The commit-graph feature in general will make these faster by reducing time
spent parsing commits. You can compute a commit-graph right now if you have
Git 2.18 installed:
[https://blogs.msdn.microsoft.com/devops/2018/06/25/superchar...](https://blogs.msdn.microsoft.com/devops/2018/06/25/supercharging-
the-git-commit-graph/)

Generation numbers will make these operations much faster in Git 2.19. Here
are the related commits:

`git branch --contains`:
[https://github.com/git/git/commit/f9b8908b85247ef60001a683c2...](https://github.com/git/git/commit/f9b8908b85247ef60001a683c281af0080e9ee77)

`git tag --contains`:
[https://github.com/git/git/commit/819807b33f820dc17d96f04374...](https://github.com/git/git/commit/819807b33f820dc17d96f043747daf18c5e38516)

~~~
kanox
A long time ago I made an alias always passes -v to git branch and this is the
problem: on large repos it can take a long time to compute the "ahead X,
behind Y" information. It can be fixed by aliasing to some git branch --format
without %(upstream:track)

Those numbers not really useful interactively when very large, for example it
doesn't help to print that one of my branches is "behind 132132". Maybe git
could print "ahead 7, behind 1000+" for old stale branches stuff? This way it
would limit the number of commits examined.

~~~
stolee
Earlier this year, we considered similar changes to the ahead/behind
calculation in 'git status'. We concluded that there is no good way to provide
partial information. Specifically, you can't say "10 ahead" without walking
everything you are behind to find all merge-bases. That "10" you found could
become smaller by walking more of the other set. We instead opted to not
calculate that information as often.

Here is my reply to the thread on-list that summarizes why I think this
direction is futile: [https://public-
inbox.org/git/20180108154822.54829-1-git@jeff...](https://public-
inbox.org/git/20180108154822.54829-1-git@jeffhostetler.com/t/#m976950a5c61dd2c9f12bdcd7c433344da473698f)

This ahead/behind calculation is in a lot of places, including 'git fetch'
where it checks if each ref update was a forced update (checks if the new ref
value has the old ref value in its history). For our version of Git that ships
with GVFS, we added an option to skip this check, providing a significant
speedup to users fetch times:
[https://github.com/Microsoft/git/commit/9616c7da3141f539a425...](https://github.com/Microsoft/git/commit/9616c7da3141f539a4254748dedaf3db9d39687b)

