
Supercharging the Git Commit Graph - ethomson
https://blogs.msdn.microsoft.com/devops/2018/06/25/supercharging-the-git-commit-graph/
======
wscott
"Before I joined Microsoft, I was a mathematician working in computational
graph theory. I spent years thinking about graphs every day, so it was a habit
that was hard to break. "

As a former BitKeeper developer, this is a key person for Microsoft to have on
hand to improve git. BitKeeper got about 10X faster after it was used on the
Linux kernel and many of the key performance wins were due to better graph
traversal algorithms. Rick was a wizard at that sort of thing. The other was
memory layout optimizations and caching. (my contribution)

So Rick made sure we walked the graph as little as necessary and I made sure
the graph had an extremely compact representation containing as little
information as possible and then once a target commit is found it is looked up
in another store.

However, this quote was disturbing. "There are a few Git features that don’t
work well with the commit-graph, such as shallow clones, replace-objects, and
commit grafts. If you never use any of those features, then you should have no
problems!" The joy of not having to write commercial software! Reading between
the lines it appears they changed the default output order for 'git log' or
some internal API and then didn't bother to fix the cases that depending on
the old order.

~~~
stolee
Sorry for the worrying note about the experimental feature. One issue when
working in open source is that contributors don't have control over the
release cycle, and review requires smaller series than having the feature be
delivered all at once.

These interactions with grafts, replace-objects, and shallow clones are one
reason 2.18 does not create and manage this file automatically. The commit-
graph file works by representing the commit relationships in a new file, and
if that file exists, we treat that as the truth. The commit grafts, replace-
objects, and shallow clones use another set of special files to track commits
whose parents have been modified in special ways. If you would like to see our
progress on integrating these features together, please see this thread on the
Git mailing list: [https://public-
inbox.org/git/20180531174024.124488-1-dstolee...](https://public-
inbox.org/git/20180531174024.124488-1-dstolee@microsoft.com/T/#u)

~~~
wscott
> One issue when working in open source is that contributors don't have
> control over the release cycle, and review requires smaller series than
> having the feature be delivered all at once.

Yes, my reply was unnecessarily disparaging. Overall this looks like a cool
feature. Perhaps a stopgap solution if for those commands to just delete your
cache. But then your repository will get mysteriously slower. I just need to
think of this as a technology demonstration.

~~~
stolee
The good news is that if you are using shallow clones, then you probably don’t
have enough commits locally to need the commit-graph feature!

------
yebyen
One of the best things I ever did for my git usage was installing this
ridiculous thing in my .gitconfig aliases:

    
    
        [alias]
            l = log --date-order --date=iso --graph --full-history --all --pretty=format:'%x08%x09%C(red)%h %C(cyan)%ad%x08%x08%x08%x08%x08%x08%x08%x08%x08%x08%x08%x08%x08%x08%x08 %C(bold blue)%aN%C(reset)%C(bold yellow)%d %C(reset)%s'
    

It lets me type 'git l' and get this kind of commit graph:

    
    
        *       8264cf2a 2018-06-28 yebyen (origin/dev, dev) ...
        |\
        | *     df6e4c49 blabla commit messages
        | *     195d8924 commit messages
        |/
        | *     96e74a5c (branch-label) commit message
        | |\
        | | *   b2786c07 blabla
        | | *   85721288 blabla
    

This is semi-related but not nearly as interesting from a technical
perspective. But it's one of the missing features in Git, to visually help
people understand why it's helpful to rebase occasionally, and keep your
commit history clean.

I think I found it here[1]

[1]:
[https://stackoverflow.com/a/16735971/661659](https://stackoverflow.com/a/16735971/661659)

~~~
WorldMaker
I had a similar alias for a while at a past job, but realized I was just using
`gitk` anyway, and `gitk` is installed by default everywhere and doesn't need
an alias setup, so it's also especially easier to teach to junior developers.

Also, I don't encourage rebasing, especially to junior developers. I realize
everyone has different preferences, but a messy graph is useful and there are
other tools like `git log --first-parent` for getting "clean" baselines from
the graph. The great thing about using a graph in the first place is there are
a lot of traversal options and you don't have to micromanage where the trees
are to still see the forest.

~~~
yebyen
Yes! I agree with you mostly, except that we try to keep from having junior
developers for too long. Nobody is really a junior developer, and everyone
should get a turn as release manager. The main benefit of this is that people
can see directly how their work style impacts the release engineering process,
if they do know exactly what that process involves and actually get a turn at
it. (You have to stub your own toe in order to know how bad it hurts.)

Our team is actually really small, and we like to make sure everyone knows
about the hassles involved in putting together a release and doing a complete
code review when it's needed. The main reason I encourage rebasing is because
it helps avoid a messy graph, and a messy graph makes it immediately much
harder to do a rebase across any number of merges, or any non-trivial span of
time.

So in other words, I like to selfishly expect the other developers on my team
to do rebases at the appropriate times, in order to preserve my own capability
to easily do rebases when needed. (We also learned last week that git rebase
has a --preserve-merges option that I feel foolish to not have known about
sooner. We've wiped out so many merge commits unnecessarily.)

We treat the master branch as carved in stone like anyone should, but other
branches should clearly flow their merges only one way (features into
releases, or features into environments), and if a branch hasn't actually been
included in a release tag, it's considered fair game for rebasing. It helps us
to prevent our three developers' sometimes too many concurrent trains of
thought from resulting in equally too many confusing merges, or an intractable
number of HEADs to manage and organize into releases.

One of the places we struggle is that we're not all-the-way onboard with CI/CD
processes, but we do the best we can so that whenever support for that kind of
thing materializes, we will mostly not need to change our processes at all and
can obtain the benefits of that kind of tooling as quickly as possible.

~~~
WorldMaker
> other branches should clearly flow their merges only one way

I tend to disagree on this as well. The best person to merge a conflict is the
developer creating the conflict in the first place, as soon or nearly as soon
as they create the conflict, because they are most likely to know why the
conflict exists in the first place. "Merge early, merge often."

Delaying "reverse" merges until the last possible second means you often don't
have the integration expertise needed without research involved, even if it
was "you" that introduced the conflict you may have moved on to other problems
since and not recall why you did something one way or another, or which bits
are important to integrate.

Delaying those merges as rebases I feel is even worse, because not only do you
not have the resources to know why an integration needs to be made, you also
aren't recording a history of the merge conflicts you saw in the rebase such
that you can easily revisit the integration if you made an integration mistake
(which does happen, because we are, after all, only human).

My advice tends to be to use a good PR system of some sort that makes your
proper, code reviewed "forward" merges clear and obvious, and then don't worry
about a mess of other merges "underneath" that top-level of strong PR merges.
(Hence the key suggestion is that `git log --first-parent` is your best friend
if you want a "clean" view of the DAG. It gives you a linear list of just your
PR merges in master, or branch work and PR merges to that branch in any other
branch.) Also, yes, CI/CD are really good ideas.

~~~
yebyen
> you also aren't recording a history of the merge conflicts you saw in the
> rebase

This is a good point. Mistakes are made during merge conflict resolution. But
if you are tracking your upstream when you develop a long-lived feature
branch, and rebasing when there are changes to the base, and actually
comparing your rebased feature branches to the version that you had before you
force push over the old remote version, then the only thing you really need to
track is "does my diff still look like the change I intended to make."

The most valuable skill we are learning from Git is how to avoid merge
conflicts altogether, and it tends to be more a human problem than a
technological issue. (You don't avoid merge conflicts with some special commit
strategy, you do it by ensuring that two people are not actively changing the
same part of the codebase unless it's absolutely necessary.)

~~~
WorldMaker
I still think you are better off preserving every merge point-in-time when
they were made than ever rebasing feature branches. Because yes, every rebase
is an opportunity for mistakes to go unnoticed, and there's no "rebase log" to
try to unwind a mistake weeks or months later.

Also, every rebased branch is its own likely source of merge conflicts for
people working on "differently rebased" versions of the same code.

> The most valuable skill we are learning from Git is how to avoid merge
> conflicts altogether, and it tends to be more a human problem than a
> technological issue. (You don't avoid merge conflicts with some special
> commit strategy, you do it by ensuring that two people are not actively
> changing the same part of the codebase unless it's absolutely necessary.)

Definitely. Communication is a key. It's also why I think "merge early, merge
often" works better, because it forces communication as early and often as
possible ("I'm seeing a merge conflict with this work you are doing, can you
explain it to me?"). Keep branches as short-lived as possible, and try to
avoid "separate but equal" work where you can't comingle features and have to
intentionally fence your branches from integrating with each other. Merge
feature branches between each other, even, to keep communication up. Find
tools like feature flags that allow you to "ship" to master "unfinished" work
faster to save yourself from trying to integrate long running branches after
the fact. (If you are going to let "marketing" choose which features "ship" in
a version, it is much nicer to do so by flipping flags in a dashboard
somewhere, maybe even one Marketing can use themselves, than to try to
furiously merge long-running feature branches at release time.)

~~~
yebyen
+1 for feature flags. We have struggled to manage long-lived changes (our
current iteration has been going on for nearly a year.)

The amount of uhh "pucker" I'll feel when we release next month, and certain
pieces of code are touching prod for the first time, is much higher than I'd
like.

The only reason I can sleep at night while spending a whole year with only
hotfixes and minor feature addons for the last release going to prod, is
because we've spent much too much time at all layers of the testing pyramid,
and every time I've made a change that should probably break some tests, it
actually breaks a few more tests than I expected (so I know that our coverage
is pretty good.) Every time we write a brand new feature, it absolutely gets a
matching Cucumber feature test in plain english that we run before and after
every merge. And that's not even to mention unit testing.

So we can be reasonably confident that everything we've ever promised, is
still true. Our test suite takes over an hour to run at this point, which is
way longer than it should. But it helps us catch the bugs, and well before
they start to have "piled up."

I'd feel marginally better if we knew what code doesn't run in prod because we
could turn it on and off, and see those feature tests failing (or simply
exclude them, since they would also be tagged with the feature label.)

It's interesting that some people feel feature flags are complicated enough to
make "Feature flags as-a-Service" businesses a thing, like LaunchDarkly. I
looked at what they are offering and thought it would serve us well, but we
haven't actually started using feature flags for anything.

------
yoklov
Dang, this made `git log` inside the repo I use at work (which is enormous
with a stupid number of commits) nearly instantaneous. Great work.

`git status` still takes over a second for me though, oh well...

~~~
fanf2
Have you tried the fsmonitor hook feature that was added in git 2.17?
[https://blog.github.com/2018-04-05-git-217-released/](https://blog.github.com/2018-04-05-git-217-released/)

~~~
yoklov
Yep, as well as the untracked cache. I also was using a split index for a
while, but it didn’t play nicely with some tools...

------
claytonjy
I wonder if any of these folks will have a hand in improving the Github commit
graph, now that Microsoft owns it?

As great as Github is, I've never understood why they still have such a hard-
to-read, horizontal commit graph while competitors like Stash/bitbucket/gitlab
have all had beautiful vertical graphs (like shown in this article) for as
long as I can remember. I think this is especially valuable for newbies who
are less inclined to get similar viz at the command line, but still useful for
vets when they (inevitably) end up in weird branch situations.

------
farresito
To save people some time, an alias for your .gitconfig

create-graph = "!f() { git show-ref -s | git commit-graph write --stdin-
commits ; } ; f"

~~~
masklinn
How does it save time, given this command will rarely be run, and next version
will automatically run it on GC? Instead of copy/pasting a command you now
have to copy-paste that same command with more gunk added and run it
separately.

~~~
farresito
You have to run it on every repository you want this in, don't you? At least,
that's what I understood.

~~~
masklinn
This is mostly for large repositories where building the revisions graph takes
a long time (aka 5+ figures revisions). I have two of those from $dayjob, most
of the stuff I work with/on/for doesn't come even remotely close. Running this
on a repo with 5 commits is all but useless.

And even then you _still_ really only need run it once per repository, you can
just cd/paste/return; cd/paste/return; … Hell, you'd probably have an easier
time writing a script which looks for all git repositories and runs the
command versus having to manually visit each and do so.

------
xvilka
Would be nice to have support of it in tig too.

------
slobotron
Interesting tidbit:

> The developers making Microsoft Windows use Git

~~~
01100011
Kinda surprising. I know everyone seems to love git these days, but I find
it's really better suited towards distributed and/or smaller projects. After
feeling the pain of a megarepo system at work, I'm pushing to switch to a
monorepo(well, like 4 repos instead of 200). git sort of sucks for monorepos.

Also, even after learning a fair amount of git, I still find I spend a
noticeable amount of time dealing with it. I don't remember spending a lot of
time on any of the VCS systems I've used in the past. They just stayed out of
my way and let do my thing.

~~~
pdpi
> git sort of sucks for monorepos.

Microsoft works around several of the issues there by using GVFS. Also, at
Microsoft scale, everything "sort of sucks", there's just no silver bullets.
You take one of the least bad options and put all the effort you can towards
making it work as well as you can.

~~~
a-dub
No. Everything only "sort of sucks" at "Microsoft scale" if you're willing to
blame it on "scale."

At "Microsoft scale," you have the resources to purpose build anything you
need from scratch for any scale you're working at... therefore if anything
"sort of sucks" it's because:

a) It's not worth the money/resources. (The "let it suck" approach)

or

b) Nobody cares (The "acceptance that we suck, at 'scale'" approach)

~~~
a-dub
... and I'm going to belabor this, because I think it's important.

There's a fair bit of excitement around the new, exciting, open source
friendly Microsoft with built-in Linux kernel emulation, an embrace of git and
a huge release of open source tools...

Honestly though, I'm not buying it, the issue I've had with Microsoft over the
years doesn't just stem from the shitty software or aggressive business
practices... it's the deep rooted culture of mediocrity it promotes.

People who belong to that church believe it's ok to build shitty software
because doing it right is too hard. Why root-cause an issue when you can just
script reboots? It's too hard anyway, we're at "Microsoft scale."

The rise of the internet and the companies that grew around it showed us that
if you have a culture of giving a shit, you can build really complex things
"at scale" that aren't complete shit.

I worry that the new Microsoft will be a different kind of trojan horse for
the OSS world. It won't be "embrace, extend, extinguish" it will be more like
a social media psyops campaign that beats it into everyone's heads that now
we're at "Microsoft scale" it's ok for everything to "kinda suck", and if
we're not careful... everything will.

------
bluebluetimes
To enable the commit-graph feature in your repository, run git config
core.commitGraph true. Then, you can update your commit-graph file by running

‘git show-ref -s | git commit-graph write —stdin-commits'

how do you automatically update this?

~~~
rakoo
It's not perfect but doing it in a pre-push hook can be useful. Unfortunately
there's no post-receive hook on client-side, which could have been useful in
this situation...

~~~
masklinn
> It's not perfect but doing it in a pre-push hook can be useful.

That's completely unnecessary and way too frequent. Some where else (reddit I
think?) the authors noted that they'd like to have it run alongside GCs in the
next version. So running it as pre-auto-gc is a better idea.

~~~
rakoo
I was thinking that you need to do that everytime refs change, but it's just a
boost and I presume you can have some refs in the commit-graph and some not in
there and it will still work, so you're probably right.

------
rakoo
Yet another file in the .git directory. The work is impressive and certainly
helpful, but I can already hear Fossil proponents say "just use SQLite", which
is getting more and more true.

~~~
superflyguy
Use sqllite instead of git? Or git should use sqllite? If the latter then one
problem is you'd need to keep your own fork forever as they don't accept
patches. I'm not sure if that's a price worth paying to reduce the number of
files git uses. Why is this a problem for you, anyway?

~~~
rakoo
It's not a problem for me, because all I see is the different commands that
_use_ the underlying infrastructure. It's more about the design that was
chosen: if you want to speed up things with git you have to implement specific
logic in application code that will write a file and will need to update it
periodically to keep it up-to-date, instead of using a querying engine made
specifically for this purpose.

I don't see git ever changing its file format, but I do see another tool that
imports everything from git and gives you a read-only sqlite db where you can
do whatever you want, including displaying a graph quickly as the post
advertises.

~~~
kyberias
I think you don't fully understand what you are proposing. The storage engine
(file system or SQLite) has little to do with git graph algorithm performance.
SQLite doesn't magically "display a graph quickly".

~~~
rakoo
What I'm saying is, the iteration step from "existing dataset" (what we have
today) to "faster data traversal" (what the article proposes) is a custom file
with a custom format on one side, and the appropriate query/index on the other
side; one is definitely more understandable, portable and maintainable than
the other.

~~~
WorldMaker
Except that SQL has never had great DAG data structures, queries, nor indexes.
You can model a DAG in a relational database, and you can non-standard SQL
extensions to get some decent but not great recursive queries to do some okay
semi-poorly indexed graph work, but having maintained databases like that at
various times that all gets to be just as much a "custom file with a custom
format" as dependent on database version and business logic as anything git is
doing here.

If there was a stronger graph database store and graph query language for
consideration than SQL you might be on to something. SQL isn't a great fit
here either.

~~~
rakoo
Fossil itself is stored entirely inside a SQLite db and only uses it to do
everything it needs; if Fossil can do it, any VCS can do it. In fact, there is
a whole section on that point in the official SQLite page
([https://www.sqlite.org/lang_with.html#rcex2](https://www.sqlite.org/lang_with.html#rcex2)).

I'm not saying SQL is the best way to store and query DAGs; any graph database
would be better. All I'm saying is that SQL is probably better at designing
and maintaining a solution than what git does with its custom file format and
custom code.

I'm only comparing what the pile-of-files that git currently is and a full-
fledged SQL database. None is perfect, but one feels overall easier than the
other.

~~~
WorldMaker
But you are also almost intentionally confusing the SQL standard here in your
comment with the SQLite implementation (a de facto standard, of a sort, but
not a recognized standard by any body of peers to my knowledge) with SQLite's
particular binary format (which does change between versions even). That is a
custom file format with custom code. Certainly it is very portable custom
code, as SQLite is open source and ported to a large number of systems, but
just because it is related to the SQL standards doesn't gift it the benefit of
being an SQL standard in and of itself.

The SQL standards define a query language, not a storage format. There are SQL
databases that themselves optimize their internal storage structures into
"piles of files". In fact, most have at one point or another. SQLite is an
intentional outlier here; it's part of why SQLite exists.

There's nothing stopping anyone from building an SQL query engine that
executes over a git database, for what that is worth. Because you can't
execute SQL queries against it today doesn't really say anything at all about
whether or not git's database storage format is insufficient or not.

All of that is also before you even start to get into the weeds about
standards compliance in the SQL query language itself and how very little is
truly compliant between database engines, as they all have slightly different
dialects due to historic oddities. Or the weeds that there's never been a good
interchange format between SQL database storage formats other than overly
verbose DDL and INSERT statement dumps. That again are sometimes subject to
compatibility failures if trying to migrate between database engines, due to
dialectal differences. Including what should be incredibly fundamental things
like making sure that foreign key relationships import and index correctly,
without data loss or data security issues, because even some of that is
dialectal and varies between engines (drop keys, ignore keys, read keys, make
sure everything is atomically transacted to the strongest transaction level
available in that particular engine, etc).

Git's current pile of files may not be better than "a full-fledged SQL
database", that's a long and difficult academic study to undertake, but a "a
full-fledged SQL database" isn't necessarily the best solution just because it
has a mostly standard query language, either.

~~~
SQLite
> SQLite [is] not a recognized standard by any body of peers to my
> knowledge...

Well, there is this:
[https://www.loc.gov/preservation/resources/rfs/data.html](https://www.loc.gov/preservation/resources/rfs/data.html)

Also, the on-disk format for SQLite has been extended, but has not
fundamentally changed since version 3.0.0 was released on 2004-06-18. SQLite
version 3.0.0 can still read and write database files created by the latest
release, as long as the database does not use any of the newer features. And,
of course, the latest release of SQLite can read/write any database. There are
over a trillion SQLite databases in active use in the wild, and so it is
important to maintain backwards compatibility. We do test for that.

The on-disk format is well-documented
([https://sqlite.org/fileformat2.html](https://sqlite.org/fileformat2.html))
and multiple third parties have used that document to independently create
software that both reads and writes SQLite database files. (We know this
because they have brought ambiguities and omissions to our attention - all of
which have now been fixed.)

------
erikb
Everybody who seriously uses git will also want to see the graph somehow.
Here's how I do it: [https://github.com/erikbgithub/dot-
files/blob/master/.gitcon...](https://github.com/erikbgithub/dot-
files/blob/master/.gitconfig#L38)

------
tomfloyer
I wish i had those amazing algorithm skills.

------
s2g
This is very cool.

Makes me sad I'll never do anything like this.

------
pmarin
This is basically what Fossil call timeline or I'm missing something?

[https://www.fossil-scm.org/index.html/timeline](https://www.fossil-
scm.org/index.html/timeline)

~~~
masklinn
It's not a UI feature, the UI feature has existed forever (it's the log and
log graph). This is a cache for the graph so that it does not have to be
rebuilt every time it's displayed.

