
How different are different diff algorithms in Git? - aw1621107
https://link.springer.com/article/10.1007%2Fs10664-019-09772-z
======
wcarss
From the section about how the qualitative difference between the two
algorithms was found:

"In the second step, we conducted a manual comparison between two diff outputs
produced by Myers and Histogram algorithms from all files in the sample. The
first two authors of this paper were involved to independently annotate the
diff outputs that makes the result is expected to be more reliable. ... The
comparison results between two authors from 377 files were subsequently
computed to find the kappa agreement.Footnote16 We obtained 70.82%, which is
categorized into ‘substantial agreement’ (Viera and Garrett 2005). This means,
the statistic result of our manual study is acceptable."

Even though I'm inclined to agree with the example given in the paper and a
lot of work clearly went into the qualitative evaluation, this feels like a
very weak way to perform a qualitative analysis. Specifically:

\- this is a sample size of two academic authors who chose to write a paper
together about the quality of different diffing algorithms, ie, a very skewed
and small sample.

\- there is no mention of any blinding in the labeling process, so any
preconceptions about the quality of different diffing may have been present in
qualitative grading -- or it may not have! We don't even know.

\- there does not seem to be a clear mention of how the representative sample
was chosen, or of what factors were taken into consideration for determining a
representative sample of changes, so that reviewers/other researchers could
potentially make different choices in the future and draw informed comparisons
with this work.

To sum up: in my admittedly not at all authoritative opinion this portion of
the paper cannot conclude more than something like, "further study is
warranted on this topic, with a far better controlled and far larger sample
size, and clearer explications of the methodological choices".

Regardless of that, it was an interesting read and not something previously on
my radar as worth experimenting with at all! Kudos to the authors for drawing
attention to it and for the other more quantitative aspects of the paper
(which I examined less and charitably assume are top notch).

------
hyperpallium
Usability of diffs, to do actual useful tasks, is an excellent research topic.
But a very difficult area.

To add my subjective expt: I just compared myers vs histogram on my latest
commit.

\- _myers_ presented a function I'd 90% gutted as the the same function,
edited (the rest was moved in the file, so no LCS algo could find it), like
word-diff often does. I thought this was clever.

\- _histogram_ presented it as one function deleted, and a completely new
function added. This was cleaner.

But I'm not even sure which is more usable. Might even vary with the specific
task, e.g. function evolution vs function readability. Difficult area!

~~~
sumosudo
> Usability of diffs, to do actual useful tasks, is an excellent research
> topic. But a very difficult area.

I use the linux command line `diff` to implement a poor mans database.

An online file store has files added daily that I need to download but not
duplicate, so I grab the list off the site, save it locally, and `diff
local_downloaded_files current_list_on_site | grep '>' | tr -d'>'`.. The '>'
symbol is an indication of files I haven't downloaded yet; the greater than is
the operation I need to perform to make my local_downloaded_files file the
same as the list on the site. So I download all the files that are indicated
by the diff, add their names to local_downloaded_files file and voila, up to
date downloads.

~~~
heavenlyblue
Which part is here a database?

~~~
nieve
I think they mean a database in the general sense and you mean an RDBMS.

------
zaptheimpaler
Diff is reverse engineering a many to one function - many possible
insert/delete sequences applied to string X map to the same string Y.

What would it look like to store files natively as insert/delete sequences
instead? So instead of filesystems and diffs on top, we could have DIFFsystems
and files on top. Kind of like a WAL. Files would be checkpoints in the WAL
for efficiency, and diffs would be 100% accurate between two checkpoints.
Probably takes a hell of a lot more space & CPU though..

~~~
shoo
i get your point. maybe clearer to say "inverting a many to one function",
i.e. there is not a unique well-defined patch that corresponds to the change,
there can be many such patches that produce the same change. c.f. ill-posed
inverse problem [https://en.wikipedia.org/wiki/Well-
posed_problem](https://en.wikipedia.org/wiki/Well-posed_problem) .

but, even more subtlety: some changes also aren't truly insert delete edits to
strings. e.g. suppose old state was: the string "x=1" ; new state is: the
string "x=2". If your system tracks string edits, maybe it encodes this change
as "delete char at index 2, append char '2' after index 1". But perhaps the
user intent behind this change is actually <increment the value stored in x by
1>. So reasoning about the patch at the string level misses the point
completely.

This difference doesn't really matter if you are just diffing changes from a
single user, but it could matter an awful lot when trying to merge overlapping
changes from multiple users! E.g. perhaps user A replaced "x=1" with "x=2" and
user B also replaced "x=1" with "x=2". Depending on what the intent of both
users is, perhaps the correct merged result is "x=2" or "x=3" or something
entirely different!

~~~
zaptheimpaler
That sounds cool but also a muuuch more complex problem that merge isn't
expected to solve. If merging two feature branches generated actual code that
understood intent and semantically combined both features, it would be close
to an AI that knows how to program. I think we know diff/merge are expected to
operate on character sequences not code semantics. Merging x=1 and x=1 just
gets you x=1 in any VCS. And they work just as well on files that aren't code
and might not have any semantics at all.

Interestingly what you want breaks an invariant I would expect traditional
merge/diff to have

diff(X, Y) = 0 => merge(X, Y) = X = Y

So what you're proposing would be a different pair of ops altogether say sdiff
and smerge

~~~
shoo
> If merging two feature branches generated actual code that understood intent
> and semantically combined both features, it would be close to an AI that
> knows how to program.

agree that understanding user intent is not really feasible, unless there is a
way to capture the intent of the user in some clear, machine computable way.

but, less ambitiously, taking the example of merging code for a popular
programming language X: the language should have some well defined grammar,
etc. maybe there's even an implementation of X's grammar in a language server
we can call. It shouldn't be a blue sky AI R&D project to build a custom merge
tool that is able to do a three way merge of code in language X using extra
information of language X grammar or feedback from a language server. we can
use the extra info to reject proposed merges that obviously don't even
correspond to anything valid in X's grammar. just prune the search space down
to space of possible merges that make some kind of basic sense from X language
perspective. often it is likely that there's more than 1 possible merge that
is also valid from an X language context, we can show all these valid merges
to the user and ask the user to pick which one they intended. but we can
reject the other 99% of possible merges that are valid text-level merges but
break obvious mechanically testable invariants of the X programming language /
X's grammar

------
shoo
in practice, suppose you want to do a merge of files that have more structure
than plain text. e.g. code belonging to specific a programming language, json,
some data format for an application, etc. This all has a lot more structure
than just plain text. You can do a better job of diffing and merging if your
difftool and mergetool is aware of the structure of your file. text-based
tooling that git offers is to some extent a lowest-common denominator. in
principle you can build custom tooling for your format / context that does a
better job.

one extension point that git offers for this is the "merge driver". you can
define an external script/application that will be called by git whenever a
merge conflict needs to be resolved for some particular file (based on path or
a pattern).

Here's an older blog post describing a custom git merge driver for merging
data files in a game engine: [http://bitsquid.blogspot.com/2010/06/avoiding-
content-locks-...](http://bitsquid.blogspot.com/2010/06/avoiding-content-
locks-and-conflicts-3.html) In this gamedev context of merging game data
files, it was less important to produce the "correct" merge result than it was
to produce some result that had a valid file format. Less technical users
could then fix up bad mergetool decisions in an editor with a UI instead of
trying to resolve the merge conflicts at the level of the raw serialisation
format itself (which could corrupt the data file and make it not possible to
load into the editor).

In other situations where there is a large cost to automatically producing the
wrong merge result, it would be a better tradeoff to "halt the line" if there
is ambiguity about how a merge should be resolved, and escalate to a human to
decide what to do.

Further reading: How to wire a custom merge driver in to git: [https://git-
scm.com/docs/gitattributes#_defining_a_custom_me...](https://git-
scm.com/docs/gitattributes#_defining_a_custom_merge_driver) What values you
can pass in to a merge driver on the command line:
[https://github.com/git/git/blob/f1d4a28250629ae469fc5dd59ab8...](https://github.com/git/git/blob/f1d4a28250629ae469fc5dd59ab843cb2fd68e12/ll-
merge.c) Simple example of the plumbing to wire in a merge driver, with a
trivial dumb driver script: [https://github.com/Praqma/git-merge-
driver](https://github.com/Praqma/git-merge-driver)

------
DaiPlusPlus
I use git GUIs a lot though (mostly GitKraken and VS’ built-in Git UI) - they
aren’t all as-customizable though :/

~~~
leipert
Most GUI tools execute git under their hood, so if the repo config is set,
they might pick it up.

~~~
DaiPlusPlus
Yes, but they sometimes have their own private copy of the git executables
that can be out of date or with their own peculiarities, such as using their
own git-config that the default command-line git in your PATH won't be using.

------
Buetol
I also researched the best diff algorithm. Google's diff-match-patch [1]
library produce very good diffs for example. But I found that the best diffs
are produced by wikidiff2 [2], the MediaWiki diff engine. Both engines produce
word-by-word diff.

[1]: [https://github.com/google/diff-match-
patch](https://github.com/google/diff-match-patch)

[2]:
[https://www.mediawiki.org/wiki/Wikidiff2](https://www.mediawiki.org/wiki/Wikidiff2)

~~~
steerablesafe
I wonder if these could be used as custom git diff drivers.

~~~
Buetol
Maybe, I've already used another diff engine, WikiWho [1], to do my own
implementation of git blame [2] (shameless self-promotion)

[1]: [https://github.com/wikiwho/WikiWho](https://github.com/wikiwho/WikiWho)

[2]: [https://pypi.org/project/git-word-blame/](https://pypi.org/project/git-
word-blame/)

------
_pastel
Are there any other articles answering this question? I would like to compare
them in my upcoming paper:

"How different are different 'How different are different diff algorithms in
Git' articles?

~~~
amenghra
I hope you submit your paper to Sigbovik!

------
Hendrikto
This article feels very drawn out. Did they have some kind of length
requirement they needed to fill?

------
modeless
The diff algorithms seem like a hacky collection of heuristics; the type
that's begging to be replaced by a machine learning system. Something that
understands the content could do much better than either of these algorithms.
I'm not sure what the training data would look like though.

~~~
Hendrikto
Some "smart" AI trying to guess my intentions and reason about the code is the
last thing I want. This will inevitably go wrong, and I will be left with a
nonsensical patch, which may not be immediately obvious, because the AI though
that it figured out what I wanted.

I‘d much rather encounter an error and solve it myself. Linus got it right:
Keep the diffing very simple and obvious, then refer to a human if there is
ambiguity.

~~~
modeless
> I will be left with a nonsensical patch

This is the current state of things, though. Heuristic diff algorithms
generate nonsensical patches all the time. They may be technically correct but
completely obscure the meaning of the diff.

The simplicity of the diff algorithm does not help me understand a bad diff.
And these diff algorithms are already not that simple. A machine learning
approach could still guarantee technical correctness of the diff, but give
patches that make sense more often.

------
hervature
TL;DR Use the Histogram diff algorithm in Git

~~~
fwip
I prefer using 'patience' most of the time, but it is slow on very large
files.

~~~
hinkley
Seems histogram is a fork of patience. What do you prefer about it?

~~~
fwip
Interesting! I'll have to revisit my comparison I made a while ago, it's
possible I read the results wrong and came away with the wrong conclusion.
Thanks!

