Hacker News new | past | comments | ask | show | jobs | submit login
How different are different diff algorithms in Git? (springer.com)
152 points by aw1621107 13 days ago | hide | past | web | favorite | 38 comments

From the section about how the qualitative difference between the two algorithms was found:

"In the second step, we conducted a manual comparison between two diff outputs produced by Myers and Histogram algorithms from all files in the sample. The first two authors of this paper were involved to independently annotate the diff outputs that makes the result is expected to be more reliable. ... The comparison results between two authors from 377 files were subsequently computed to find the kappa agreement.Footnote16 We obtained 70.82%, which is categorized into ‘substantial agreement’ (Viera and Garrett 2005). This means, the statistic result of our manual study is acceptable."

Even though I'm inclined to agree with the example given in the paper and a lot of work clearly went into the qualitative evaluation, this feels like a very weak way to perform a qualitative analysis. Specifically:

- this is a sample size of two academic authors who chose to write a paper together about the quality of different diffing algorithms, ie, a very skewed and small sample.

- there is no mention of any blinding in the labeling process, so any preconceptions about the quality of different diffing may have been present in qualitative grading -- or it may not have! We don't even know.

- there does not seem to be a clear mention of how the representative sample was chosen, or of what factors were taken into consideration for determining a representative sample of changes, so that reviewers/other researchers could potentially make different choices in the future and draw informed comparisons with this work.

To sum up: in my admittedly not at all authoritative opinion this portion of the paper cannot conclude more than something like, "further study is warranted on this topic, with a far better controlled and far larger sample size, and clearer explications of the methodological choices".

Regardless of that, it was an interesting read and not something previously on my radar as worth experimenting with at all! Kudos to the authors for drawing attention to it and for the other more quantitative aspects of the paper (which I examined less and charitably assume are top notch).

Usability of diffs, to do actual useful tasks, is an excellent research topic. But a very difficult area.

To add my subjective expt: I just compared myers vs histogram on my latest commit.

- myers presented a function I'd 90% gutted as the the same function, edited (the rest was moved in the file, so no LCS algo could find it), like word-diff often does. I thought this was clever.

- histogram presented it as one function deleted, and a completely new function added. This was cleaner.

But I'm not even sure which is more usable. Might even vary with the specific task, e.g. function evolution vs function readability. Difficult area!

> Usability of diffs, to do actual useful tasks, is an excellent research topic. But a very difficult area.

I use the linux command line `diff` to implement a poor mans database.

An online file store has files added daily that I need to download but not duplicate, so I grab the list off the site, save it locally, and `diff local_downloaded_files current_list_on_site | grep '>' | tr -d'>'`.. The '>' symbol is an indication of files I haven't downloaded yet; the greater than is the operation I need to perform to make my local_downloaded_files file the same as the list on the site. So I download all the files that are indicated by the diff, add their names to local_downloaded_files file and voila, up to date downloads.

Which part is here a database?

I think they mean a database in the general sense and you mean an RDBMS.

I wrote https://github.com/amb007/persistent-buffer hoping to use the actual editing history foe precise diffing. I know that projectional/structural editor do such things, not sure if for diffing.

Diff is reverse engineering a many to one function - many possible insert/delete sequences applied to string X map to the same string Y.

What would it look like to store files natively as insert/delete sequences instead? So instead of filesystems and diffs on top, we could have DIFFsystems and files on top. Kind of like a WAL. Files would be checkpoints in the WAL for efficiency, and diffs would be 100% accurate between two checkpoints. Probably takes a hell of a lot more space & CPU though..

This reminds me of codeq, a clojure+datomic project that intended to move source control from lines of text to functions and expressions. From their introduction [0]:


  Programmer Sally: "So, what are you going to do today Bob?"
  Programmer Bob: "I'm not happy with the file baz.clj residing in my/ns. So I'm going to go to line 96 and change 2 to 42. I've been thinking about deleting line 124. If I have time, I'm also going to insert some text I've been working on at line 64."
  Programmer Sally: (what's wrong with Bob?)

  Short Story

  codeq ( 'co-deck') is a little application that imports your Git repositories into a Datomic database, then performs language-aware analysis on them, extending the Git model down from the file to the code quantum (codeq) level, and up across repos. By doing so, codeq allows you to:
  - Track change at the program unit level (e.g. function and method definitions)
  - Query your programs and libraries declaratively, with the same cognitive units and names you use while programming
  - Query across repos
I never got to use it, though, and it seems that there have been no more updates in the repo [1] since 6 years ago.

[0] https://blog.datomic.com/2012/10/codeq.html [1] https://github.com/Datomic/codeq

i get your point. maybe clearer to say "inverting a many to one function", i.e. there is not a unique well-defined patch that corresponds to the change, there can be many such patches that produce the same change. c.f. ill-posed inverse problem https://en.wikipedia.org/wiki/Well-posed_problem .

but, even more subtlety: some changes also aren't truly insert delete edits to strings. e.g. suppose old state was: the string "x=1" ; new state is: the string "x=2". If your system tracks string edits, maybe it encodes this change as "delete char at index 2, append char '2' after index 1". But perhaps the user intent behind this change is actually <increment the value stored in x by 1>. So reasoning about the patch at the string level misses the point completely.

This difference doesn't really matter if you are just diffing changes from a single user, but it could matter an awful lot when trying to merge overlapping changes from multiple users! E.g. perhaps user A replaced "x=1" with "x=2" and user B also replaced "x=1" with "x=2". Depending on what the intent of both users is, perhaps the correct merged result is "x=2" or "x=3" or something entirely different!

That sounds cool but also a muuuch more complex problem that merge isn't expected to solve. If merging two feature branches generated actual code that understood intent and semantically combined both features, it would be close to an AI that knows how to program. I think we know diff/merge are expected to operate on character sequences not code semantics. Merging x=1 and x=1 just gets you x=1 in any VCS. And they work just as well on files that aren't code and might not have any semantics at all.

Interestingly what you want breaks an invariant I would expect traditional merge/diff to have

diff(X, Y) = 0 => merge(X, Y) = X = Y

So what you're proposing would be a different pair of ops altogether say sdiff and smerge

> If merging two feature branches generated actual code that understood intent and semantically combined both features, it would be close to an AI that knows how to program.

agree that understanding user intent is not really feasible, unless there is a way to capture the intent of the user in some clear, machine computable way.

but, less ambitiously, taking the example of merging code for a popular programming language X: the language should have some well defined grammar, etc. maybe there's even an implementation of X's grammar in a language server we can call. It shouldn't be a blue sky AI R&D project to build a custom merge tool that is able to do a three way merge of code in language X using extra information of language X grammar or feedback from a language server. we can use the extra info to reject proposed merges that obviously don't even correspond to anything valid in X's grammar. just prune the search space down to space of possible merges that make some kind of basic sense from X language perspective. often it is likely that there's more than 1 possible merge that is also valid from an X language context, we can show all these valid merges to the user and ask the user to pick which one they intended. but we can reject the other 99% of possible merges that are valid text-level merges but break obvious mechanically testable invariants of the X programming language / X's grammar

in practice, suppose you want to do a merge of files that have more structure than plain text. e.g. code belonging to specific a programming language, json, some data format for an application, etc. This all has a lot more structure than just plain text. You can do a better job of diffing and merging if your difftool and mergetool is aware of the structure of your file. text-based tooling that git offers is to some extent a lowest-common denominator. in principle you can build custom tooling for your format / context that does a better job.

one extension point that git offers for this is the "merge driver". you can define an external script/application that will be called by git whenever a merge conflict needs to be resolved for some particular file (based on path or a pattern).

Here's an older blog post describing a custom git merge driver for merging data files in a game engine: http://bitsquid.blogspot.com/2010/06/avoiding-content-locks-... In this gamedev context of merging game data files, it was less important to produce the "correct" merge result than it was to produce some result that had a valid file format. Less technical users could then fix up bad mergetool decisions in an editor with a UI instead of trying to resolve the merge conflicts at the level of the raw serialisation format itself (which could corrupt the data file and make it not possible to load into the editor).

In other situations where there is a large cost to automatically producing the wrong merge result, it would be a better tradeoff to "halt the line" if there is ambiguity about how a merge should be resolved, and escalate to a human to decide what to do.

Further reading: How to wire a custom merge driver in to git: https://git-scm.com/docs/gitattributes#_defining_a_custom_me... What values you can pass in to a merge driver on the command line: https://github.com/git/git/blob/f1d4a28250629ae469fc5dd59ab8... Simple example of the plumbing to wire in a merge driver, with a trivial dumb driver script: https://github.com/Praqma/git-merge-driver

I use git GUIs a lot though (mostly GitKraken and VS’ built-in Git UI) - they aren’t all as-customizable though :/

Most GUI tools execute git under their hood, so if the repo config is set, they might pick it up.

Yes, but they sometimes have their own private copy of the git executables that can be out of date or with their own peculiarities, such as using their own git-config that the default command-line git in your PATH won't be using.

Not all? I'd find it scary to use a tool that makes its own things not relying on Git

I also researched the best diff algorithm. Google's diff-match-patch [1] library produce very good diffs for example. But I found that the best diffs are produced by wikidiff2 [2], the MediaWiki diff engine. Both engines produce word-by-word diff.

[1]: https://github.com/google/diff-match-patch

[2]: https://www.mediawiki.org/wiki/Wikidiff2

I wonder if these could be used as custom git diff drivers.

Maybe, I've already used another diff engine, WikiWho [1], to do my own implementation of git blame [2] (shameless self-promotion)

[1]: https://github.com/wikiwho/WikiWho

[2]: https://pypi.org/project/git-word-blame/

Are there any other articles answering this question? I would like to compare them in my upcoming paper:

"How different are different 'How different are different diff algorithms in Git' articles?

I hope you submit your paper to Sigbovik!

And as always, there is a relevant xkcd: https://xkcd.com/1447/

This article feels very drawn out. Did they have some kind of length requirement they needed to fill?

The diff algorithms seem like a hacky collection of heuristics; the type that's begging to be replaced by a machine learning system. Something that understands the content could do much better than either of these algorithms. I'm not sure what the training data would look like though.

Some "smart" AI trying to guess my intentions and reason about the code is the last thing I want. This will inevitably go wrong, and I will be left with a nonsensical patch, which may not be immediately obvious, because the AI though that it figured out what I wanted.

I‘d much rather encounter an error and solve it myself. Linus got it right: Keep the diffing very simple and obvious, then refer to a human if there is ambiguity.

> I will be left with a nonsensical patch

This is the current state of things, though. Heuristic diff algorithms generate nonsensical patches all the time. They may be technically correct but completely obscure the meaning of the diff.

The simplicity of the diff algorithm does not help me understand a bad diff. And these diff algorithms are already not that simple. A machine learning approach could still guarantee technical correctness of the diff, but give patches that make sense more often.

Even if a state of the art machine learning system is going to work 99.999% of the time - with millions of diffs run daily - that's still too many failures.

So basically with diff we had 1 problem, with ML-diff we'll have two.

Just because a system uses ML doesn't mean it has to be less reliable, especially when it replaces a heuristic that's already unreliable. The correctness of the patch can be easily verified, and the reasonableness of the patch can also be ensured with some heuristics that would be much easier to make than a whole diff algorithm.

There seems to be a serious bias against ML in systems programming that is unwarranted. There are plenty of problems using bad heuristics today that could be improved with ML in a way that doesn't reduce reliability. Scheduling, compiler optimization, stuff like that.

TL;DR Use the Histogram diff algorithm in Git

I prefer using 'patience' most of the time, but it is slow on very large files.

Seems histogram is a fork of patience. What do you prefer about it?

Interesting! I'll have to revisit my comparison I made a while ago, it's possible I read the results wrong and came away with the wrong conclusion. Thanks!

does someone know if it's possible to activate it in gitlab web diff of a merge request ?

Mhm. I would need to dive into our implementation. I don’t know whether we reimplement diffing algorithms in GitLab or rely on git internals here. [0]

I am my phone, so I cannot check, but my two assumptions would be: Wr might use the default so that people have a match to what they see if they are diffing on their computers.

And if it is confuigurable, probably just on an instance level, so you need to run GitLab yourself and not dot com

Again just assumptions, will double check tomorrow.

[0]: https://docs.gitlab.com/ee/administration/gitaly/

Checked in with the Gitaly team and we are using the default. I have created an issue about exploring support for different algorithms:


thank a lot for the issue


`git config --global diff.algorithm histogram`

With pretty compelling examples showing Meyers (default) behavior that I've encountered many times. Nice.

Yea, a fairly accurate TL;DR.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact