Hacker News new | past | comments | ask | show | jobs | submit login
The Myers diff algorithm that is used in Git (2017) (jcoglan.com)
122 points by dgudkov on May 31, 2020 | hide | past | favorite | 15 comments



By the way, git also comes with other diff algorithms which may give cleaner diffs than the default one

https://gist.github.com/roryokane/6f9061d3a60c1ba41237

Those alternative algorithms can be requested with command line flags like "git diff --histogram" or "git diff --patience". The default algorithm can also be changed via the git config file.


Patience in this example looks a lot better for what I consider a good diff to do a code review with. Is there any settings you could toggle in GitHub Enterprise you could enable this diffing in the UI?


Note that the same user also has a gist showing a case where myers gives a better result: https://gist.github.com/roryokane/79b8ebcb3813ebd934c4


Patience diff cares about lines that only appear once in each file. The idea behind that heuristic is that lines that only appear once are more likely to represent the parts of the document that don't belong to the diff. Conversely, if a line appears more than once in the file then it is more likely to be something like a blank line or a " }", where a match in the two documents might be just a coincidence. That particular example is one where the heuristic doesn't work well. The big block of abcs doesn't have any unique lines so patience diff doesn't see much cost in including it as part of the diff.

OP's blog has another series of posts about patience diff. They are also an interesting read: https://blog.jcoglan.com/2017/09/19/the-patience-diff-algori...


Feels like there ought to be a cute scoring algorithm to determine which to use. It is a pretty important UX to get right IMO.

I reckon it'd be nice to include which to use in the commit too.


It'd be even nicer if git could figure it out.

From those two examples only, looks like fewer hunks are better, but so are smaller hunks, at least when the number of hunks are equal.


I hazard a guess that there doesn't exist a simple heuristic (such as "number of chunks") that always does the right thing. I think that to achieve better diffs, the diff tool should have some understanding over the structural properties of the language – but that would, of course, mean that the diff tool should be different for each language. (Or more like, there should be a parsing frontend that parses a "labeled tree" structure, and then a generic diffing algorithm over those kinds of trees.)


git's diff already has language-specific configurability, for example xfuncname sets a "function declaration" or similar header (e.g. markdown heading, org-mode section) that is shown in a diff.

Similarly, one could imagine a language-specific regexp that gives "anchor points" such as declaration start, where one would prefer that a single diff hunk doesn't mutate things across that anchor, if it doesn't have to.


The one thing I really miss from perforce/source depot was seeing the original code block when performing a merge. Its so much more difficult to merge just seeing both new sections.


For diffs that include the common ancestor, add this to .gitconfig:

  [merge]
  conflictStyle = diff3


Git can call a whole bunch of merge visualization and conflict resolution tools. See `git mergetool --tool-help`.


I'm currently going through his book, Building Git, and doing it with Rust. It's a lot of effort but it's very good.


for diff algorithims in Rust, see https://nest.pijul.com/pijul_org/pijul:master/6afda19ba69307...

for a Rust diff library crates, see https://github.com/changeutils


hey I got this 'edit distance' question in my FB phone screen last month

https://leetcode.com/problems/edit-distance/

graph search + dynamic programming to pass the phone screen.

Ofcourse the consideration here just the smallest number of edits.


I always wonder how it was done.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: