
Making a good diff algorithm - austincheney
http://prettydiff.com/guide/unrelated_diff.xhtml
======
kccqzy
I've always wanted to explore the topic of diff in depth, but so far haven't
been able to. I have some links that I've stashed; hopefully they will be
useful.

I believe the classic for typical textual diff is this article by Myers, whose
algorithm is still the default in git:
[http://link.springer.com/article/10.1007/BF01840446](http://link.springer.com/article/10.1007/BF01840446)

Git has two other diff algorithms, patience and histogram:
[http://alfedenzo.livejournal.com/170301.html](http://alfedenzo.livejournal.com/170301.html)
[https://github.com/git/git/commit/8c912eea94a2138e8bc608f7c3...](https://github.com/git/git/commit/8c912eea94a2138e8bc608f7c390eb0b313effb0)

For binary/executable code, I believe Colin Percival's bsdiff is the best:
[http://www.daemonology.net/bsdiff/](http://www.daemonology.net/bsdiff/)
although he hints that his thesis contains a better algorithm.

For just executables, however, I think Google Chrome uses Courgette, which
actually performs disassembly first:
[https://www.chromium.org/developers/design-
documents/softwar...](https://www.chromium.org/developers/design-
documents/software-updates-courgette)

Also useful is libxdiff, which is a C library offering various diff utilities:
[http://www.xmailserver.org/xdiff-lib.html](http://www.xmailserver.org/xdiff-
lib.html)

~~~
jessriedel
Does anyone know much about prose diff? Very few diff apps work well with
prose, and I wonder to what extent this is driven at the level of the
algorithm.

~~~
nolemurs
> I wonder to what extent this is driven at the level of the algorithm.

Almost entirely.

Most programming diff algorithms operate on a line-by-line basis. That means
that a single character change in a line marks the whole line as changed. For
prose, a single 'line' is usually a paragraph, so it's pretty obvious why they
don't work well. You might check out gnu `wdiff` for what a word-by-word diff
could look like. I haven't really looked into the area deeply, so I don't know
what the state of the art is.

~~~
rwmj
This makes me wonder if anyone has implemented diff by comparing the Abstract
Syntax Trees of the two code-bases. I guess it's probably easier than regular
diff (assuming you've already got the ASTs from somewhere).

~~~
nolemurs
I would think diffing trees would be necessarily more complex than diffing
lines - you can think of the line-by-line description of a file as a tree with
at most one child per node, so any algorithm for diffing trees should also
automatically be able to diff lines.

Tree diff algorithms definitely exist (I know React.js uses diffs on a virtual
DOM to minimize operations), but I'm not sure what the state of the art is for
those.

~~~
Robin_Message
Tree diff varies between O(N^2) and O(N^4) for simple solutions of varying
complexity of matches possible, with a complex fully general algorithms coming
in at O(N^2 log^2 N).

Tree diff is harder because the range of operations is bigger.

React.js cheats by insisting on keys for arrays so matching is much easier.

------
paulmd
For most reasonable quantities of data, diff speed is Good Enough as it is,
and it's certainly not worth trading off accuracy for speed. In fact if
anything we should be going the other direction - Patience Diff is not quite
as fast as regular diff, but the output is much higher-quality in my
experience, particularly with non-trivial amounts of diverge between
codebases.

[http://git.661346.n2.nabble.com/Bram-Cohen-speaks-up-
about-p...](http://git.661346.n2.nabble.com/Bram-Cohen-speaks-up-about-
patience-diff-td2277041.html)

~~~
unhammer
What if you have unreasonable quantities of data? I've as yet not come across
a really good program that lets me do `bigdiff <(xzcat bigresult-old.xz)
<(xzcat bigresult-new.xz)|less` (where the files are gigabytes of text with
fairly few differences) in a reasonable amount of time/memory. I've used hacks
that only work on a line-by-line basis (or use some hardcoded marker in the
input) to try to read both files in parallel and run a real diff on a
subsection when seeing a difference between the markers, but it's far from
trivial getting it to work well (and I unfortunately don't have time to shave
that yak :/)

~~~
Arkanosis
I always have an old version of the source code of Solaris' bdiff with me
([https://github.com/Arkanosis/Arkonf/blob/master/tools-
src/bd...](https://github.com/Arkanosis/Arkonf/blob/master/tools-
src/bdiff.c)), just in case. It might have changed in the meantime in
OpenIndiana / Illumos.

It was a very significant improvement in speed a few years ago — though with
time I've gotten more RAM faster than bigger files to run diff on, and I
haven't had any difficulty with the regular Linux diff for a long time.

~~~
unhammer
Wow, zero memory usage and immediate output on files where GNU diff just sits
there eating memory until everything is read! Thanks, that's fantastic.

------
jfoutz
I might be wildly misreading the code, but it seems like [1,2,3,a,b,c] diffed
with [a,b,c,1,2,3] is going to appear as six lines of replacement.

The table will be {a :(1,1), b :(1,1), ...} because each line appears in both
files exactly once.

    
    
        do {
        c = a;
        d = b;
        // first pass a=0, b=0
        // the lines aren't equal, they're "1" and "a" so we pass this case.
        if (one[a] === two[b]) {
            equality();
        } else {
            //one[0] is "1", matching 1:(1,1)
            //two[0] is "a", matching a:(1,1)
            if (table[one[a]].two < 1 && table[two[b]].one < 1)     {
                //can't get here
                replaceUniques();
            } else if (table[one[a]].two < 1 && one[a + 1] !== two[b + 2]) {
                // 1:(1,1) cdr not less than 1, so this is out
                deletion();
            } else if (table[two[b]].one < 1 && one[a + 2] !== two[b + 1]) {
             // a:(1,1) car not less than 1, so this is out
                insertion();
            } else if (table[one[a]].one - table[one[a]].two === 1 && one[a + 1] !== two[b + 2]) {
                // 0 === one[a+1] !== two[b+2]
                //   === "2" !== "c"
                //   === true
                deletionStatic();
            } else if (table[two[b]].two - table[two[b]].one === 1 && one[a + 2] !== two[b + 1]) {
                // 0 === one[a+2] !== two[b+1]
                //   === "3" !== "b"
                //   === true
                insertionStatic();
            } else {
                // so we're stuck replacing.
                replacement();
            }
        }
        a += 1;
        b += 1;
        } while (a < lena && b < lenb);
    
    

it's very pretty, it just seems like it's not not doing enough lookahead to
find those long distance relationships.

(although i didn't run it, i could have screwed up my interpretation)

in my limited experience, diff is very hard.

 _edit_

fixed a couple typos in the comments.

yeah, three deletes and three inserts showing at least 3 lines of equality is
preferable. ideally it'd show a single edit, move of 3 lines, but that's
really hard to do.

~~~
austincheney
> I might be wildly misreading the code, but it seems like [1,2,3,a,b,c]
> diffed with [a,b,c,1,2,3] is going to appear as six lines of replacement.

A bit of a pickle.

The challenge I experienced with this is that either there are 6 changes (line
for line comparison) or there are 3 lines the same and three lines moved, but
three lines moved means a deletion of 3 lines from the first sample and 3
lines of insertion later in the second sample, which is still 6 differences.
The output reads very differently, but the number of differences is identical,
which is no change in precision.

~~~
zzazzdsa
Unless my understanding of things is wrong, I'm pretty sure that unless the
strong exponential time hypothesis (essentially, n variable CNF-SAT takes
O(2^n) time in the worst case) is false, no diff algorithm can run in
subquadratic time. If you could, then you would be able to solve longest
common subsequence in subquadatic time, and that in turn implies SETH being
false by [https://arxiv.org/abs/1501.07053](https://arxiv.org/abs/1501.07053)
.

I'm pretty sure exact diff in subquadratic time is therefore impossible. It's
still a nice heuristic though.

~~~
starikovskaya
Minor remark: If the edit distance between two n-length strings is
(relatively) small, it is possible to find it in subquadratic time. More
precisely, if the edit distance is at most d, you can solve the problem in
O(nd) time. The upper bound d does not have to be known in advance. The
conditional lower bound you mentioned holds only for large values of edit
distance.

------
amluto
Hmm, interesting. The classic dynamic programming edit distance algorithm uses
linear space (with a far better constant factor than this algorithm) and
quadratic time. You build a table c where c[i][j] is the edit distance from
the first I characters of file 1 to the first j characters of file 2. (You
don't need to store the whole table.) There's a cute trick to get the actual
diff out without using more space (by rerunning the algorithm a few times on
portions of the input). To get below quadratic time on sensible input, IIRC,
the standard approach is to think of the problem a bit like Dijkstra's
algorithm and fill in the table in order by _value_ , smallest edits first.

I imagine you get to linear time without much loss by heuristically throwing
out chunks of the table where i is far from j.

The article raises an interesting question: can you improve this by using
advantage of the fact that lines of code have a decent chance of being unique
or nearly unique within a file?

~~~
hairtuq
Yes, you can. There's an algorithm running in O((r+l) log l) time, where l is
the number of lines and r is the total number of ordered pairs of line numbers
at which the two files match [1]. So if all lines are unique, this runs in O(l
log l).

[1] J. W. Hunt and T. G. Szymanski. A fast algorithm for computing longest
common subsequences. Communications of the ACM, 20(5):350–353, 1977.

~~~
DannyBee
As the paper itself says, this is still N^2 log N worst case :)

Note that these are all variants of general algorithm I posted, and more
generally, variants of tricks used in boyer-moore (though hunt's work predates
boyer-moore, that's the easiest way to describe it), which means they try to
skip parts of the text they can prove can't match.

Because they can't _always_ do so, they don't change the worst case time
bound, only various _other_ time bounds.

------
DannyBee
It's nice to look at, but it doesn't actually work that well, because it has
no real way of stream alignment other than simple equality (stream alignment
== figuring out the points where two streams become equal. This is what the
edit distance calculation gives most diff algorithms)

So various forms of lines that have been moved will always be shown as
replacement, for example (I see someone else discovered this)

Most of the cost and complexity of existing diff algorithms is essentially
stream alignment (that's what the the dynamic programming problem they solve
tells them) , so yes, removing stream alignment will in fact, make your diff
algorithm "simple" :)

The fix function is also just a hack for no real way to align streams on a
per-character basis, so it has no way of, for example, maximally extending
matches until it's done.

You could avoid a lot of what it does by not using lines as your basis, but
characters instead.

The direct translation here would be:

1\. build segments by starting at the beginning of both files, and
incrementing one file until they match, and then incrementing the other until
they stop matching (this gives you one or two segments, the mismatch, segment,
which may be empty, but represents the part where they don't match, and the
matched segment, which will be maximal instead of line split.

This is O( min of both file sizes), just like the current hash building.

You do have to keep track of the line number, but it doesn't change time
bounds

2\. hash/index the segments the same way you are hashing/indexing lines.

3\. do rest of described algorithm.

the only upside/downside is you end up with partially changed lines (IE there
may be more than one segment per line), but here you can detect that
immediately (you can even just mark where this has happened while you build
segments) and transform to replacements if that's what you want instead of
trying to notice later that this is what happened.

IF you wanted super simple, but more accurate than this, you could do:
[https://en.wikipedia.org/wiki/Rabin%E2%80%93Karp_algorithm](https://en.wikipedia.org/wiki/Rabin%E2%80%93Karp_algorithm)

Split file2 into x sized blocks. Use it as patterns you search against file1.
Simple, and same worst case as naive edit distance.

Note that if you had a good enough rolling hash, you could do the same thing
in O(N) time, by not doing the equality check in the hashtable, and instead
just issuing replacement if it turned out you were wrong :)

~~~
marijn
> It's nice to look at

I don't think it is -- it's full of the kind of features (complicated
condition expressions, long chains of elses, hard-coded numbers) that, when I
find myself writing them, suggest that I'm approaching the problem in the
wrong way, because previous experiences make me associate that kind of code
with lack of generality and corner-case bugs.

~~~
jayajay
I thought this too, at a first glance. There must be a more elegant method.

~~~
Robin_Message
There is:
[https://en.wikipedia.org/wiki/Longest_common_subsequence_pro...](https://en.wikipedia.org/wiki/Longest_common_subsequence_problem)

which is what most diff tools use. It is a simple and clear algorithm with no
special cases once you get your head around it. It works in O(NxM) time, which
seems like a reasonable lower bound (you need to compare everything to
everything else to have a chance of getting the best alignment) although there
are ways to do better with constraints.

(I remember one for gene alignment which broke N and M in two, recurse 4 times
on each pairing, and then had a quick-ish way to put those together again.
Can't remember the details though!)

------
kccqzy
The standard diff only requires one pass through the data. Am I missing
something? I can't understand the value/utility of this, given this is likely
slower and more ad-hoc (likely incorrect) than the standard diff and its
cousins.

~~~
austincheney
The minimum number of data passes is two, one for each submitted sample. The
Pretty Diff algorithm, in the subject of this thread, uses three complete
passes through data without any repetition. Therefore the number of iterations
is simply: total number of lines of the first sample, total number of lines of
the second sample, and the total number of lines of the smallest sample.

A more optimized approach is 2 complete passes and a partial third pass
without repetition. Achieving that optimization requires more logic up front
to populate a smaller central analysis store and a different means of
iterating through that container.

Most approaches, as I have seen them, don't achieve a fully optimized
approach. While they may not have complete passes through data (after the
required initial two) they tend to have numerous smaller passes in order to
derive edit distances. This could be more efficient if these smaller passes
never pass through the same data indexes/keys more than once and achieve a
reduced total number of iterations. These approaches seem less straight
forward to me and are misleading in terms of total statements of
execution/iterations.

The only way to guarantee greater execution efficiency is to run through a
checklist like this and compare clock times in similar execution contexts:

* total number of loop iterations

* fewer instructions

* instruction optimizations (compiler/interpreter dependent considerations)

------
pebblexe
I remember seeing this posted here and I was really impressed by it:
[https://github.com/yinwang0/ydiff](https://github.com/yinwang0/ydiff)

------
dbkaplun
Another useful quality of a diff algorithm is being able to understand and
detect moves. This is not commonly implemented but one real-world example is
[https://www.semanticmerge.com/](https://www.semanticmerge.com/).

------
pjtr
In the first code snippet, while (b > lenb) should be while (b < lenb)?

~~~
austincheney
Typo converting "<" to "&lt;" in preparation for the HTML. Fixed, thanks!

------
dwenzek
This has to be compared to the Myers' Diff Algorithm

[http://xmailserver.org/diff2.pdf](http://xmailserver.org/diff2.pdf)

------
kinow
If anybody feels like writing some Java code for diff, Apache Commons new
component, Text, would definitely accept issues/pull requests

[https://commons.apache.org/proper/commons-
text/](https://commons.apache.org/proper/commons-text/)

------
donretag
On the subject of diff, can anyone recommend a resource to visualize the
difference between two ordered lists? This website does visualize things
nicely, but I would love to see a similar article written solely on the visual
aspect.

~~~
austincheney
For a long term solution you may be better served with a domain specific tool
that fits your needs directly. In the mean time the Pretty Diff tool,
[http://prettydiff.com/](http://prettydiff.com/), is language aware and offers
a fair number of features that may provide what you are looking for. To see
the options GUI just scroll down. You can also enter options and values as URI
parameters. Just see the documentation for the options and their requirements.

If you are looking for something that isn't available please open a Github
issue and I will triage it.

~~~
donretag
Thanks for the reply. I was thinking more general on how to display
differences, not so much for code, although it definitely does help. More of a
Tufte treatise on difference.

Working on a tool to calculate search relevancy and it would be nice to visual
display to users the difference between two queries.

------
vaibhavsagar
I wouldn't describe the logic here as 'minimal'.

------
smoothgrammer
github needs to read this article. I've seen some pretty awful diffs on github
and have had to resort to using CLI diffs which tend to be superior. For a
company that's bread and butter is code, vcs, committs, PR's, etc you would
think their diffs wouldn't be scraping the bottom of the barrel.

