
Ask HN: I just wrote an O(N) diffing algorithm – what am I missing? - keithwhor
Hey folks,<p>I&#x27;ve been building a rendering engine for a code editor the past couple of days. Rendering huge chunks of highlighted syntax can get laggy. It&#x27;s not worth switching to React at this stage, so I wanted to just write a quick diff algorithm that would selectively update only changed lines.<p>I found this article:
https:&#x2F;&#x2F;blog.jcoglan.com&#x2F;2017&#x2F;02&#x2F;12&#x2F;the-myers-diff-algorithm-part-1&#x2F;<p>With a link to this paper, the initial Git diff implementation:
http:&#x2F;&#x2F;www.xmailserver.org&#x2F;diff2.pdf<p>I couldn&#x27;t find the PDF to start with, but read &quot;edit graph&quot; and immediately thought — why don&#x27;t I just use a hashtable to store lines from LEFT_TEXT and references to where they are, then iterate over RIGHT_TEXT and return matches one by one, also making sure that I keep track of the last match to prevent jumbling?<p>The algorithm I produced is only a few lines and seems accurate. It&#x27;s O(N) time complexity, whereas the paper above gives a best case of O(ND) where D is minimum edit distance.<p><pre><code>  function lineDiff (left, right) {
    left = left.split(&#x27;\n&#x27;);
    right = right.split(&#x27;\n&#x27;);
    let lookup = {};
    &#x2F;&#x2F; Store line numbers from LEFT in a lookup table
    left.forEach(function (line, i) {
      lookup[line] = lookup[line] || [];
      lookup[line].push(i);
    });
    &#x2F;&#x2F; Last line we matched
    var minLine = -1;
    return right.map(function (line) {
      lookup[line] = lookup[line] || [];
      var lineNumber = -1;
      if (lookup[line].length) {
        lineNumber = lookup[line].shift();
        &#x2F;&#x2F; Make sure we&#x27;re looking ahead
        if (lineNumber &gt; minLine) {
          minLine = lineNumber;
        } else {
          lineNumber = -1
        }
      }
      return {
        value: line,
        from: lineNumber
      };
    });
  }
</code></pre>
RunKit link:
https:&#x2F;&#x2F;runkit.com&#x2F;keithwhor&#x2F;line-diff<p>What am I missing? I can&#x27;t find other references to doing diffing like this. Everything just links back to that one paper.
======
ridiculous_fish
Diffing means computing the longest common subsequence (LCS); the edit script
is everything that's not part of the LCS. The proposed algorithm greedily
finds a matching element. However it may be that it would be better to skip
this element, and not incorporate it into your sequence.

Consider the following sequences (here each letter represents a line).

    
    
       AAAB
       BAAA
    

Where the first one is "left" and the second one is "right".

The LCS is AAA. The minimal edit script from left to right is:

1\. Insert B at beginning.

2\. Delete B at end.

The proposed O(N) algorithm finds B as the first matching line, and thereby
exhaust `left`. So it finds the common subsequence of just "B" and its edit
script is:

1\. Delete 3 As at beginning

2\. Insert 3 As at end

which is longer.

~~~
leshow
Is it subsequence or substring? Don't you want to find the longest common
substring, i.e. contiguous block? I always understood subsequence to be non-
contiguous.

~~~
ruds
It's subseqence. For instance, in

ABXCDYEFGZ QARBCSTDEFG

You want to find [Q]A[R]B(X)C[ST]D(Y)EFG(Z)

ABCDEFG is the longest common subsequence, QRST are the additions and XYZ are
the deletions.

------
robertelder
I spent a bit of time learning the Myers Diff algorithm, (I made this page:
[https://blog.robertelder.org/diff-
algorithm/](https://blog.robertelder.org/diff-algorithm/))

Like someone else said, I think the key is that you're not getting the
shortest edit script, you're just getting _a_ edit script.

Interestingly, even git's implementation of diff (which uses a variant of the
Myers Diff algorithm) doesn't actually find the shortest edit script by
default. It has a hard-coded heuristic that just quits early if it finds a
'good enough' solution:

[https://github.com/git/git/blob/master/xdiff/xdiffi.c#L143](https://github.com/git/git/blob/master/xdiff/xdiffi.c#L143)

Same this for unix diff (unless you specify the commandline flag to find
shortest):

[https://github.com/Distrotech/diffutils/blob/distrotech-
diff...](https://github.com/Distrotech/diffutils/blob/distrotech-
diffutils/lib/diffseq.h#L255)

Having said that, if you're still confident that you've made an improvement,
you can port the function to python and try it out as a test case in my unit
test file that I wrote when I developed my variant of the Myers Diff
algorithm:

[https://github.com/RobertElderSoftware/roberteldersoftwaredi...](https://github.com/RobertElderSoftware/roberteldersoftwarediff/blob/master/myers_diff_and_variations.py)

On that topic, I'm pretty sure my variant of the Myers Diff algorithm is
actually slightly faster/uses less memory than the original (I'd be happy to
be proved wrong). At least, I've had my blog post up for a few years and no
one has corrected me, but then again, I think the original paper has a few
mistakes/obvious flaws and it still has hundreds of citations and no one has
written much about improvements.

~~~
ridiculous_fish
Nice work on your visualization! Thank you for rebutting the Fraser article; I
came to the same conclusion as you (that Myers algorithm is in fact correct)
but had never written it up.

I wrote a vectorized Myers implementation for diffing binary files. Perhaps
the most interesting aspect is the progress bar: it's not obvious how to
create a monotonic progress estimate for Myers diff.

[https://github.com/ridiculousfish/HexFiend/blob/master/frame...](https://github.com/ridiculousfish/HexFiend/blob/master/framework/sources/HFByteArrayEditScript.m)

~~~
robertelder
Cool. You've got a fairly impressive following on that tool. I wish more
people took an interest in studying these sorts of algorithms. Most people
assume that some genius figured out the optimal way to do things years ago,
but more often than not if you spent a few weeks focusing on a specific
paper/algorithm, you'll come up with ideas for improvement (although it might
not always be in the way you initially expected). Sometimes the improvement is
just figuring out a better way to explain it and that in itself can lead other
people to create breakthroughs.

I think there are still a few improvements that could be done the the Myers
algorithms. One specifically, is that I'm pretty sure it's tail recursive. You
won't get any big O improvements off of that, but less stack space is still an
improvement. Plus better cache locality. I'd work on some demos, but too many
things to do these days...

------
im3w1l
Since the problem has already been resolved, I'll go on a slight tangent and
mention there are some very interesting probabilistic diffing algorithms
invented by the bio-informatics people.

E.g
[https://en.wikipedia.org/wiki/BLAST_(biotechnology)](https://en.wikipedia.org/wiki/BLAST_\(biotechnology\))

------
terrelln
Slightly off topic, but one great way to diff is to compress the new file
using the old file as a dictionary. Both zlib and zstd support this (but zlib
is limited to a 32KB file).

You can tune how much CPU you want to spend getting a better diff by adjusting
the compression ratio, the compressor is well tuned for speed and is generally
more efficient than a diffing algorithm, it can find repetitions in the new
content, and you get entropy coding.

~~~
andai
That's really neat, do you maybe have any links on this?

------
anonytrary
You're most likely not getting a minimal diff, but it's highly likely that
your diff is good enough for your use-case. React and most other similar VDOM
libraries also fail to achieve minimal diffs, however, that's not necessarily
a bad thing. Just because a diff isn't minimal doesn't mean it's wrong.

The ideal trade-off (diff-size vs space-time characteristics) depends on the
use-case. It might be more important for a VDOM library to obtain a good-
enough diff in O(N) time, as opposed to an LCS in O(NM) space and time. Linear
diffing for VDOMs works well, especially with heuristics and things like
explicit and implicit key matching. Sometimes the asymptotic characteristics
of a program are irrelevant if your system is bounded.

~~~
devit
React doesn't compute diffs, and it would be wrong for it to do so since
components are stateful, and so it matters how they are matched (the idea that
VDOM diffing is just an optimization is a very common misconception).

Instead, you just specify keys for elements (either explicitly or by index in
the sequence) and it matches elements with the same key in the old and new
trees.

~~~
anonytrary
> React doesn't compute diffs

I think this is a bit pedantic. React computes what is essentially an edit
path from one tree to another, so yes, it is computing diffs for all intents
and purposes. That said, you're right that implicit and explicit key matching
are not _only_ for optimizations; they're for persisting state as well.

------
kazinator
I just found an O(M + N) diffing algorithm:

    
    
      $ my-diff yours mine
      --- yours 2019-10-28 15:11:07.150118090 -0700
      +++ mine  2019-10-28 16:46:42.093923066 -0700
      @@ -1,50 +1,75 @@
      - yours
      - yours
      [...]
      - yours
      + mine
      + mine
      [...]
      + mine
    

Spit out the diff and hunk header, then dump yours preceded by "\- " and then
mine preceded by "\+ ".

~~~
jerf
There's a non-trivial sense in which you'd be justified calling this O(1),
inasmuch as the edit script "delete yours, insert mine" can be yielded up as
an edit script without even reading "yours" or "mine", simply yielding "yours"
and "mine" by reference. That's in a more abstract sense than something "diff"
could take. However, with at most a slight change to "diff" (and I'm not sure
it's even that, I think it may have this already I'm just not checking for an
HN post), you can yield an O(1) "delete yours" and then insert mine, so it's
O(N) without the M [1].

In UNIX terms you can just spec out "cp mine yours" as the edit script, which
is O(1) from the perspective of the diff algorithm. Applying that script is
O(N), of course, but generating it is constant.

1: Super pedants will note that O(N + M) isn't terribly meaningful here since
adding two linear terms means you can reduce that to O(N) under the big-O
rules. However, I've always like keeping the information like that in the O
notation, in some sense precisely _because_ O(N+M) is-equivalent-to O(N) in
big-O terms, but if you have additional knowledge like for some reason, M is
exponential in a term you care about you can generate the new big-O term that
you care about. In graph theory, my professor tended to want to express
O(edges) as O(nodes^2), but I always liked retaining the O(edges) because
O(edges) yields useful information if you know you have a sparse graph, for
instance, whereas O(nodes^2) removes that. To his credit, he never marked me
down for it, which puts him ahead of a lot of professors and their personal
preferences...

~~~
syrak
O(N + M) is not equivalent to O(N) if you make no assumptions about the
relative growths of N and M, that's why it's actually meaningful to keep both
terms around. You can only reduce it to O(N) when N dominates M (M = O(N)).

~~~
jerf
In the case of file size inputs to a diff algorithm, it would not normally
make sense to discuss how one file is going to be "exponentially" larger than
another or something. They just are what they are, the bare inputs to the
function in question.

By contrast, graphs can have significantly different relationships between
nodes and edges, for instance, with sparse vs. dense graphs.

One could construct a situation in which file size inputs are varying related
to each other in some O(...)-significant way, but it would be very strained,
especially in the context of a _diff_ algorithm (why are you trying to diff
things significantly different in size at all?). You can fiddle with almost
any O(...) pronouncement that way, e.g., "adding two integers isn't O(1)".
True enough, but generally, more detail then we are looking for.

~~~
lorenzhs
You’re wrong, the sizes of both input files are commonly considered when
analysing algorithms that work on two files. Consider a problem that could be
solved by sorting one of the files and then stepping through both files item
by item (not saying that any meaningful problem with this property exists,
it’s just an example). Then you would have a complexity of O(n log n + m)
assuming n is the size of the file being sorted.

There is valuable information in running time being linear in the size of both
files. You could define _n_ as the combined size if you wanted to, but that
could be confusing.

------
ysleepy
Does this algorithm find the _shortest_ edit script?

Finding any diff is not really the problem.

~~~
dllthomas
I think shortest is probably a good proxy if we can't find something better
but I think in an editor "easiest to understand" is probably what's actually
important.

Not that I have any particular expectation that this algorithm does
unexpectedly well at that.

~~~
eru
Yes. That reminds me of lossy image compression: the real goal is to find a
representation that looks as close as possible to the input image to the human
eye. In practice, we use proxies.

Not all proxies are equally good. Eg just going by Hamming distance or L2 norm
distance would be awful. Proxies that take more features of our visual system
into account are going to fare better.

------
jacquesm
This doesn't answer your question but you really should read up on Needleman-
Wunsch for matching up strings.

Love this post by the way, wished we would see more of these on HN.

------
bodeadly
Ah, no, the Myers algorithm is also known as the "Sortest Edit Script" and
your algo does not follow the shortest path or produce a script (just -1 for
mismatches).

More specifically, it's not just something at the end appearing at the
beginning causing skipping like others have pointed out. The skipping can
happen in the middle. Consider the two sequences A B B C and B C A. Your algo
will match the first Bs and then the Cs and skip the common B C together.

Even if you could somehow make it work, it would be O(N) but note that with
all of the split() and map() and array indexing, it starts to be more like
O(N+NX) where X is that potentially non-insignificant overhead.

Don't let me pooh-pooh you from trying but if you do give-in and use the
Myers' algo, here's my C diff implementation:

    
    
      http://www.ioplex.com/~miallen/libmba/dl/src/diff.c
    

This is actually pretty compact as diff implementations go and would probably
be a reference for implementing diff in JS.

~~~
keithwhor
Thank you: I'll definitely reference this if (and when) the bottleneck becomes
our diffing implementation, or if as part of the editing experience we want to
be able to show diffs between files that aren't stored in git.

Keep in mind that right now the implementation is for an in-browser code
editor. The importance of the diff algorithm is primarily to quickly determine
which HTML elements to re-render on a line-by-line basis. The changes a user
makes between frames are at best single-grouping insertions / deletions (cut,
paste, delete, insert). For a 2k LOC file, the render step (pre-diff) was on
the order of 200ms: I was repopulating thousands of DOM elements every time a
line was changed. Now it's at like 3ms, so under the 16.67ms frame cutoff.

I'm floored by the responses here though, this has been super helpful! I knew
I was missing something academic here. :)

------
victorNicollet
Question: why do you ensure that you are only moving forward? Does it matter,
for line highlighting, that a line has moved up, or now appears twice?

------
pratchett
Your approach is called hash join in databases.

Not sure about the diffing research though.

------
vince14
React has absolutely nothing to do with that.

------
hhmc
hash map worst case insertion/search is O(N) so this is potentially worst case
O(N^2). IIRC a typical implementation would be more likely to be O(N log N)
for some high log factor in the limit, but still worse than O(ND). And this is
before we account for the fact that hashing a line probably has an implicit
factor of D in it.

~~~
heavenlyblue
I have just been interviewing at several companies that have highly technical
interviews in London, and about 80% of the interviewers who asked questions
about hash maps expected me to assume that hashmaps imply constant-time
lookups.

People tend to dismiss the statistics related to hash collisions because they
think they are absolutely irrelevant to “real-life scenarios”.

~~~
Shoop
This is because hash tables resize when they are filled with too many
elements. Assuming a suitably good hash function, this gives O(1) lookup
w.h.p. and insertion stays O(1) amortized. The mathematics of hash collision
probabilities is factored into the resizing thresholds.

Many standard library implementations of hash tables (such as rust's) also
include special hash functions which are salted with random values to prevent
DoS attacks which create large numbers of hash collisions.

~~~
eru
If you are really paranoid, you can even use a salted cryptographic hash
function. But the constant factors for those are usually worse than for
simpler salted hashes, so they are not worth it for hashtables.

Your fallback for collisions could also be something with O(log
size_of_bucket) runtime, instead of linked lists. But again, when you don't
have many collisions, that's going to be slower than something simpler.

(I half remember a result that if you tune your parameters right, you can get
away with a hashtable that only has enough size for storing one element in
each slot; and if a collision happens, you replace that element with a tomb
stone, and store put the elements involved in a single global linked list.

Basically, for that to work you need to keep your total number of collision
smaller than a constant with high probability.)

~~~
LgWoodenBadger
Once your hashes collide, you have no other mechanism to use besides equality.
How can you use something other than a list in order to find the actual value?
This being for the general case where items don't have an "order/rank/sort"
property between them.

~~~
eru
Most of the time, when you talk about collisions we mean that two different
keys get hashed to the same bucket.

What mechanisms you have available depends on your implementation.

For example Cuckoo Hashing
([https://en.wikipedia.org/wiki/Cuckoo_hashing](https://en.wikipedia.org/wiki/Cuckoo_hashing))
relies on having two different hash functions available.

And yes, for having something like a balanced tree, having a comparison would
be useful.

In general, most hash functions work by consuming something like a stream of
bits. So for your implementation it makes a lot of sense for the datatypes you
want to use as keys to export a way to convert themselves into that stream,
and leave the actual hashing into a fixed size as a detail of your hashtable.

That way you can eg do fall-back comparisons directly on stream of bits
(including for equality). Or you can transparently support multiple hash
functions.

Even in languages where the hashable interface works by giving you a method to
spit out eg a 64 bit number only, you still have to map that number to one of
your buckets. So for your fall-back, you can choose a different mapping.

~~~
LgWoodenBadger
My point was "what else can you use besides a linked list to store hash-
colliding values?"

If you have a second (or a third, or a fourth...) hashing algorithm, then make
it hash tables all the way down. At the end, you still need some data
structure to store hash-colliding values. And if so, what other structure
could you possibly use besides a list (linked-, array-, etc.) ?

~~~
eru
> If you have a second (or a third, or a fourth...) hashing algorithm, then
> make it hash tables all the way down. At the end, you still need some data
> structure to store hash-colliding values.

Why? You can have just two layers: the primary hash table and your buckets are
made up of one small secondary hashtable each. If there's a collision in the
bucket hashtable, pick a now random hash function for that bucket and re-hash
everything in the bucket.

If that fails after trying a few times, pick a new random hash for the primary
hash table and consider resizing.

I bet you can make that scheme workable in O(1) expected amortised time for
inserts and lookups.

Cuckoo hashing
([https://en.wikipedia.org/wiki/Cuckoo_hashing](https://en.wikipedia.org/wiki/Cuckoo_hashing))
is a related idea: you just have to hash functions. If you had three elements
that collide for both hash functions each, you just re-hash your entire table.

(From Wikipedia:)

> When a new key is inserted, and one of its two cells is empty, it may be
> placed in that cell. However, when both cells are already full, it will be
> necessary to move other keys to their second locations (or back to their
> first locations) to make room for the new key. A greedy algorithm is used:
> The new key is inserted in one of its two possible locations, "kicking out",
> that is, displacing, any key that might already reside in this location.
> This displaced key is then inserted in its alternative location, again
> kicking out any key that might reside there. The process continues in the
> same way until an empty position is found, completing the algorithm.
> However, it is possible for this insertion process to fail, by entering an
> infinite loop or by finding a very long chain (longer than a preset
> threshold that is logarithmic in the table size). In this case, the hash
> table is rebuilt in-place using new hash functions: [...]

You say:

> And if so, what other structure could you possibly use besides a list
> (linked-, array-, etc.) ?

Any datastructure you feel like. You can also use a balanced search tree, if
you want to. See eg
[https://en.wikipedia.org/wiki/Hash_table#Separate_chaining_w...](https://en.wikipedia.org/wiki/Hash_table#Separate_chaining_with_other_structures)

