
Designing a Tree Diff Algorithm Using Dynamic Programming and A* - yminsky
http://thume.ca/2017/06/17/tree-diffing/
======
toomim
I believe this article may have produced the same algorithm as [1] and [2] but
applied to tree diffs [3] instead of string diffs.

In particular, Myers' algorithm [2] is generally considered the best general-
purpose string diff today, and I bet he was doing A* search without realizing
it. I would be very curious to find out if you could describe his algorithm as
an A* search.

A* search brought his algorithm's runtime down from O(N^2) to O(N•diffsize).
Tree diffs are often O(N^3) or O(N^2•log(N)), and this A* search seems to have
brought that down to O(N•log(N)•diffsize). The A* search seems to reduce the
extra O(N or N^2) to O(diffsize)!

I'm excited by the general use of A* to prune the dynamic programming search
space. We might be onto a more general understanding of diffing.

[1]
[https://www.google.com/patents/US7313555](https://www.google.com/patents/US7313555)

[2]
[http://www.xmailserver.org/diff2.pdf](http://www.xmailserver.org/diff2.pdf)

[3]
[http://www.sciencedirect.com/science/article/pii/S0304397505...](http://www.sciencedirect.com/science/article/pii/S0304397505000174)

------
robertkrahn01
Interesting problem, and useful for creating semantically meaningful diffs,
i.e. using entities of the programming language. A different approach to that
is implemented in gumtree [1,2].

[1] Fine-grained and Accurate Source Code Differencing [https://hal.archives-
ouvertes.fr/hal-01054552/document](https://hal.archives-
ouvertes.fr/hal-01054552/document)

[2]
[https://github.com/GumTreeDiff/gumtree](https://github.com/GumTreeDiff/gumtree)

------
dahart
Pretty nice writeup, congrats. Tree diffing is super fun to code! (I've used
it to animate between two different shader programs...) Applying A* is a great
idea.

Not sure if this will help anyone interested, but edit operations on trees
have a couple more categories besides inserts & deletes that I think makes it
hard to think about as a grid the same way you can with strings. I spent a lot
of time boiling down the math notation in a paper on tree edit distance to a
single diagram that visualizes the edit operations:
[https://www.dropbox.com/s/iubwg113mm7ey6q/treeEditDist.png?d...](https://www.dropbox.com/s/iubwg113mm7ey6q/treeEditDist.png?dl=0)

This is a visualization of "T. Jiang, L. Wang, and K. Zhang. Alignment of
trees- an alternative to tree edit. Theoretical Computer Science,
143(1):137–148, 1995."

The interesting part is the right side, comparing a forest of children to
another forest of children, which happens when a node changes type, but the
subnodes are identical.

Another critical piece in tree diff is whether the nodes are ordered or
unordered. If you have an AST for a math expression, a "+" operation is
unordered, and you may have to diff two plus nodes twice, once with the
arguments swapped - which is represented on the left in my diagram.

ASTs with 3+ argument nodes also confound visualizing the edit search space on
a grid.

After having implemented tree diff using careful dynamic programming with
backtracking
([https://sourceforge.net/p/blot/code/HEAD/tree/trunk/lib/func...](https://sourceforge.net/p/blot/code/HEAD/tree/trunk/lib/function/align.cpp))
I came to the conclusion that memoizing would be a lot easier and have the
same complexity. And you get A* for free-ish if you add the cost function and
skip comparing two sub-trees that are further away than the best answer you've
found so far.

------
how_gauche
Similar dynamic programming idea from Zhang + Shasha 1992:
[http://www.cs.nyu.edu/shasha/papers/treebook3.pdf](http://www.cs.nyu.edu/shasha/papers/treebook3.pdf)

~~~
dahart
I think this is the same Zhang with a later paper. This is the most direct
mapping of edit distance from strings onto trees that I know of, and uses
dynamic programming:
[http://www.sciencedirect.com/science/article/pii/03043975958...](http://www.sciencedirect.com/science/article/pii/0304397595800299)

------
bhrgunatha
> When I first heard the problem, it sounded pretty easy and I thought I’d be
> done within a few days, but I discovered a number of catches after starting
> and it ended up taking 5 weeks

It's really refreshing to hear about this type of real world estimation error
that reflects so well the problem in software of making estimates with so many
unknowns.

------
daww
Hey you are the guy who made syntect! Great job and thanks

~~~
trishume
Wow, this is the first time I've been recognized for my obscure Rust syntax
highlighting library. Glad you like it. It is one of the projects I've put the
most work into, but in the rare occasions I'm recognized it's normally for
contributing to Spacemacs.

------
kuwze
There's also ydiff in case anyone is interested (it works well with
lisp/scheme):
[https://github.com/JexCheng/ydiff](https://github.com/JexCheng/ydiff)

~~~
trishume
Neat, it looks similar in spirit to the original dynamic programming diff
algorithm I used, but it doesn't have the A* acceleration so I imagine it
doesn't scale to large files. Probably nice for most cases though.

------
kruhft
Some of you in here reading this might like this new game:

[https://github.com/BusFactor1Inc/explain](https://github.com/BusFactor1Inc/explain)

It's about exploring language spaces. Syntax free.

And metaprogramming.

------
nerdponx
Great post, thank you!

But why is the config represented as a sexpr and not, say, YAML?

~~~
trishume
Expanding on the other comment, the answer is that Jane Street uses OCaml,
where they have to build many of the libraries themselves. They've built good
libraries for s-expressions but not for YAML.

They've also built an ecosystem of tools for s-expressions and switching to
YAML would discard that ecosystem without significant benefits. This is self-
reinforcing, for example my tool uses s-expressions because everyone does, and
it added to the ecosystem that makes s-expressions the right choice. Almost
everyone there uses Emacs so s-expressions are equal to or easier to edit than
YAML.

Kinda the same reasons that for config files Ruby people mostly use YAML, JS
people mostly use JSON and Rust people mostly use TOML.

------
senatorobama
What's the state of the art of progarm-diffing (i.e based on LLVM IR)?

~~~
yorwba
What kind of differences are you looking for? And equally important, what kind
of differences do you want to ignore?

Is

    
    
      if(condition) { a(); } else { b(); }
    

meaningfully different from

    
    
      if(!condition) { b(); } else { a(); }
    

or not?

Depending on your needs, you could just diff a textual representation using a
string diff algorithm, or diff the control flow graph using a graph diff
algorithm (but that might require solving graph isomorphism problems, which
are hard (potentially NP-hard)).

~~~
senatorobama
Can't I check whether every instruction in every BasicBlock of the CFG is
equal?

~~~
yorwba
Sure, but how do you determine which basic blocks correspond to each other,
when there have been changes to parts of the graph?

~~~
senatorobama
Damn. Is this an unsolved problem in CS?

~~~
raphlinus
Graph isomorphism is a known NP-hard problem, but the vast majority of
subgraphs in programs are reducible, so I expect that aligning two control
flow graphs admits a good heuristic solution.

~~~
yorwba
Actually, the complexity of Graph Isomorphism in general is still unknown, and
any definite classification would have interesting consequences.
[https://en.wikipedia.org/wiki/Graph_isomorphism_problem](https://en.wikipedia.org/wiki/Graph_isomorphism_problem)

I agree though that control flow graphs are probably easier to align than
completely random graphs, although there might be pathological cases e.g. if
you have lots of identical subgraphs like _if err != nil { return nil, err }_.

