Hacker News new | past | comments | ask | show | jobs | submit login
Designing a Tree Diff Algorithm Using Dynamic Programming and A* (thume.ca)
230 points by yminsky on Aug 25, 2017 | hide | past | web | favorite | 21 comments

I believe this article may have produced the same algorithm as [1] and [2] but applied to tree diffs [3] instead of string diffs.

In particular, Myers' algorithm [2] is generally considered the best general-purpose string diff today, and I bet he was doing A* search without realizing it. I would be very curious to find out if you could describe his algorithm as an A* search.

A* search brought his algorithm's runtime down from O(N^2) to O(N•diffsize). Tree diffs are often O(N^3) or O(N^2•log(N)), and this A* search seems to have brought that down to O(N•log(N)•diffsize). The A* search seems to reduce the extra O(N or N^2) to O(diffsize)!

I'm excited by the general use of A* to prune the dynamic programming search space. We might be onto a more general understanding of diffing.

[1] https://www.google.com/patents/US7313555

[2] http://www.xmailserver.org/diff2.pdf

[3] http://www.sciencedirect.com/science/article/pii/S0304397505...

Interesting problem, and useful for creating semantically meaningful diffs, i.e. using entities of the programming language. A different approach to that is implemented in gumtree [1,2].

[1] Fine-grained and Accurate Source Code Differencing https://hal.archives-ouvertes.fr/hal-01054552/document

[2] https://github.com/GumTreeDiff/gumtree

Pretty nice writeup, congrats. Tree diffing is super fun to code! (I've used it to animate between two different shader programs...) Applying A* is a great idea.

Not sure if this will help anyone interested, but edit operations on trees have a couple more categories besides inserts & deletes that I think makes it hard to think about as a grid the same way you can with strings. I spent a lot of time boiling down the math notation in a paper on tree edit distance to a single diagram that visualizes the edit operations: https://www.dropbox.com/s/iubwg113mm7ey6q/treeEditDist.png?d...

This is a visualization of "T. Jiang, L. Wang, and K. Zhang. Alignment of trees- an alternative to tree edit. Theoretical Computer Science, 143(1):137–148, 1995."

The interesting part is the right side, comparing a forest of children to another forest of children, which happens when a node changes type, but the subnodes are identical.

Another critical piece in tree diff is whether the nodes are ordered or unordered. If you have an AST for a math expression, a "+" operation is unordered, and you may have to diff two plus nodes twice, once with the arguments swapped - which is represented on the left in my diagram.

ASTs with 3+ argument nodes also confound visualizing the edit search space on a grid.

After having implemented tree diff using careful dynamic programming with backtracking (https://sourceforge.net/p/blot/code/HEAD/tree/trunk/lib/func...) I came to the conclusion that memoizing would be a lot easier and have the same complexity. And you get A* for free-ish if you add the cost function and skip comparing two sub-trees that are further away than the best answer you've found so far.

Similar dynamic programming idea from Zhang + Shasha 1992: http://www.cs.nyu.edu/shasha/papers/treebook3.pdf

I think this is the same Zhang with a later paper. This is the most direct mapping of edit distance from strings onto trees that I know of, and uses dynamic programming: http://www.sciencedirect.com/science/article/pii/03043975958...

> When I first heard the problem, it sounded pretty easy and I thought I’d be done within a few days, but I discovered a number of catches after starting and it ended up taking 5 weeks

It's really refreshing to hear about this type of real world estimation error that reflects so well the problem in software of making estimates with so many unknowns.

There's also ydiff in case anyone is interested (it works well with lisp/scheme): https://github.com/JexCheng/ydiff

Neat, it looks similar in spirit to the original dynamic programming diff algorithm I used, but it doesn't have the A* acceleration so I imagine it doesn't scale to large files. Probably nice for most cases though.

Hey you are the guy who made syntect! Great job and thanks

Wow, this is the first time I've been recognized for my obscure Rust syntax highlighting library. Glad you like it. It is one of the projects I've put the most work into, but in the rare occasions I'm recognized it's normally for contributing to Spacemacs.

Some of you in here reading this might like this new game:


It's about exploring language spaces. Syntax free.

And metaprogramming.

Great post, thank you!

But why is the config represented as a sexpr and not, say, YAML?

Expanding on the other comment, the answer is that Jane Street uses OCaml, where they have to build many of the libraries themselves. They've built good libraries for s-expressions but not for YAML.

They've also built an ecosystem of tools for s-expressions and switching to YAML would discard that ecosystem without significant benefits. This is self-reinforcing, for example my tool uses s-expressions because everyone does, and it added to the ecosystem that makes s-expressions the right choice. Almost everyone there uses Emacs so s-expressions are equal to or easier to edit than YAML.

Kinda the same reasons that for config files Ruby people mostly use YAML, JS people mostly use JSON and Rust people mostly use TOML.

From the looks of it, their config files are close to (if not actually) executable code. It also looks like S-expressions are a common data serialization format for OCAML, which Jane Street is known for using.

Turning the question around, what would the advantages of using YAML instead of S-expressions?

What's the state of the art of progarm-diffing (i.e based on LLVM IR)?

What kind of differences are you looking for? And equally important, what kind of differences do you want to ignore?


  if(condition) { a(); } else { b(); }
meaningfully different from

  if(!condition) { b(); } else { a(); }
or not?

Depending on your needs, you could just diff a textual representation using a string diff algorithm, or diff the control flow graph using a graph diff algorithm (but that might require solving graph isomorphism problems, which are hard (potentially NP-hard)).

Can't I check whether every instruction in every BasicBlock of the CFG is equal?

Sure, but how do you determine which basic blocks correspond to each other, when there have been changes to parts of the graph?

Damn. Is this an unsolved problem in CS?

Graph isomorphism is a known NP-hard problem, but the vast majority of subgraphs in programs are reducible, so I expect that aligning two control flow graphs admits a good heuristic solution.

Actually, the complexity of Graph Isomorphism in general is still unknown, and any definite classification would have interesting consequences. https://en.wikipedia.org/wiki/Graph_isomorphism_problem

I agree though that control flow graphs are probably easier to align than completely random graphs, although there might be pathological cases e.g. if you have lots of identical subgraphs like if err != nil { return nil, err }.

Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact