
Ydiff: Structural Comparison of Programs - roquin
http://yinwang0.wordpress.com/2012/01/03/ydiff/
======
barakm
This is pretty damn cool, but I can think of a few things that I'm curious how
you're going to address:

1) Your C++ example, we have "namespace 'v8'" -- which has been removed and
reinserted. That made me scratch my head a bit. Has it been refactored enough
that this is kind of a rewrite? If so, there might be a third or fourth color
here for "hey, this part didn't /change/ per se, but everything under it did"

2) Your Python example: right at the top we have a new class, PairIterator,
with "class" rightly green. However I'm fighting against years of reading
textual diffs that suggest the whole block is really the new thing. Could the
block highlight the class as-is (where it changed) and subtly color the rest
of the tree (what else changed with it)?

3) I get your whole in-the-future-we-store-ASTs argument, but that's certainly
not the case today. Today we store text, and I don't see that changing. Could
there also be a diff, perhaps even another mode that, counter to just dealing
with structure, deals with formatting? ie, find the ranges that contribute
nothing to the AST and then diff those textually.

I like the ideas and the paradigm shifting -- and the real answer is likely
somewhere in between, because, at the end of the day, programmers are still
editing things in text, even if they are manipulating greater structures.

As a diff tool, I'd introduce this to my workflow _in a heartbeat_ if I wasn't
working so hard to interpret what the diff means.

~~~
shurcooL
> 3) I get your whole in-the-future-we-store-ASTs argument, but that's
> certainly not the case today. Today we store text, and I don't see that
> changing. Could there also be a diff, perhaps even another mode that,
> counter to just dealing with structure, deals with formatting? ie, find the
> ranges that contribute nothing to the AST and then diff those textually.

Agree. I think it's likely code will be stored as text, but parsed into ASTs
easily. Consider Go standard library has a full language parser built right
in, so to get an AST from a .go text source file is about 3 lines of code.

Because you can get the code back from AST easily (while maintaining spacing),
it really doesn't matter what you save. You can use tools to edit the text
form or AST form without any duplication.

~~~
pshc
I've actually been trying to make 3) happen with a long-standing project of
mine, because serializing AST to text discards information. In a (proper) AST,
human-readable names are never used as pointer values.

When you bind something (reference a variable), you just point at that
variable's node, rather than mentioning it by string. This removes a huge
class of artificial problems brought on by plaintext source code (symbol name
clashes, namespaces, overeager imports, var name typos, shadowing, and other
programmer-compiler miscommunications), but introduces some editor UI concerns
(e.g. indicating shadowing).

There are a whole host of other advantages of saving as ASTs directly. One of
them is granular, semantic diffs like in the OP. I'm convinced it's the
future... there are a lot of UI problems to solve in a solid, practical editor
for it though.

~~~
shurcooL
I don't see why a variable can't have a "name" property, even if it's not used
as pointer value. I don't see why ASTs should have less information than
source code. IMO they're equivalent, and just different forms of
representation.

ASTs are easier to manipulate with code, source code is easier to manipulate
with text editors. We don't really have good tools for manipulating ASTs yet.

~~~
pshc
The same amount of information exists, but is normalized. The variable
_definition_ absolutely will have a "name" annotation, but copying that name
to each binding site is brittle. To render a binding, just dereference the
pointer to the original definition and use its name.

I'm working on it.

~~~
shurcooL
Me too. :)

------
RBerenguel
Neat, love new approaches to old problems. At first sight I thought it would
be an awesome implementation of an idea I had (and probably tons of other
people before and after :) a few years ago to find cheating among C
programming assignments: convert programs to PostScript "drawings": ifs, fors
and certain other functions gave rise to different "movements" of the cursor,
eventually drawing paths for different expressions. Since it didn't take into
account variable names, malloc-calloc-free order or any other extraneous
thing, it could pinpoint cheating with a relatively good accuracy (and indeed,
we caught 3 cheaters among ~90 submissions, and one of them would have been
pretty hard to spot without the tool.) If anyone's interested, it was created
with lex and basically parsed the expressions I was interested. I think I
wrote about it once in my blog, but I'm not sure if I ever published the code
(it was hacky as hell!)

~~~
jacques_chester
I've seen cheating-detection programs that use tree diff scores to compare
code from different students.

The theory goes that students who share solutions will probably change the
variable names, do some reformatting etc, which would fool a text diff. But
the actual structure of the AST will be the same or similar.

So if you find close matches, you inspect them more closely.

Some quick Googling reveals that plagiarism detection using tree comparisons
is a common idea.

~~~
RBerenguel
Yup, when I thought "this looks neat!" I didn't think it was very novel, the
idea is very clear once you are actually looking at cheated code (IIRC my
brother in law told me a few years ago - later an I wrote my code, which was
written before meeting him - he had used a similar method years earlier in
another university). But I didn't bother much with looking for existing
solutions: this was just soo fun! I love reinventing wheels for fun/learning
:)

------
tikhonj
I was working on a very similar project myself recently [1], heavily inspired
by ydiff. Unfortunately, I didn't implement the tree diff algorithm entirely
correctly, and it's been on hiatus for over a year :(.

The one interesting addition my project had was merging. The neat trick was
that we reused the same tree diff algorithm to find conflicts :P. With a bit
of work, we would have some very neat features, including the ability to
resolve certain conflicts which physically overlap.

[1]: <http://jelv.is/cow>

~~~
ay
I thought of a variant of this approach for distributed code storage/reuse.
Store each function as AST with normalized variable names. Then, a
cryptographic hash of this would uniquely identify algorithm. So, caching
becomes quite easy and very scalable. Callers refer to specific version by
hash also. Thus your entire program can be identified by a single hash. Nice
to distribute and cache at web scale. What do you think ?

~~~
pshc
It's what we need. This became clear shortly after learning git--it's the
right data structure for the problem.

The hurdle is that hashes aren't user-friendly in a text-based code editor. We
need an editor that lets us view and work at the right level of abstraction.

------
goldfeld
Lovely to see Racket being used. Given the pure meta-ness of Racket hopefully
ydiff will not only evolve into a full-fledged version control system but a
completely configurable one at that, akin to the power of Emacs and it's Lisp.
Having something like a .yinrc where I could quickly throw around and test out
custom functionality would be awesome. I haven't gotten much into git's
plumbing yet, but I don't see a culture of hacking it like Emacs and Vim, and
I'm not sure why that couldn't be. There's definitely a need, since you have
things like git-flow. With structural versioning I think the possibilities of
customization and automation would be greatly expanded, and I'd love to be
able to use an .*rc file like I use my .vimrc daily, as a scratchpad for small
scripts or for fleshing out rough ideas into later externalized plugins.

~~~
reubenmorais
That's why I love Mercurial. The API is great, people build great things with
it, and it has a policy of "default to sane, simple and working; let users
tweak stuff easily".

------
taliesinb
We use a similar thing internally at Wolfram to diff Wolfram Language
expressions, though I believe it doesn't have the subexpression matcher. But
the diffs are much clearer when you have full m-expressions instead of
s-expressions.

The coder who wrote it (@wmacura) now works at Tumblr.

------
n72
Relatedly, see Rich Hickey's <http://blog.datomic.com/2012/10/codeq.html>

------
frozenport
Wonderful, although I would rather have the AST integrated with GIT and not as
some other kind of version control.

~~~
jared314
As long as you can convert the AST into a git Tree Object[0] you should be
good to go.

[0] [http://git-scm.com/book/en/Git-Internals-Git-Objects#Tree-
Ob...](http://git-scm.com/book/en/Git-Internals-Git-Objects#Tree-Objects)

------
rmc
Really neat. Shall have to play with it.

Related to this, I wish there was a way to navigate a programme based on a
tree, not just line and character navigation.

~~~
gngeal
Have you tried Paredit?

------
aethertap
This is great. I've wanted to build a similar tool for a couple of years, the
distinction being that I want a structural grep utility. I'd pass in a pattern
in the form of a code snippet (with pattern operators), and have the thing
search for matching code structures. I suspect that this tool does 95% of what
would be needed for that, very nice.

~~~
JadeNB
It's not what you want, but, as perhaps a first step, there's an SGML sgrep
(<http://www.cs.helsinki.fi/u/jjaakkol/sgrep.html>). Perhaps one might combine
it with programming based on Oleg's SXML
(<http://okmij.org/ftp/Scheme/xml.html>) and similar.

EDIT (fixing formatting, and …): I knew something else was nagging at me. The
idea of a structural grep makes me think of Rob Pike's structural regular
expressions (<http://doc.cat-v.org/bell_labs/structural_regexps>). The Sam
editor (<http://sam.cat-v.org>) is based on this, and I thought I remembered a
more modern version called something like 'e'; but, of course, such a name is
impossible to Google.

~~~
aethertap
Awesome, thanks for the links.

------
jongraehl
Cool, but not perfect:

at the end of void ArmDebugger::Stop(Instruction* instr) (rhs of
[http://www.cs.indiana.edu/~yw21/demos/simulator-mips-
simulat...](http://www.cs.indiana.edu/~yw21/demos/simulator-mips-simulator-
arm.html) )

we have an alignment of ' 2 * Instruction::kInstrSize' to some random code at
the end

------
Scaevolus
Semantic Merge [1] is a product based around semantic diffs for C#/VB.NET.
Their demo is quite impressive, and they're working on Java and C++ next.

[1]: <http://www.semanticmerge.com/>

------
mcmire
Worth noting this was written up a year ago and there hasn't been a whole lot
of progress since. Would love to see this idea progressed further, this seems
really neat.

------
thesebas
interesting thing in similar topic
[http://programmers.stackexchange.com/questions/119095/why-
do...](http://programmers.stackexchange.com/questions/119095/why-dont-we-
store-the-syntax-tree-instead-of-the-source-code)

------
pagekicker
can this be pointed at bash scripts?

------
blacksqr
Name too close to "yiff" to take seriously.

