Hacker News new | past | comments | ask | show | jobs | submit login
Ydiff: Structural Comparison of Programs (yinwang0.wordpress.com)
176 points by roquin on May 25, 2013 | hide | past | favorite | 33 comments

This is pretty damn cool, but I can think of a few things that I'm curious how you're going to address:

1) Your C++ example, we have "namespace 'v8'" -- which has been removed and reinserted. That made me scratch my head a bit. Has it been refactored enough that this is kind of a rewrite? If so, there might be a third or fourth color here for "hey, this part didn't /change/ per se, but everything under it did"

2) Your Python example: right at the top we have a new class, PairIterator, with "class" rightly green. However I'm fighting against years of reading textual diffs that suggest the whole block is really the new thing. Could the block highlight the class as-is (where it changed) and subtly color the rest of the tree (what else changed with it)?

3) I get your whole in-the-future-we-store-ASTs argument, but that's certainly not the case today. Today we store text, and I don't see that changing. Could there also be a diff, perhaps even another mode that, counter to just dealing with structure, deals with formatting? ie, find the ranges that contribute nothing to the AST and then diff those textually.

I like the ideas and the paradigm shifting -- and the real answer is likely somewhere in between, because, at the end of the day, programmers are still editing things in text, even if they are manipulating greater structures.

As a diff tool, I'd introduce this to my workflow in a heartbeat if I wasn't working so hard to interpret what the diff means.

> 3) I get your whole in-the-future-we-store-ASTs argument, but that's certainly not the case today. Today we store text, and I don't see that changing. Could there also be a diff, perhaps even another mode that, counter to just dealing with structure, deals with formatting? ie, find the ranges that contribute nothing to the AST and then diff those textually.

Agree. I think it's likely code will be stored as text, but parsed into ASTs easily. Consider Go standard library has a full language parser built right in, so to get an AST from a .go text source file is about 3 lines of code.

Because you can get the code back from AST easily (while maintaining spacing), it really doesn't matter what you save. You can use tools to edit the text form or AST form without any duplication.

I've actually been trying to make 3) happen with a long-standing project of mine, because serializing AST to text discards information. In a (proper) AST, human-readable names are never used as pointer values.

When you bind something (reference a variable), you just point at that variable's node, rather than mentioning it by string. This removes a huge class of artificial problems brought on by plaintext source code (symbol name clashes, namespaces, overeager imports, var name typos, shadowing, and other programmer-compiler miscommunications), but introduces some editor UI concerns (e.g. indicating shadowing).

There are a whole host of other advantages of saving as ASTs directly. One of them is granular, semantic diffs like in the OP. I'm convinced it's the future... there are a lot of UI problems to solve in a solid, practical editor for it though.

I don't see why a variable can't have a "name" property, even if it's not used as pointer value. I don't see why ASTs should have less information than source code. IMO they're equivalent, and just different forms of representation.

ASTs are easier to manipulate with code, source code is easier to manipulate with text editors. We don't really have good tools for manipulating ASTs yet.

The same amount of information exists, but is normalized. The variable _definition_ absolutely will have a "name" annotation, but copying that name to each binding site is brittle. To render a binding, just dereference the pointer to the original definition and use its name.

I'm working on it.

Me too. :)

While there are certainly advantages to storing code in an AST format, I don't think any of these are significant enough to overcome the hugely important advantage of legacy support. Maybe in the long term it will make sense to replace text based code with a new AST format, but for now a slightly crufty but ultimately isomorphic text format seems entirely sufficient.

Neat, love new approaches to old problems. At first sight I thought it would be an awesome implementation of an idea I had (and probably tons of other people before and after :) a few years ago to find cheating among C programming assignments: convert programs to PostScript "drawings": ifs, fors and certain other functions gave rise to different "movements" of the cursor, eventually drawing paths for different expressions. Since it didn't take into account variable names, malloc-calloc-free order or any other extraneous thing, it could pinpoint cheating with a relatively good accuracy (and indeed, we caught 3 cheaters among ~90 submissions, and one of them would have been pretty hard to spot without the tool.) If anyone's interested, it was created with lex and basically parsed the expressions I was interested. I think I wrote about it once in my blog, but I'm not sure if I ever published the code (it was hacky as hell!)

I've seen cheating-detection programs that use tree diff scores to compare code from different students.

The theory goes that students who share solutions will probably change the variable names, do some reformatting etc, which would fool a text diff. But the actual structure of the AST will be the same or similar.

So if you find close matches, you inspect them more closely.

Some quick Googling reveals that plagiarism detection using tree comparisons is a common idea.

Yup, when I thought "this looks neat!" I didn't think it was very novel, the idea is very clear once you are actually looking at cheated code (IIRC my brother in law told me a few years ago - later an I wrote my code, which was written before meeting him - he had used a similar method years earlier in another university). But I didn't bother much with looking for existing solutions: this was just soo fun! I love reinventing wheels for fun/learning :)

Am I correct in guessing, though, that it would only work for relatively sophisticated assignments that can be done in lots of different ways? Or is there always enough variety in different people's structure to tell when things are being copied?

It was more like a quick way to check without having to remember 90 assignments. Visual inspection was far, far better (almost like a game of match 2). I taught numerical analysis, and usually there was more than one way to structure the code, in fact most assignments were clearly very different. You'd be surprised how many ways our students found to write the same Gauss inversion algorithm, for example (my personal experience is that I write always the exact same code with very minor variations, but of course I alwas refer to Golub-Van Loan for the pseudocode...) Other assignments were harder to check, there are not that many ways to write a Runge-Kutta integrator with variable stepsizes (it's basically a set of for statements in a precise order, except for the variable step integrator part), but the "fingerprints" pinpointed some odd stuff that could be checked manually.

Basically it was a fun project that got me a little into Lex, the kind of odd stuff I do in Saturday afternoons

I was working on a very similar project myself recently [1], heavily inspired by ydiff. Unfortunately, I didn't implement the tree diff algorithm entirely correctly, and it's been on hiatus for over a year :(.

The one interesting addition my project had was merging. The neat trick was that we reused the same tree diff algorithm to find conflicts :P. With a bit of work, we would have some very neat features, including the ability to resolve certain conflicts which physically overlap.

[1]: http://jelv.is/cow

I thought of a variant of this approach for distributed code storage/reuse. Store each function as AST with normalized variable names. Then, a cryptographic hash of this would uniquely identify algorithm. So, caching becomes quite easy and very scalable. Callers refer to specific version by hash also. Thus your entire program can be identified by a single hash. Nice to distribute and cache at web scale. What do you think ?

It's what we need. This became clear shortly after learning git--it's the right data structure for the problem.

The hurdle is that hashes aren't user-friendly in a text-based code editor. We need an editor that lets us view and work at the right level of abstraction.

Lovely to see Racket being used. Given the pure meta-ness of Racket hopefully ydiff will not only evolve into a full-fledged version control system but a completely configurable one at that, akin to the power of Emacs and it's Lisp. Having something like a .yinrc where I could quickly throw around and test out custom functionality would be awesome. I haven't gotten much into git's plumbing yet, but I don't see a culture of hacking it like Emacs and Vim, and I'm not sure why that couldn't be. There's definitely a need, since you have things like git-flow. With structural versioning I think the possibilities of customization and automation would be greatly expanded, and I'd love to be able to use an .*rc file like I use my .vimrc daily, as a scratchpad for small scripts or for fleshing out rough ideas into later externalized plugins.

That's why I love Mercurial. The API is great, people build great things with it, and it has a policy of "default to sane, simple and working; let users tweak stuff easily".

In my experience, git has that culture of hacking, but it isn't as visible. Things get added to the "contrib" folder pretty quickly; but few ever get the polish to graduate out. On my box, contrib is installed at `/usr/share/git/`.

Just create an executable named git-WHATEVER and put it somewhere in $PATH (or $GIT_EXEC_PATH).

We use a similar thing internally at Wolfram to diff Wolfram Language expressions, though I believe it doesn't have the subexpression matcher. But the diffs are much clearer when you have full m-expressions instead of s-expressions.

The coder who wrote it (@wmacura) now works at Tumblr.

Relatedly, see Rich Hickey's http://blog.datomic.com/2012/10/codeq.html

Wonderful, although I would rather have the AST integrated with GIT and not as some other kind of version control.

As long as you can convert the AST into a git Tree Object[0] you should be good to go.

[0] http://git-scm.com/book/en/Git-Internals-Git-Objects#Tree-Ob...

Really neat. Shall have to play with it.

Related to this, I wish there was a way to navigate a programme based on a tree, not just line and character navigation.

Have you tried Paredit?

This is great. I've wanted to build a similar tool for a couple of years, the distinction being that I want a structural grep utility. I'd pass in a pattern in the form of a code snippet (with pattern operators), and have the thing search for matching code structures. I suspect that this tool does 95% of what would be needed for that, very nice.

It's not what you want, but, as perhaps a first step, there's an SGML sgrep (http://www.cs.helsinki.fi/u/jjaakkol/sgrep.html). Perhaps one might combine it with programming based on Oleg's SXML (http://okmij.org/ftp/Scheme/xml.html) and similar.

EDIT (fixing formatting, and …): I knew something else was nagging at me. The idea of a structural grep makes me think of Rob Pike's structural regular expressions (http://doc.cat-v.org/bell_labs/structural_regexps). The Sam editor (http://sam.cat-v.org) is based on this, and I thought I remembered a more modern version called something like 'e'; but, of course, such a name is impossible to Google.

Awesome, thanks for the links.

Cool, but not perfect:

at the end of void ArmDebugger::Stop(Instruction* instr) (rhs of http://www.cs.indiana.edu/~yw21/demos/simulator-mips-simulat... )

we have an alignment of ' 2 * Instruction::kInstrSize' to some random code at the end

Semantic Merge [1] is a product based around semantic diffs for C#/VB.NET. Their demo is quite impressive, and they're working on Java and C++ next.

[1]: http://www.semanticmerge.com/

Worth noting this was written up a year ago and there hasn't been a whole lot of progress since. Would love to see this idea progressed further, this seems really neat.

can this be pointed at bash scripts?

Name too close to "yiff" to take seriously.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact