Since this is built on top of tree-sitter, it can be extended[0] to work with other languages as well, no matter how obscure, as long as a tree-sitter grammar exists. This IMO really highlights the power of having an ecosystem built around tools like tree-sitter because they allow for powerful dev UX tools to be more democratized. Excellent syntax highlighting, error recovery, linting, now tree-sensitive diffing can be provided to languages big and small.
As impressive as the tree-sitter ecosystem is, there is a certain danger in building tooling on top of it. Unless tree-sitter is used as the parser in the actual language compiler, it is almost always going to be an approximation to the language grammar. This is somewhat risky if it is important that any kind of semantic analysis, including diffing, is correct. In the case of difftastic, what worries me is that it might be possible to inject malicious code into a repo by exploiting a bug in the tree-sitter grammar/parsing implementation such that the malicious code is hidden from difftastic because, for example, it is treated as whitespace when it actually isn't. This of course wouldn't be a problem if PRs are being handled by github, but what if down the road github starts offering difftastic (or something similar) as a diff option for PRs?
Admittedly, this is a somewhat far fetched scenario that I have posited but I am somewhat concerned that the proliferation of incorrect tree-sitter grammars is going to lead to the worst kind of problem down the road: tools that work great _most_ of the time.
Isn't this the case for any tool that does semantic analysis, including any interesting feature of any IDE? Even in the realm of basic text editors, there's this classic exploit: if your editor and language disagree on what characters are allowed to end a line, you can smuggle lines of code into a program that appear to be commented out but are actually executed (or the inverse, lines that appear to be executed but are actually commented out).
in theory a compiler could have a mode which gives any tool results it can display as errors or whatever anyone needs. I think this was one of the original goals of the language server protocol, which I still do not understand the need for at all.
I don't know if any languages do this.
in reality damn near all languages are parsed easily enough that there is no difference.
> This IMO really highlights the power of having an ecosystem built around tools like tree-sitter because they allow for powerful dev UX tools to be more democratized
If microsoft, tomorrow, pulled the plug on the pylance LSP and removed all access, and the only thing left is the opensource version, the world is still at a better position than if they hadn't invested any money or opensourced anything.
This is strictly different from google owning chrome, because with that ownership, they have the ability to dictate protocols for the web.
The LSP is a protocol, which now is quite widespread, and will continue to survive and be enhanced, regardless of microsoft's involvement or not. The fact that they can choose to make some language servers close-sourced sucks, but being able to access a closed source component is _still_ strictly better than having zero access in the first place.
Embrace, extend, and extinguish [1]. This is the extend phase.
Microsoft has the resources and ability to make better language servers which could be closed source now, so people end up switching to those tools, owned by Microsoft, with better language servers, killing other editors if they are not able to catch up.
E.g.: Why would a C# programmer use Neovim if the language server is worse than the one that VSCode has? (Which is now proprietary and closed-source). The difference right now is insignificant, but the future tools and features that they are adding to the new proprietary C# language server will not be available for other editors.
they can only extinguish by forcefully bankrupting alternatives. Unfortunately for microsoft, there's no such thing as bankrupting an opensource project.
If neovim's feature set can't keep up with the might of microsoft's development budget for vscode, and many users switch over, then it's a net-win for those users. If microsoft then removes the C# LSP afterwards (or charges money for it), neovim is still around. Those users who don't want to pay can move back to neovim (and may be someone would then start contributing to it again, and over time, the features of neovim will improve again).
Microsoft killed netscape by depriving it of a business model. But you cannot deprive opensource of a business model, because that model is not profit oriented.
Unless microsoft patents the LSP protocols and stops other people from using it (and even then, other editors do not die, unlike netscape), there's no real way for microsoft to EEE.
Google can deprive firefox of a business model (directly by stopping their payment to mozilla, and indirectly, by making the web protocols so restrictive that firefox cannot implement them properly, such as DRM). Therefore, google is in a position to EEE other browsers.
> But you cannot deprive opensource of a business model, because that model is not profit oriented.
Lots of open source does, in fact, come out of a profit oriented business model. Open source is harder, but not impossible, to kill by killing the business model of whoever runs the current project because open source provides rights to other people to pick it up even if the original sponsor’s interest becomes to end it, so as long as there is someone or some group with the resources and interest, it can continue.
tree-sitter is the wrong choice here IMHO, because it requires that most languages reimplement their parser in tree-sitter’s DSL, rather than reusing a compiler's existing parser.
A better interface would be a simple binary that produces JSON output from AST, and another binary that produces AST from the same JSON output. Then you'd do a diff on the JSON and converting it to AST before displaying. Then you’d only need to modify the compiler to print out the AST as JSON (and read it in as JSON), instead of reimplementing the parser.
This is a really cool example of tree diffing via path finding. I noticed that this was the approach I used when I did tree diffing, and sure enough looks like this was inspired by autochrome which was inspired by my post (https://thume.ca/2017/06/17/tree-diffing/).
I'm curious exactly why A* failed here. It worked great for me, as long as you design a good heuristic. I imagine it might have been complicated to design a good heuristic with an expanded move set. I see autochrome had to abandon A* and has an explanation of why, but that explanation shouldn't apply to difftastic I think.
I think (maybe I’m wrong) that your graph searches correspond to diffing single lists and you can have an expensive diagonal step to recurse into two sublists whereas the tool in this post has extra nodes for every token and extra edges for inserting/deleting delimiters. That seems to be the biggest difference to me and I guess is what you mean by it being complicated to design a good heuristic for the expanded move set. I agree it sounds complicated. I think that my guess was that bigger graphs would make things harder but that isn’t a reason for A* to fail.
Although, I do not have much to add; using `difftastic`[0] & `delta` [1] is a very cool combo to make _git_ a little more approachable for newbies like me.
I use delta as my daily driver but sometimes when I want the contextual info, switching to `env GIT_EXTERNAL_DIFF=difft git log -p --ext-diff` gives a better picture.
I feel the visual eye-candy provided by delta far more appealing than what difft does.
Also, delta is much more configurable and has nice binding for moving through hunks.
That's probably the only reason, otherwise, you are correct. I would want the context always.
My dream would be to have a three-way merge tool that worked like this at a semantic level. It feels like merges almost always have the information needed to automatically resolve, but our line-based tools are too simple to see it.
Geez, I've been looking at diff programs and the personal licenses for them are so expensive. I get they're incredibly useful when you need them but I personally don't need a 3rd party differ very often. Kaleidoscope is $250ish IIRC.
The VSCode extension "Diff & Merge" give you the right/left arrows to merge lines if anyone is looking for a tool that does that. I haven't needed another one since I found that.
It's still around and getting periodic updates. I originally bought it mainly to give Sublime more money (after using the Sublime Text trial for most of college), but it ended up being useful and I use it daily
Visually easier on my eyes, easier to scroll, I can filter out `package-lock.json` file from the diff (with a command line argument). In the end it's a preference thing, not an "objectively better" thing.
It looks like difftastic can support way more languages. I haven't used semantic merge (or even heard about it before this) so idk if its language support is somehow better though.
This post reminded me of the Trail Of Bits post about Graphtage: https://blog.trailofbits.com/2020/08/28/graphtage/ Graphtage is written in Python and can be used as a library. But it seems Difftastic can be used as a git diff tool directly.
IIRC the configurable diff tool is just to show diffs to the user, not for internal storage. So everyone on your project can use a different diff tool without problems.
An analogy would be like of the editor. Think of this tool as the editor that one uses. It does not matter to git what editor one uses; similarly, it does not matter what diff pager one uses :)
Yes, you can turn it on without having any side-effects for others.
You can use difftastic as your default git diff tool, but you can also use it as an opt-in diffing tool. I recommend using it as an opt-in, but defining a git alias so you can do 'git difft'.
I like that this seems to have ways to address two common problems with tree diffs:
1. The nesting/unnesting/merging/splitting problem that many tree diff algorithms have a hard time with are handled here by allowing the diff to see insertion/deletion of delimiters. I guess one way to see it is that the tool is doing a text level diff augmented with the tree structures to calculate the cost of diffs and choose the cheapest one it can find.
2. The problem of syntax errors. I think this just depends a lot on how well tree sitter copes with syntax errors (or weird syntax that’s hard to parse or eg a committed merge conflict) but my understanding is that it is designed to cope ok with syntax errors.
I basically felt that tree diffing was not very viable because of these issues but seeing this project I think I’ve changed my mind. I guess it remains to be seen how good performance is though (maybe this isn’t good enough yet if it can sometimes be very slow).
Dealing with syntax errors in parsers is an interesting subject. To do it really well I think you “just” have to make the error into an AST node similar to the other nodes in the tree, so that you always have some kind of AST to work with.
The problem is that you want parsing to be tolerant of errors in the sense that a single syntax error doesn’t cascade to incorrect parses far away. “Just” having an error mode means your grammar is very ambiguous and so you want a parser that will stably choose “reasonable” parses, which I think is hard. (But maybe it’s harder in the abstract and easier with heuristics on indentation and suchlike)
Though, to be clear, I think you put the word in quotes because you think the problem isn’t trivial.
No, it’s not trivial and it does require thinking about your grammar before you decide to lock down your language design. A lot of languages have some pretty obvious synchronization points though, like semicolons and curly braces. If all else fails you can always read up to the next semicolon or curly brace and dump whatever you get into a syntax error node, then keep going. This is especially true if you are implementing an LSP server and so you expect to regularly see incomplete lines simply because the user hasn’t finished typing them in yet.
Ultimately what you really want is many different kinds of specialized error (and warning) nodes for different mistakes users might make. For example, in C++ the statement `auto x = 1,000,000;` is technically correct but probably not in the way the user wanted. So you put a warning node there to record that they might want to use `1'000'000` instead. If you see something like `auto x = 42\n if (y) {…`, then you might guess that there is a missing semicolon before the newline. If a speculative parse of the next line succeeds, then you put a MissingSemicolon node into the AST (instead of one saying that the “if” was unexpected), followed by whatever the speculative parse found.
Now repeat that process for five years as you polish your compiler into a sparking user–friendly gem.
tree-sitter has first class support for parsing errors as ERROR nodes in the output tree. I treat these as just another atom in the s-expression.
In practise I haven't noticed any issues yet. I suspect that difftastic doesn't see many syntax errors because users have fixed most of them by the time they run the diff. When you look at diffs of committed code you hopefully have no parse errors at all.
The tree-sitter parsers could still reject valid code I suppose. I worry slightly more about the parsers getting precedence/associativity wrong, but it would be hard to construct an example that produce identical parse trees due to incorrect precedence.
This article loses me when it gets to the "calculating the diff" section.
>Autochrome and difftastic represent diffing as a shortest path problem on a directed acyclic graph. A vertex represents a pair of positions: the position in the left-hand side s-expression (before), and the position in the right-hand side s-expression (after).
>The goal is to find the shortest route from the start vertex (where both positions are before the first item in the programs) to the end vertex (where both positions are after the last item in the program).
I don't understand what this means at all? What is a "position" here? Position of what? If it's a graph, what do the edges represent? The diagrams afterwards aren't very helpful either, I can't make head nor tail of them. They talk about a "start" vertex and an "end" vertex, but before that it said that a vertex is a pair of start-end positions ... I'm totally lost.
Every graph vertex represents a pair of pointers (or positions) to AST nodes. So in the example, the start program is `A` and the end program is `X A`. The positions point to AST nodes in these programs.
I try to use 'vertex' consistently for graph vertices, to avoid confusion with AST/s-expression nodes. If you have any suggestions for better terminology I'd be very interested too :)
I got lost on the last example: `(foo (bar))` -> `(foo (novel) (bar))`. The diff that adds `novel` and a pair of parentheses seemed impossible -- wouldn't you also have to delete and re-add `bar`?
Writing this comment it occurs to me that the structural diff doesn't translate to plaintext very well, and thus is not accessible to folks with red-green colorblindness.
Thanks for the feedback! I'll look to clarify the wording here. The failed example was a real output that difftastic gave in early versions.
I agree that the classic red/green colour scheme of diffs isn't great for colourblind users. I've asked a few colourblind developers and they were happy with terminal ANSI colours (which is what difftastic uses), because they can configure each colour individually.
I think that example is an error. What they possibly meant was you can perform the change by inserting “novel) (”, which may structurally translate to inserting “novel” in the bar list and then splitting that list.
Advanced diff tools usually allow configuring the colors, so that if the red/green is a problem, you can change it to red/blue or blue/yellow or whatever.
I don’t think it was an error, given how the article describes its handling of delimiters. In the example:
(foo (bar))
(foo (novel) (bar))
The balanced parentheses around bar “haven’t changed”, but the symbol between them “has”, and the balanced parentheses around bar were “added”. It’s a perfectly valid diff given the approach, even if it’s not ideal, as the author found.
The only way you could arrive at “novel) (” given the algorithm, I think, would be something like
(foo (bar))
(foo novel) ((bar))
(This might be off too, hard to write code on mobile)
It is interesting to see the use of the A* path finding algorithm for finding optimal matchings of nodes.
The approach from Chawathe et. al splits nodes from the before/after trees into chains by their label in the syntax grammar, and then runs myers’ longest common subsequence on each pair of chains. Some parameters t, f are used to have an approximate ‘equals’ method for subtrees.
I’d be curious to see if this approach handles re-ordering of nodes better. The ‘fastMatch’ algorithm described above will typically miss matching cases where a node that is not order sensitive (i.e a function in a namespace can be moved somewhere else in that namespace).
Most tree diffing papers that I've seen focus on either (1) providing a minimal diff and accepting the performance cost or (2) providing a relatively minimal diff and focusing on the performance.
I've generally found that you need a minimal diff to get a good result, so papers in (2) are less applicable. I've also found several cases where there are several possible minimal diffs, but there's a clear 'correct' answer from the user's perspective.
Difftastic doesn't handle moves: the edit set is add, remove, or replace similar comment. If you reorder functions, it will take the largest unchanged subset. Moves are hard to model in a diffing algorithm, but they're also very hard to display coherently in the UI.
I know a few code forge websites (e.g. Phabricator) show moves in a fairly comprehensible way, although they're all based on line-based diffs.
Smarter diffs are great, but what is especially nice is smarter merges. I would love to see this extended to be able to do the same kinds of auto-merges that SemanticMerge could. I would pay for that functionality (just like I paid for SemanticMerge for years before they took it away to instead entice people to pay for their stupid VCS).
Nope. They integrated the functionality with Plastic SCM. And what's especially annoying is that they don't even grandfather existing customers into being able to continue using SemanticMerge. Once your current license expires, that's it, no more SemanticMerge.
This makes sense to me on a high level, but I would love to see more practical examples so I can quickly wrap my head around use cases without having to download and use the tool first.
Try the diffoscope-minimal variant and disable recommends when installing. You'll lose support for a lot of different formats. You can also try it online using the service instead of locally:
I see some Clojure examples in the article. We use Clojure almost exclusively and this would be a great replacement for line-based Git diffs, providing much more insight in actual changes.
Is there any chance such an alternative differ could be used in Git (and adjacent tools like GitLab), or are we stuck with line-based forever?
Nice! Will def check it out. What I don’t understand is why it sees “value”->target as a diff.
Also inside the if statement, target was semantically unchanged
The simple answer is that renaming a variable (or function parameter) is a semantic change - even if it doesn’t, say, change the compiled code, it affects how a human would read and understand the code. Plus, in many languages, function parameter names are indeed semantically meaningful and can be accessed via features like reflection or keyword arguments.
AST is first and foremost a representation of a parse, i.e. it shows how a serialized representation of _something_ was understood by the machine.
Due to this constraint, it is unlikely to be the most convenient way for a human to input that something.
Take arithmetic expressions, for example. The usual infix syntax as edited in a conventional text editor is very good, probably the best option that humans have for editing arithmetic expressions. Having to edit an AST structure explicitly would almost certainly be a loss in terms of convenience and productivity.
Or take a (raster or vector) image editor. They provide interfaces to editing image data in a much more convenient way than editing AST or other internal representations directly.
it's not free, but Beyond Compare 4 has a clipboard comparison tool you can install as a task-bar icon in windows (I think that also works in linux, but IIRC it's harder to setup - been a while)
I knocked this together because that sounds like a kind of handy thing to have generally. The UX is _pretty rough_ to say the least at the moment.
https://github.com/Tacklebox/cursh
You can use it like `cursh difftastic {{1}} {{2}}` and it will prompt you for each {{#}} to hit enter once you have the thing you want in your clipboard and it passes the content to the command as files
Not sure if I understand correctly, but wouldn't one of the two "operands" need to be a file in any case (the linked tools would require you to paste two times as well)?
In that case, assuming that one file is OK and only the other text needs to be pasted, you can diff against stdin. On macOS there is `pbpaste`, which prints the contents of the pasteboard to stdout. This allows you to do `pbpaste | diff the-file -`, with `-` being the commonly used "filename" for stdin.
I also work this way a lot; in fact, it's probably the majority of my non-VCS-related diff tool use at work. It's very common for me to have data only on the clipboard: watch window output, TTY output, similar-looking snippets posted in Slack, that sort of thing.
I use Araxis Merge at work, and Meld at home, both of which can work this way. (Araxis is much better, but I can't justify the cost personally. Meld is fine, just a bit slow.)
[0] https://difftastic.wilfred.me.uk/adding_a_parser.html