Hacker News new | past | comments | ask | show | jobs | submit login

Although slightly off-topic, I find the mentioned idea of "syntax trivia" very compelling. This should not only be part of code parsers, but of any parsers.

I thought about something similar for binary parsers. Here, too, there are often many different ways to express a certain structure. Most libraries (such as Exiv2 for PNG/JPEG/EPS/etc. metadata) perform the following procedure: First, they parse out all information they are interested in. Then the application modifies that structure. Then it has to be embedded back into to binary format (here the image). The latter step is a nasty type of merging. They go over the file, try to find places where to "safely" embed or replace parts, then serialize their stuff and try to put it there.

However, if the binary parser was lossless, all changes to the parsed structure would preverse the "trivia", so serializing is straight forward. And any potential issues could be handled on application level in the parsed structure, rather than by guesswork and heuristics during the merge phase.

The only problem could be large BLOBs, but those could easily be represented as position & offset in the original file, rather than as the actual binary data in memory. (assuming the file will not change in the meantime, but in that case there the "merging" approach is very dangerous, too.)

Syntax trivia is a very important component and a pretty natural fit once you start thinking about the feature that we decided was absolutely necessary: full-fidelity round-tripping between parsed syntax trees and the original source.

I think this is one of the most common mistakes compiler writers make -- they don't spend enough time thinking about incorrect code. Let me be clear: most code is wrong. If your code is right, you compile it once. If your code is wrong, you compile it many times until you get it right. More importantly, if you run your IDE off your compiler, almost every character you type is a wrong program.

Having a full-fidelity syntax tree is essential for having great experiences with wrong code. In addition, it easily solves the problem of having to serialize your trees -- the source text is the serialization.

This feeds into API considerations as well. A number of people have repeatedly talked about how cool it would be to have a lot of other tools understand your AST (like if Git could support checking in ASTs instead of source). This is the wrong way of looking at it. If you're dealing with a raw AST you often have to have domain-specific knowledge of the language itself. Instead, what you want is to take the thing with the most domain specific knowledge, the compiler, and allow it to answer questions, i.e. have an API. By having round-trippable source, all source is essentially given a transparent API that can be used just as if you were interacting with the source code itself.

Anyway, this is going off the rails, but it's one of the numerous things I'd point to for many production compilers and say, "you are the past, this is the future."

The idea of "checking in ASTs" becomes a lot more compelling when you think about representing a diff of an automated refactoring. For example, a simple rename refactoring, performed by an IDE for a statically typed language, is completely safe, yet it could generate a 1000 line diff. This is hard to review, causes merge conflicts, does not rebase correctly, etc.

Instead, if you could somehow check in the refactoring action, all those problems would go away. You could rebase the code by undoing and redoing the rename, taking into account new usages of the renamed item, etc.

I think it's a nice idea, but seems difficult in practice.

First, I think embedding language knowledge into the VCS is fraught with peril. For one, does that mean that you need to rev your VCS version every time your language changes? What about when your language revs its AST, but not the language itself? Is your VCS version now no longer backwards compatible with old versions?

Second, I think there's a significant amount of overhead and new technology here. Most DVCS's currently use hash-based filesystems for storing history. If you replace simple data diffs with semantic transformations then you have to find some portable way of encoding that. If you don't want the implementation to be language-specific than you have to find some language-agnostic encoding system that can also recognize that the textual diff and the alpha-rename are identical commits.

IMHO, I would rather have metadata on commits. That way you can always fall back to plain text and all the old tools (like an ancient vi) continue to still be usable, but more advanced language-specific tooling could recognize these things and provide a simple view to the user.

I really like how syntax trivia allows you to completely ignore comments and whitespace, if you want, and focus exclusively on your problem domain (source code). But it's also available for when you do want to inspect or modify it.

Because Wasabi doesn't implement syntax trivia, there's no way to losslessly round-trip Wasabi code. Whitespace is not preserved, and comments are always assumbed to be on a line of their own.

Did your team invent the idea, or is there prior research I could read?

It's novel, as far as I know. I wouldn't be surprised to see inspirations from a bunch of places (Microsoft's programming language history is old and storied[1] and we've had many teams working on many languages for many years with many compilers), but we do actually have a patent on this structure[2] (which is now open because we published under the Apache 2 license, which has a patent grant).

[1] Microsoft's first product was a compiler for Altair BASIC, written by Gates himself

[2] https://www.google.com/patents/US20130152061

> but we do actually have a patent on this structure[2] (which is now open because we published under the Apache 2 license, which has a patent grant).

Not meant to be snippy, but ...

If you publish this under patent grant, why did you patent it in the first place?

Apart from that, I must say that I'm very glad to live in a jurisdiction where these types of patents are void. I could have written down this idea long before 2011, but I never thought this would be worth patenting. (Don't get me wrong, the concept is really great, but to me it doesn't make any sense to prevent others from implementing the same idea. It's not like Microsoft had to invest thousands of dollar into research to develop this idea, and aims to refinance that development effort via patent licenses.)

> If you publish this under patent grant, why did you patent it in the first place?

It could be part of a defensive patent strategy. If they successfully received a patent for that, then it makes it that much harder for a malicious third party to troll them. And if you ask me, turning around and publishing the patent under the APL demonstrates good faith to the /libre/ software community (it's a GPLv3-compatible license, no less), so that's a strategic win, too.

Wow, this is an interesting strategy. To summarize, just to make sure I got this right:

    1. They make use the patent protection clauses via APL (and also GPLv3)
    2. They use to to protect it from their own patent
Result: Their patent ensures that all Free Software implementations must use a strong Copyleft license - APL, GPLv3, AGPLv3 or similar. (Or they have to try to get a patent license from MS.)

What's still unclear to me: Does it affect other implementations? Assume that Clang/LLVM wants to implement "syntax trivia". Will they have to use exactly their implementation, and thus have to switch from their BSD-style license to APL or (A)GPL?

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact