I thought about something similar for binary parsers. Here, too, there are often many different ways to express a certain structure. Most libraries (such as Exiv2 for PNG/JPEG/EPS/etc. metadata) perform the following procedure: First, they parse out all information they are interested in. Then the application modifies that structure. Then it has to be embedded back into to binary format (here the image). The latter step is a nasty type of merging. They go over the file, try to find places where to "safely" embed or replace parts, then serialize their stuff and try to put it there.
However, if the binary parser was lossless, all changes to the parsed structure would preverse the "trivia", so serializing is straight forward. And any potential issues could be handled on application level in the parsed structure, rather than by guesswork and heuristics during the merge phase.
The only problem could be large BLOBs, but those could easily be represented as position & offset in the original file, rather than as the actual binary data in memory. (assuming the file will not change in the meantime, but in that case there the "merging" approach is very dangerous, too.)
I think this is one of the most common mistakes compiler writers make -- they don't spend enough time thinking about incorrect code. Let me be clear: most code is wrong. If your code is right, you compile it once. If your code is wrong, you compile it many times until you get it right. More importantly, if you run your IDE off your compiler, almost every character you type is a wrong program.
Having a full-fidelity syntax tree is essential for having great experiences with wrong code. In addition, it easily solves the problem of having to serialize your trees -- the source text is the serialization.
This feeds into API considerations as well. A number of people have repeatedly talked about how cool it would be to have a lot of other tools understand your AST (like if Git could support checking in ASTs instead of source). This is the wrong way of looking at it. If you're dealing with a raw AST you often have to have domain-specific knowledge of the language itself. Instead, what you want is to take the thing with the most domain specific knowledge, the compiler, and allow it to answer questions, i.e. have an API. By having round-trippable source, all source is essentially given a transparent API that can be used just as if you were interacting with the source code itself.
Anyway, this is going off the rails, but it's one of the numerous things I'd point to for many production compilers and say, "you are the past, this is the future."
Instead, if you could somehow check in the refactoring action, all those problems would go away. You could rebase the code by undoing and redoing the rename, taking into account new usages of the renamed item, etc.
First, I think embedding language knowledge into the VCS is fraught with peril. For one, does that mean that you need to rev your VCS version every time your language changes? What about when your language revs its AST, but not the language itself? Is your VCS version now no longer backwards compatible with old versions?
Second, I think there's a significant amount of overhead and new technology here. Most DVCS's currently use hash-based filesystems for storing history. If you replace simple data diffs with semantic transformations then you have to find some portable way of encoding that. If you don't want the implementation to be language-specific than you have to find some language-agnostic encoding system that can also recognize that the textual diff and the alpha-rename are identical commits.
IMHO, I would rather have metadata on commits. That way you can always fall back to plain text and all the old tools (like an ancient vi) continue to still be usable, but more advanced language-specific tooling could recognize these things and provide a simple view to the user.
Because Wasabi doesn't implement syntax trivia, there's no way to losslessly round-trip Wasabi code. Whitespace is not preserved, and comments are always assumbed to be on a line of their own.
Did your team invent the idea, or is there prior research I could read?
 Microsoft's first product was a compiler for Altair BASIC, written by Gates himself
Not meant to be snippy, but ...
If you publish this under patent grant, why did you patent it in the first place?
Apart from that, I must say that I'm very glad to live in a jurisdiction where these types of patents are void. I could have written down this idea long before 2011, but I never thought this would be worth patenting. (Don't get me wrong, the concept is really great, but to me it doesn't make any sense to prevent others from implementing the same idea. It's not like Microsoft had to invest thousands of dollar into research to develop this idea, and aims to refinance that development effort via patent licenses.)
It could be part of a defensive patent strategy. If they successfully received a patent for that, then it makes it that much harder for a malicious third party to troll them. And if you ask me, turning around and publishing the patent under the APL demonstrates good faith to the /libre/ software community (it's a GPLv3-compatible license, no less), so that's a strategic win, too.
1. They make use the patent protection clauses via APL (and also GPLv3)
2. They use to to protect it from their own patent
What's still unclear to me: Does it affect other implementations? Assume that Clang/LLVM wants to implement "syntax trivia". Will they have to use exactly their implementation, and thus have to switch from their BSD-style license to APL or (A)GPL?
 Thank !
I have also played with transpilers written for 'foreign' languages and then you do end up in a situation where you can only use a (often badly defined) subset of the source language, and debugging is a pain since the output is often very difficult to read. Still they're handy when you need them and can (sometimes) save you time compared to porting the code by hand.
More broadly, and sidestepping a definitional debate, our day to day life is "Take a program state described in Language X then shoot it down the pipe in three other languages and pray the virtual machine knows what it is doing."
Random note: you can totally write a compiler. Writing a really compelling one is a social and technological challenge, but writing a limited one for pedagogical or personal/business use is roughly "an undergrad lab assignment" in terms of difficulty level. I think many engineers get scared by words like "compiler" and "emulator" and "virtual machine" and "programming language" and "operating system" and forget that these are the kind of things that talented-but-very-much-mortal solo developers can ship in weeks for fun.
I doubt a transpiler has any limitations that doesn't apply to compilers in general, though of course any particular one might.
: Look, I'll be honest, I consider the term "transpiler" completely unnecessary. It's a compiler. Feel free to just keep using transpiler in your reply and I won't say anything again. I've had this argument, I bow to the trend, but this old fogey doesn't intend to change (and still considers it a serious impediment to understanding to believe they are two different things, despite using the exact same techniques, architecture, tools, mental models... grumble grumble).
: My recent previous comment goes into more detail on that claim: https://news.ycombinator.com/item?id=9800231
> Look, I'll be honest, I consider the term "transpiler"
> completely unnecessary.
I do think it is important to make a distinction between a transpiler and a compiler. If you only consider the runtime behavior of the resulting code, sure, the two are roughly equivalent.
But, that totally ignores the human factors involved.
1. You may need to inspect the ___piler output to find bugs in it.
2. A user may read the ___piler output to better understand the semantics of the source language. "Oh, ___ in CoffeeScript means ___ in JS!".
3. The user will need to debug their program. Unless you have very sophisticated debugging infrastructure (hint: source maps are not good enough), then that means they will be stepping through the ___piler output, not the original source.
4. A user may end up discarding the original source and hand-maintaining the ___piled output as the new source of truth. (This use case is exactly what the blog post is about.) Even if a user doesn't actually do this, they may want the psychological security of knowing that they could before they are willing to adopt the language.
I consider a compiler's job to be to generate the most efficient representation of the source program's semantics in terms of the target language. "Efficient" here may mean "fast" or "small" (since download speed matters), or some combination of both.
A transpiler's job is to generate the most similar representation of the source program's semantics in terms of the target. This means preserving function structure, control flow constructs, variable names, and comments whenever possible. The goal is to have as many pieces of the source program recognizably appear in the target.
A transpiler is much better at addressing the human factors above. However, a good one can be more difficult to write than a compiler, and in many cases you lose some runtime efficiency.
Also, knowing you want to transpile often constrains the design of your source language. (For example, this is one reason CoffeeScript is so similar to JS.)
This raises a minor design concern in the serialization of the final AST to some sort of big deal... but that's no different than optimizing your compile for space vs. time vs. compile time vs. correctness vs. any of the dozens of other dimensions you may have to optimize on. We don't run around giving them all names. It would be crazy. For instance, you're always worried about the mapping of the source to the target language... that's hardly special to Coffeescript. How could you possibly create a good compiler without thinking about that?
In the meantime, the cognitive damage of people thinking these are somehow different skillsets is bad. Compilers are mystical enough to people without making up a brand new category of thing to confuse them about.
Go learn compilers, people. You'll find you've automatically learned how to write transpilers too, without a single additional lesson. I can't think of a much better proof than that that they aren't actually different.
> 1. You may need to inspect the ___piler output to find bugs in it.
The set of people who are debugging a C++ compiler by looking at its machine code output is vanishingly small compared to the number of regular working C++ programmers.
> 2. A user may read the ___piler output to better understand the semantics of the source language. "Oh, ___ in CoffeeScript means ___ in JS!".
I'm not aware of anyone who learned C++'s high level semantics by seeing what machine code they are compiled to. Sure, there are a couple of corner cases where you might want to dig into the details of how something like vtables or calling conventions are implemented. But no one I know says, "Hmm, what does 'protected' mean? Let me look at the ASM and see."
But that is exactly how people learn CoffeeScript. Look right on the page: http://coffeescript.org/
> 3. The user will need to debug their program. Unless you have very sophisticated debugging infrastructure (hint: source maps are not good enough), then that means they will be stepping through the ___piler output, not the original source.
Fortunately, we do have very sophisticated debuggers for C++. A handful of real pros may also step through the assembly on rare occasions, but most working C++ programmers never do and never need to.
This is simply not the case for other transpiled langauges. There is no such thing as a "CoffeeScript debugger".
> 4. A user may end up discarding the original source and hand-maintaining the ___piled output as the new source of truth.
Heaven help you if you have to do this with the machine code of your C++ compiler. I've heard horror stories of teams that had to do this after losing the original source, but there's a reason we consider those horror stories. I've never heard of anyone doing this deliberately or considering it a feature.
> We don't run around giving them all names.
"Link time optimization", "dead code elimination", "global value numbering", "constant folding", "common subexpression elimination", "static single assignment form", "continuation-passing style", ...
> For instance, you're always worried about the mapping of the source to the target language... that's hardly special to Coffeescript. How could you possibly create a good compiler without thinking about that?
Your C++ compiler writer is never thinking, "How can I make this for() loop look like a for loop in assembly?" Or "How can I maintain this local variable name?"
In fact, they are often doing the opposite: "How can I lower this for loop to a more primitive control flow graph so I can optimize it?" Or "How can I eliminate this local variable entirely if it's never read?"
But that kind of stuff is exactly what a transpiler writer is trying to maintain.
> In the meantime, the cognitive damage of people thinking these are somehow different skillsets is bad.
I don't think having a different term for transpilers causes any cognitive damage. If anything, the name is associated with "lightweight compilers" like CoffeeScript which are more approachable to newcomers than "real" compilers like you learn about in the dragon book.
> You'll find you've automatically learned how to write transpilers too, without a single additional lesson.
I've written both, and I think there is actually quite a bit of difference between the two. Sure, the parsing is the same. But the way you approach code gen is very different between a compiler and a transpiler.
In fact, that often even bleeds forward into the front end. For a compiler, you're in this state of actively discarding information. Comments? Don't even lex them. Variable names? Just de Bruijn index them.
With a transpiler, all of that is precious data that you have to carefully pipe through to make the output code as readable as possible.
I don't think any of this is rocket science, and I definitely encourage more people to learn how to do this. If nothing else, it's tons of fun. But transpilers are different from compilers.
They share a lot of techniques, but they have different goals, scope, and requirements. Having a different word for that doesn't seem bad to me.
Honestly, if you want to talk about confusing PL terms, how about "interpreter" and "virtual machine". Now those are ones that cause real confusion.
FYI, for anyone thinking about writing a production compiler front-end, don't do this, it's a terrible idea.
But it seems to me that these should be considered different kinds of optimization, rather than fundamentally different things, eg. `xpiler -Oreadability` versus `xpiler -Oefficiency`.
Plus, writing a transpiler means you can lean on better-funded compiler makers' optimizations. In our case, the .NET JIT team is way more hardcore at x86 optimization than I have the resources to be. In your case, you get to avoid duplicating the V8 team's work.
It's interesting how many people are out there trying to switch from one language to another!
Whole careers would be ruined by new developers closing one eye and peeking at the last close brace.