Hacker News new | past | comments | ask | show | jobs | submit login
Killing Off Wasabi – Part 2 (fogcreek.com)
90 points by GarethX on July 16, 2015 | hide | past | web | favorite | 37 comments



Although slightly off-topic, I find the mentioned idea of "syntax trivia" very compelling. This should not only be part of code parsers, but of any parsers.

I thought about something similar for binary parsers. Here, too, there are often many different ways to express a certain structure. Most libraries (such as Exiv2 for PNG/JPEG/EPS/etc. metadata) perform the following procedure: First, they parse out all information they are interested in. Then the application modifies that structure. Then it has to be embedded back into to binary format (here the image). The latter step is a nasty type of merging. They go over the file, try to find places where to "safely" embed or replace parts, then serialize their stuff and try to put it there.

However, if the binary parser was lossless, all changes to the parsed structure would preverse the "trivia", so serializing is straight forward. And any potential issues could be handled on application level in the parsed structure, rather than by guesswork and heuristics during the merge phase.

The only problem could be large BLOBs, but those could easily be represented as position & offset in the original file, rather than as the actual binary data in memory. (assuming the file will not change in the meantime, but in that case there the "merging" approach is very dangerous, too.)


Syntax trivia is a very important component and a pretty natural fit once you start thinking about the feature that we decided was absolutely necessary: full-fidelity round-tripping between parsed syntax trees and the original source.

I think this is one of the most common mistakes compiler writers make -- they don't spend enough time thinking about incorrect code. Let me be clear: most code is wrong. If your code is right, you compile it once. If your code is wrong, you compile it many times until you get it right. More importantly, if you run your IDE off your compiler, almost every character you type is a wrong program.

Having a full-fidelity syntax tree is essential for having great experiences with wrong code. In addition, it easily solves the problem of having to serialize your trees -- the source text is the serialization.

This feeds into API considerations as well. A number of people have repeatedly talked about how cool it would be to have a lot of other tools understand your AST (like if Git could support checking in ASTs instead of source). This is the wrong way of looking at it. If you're dealing with a raw AST you often have to have domain-specific knowledge of the language itself. Instead, what you want is to take the thing with the most domain specific knowledge, the compiler, and allow it to answer questions, i.e. have an API. By having round-trippable source, all source is essentially given a transparent API that can be used just as if you were interacting with the source code itself.

Anyway, this is going off the rails, but it's one of the numerous things I'd point to for many production compilers and say, "you are the past, this is the future."


The idea of "checking in ASTs" becomes a lot more compelling when you think about representing a diff of an automated refactoring. For example, a simple rename refactoring, performed by an IDE for a statically typed language, is completely safe, yet it could generate a 1000 line diff. This is hard to review, causes merge conflicts, does not rebase correctly, etc.

Instead, if you could somehow check in the refactoring action, all those problems would go away. You could rebase the code by undoing and redoing the rename, taking into account new usages of the renamed item, etc.


I think it's a nice idea, but seems difficult in practice.

First, I think embedding language knowledge into the VCS is fraught with peril. For one, does that mean that you need to rev your VCS version every time your language changes? What about when your language revs its AST, but not the language itself? Is your VCS version now no longer backwards compatible with old versions?

Second, I think there's a significant amount of overhead and new technology here. Most DVCS's currently use hash-based filesystems for storing history. If you replace simple data diffs with semantic transformations then you have to find some portable way of encoding that. If you don't want the implementation to be language-specific than you have to find some language-agnostic encoding system that can also recognize that the textual diff and the alpha-rename are identical commits.

IMHO, I would rather have metadata on commits. That way you can always fall back to plain text and all the old tools (like an ancient vi) continue to still be usable, but more advanced language-specific tooling could recognize these things and provide a simple view to the user.


I really like how syntax trivia allows you to completely ignore comments and whitespace, if you want, and focus exclusively on your problem domain (source code). But it's also available for when you do want to inspect or modify it.

Because Wasabi doesn't implement syntax trivia, there's no way to losslessly round-trip Wasabi code. Whitespace is not preserved, and comments are always assumbed to be on a line of their own.

Did your team invent the idea, or is there prior research I could read?


It's novel, as far as I know. I wouldn't be surprised to see inspirations from a bunch of places (Microsoft's programming language history is old and storied[1] and we've had many teams working on many languages for many years with many compilers), but we do actually have a patent on this structure[2] (which is now open because we published under the Apache 2 license, which has a patent grant).

[1] Microsoft's first product was a compiler for Altair BASIC, written by Gates himself

[2] https://www.google.com/patents/US20130152061


> but we do actually have a patent on this structure[2] (which is now open because we published under the Apache 2 license, which has a patent grant).

Not meant to be snippy, but ...

If you publish this under patent grant, why did you patent it in the first place?

Apart from that, I must say that I'm very glad to live in a jurisdiction where these types of patents are void. I could have written down this idea long before 2011, but I never thought this would be worth patenting. (Don't get me wrong, the concept is really great, but to me it doesn't make any sense to prevent others from implementing the same idea. It's not like Microsoft had to invest thousands of dollar into research to develop this idea, and aims to refinance that development effort via patent licenses.)


> If you publish this under patent grant, why did you patent it in the first place?

It could be part of a defensive patent strategy. If they successfully received a patent for that, then it makes it that much harder for a malicious third party to troll them. And if you ask me, turning around and publishing the patent under the APL demonstrates good faith to the /libre/ software community (it's a GPLv3-compatible license, no less), so that's a strategic win, too.


Wow, this is an interesting strategy. To summarize, just to make sure I got this right:

    1. They make use the patent protection clauses via APL (and also GPLv3)
    2. They use to to protect it from their own patent
Result: Their patent ensures that all Free Software implementations must use a strong Copyleft license - APL, GPLv3, AGPLv3 or similar. (Or they have to try to get a patent license from MS.)

What's still unclear to me: Does it affect other implementations? Assume that Clang/LLVM wants to implement "syntax trivia". Will they have to use exactly their implementation, and thus have to switch from their BSD-style license to APL or (A)GPL?


For those who missed it, discussion of part 1 is at [1].

[1] https://news.ycombinator.com/item?id=9777829


[1]

[1] Thank [2]!

[2] you


Honestly, I found the way he used [1] to be easier to parse even though it's the only link and at the end of the sentence. I appreciated it.


Meta question, do transpilers actually work well in practice? My intuition is that they would have all sorts of limitations, but I've never had to use one. Could anyone point me to a resource describing what those limitations might be? In what circumstances would they work best?


I use two transpilers in my day to day work, cython and coffeescript. In both cases they work really well and save me a bunch of time and effort. However in both cases the transpiler is written in parallel with the source language which always makes it a lot easier and the transpiler obviously supports 100% of the source language.

I have also played with transpilers written for 'foreign' languages and then you do end up in a situation where you can only use a (often badly defined) subset of the source language, and debugging is a pain since the output is often very difficult to read. Still they're handy when you need them and can (sometimes) save you time compared to porting the code by hand.


The chief difficulty is more social than technical: the description of a language is a Schelling Point (https://en.wikipedia.org/wiki/Focal_point_(game_theory)). Satisfying computers on the other end of the transformation doesn't necessarily help you satisfy people on both ends of the transformation.

Are you in web development? If so, I rate it as "Extraordinarily likely" you will use transpilers in the future or have done so already. Common ones which people would generally agree with me "Yep, that's a transpiler" include Sass, CoffeeScript, and Babel (an ES6 to the-Javascript-modern-browsers-actually-run transpiler -- which is, by the way, quite nice).

More broadly, and sidestepping a definitional debate, our day to day life is "Take a program state described in Language X then shoot it down the pipe in three other languages and pray the virtual machine knows what it is doing."

Random note: you can totally write a compiler. Writing a really compelling one is a social and technological challenge, but writing a limited one for pedagogical or personal/business use is roughly "an undergrad lab assignment" in terms of difficulty level. I think many engineers get scared by words like "compiler" and "emulator" and "virtual machine" and "programming language" and "operating system" and forget that these are the kind of things that talented-but-very-much-mortal solo developers can ship in weeks for fun.


Transpilers are just a new word for compilers - the major difference is that the target language is a high level one, but that just means that it is a lot easier to write the compiler (no optimizing register assignment!).

I doubt a transpiler has any limitations that doesn't apply to compilers in general, though of course any particular one might.


one of the most popular transpiler is http://haxe.org - it compiles down to many different languages, and it works pretty well.


In the end, a "transpiler" is just a compiler and subject to the same limitations. There is no generic set of limitations that apply to all compilers. Some are very good because they do something relatively easy, like Coffeescript -> JavaScript. I don't mean that writing a great Coffeescript compiler [1] is necessarily "easy", but the Coffeescript language was tightly designed to map directly to Javascript constructs, so there's hardly any "limitations" at all; very nearly the entire target language is available to you, and to the extent that there may be JS you can't generate, it's probably JS the community generally agrees is a bad idea anyhow. Others, like gcc or LLVM, are actually bridging a fairly large gulf now between the source and target languages [2], but do a great job because of massive piles of effort by incredibly smart people.

But if you did have to make generalizations, the two big ones are that generally you can have a performance problem if you're forced to compile something onto a target with a very different paradigm, and the target just doesn't let you have a low-enough level view of things to implement the source language's abstractions efficiently. For instance, consider the languages targeting the JVM before invokedynamic was added to the JVM... there was a fundamental paradigm mismatch that bled away a certain unavoidable amount of performance. The other problem you can get is that you may not get slick access to the low-level details of the target language; for instance, take a look at the piles of things that compile to JavaScript and compare how they allow you to use jQuery. It's anything from "What's the problem?" in CoffeeScript to being a full-on Foreign Function Interface-type call in GHCJS (even if you can get it pre-wrapped for you [3]), a Haskell->Javascript compiler. Both are really two aspects of the same "impedance mismatch" problem.

[1]: Look, I'll be honest, I consider the term "transpiler" completely unnecessary. It's a compiler. Feel free to just keep using transpiler in your reply and I won't say anything again. I've had this argument, I bow to the trend, but this old fogey doesn't intend to change (and still considers it a serious impediment to understanding to believe they are two different things, despite using the exact same techniques, architecture, tools, mental models... grumble grumble).

[2]: My recent previous comment goes into more detail on that claim: https://news.ycombinator.com/item?id=9800231

[3] https://github.com/ghcjs/ghcjs-jquery


    > Look, I'll be honest, I consider the term "transpiler"
    > completely unnecessary.
I'm part of the Dart team, so transpiling is my bread and butter.

I do think it is important to make a distinction between a transpiler and a compiler. If you only consider the runtime behavior of the resulting code, sure, the two are roughly equivalent.

But, that totally ignores the human factors involved.

1. You may need to inspect the ___piler output to find bugs in it.

2. A user may read the ___piler output to better understand the semantics of the source language. "Oh, ___ in CoffeeScript means ___ in JS!".

3. The user will need to debug their program. Unless you have very sophisticated debugging infrastructure (hint: source maps are not good enough), then that means they will be stepping through the ___piler output, not the original source.

4. A user may end up discarding the original source and hand-maintaining the ___piled output as the new source of truth. (This use case is exactly what the blog post is about.) Even if a user doesn't actually do this, they may want the psychological security of knowing that they could before they are willing to adopt the language.

I consider a compiler's job to be to generate the most efficient representation of the source program's semantics in terms of the target language. "Efficient" here may mean "fast" or "small" (since download speed matters), or some combination of both.

A transpiler's job is to generate the most similar representation of the source program's semantics in terms of the target. This means preserving function structure, control flow constructs, variable names, and comments whenever possible. The goal is to have as many pieces of the source program recognizably appear in the target.

A transpiler is much better at addressing the human factors above. However, a good one can be more difficult to write than a compiler, and in many cases you lose some runtime efficiency.

Also, knowing you want to transpile often constrains the design of your source language. (For example, this is one reason CoffeeScript is so similar to JS.)


Every last one of your points COMPLETELY applies to C++ -> ASM. It's not even like I'm stretching, except maybe on 4, but even then it's only because the tradeoffs are so poor when the compiler is big and complicated. I read about 3 blog posts a month that dig into the actual ASM generated by some C or C++ code (security analysis, mostly, but sometime compiler bug/CPU bug/surprising behavior discussions). Plenty of other compiler output gets picked up as the final maintained output.

This raises a minor design concern in the serialization of the final AST to some sort of big deal... but that's no different than optimizing your compile for space vs. time vs. compile time vs. correctness vs. any of the dozens of other dimensions you may have to optimize on. We don't run around giving them all names. It would be crazy. For instance, you're always worried about the mapping of the source to the target language... that's hardly special to Coffeescript. How could you possibly create a good compiler without thinking about that?

In the meantime, the cognitive damage of people thinking these are somehow different skillsets is bad. Compilers are mystical enough to people without making up a brand new category of thing to confuse them about.

Go learn compilers, people. You'll find you've automatically learned how to write transpilers too, without a single additional lesson. I can't think of a much better proof than that that they aren't actually different.


> Every last one of your points COMPLETELY applies to C++ -> ASM.

Really?

> 1. You may need to inspect the ___piler output to find bugs in it.

The set of people who are debugging a C++ compiler by looking at its machine code output is vanishingly small compared to the number of regular working C++ programmers.

> 2. A user may read the ___piler output to better understand the semantics of the source language. "Oh, ___ in CoffeeScript means ___ in JS!".

I'm not aware of anyone who learned C++'s high level semantics by seeing what machine code they are compiled to. Sure, there are a couple of corner cases where you might want to dig into the details of how something like vtables or calling conventions are implemented. But no one I know says, "Hmm, what does 'protected' mean? Let me look at the ASM and see."

But that is exactly how people learn CoffeeScript. Look right on the page: http://coffeescript.org/

> 3. The user will need to debug their program. Unless you have very sophisticated debugging infrastructure (hint: source maps are not good enough), then that means they will be stepping through the ___piler output, not the original source.

Fortunately, we do have very sophisticated debuggers for C++. A handful of real pros may also step through the assembly on rare occasions, but most working C++ programmers never do and never need to.

This is simply not the case for other transpiled langauges. There is no such thing as a "CoffeeScript debugger".

> 4. A user may end up discarding the original source and hand-maintaining the ___piled output as the new source of truth.

Heaven help you if you have to do this with the machine code of your C++ compiler. I've heard horror stories of teams that had to do this after losing the original source, but there's a reason we consider those horror stories. I've never heard of anyone doing this deliberately or considering it a feature.

> We don't run around giving them all names.

"Link time optimization", "dead code elimination", "global value numbering", "constant folding", "common subexpression elimination", "static single assignment form", "continuation-passing style", ...

> For instance, you're always worried about the mapping of the source to the target language... that's hardly special to Coffeescript. How could you possibly create a good compiler without thinking about that?

Your C++ compiler writer is never thinking, "How can I make this for() loop look like a for loop in assembly?" Or "How can I maintain this local variable name?"

In fact, they are often doing the opposite: "How can I lower this for loop to a more primitive control flow graph so I can optimize it?" Or "How can I eliminate this local variable entirely if it's never read?"

But that kind of stuff is exactly what a transpiler writer is trying to maintain.

> In the meantime, the cognitive damage of people thinking these are somehow different skillsets is bad.

I don't think having a different term for transpilers causes any cognitive damage. If anything, the name is associated with "lightweight compilers" like CoffeeScript which are more approachable to newcomers than "real" compilers like you learn about in the dragon book.

> You'll find you've automatically learned how to write transpilers too, without a single additional lesson.

I've written both, and I think there is actually quite a bit of difference between the two. Sure, the parsing is the same. But the way you approach code gen is very different between a compiler and a transpiler.

In fact, that often even bleeds forward into the front end. For a compiler, you're in this state of actively discarding information. Comments? Don't even lex them. Variable names? Just de Bruijn index them.

With a transpiler, all of that is precious data that you have to carefully pipe through to make the output code as readable as possible.

I don't think any of this is rocket science, and I definitely encourage more people to learn how to do this. If nothing else, it's tons of fun. But transpilers are different from compilers.

They share a lot of techniques, but they have different goals, scope, and requirements. Having a different word for that doesn't seem bad to me.

Honestly, if you want to talk about confusing PL terms, how about "interpreter" and "virtual machine". Now those are ones that cause real confusion.


In fact, that often even bleeds forward into the front end. For a compiler, you're in this state of actively discarding information. Comments? Don't even lex them. Variable names? Just de Bruijn index them.

FYI, for anyone thinking about writing a production compiler front-end, don't do this, it's a terrible idea.


I think the important bit is to not throw that information away until after you've reported any compile errors. Once you know the code is valid, you can discard that stuff (unless you need it to generate debugging information).


Oh, yes, that's fine. Although by that point you've kind of already spent the time, so it's not going to gain you much by throwing it out.


Because your error messages might be less-than-useful. Not that Wasabi ever had an issue with that. Cough.


This is a really interesting discussion, and I quite like the distinction you make between "most efficient" and "most similar" representation in the target language. From that perspective, it makes sense to me that CoffeeScript is very much a transpiler, while Closure is very much a compiler, which meshes with common terminology.

But it seems to me that these should be considered different kinds of optimization, rather than fundamentally different things, eg. `xpiler -Oreadability` versus `xpiler -Oefficiency`.


Agree completely -- all of these human factors affected Wasabi. #1 was the main reason we chose to write a C# transpiler for v3 instead of a CIL emitter. #2 and #3 definitely helped other developers figure out what their code was actually doing. And like you said, #4 is what ended up making my Kill Wasabi project feasible.

Plus, writing a transpiler means you can lean on better-funded compiler makers' optimizations. In our case, the .NET JIT team is way more hardcore at x86 optimization than I have the resources to be. In your case, you get to avoid duplicating the V8 team's work.


Can't point to any resource... but my experience is that the main problems are a) parsing the input language and in the same way as other tools do and b) giving useful diagnostics and making the next-step compiler's and/or debugger's diagnostics useful. Far from trivial, but not insurmountable.


One minor correction: Trivia are part of tokens, not nodes, which allows to represent things like white space between the if keyword and the opening parenthesis. There is a convenience method to get leading and trailing trivia of nodes, though, but all it does is take the first token and grab trivia from there.

Having worked with Roslyn for the past seven months I'm actually very impressed of the overall architecture. We're using it to convert C# to Java and JavaScript, so we essentially convert one AST into another. I took a few design decisions from Roslyn and applied them to our own AST.


Thanks, I appreciate the correction. My experience with Roslyn was very ad-hoc: just enough to do the job, see the sights, and get out of the compiler business :)

It's interesting how many people are out there trying to switch from one language to another!


Why not open-source the whole compiler? Probably nobody else would use the whole thing as is, but there are surely other interesting bits, like the global type inference.


The type inference code would make the world a worse place.

Whole careers would be ruined by new developers closing one eye and peeking at the last close brace.


Perhaps I misunderstand your question, but doesn't the article say the source is on github? https://github.com/FogCreek/RoslynGenerator


In github it states "The CLR importer, lexer, parser, interpreter, type checker, language runtime, JavaScript generator, and other components of Wasabi are missing."


For a split second I thought it was going to be about Nullsoft's Wasabi https://en.wikipedia.org/wiki/Wasabi_%28software%29


I wonder why you have been downvoted. I thought of the same. Upvoted.


Thanks. Someone might have `misclicked'.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: