
Killing Off Wasabi – Part 2 - GarethX
http://blog.fogcreek.com/killing-off-wasabi-part-2/
======
vog
Although slightly off-topic, I find the mentioned idea of "syntax trivia" very
compelling. This should not only be part of code parsers, but of any parsers.

I thought about something similar for binary parsers. Here, too, there are
often many different ways to express a certain structure. Most libraries (such
as Exiv2 for PNG/JPEG/EPS/etc. metadata) perform the following procedure:
First, they parse out all information they are interested in. Then the
application modifies that structure. Then it has to be embedded back into to
binary format (here the image). The latter step is a nasty type of merging.
They go over the file, try to find places where to "safely" embed or replace
parts, then serialize their stuff and try to put it there.

However, if the binary parser was lossless, all changes to the parsed
structure would preverse the "trivia", so serializing is straight forward. And
any potential issues could be handled on application level in the parsed
structure, rather than by guesswork and heuristics during the merge phase.

The only problem could be large BLOBs, but those could easily be represented
as position & offset in the original file, rather than as the actual binary
data in memory. (assuming the file will not change in the meantime, but in
that case there the "merging" approach is very dangerous, too.)

~~~
Locke1689
Syntax trivia is a very important component and a pretty natural fit once you
start thinking about the feature that we decided was absolutely necessary:
full-fidelity round-tripping between parsed syntax trees and the original
source.

I think this is one of the most common mistakes compiler writers make -- they
don't spend enough time thinking about incorrect code. Let me be clear: most
code is wrong. If your code is right, you compile it once. If your code is
wrong, you compile it many times until you get it right. More importantly, if
you run your IDE off your compiler, almost every character you type is a wrong
program.

Having a full-fidelity syntax tree is essential for having great experiences
with wrong code. In addition, it easily solves the problem of having to
serialize your trees -- the source text _is_ the serialization.

This feeds into API considerations as well. A number of people have repeatedly
talked about how cool it would be to have a lot of other tools understand your
AST (like if Git could support checking in ASTs instead of source). This is
the wrong way of looking at it. If you're dealing with a raw AST you often
have to have domain-specific knowledge of the language itself. Instead, what
you want is to take the thing with the most domain specific knowledge, the
compiler, and allow it to answer questions, i.e. have an API. By having round-
trippable source, all source is essentially given a transparent API that can
be used just as if you were interacting with the source code itself.

Anyway, this is going off the rails, but it's one of the numerous things I'd
point to for many production compilers and say, "you are the past, this is the
future."

~~~
electrum
The idea of "checking in ASTs" becomes a lot more compelling when you think
about representing a diff of an automated refactoring. For example, a simple
rename refactoring, performed by an IDE for a statically typed language, is
completely safe, yet it could generate a 1000 line diff. This is hard to
review, causes merge conflicts, does not rebase correctly, etc.

Instead, if you could somehow check in the refactoring action, all those
problems would go away. You could rebase the code by undoing and redoing the
rename, taking into account new usages of the renamed item, etc.

~~~
Locke1689
I think it's a nice idea, but seems difficult in practice.

First, I think embedding language knowledge into the VCS is fraught with
peril. For one, does that mean that you need to rev your VCS version every
time your language changes? What about when your language revs its AST, but
not the language itself? Is your VCS version now no longer backwards
compatible with old versions?

Second, I think there's a significant amount of overhead and new technology
here. Most DVCS's currently use hash-based filesystems for storing history. If
you replace simple data diffs with semantic transformations then you have to
find some portable way of encoding that. If you don't want the implementation
to be language-specific than you have to find some language-agnostic encoding
system that can also recognize that the textual diff and the alpha-rename are
identical commits.

IMHO, I would rather have metadata on commits. That way you can always fall
back to plain text and all the old tools (like an ancient vi) continue to
still be usable, but more advanced language-specific tooling could recognize
these things and provide a simple view to the user.

------
andyjohnson0
For those who missed it, discussion of part 1 is at [1].

[1]
[https://news.ycombinator.com/item?id=9777829](https://news.ycombinator.com/item?id=9777829)

~~~
unstabilo
[1]

[1] Thank [2]!

[2] you

~~~
savanaly
Honestly, I found the way he used [1] to be easier to parse even though it's
the only link and at the end of the sentence. I appreciated it.

------
anonymousDan
Meta question, do transpilers actually work well in practice? My intuition is
that they would have all sorts of limitations, but I've never had to use one.
Could anyone point me to a resource describing what those limitations might
be? In what circumstances would they work best?

~~~
jerf
In the end, a "transpiler" is just a compiler and subject to the same
limitations. There is no generic set of limitations that apply to all
compilers. Some are very good because they do something relatively easy, like
Coffeescript -> JavaScript. I don't mean that writing a great Coffeescript
compiler [1] is necessarily "easy", but the Coffeescript language was tightly
designed to map directly to Javascript constructs, so there's hardly any
"limitations" at all; very nearly the entire target language is available to
you, and to the extent that there may be JS you can't generate, it's probably
JS the community generally agrees is a bad idea anyhow. Others, like gcc or
LLVM, are actually bridging a fairly large gulf now between the source and
target languages [2], but do a great job because of _massive piles of effort_
by incredibly smart people.

But if you did have to make generalizations, the two big ones are that
generally you can have a performance problem if you're forced to compile
something onto a target with a very different paradigm, and the target just
doesn't let you have a low-enough level view of things to implement the source
language's abstractions efficiently. For instance, consider the languages
targeting the JVM _before_ invokedynamic was added to the JVM... there was a
fundamental paradigm mismatch that bled away a certain unavoidable amount of
performance. The other problem you can get is that you may not get slick
access to the low-level details of the target language; for instance, take a
look at the piles of things that compile to JavaScript and compare how they
allow you to use jQuery. It's anything from "What's the problem?" in
CoffeeScript to being a full-on Foreign Function Interface-type call in GHCJS
(even if you can get it pre-wrapped for you [3]), a Haskell->Javascript
compiler. Both are really two aspects of the same "impedance mismatch"
problem.

[1]: Look, I'll be honest, I consider the term "transpiler" completely
unnecessary. It's a compiler. Feel free to just keep using transpiler in your
reply and I won't say anything again. I've had this argument, I bow to the
trend, but this old fogey doesn't intend to change (and still considers it a
serious impediment to understanding to believe they are two different things,
despite using the exact same techniques, architecture, tools, mental models...
grumble grumble).

[2]: My recent previous comment goes into more detail on that claim:
[https://news.ycombinator.com/item?id=9800231](https://news.ycombinator.com/item?id=9800231)

[3] [https://github.com/ghcjs/ghcjs-jquery](https://github.com/ghcjs/ghcjs-
jquery)

~~~
munificent

        > Look, I'll be honest, I consider the term "transpiler"
        > completely unnecessary.
    

I'm part of the Dart team, so transpiling is my bread and butter.

I do think it is important to make a distinction between a transpiler and a
compiler. If you only consider the runtime behavior of the resulting code,
sure, the two are roughly equivalent.

But, that totally ignores the human factors involved.

1\. You may need to inspect the ___piler output to find bugs in it.

2\. A user may read the ___piler output to better understand the semantics of
the source language. "Oh, ___ in CoffeeScript means ___ in JS!".

3\. The user _will_ need to debug their program. Unless you have very
sophisticated debugging infrastructure (hint: source maps are not good
enough), then that means they will be stepping through the ___piler output,
not the original source.

4\. A user may end up discarding the original source and hand-maintaining the
___piled output as the new source of truth. (This use case is exactly what the
blog post is about.) Even if a user doesn't actually do this, they may want
the psychological security of knowing that they _could_ before they are
willing to adopt the language.

I consider a compiler's job to be to generate the most _efficient_
representation of the source program's semantics in terms of the target
language. "Efficient" here may mean "fast" or "small" (since download speed
matters), or some combination of both.

A transpiler's job is to generate the most _similar_ representation of the
source program's semantics in terms of the target. This means preserving
function structure, control flow constructs, variable names, and comments
whenever possible. The goal is to have as many pieces of the source program
recognizably appear in the target.

A transpiler is much better at addressing the human factors above. However, a
good one can be more difficult to write than a compiler, and in many cases you
lose some runtime efficiency.

Also, knowing you want to transpile often constrains the design of your source
language. (For example, this is one reason CoffeeScript is so similar to JS.)

~~~
jerf
Every last one of your points _COMPLETELY_ applies to C++ -> ASM. It's not
even like I'm stretching, except maybe on 4, but even then it's only because
the tradeoffs are so poor when the compiler is big and complicated. I read
about 3 blog posts a month that dig into the actual ASM generated by some C or
C++ code (security analysis, mostly, but sometime compiler bug/CPU
bug/surprising behavior discussions). Plenty of other compiler output gets
picked up as the final maintained output.

This raises a minor design concern in the serialization of the final AST to
some sort of big deal... but that's no different than optimizing your compile
for space vs. time vs. compile time vs. correctness vs. any of the dozens of
other dimensions you may have to optimize on. We don't run around giving them
all names. It would be crazy. For instance, you're _always_ worried about the
mapping of the source to the target language... that's hardly special to
Coffeescript. How could you possibly create a _good_ compiler without thinking
about that?

In the meantime, the cognitive damage of people thinking these are somehow
different skillsets is bad. Compilers are mystical enough to people without
making up a brand new category of thing to confuse them about.

Go learn compilers, people. You'll find you've automatically learned how to
write transpilers too, without a single additional lesson. I can't think of a
much better proof than that that they aren't actually different.

~~~
munificent
> Every last one of your points COMPLETELY applies to C++ -> ASM.

Really?

> 1\. You may need to inspect the ___piler output to find bugs in it.

The set of people who are debugging a C++ compiler by looking at its machine
code output is vanishingly small compared to the number of regular working C++
programmers.

> 2\. A user may read the ___piler output to better understand the semantics
> of the source language. "Oh, ___ in CoffeeScript means ___ in JS!".

I'm not aware of anyone who learned C++'s high level semantics by seeing what
machine code they are compiled to. Sure, there are a couple of corner cases
where you might want to dig into the details of how something like vtables or
calling conventions are implemented. But no one I know says, "Hmm, what does
'protected' mean? Let me look at the ASM and see."

But that is _exactly_ how people learn CoffeeScript. Look right on the page:
[http://coffeescript.org/](http://coffeescript.org/)

> 3\. The user will need to debug their program. Unless you have very
> sophisticated debugging infrastructure (hint: source maps are not good
> enough), then that means they will be stepping through the ___piler output,
> not the original source.

Fortunately, we do have very sophisticated debuggers for C++. A handful of
real pros may also step through the assembly on rare occasions, but most
working C++ programmers never do and never need to.

This is simply not the case for other transpiled langauges. There is no such
thing as a "CoffeeScript debugger".

> 4\. A user may end up discarding the original source and hand-maintaining
> the ___piled output as the new source of truth.

Heaven help you if you have to do this with the machine code of your C++
compiler. I've heard horror stories of teams that had to do this after
_losing_ the original source, but there's a reason we consider those horror
stories. I've never heard of anyone doing this deliberately or considering it
a _feature_.

> We don't run around giving them all names.

"Link time optimization", "dead code elimination", "global value numbering",
"constant folding", "common subexpression elimination", "static single
assignment form", "continuation-passing style", ...

> For instance, you're always worried about the mapping of the source to the
> target language... that's hardly special to Coffeescript. How could you
> possibly create a good compiler without thinking about that?

Your C++ compiler writer is never thinking, "How can I make this for() loop
look like a for loop in assembly?" Or "How can I maintain this local variable
name?"

In fact, they are often doing the _opposite_ : "How can I lower this for loop
to a more primitive control flow graph so I can optimize it?" Or "How can I
eliminate this local variable entirely if it's never read?"

But that kind of stuff is exactly what a transpiler writer is trying to
maintain.

> In the meantime, the cognitive damage of people thinking these are somehow
> different skillsets is bad.

I don't think having a different term for transpilers causes any cognitive
damage. If anything, the name is associated with "lightweight compilers" like
CoffeeScript which are more approachable to newcomers than "real" compilers
like you learn about in the dragon book.

> You'll find you've automatically learned how to write transpilers too,
> without a single additional lesson.

I've written both, and I think there is actually quite a bit of difference
between the two. Sure, the parsing is the same. But the way you approach code
gen is very different between a compiler and a transpiler.

In fact, that often even bleeds forward into the front end. For a compiler,
you're in this state of actively discarding information. Comments? Don't even
lex them. Variable names? Just de Bruijn index them.

With a transpiler, all of that is precious data that you have to carefully
pipe through to make the output code as readable as possible.

I don't think any of this is rocket science, and I definitely encourage more
people to learn how to do this. If nothing else, it's tons of fun. But
transpilers are different from compilers.

They share a lot of techniques, but they have different goals, scope, and
requirements. Having a different word for that doesn't seem bad to me.

Honestly, if you want to talk about confusing PL terms, how about
"interpreter" and "virtual machine". Now _those_ are ones that cause real
confusion.

~~~
Locke1689
_In fact, that often even bleeds forward into the front end. For a compiler,
you 're in this state of actively discarding information. Comments? Don't even
lex them. Variable names? Just de Bruijn index them._

FYI, for anyone thinking about writing a production compiler front-end, don't
do this, it's a terrible idea.

~~~
munificent
I think the important bit is to not throw that information away until _after_
you've reported any compile errors. Once you know the code is valid, you can
discard that stuff (unless you need it to generate debugging information).

~~~
Locke1689
Oh, yes, that's fine. Although by that point you've kind of already spent the
time, so it's not going to gain you much by throwing it out.

------
ygra
One minor correction: Trivia are part of _tokens_ , not nodes, which allows to
represent things like white space between the if keyword and the opening
parenthesis. There is a convenience method to get leading and trailing trivia
of nodes, though, but all it does is take the first token and grab trivia from
there.

Having worked with Roslyn for the past seven months I'm actually very
impressed of the overall architecture. We're using it to convert C# to Java
and JavaScript, so we essentially convert one AST into another. I took a few
design decisions from Roslyn and applied them to our own AST.

~~~
krallja
Thanks, I appreciate the correction. My experience with Roslyn was very ad-
hoc: just enough to do the job, see the sights, and get out of the compiler
business :)

It's interesting how many people are out there trying to switch from one
language to another!

------
mwcampbell
Why not open-source the whole compiler? Probably nobody else would use the
whole thing as is, but there are surely other interesting bits, like the
global type inference.

~~~
taprun
Perhaps I misunderstand your question, but doesn't the article say the source
is on github?
[https://github.com/FogCreek/RoslynGenerator](https://github.com/FogCreek/RoslynGenerator)

~~~
pweissbrod
In github it states "The CLR importer, lexer, parser, interpreter, type
checker, language runtime, JavaScript generator, and other components of
Wasabi are missing."

------
johnchristopher
For a split second I thought it was going to be about Nullsoft's Wasabi
[https://en.wikipedia.org/wiki/Wasabi_%28software%29](https://en.wikipedia.org/wiki/Wasabi_%28software%29)

~~~
the-dude
I wonder why you have been downvoted. I thought of the same. Upvoted.

~~~
johnchristopher
Thanks. Someone might have `misclicked'.

