
From AST to Lossless Syntax Tree - gbrown_
http://www.oilshell.org/blog/2017/02/11.html
======
brabram
Funny, I actually did the same thing for python years ago and also talked
about "Lossless Syntax Tree" then start calling it a Full Syntax Tree since
CST doesn't really fit; I wasn't aware of the existence of lib2to3 at that
time.

My goal was different than your: I wanted make writing custom refactoring code
(that's it: code that modify source code) and code that works on source code a
do-able task.

I end up doing some design decisions that I haven't found elsewhere (but this
field is hard to explore):

\- producing json, because datastructure doesn't lie to you and potential
interoperability

\- nodes are responsible for the formatting within itself, in opposition with
lib2to3 where a node is responsible for the formatting before itself (or
after, I'm not sure anymore)

\- the tree is design for the human brain instead of an interpreter/compiler
(for example having list instead of recursive structures)

The project is called Baron
[https://github.com/pycqa/baron](https://github.com/pycqa/baron) and was
actually a mean for me to work on what really interest me: the abstraction
that attempt to make writing custom refactoring a doable task
[https://github.com/pycqa/redbaron](https://github.com/pycqa/redbaron)

Good luck with your project :)

~~~
shakna
I've used RedBaron before when building some toy languages.

Thanks for a great project!

~~~
brabram
Oh, can you give more details/links please? I'm really curious about that!

~~~
shakna
The code never became public, but I can give some details.

The idea was to create an easier C, using Python as an intermediary language.

RedBaron provided the AST and AST->Python layers.

Unfortunately, the project had to pivot, because optimising Python to be fast
enough was overly difficult/unpredictable.

------
zaptheimpaler
Cool I didn't know this was a common pattern. I recently saw the same approach
implemented in scala.meta [1] - it allows you to view the code as both parsed
tokens with all syntax intact as well as more abstracted ASTs which only carry
semantic meaning. Someone even built a code formatter called scalafmt[2] like
the author mentions.

Its a really cool approach because I think we need to pay much more attention
to making more/better structured data from the compiler available to tools.

[1] [http://scalameta.org/tutorial/](http://scalameta.org/tutorial/) [2]
[https://github.com/olafurpg/scalafmt](https://github.com/olafurpg/scalafmt)

~~~
densh
Back when we started scala.meta token-level granularity wasn't available in
most metaprogramming frameworks. Clang [1] and Roslyn [2] seem to be the first
two major industry-grade compilers that use this approach and re-use the
compiler as the foundation for extensible tooling APIs.

[1] [https://clang.llvm.org/](https://clang.llvm.org/)

[2] [https://roslyn.codeplex.com/](https://roslyn.codeplex.com/)

~~~
chubot
(author here)

Cool! Yes this is what I was getting at. I mentioned Clang and Microsoft's
IDEs in the "related work" section. I haven't seen them describe their data
structures very much.

Clang has some documentation, but it's not very clear. For example, this doc
has a FIXME(describe design) on it.

[http://clang.llvm.org/docs/LibFormat.html](http://clang.llvm.org/docs/LibFormat.html)

I also think the Clang AST is absolutely enormous and thus hard to document.

~~~
densh
We've not completely solved the problem of language complexity creeping in to
the metaprogramming toolkit but we do have quasiquotes [1] to partially
address the pain of ast constructors and destructors.

e.g. For code snippet "class C(x: Int)" Scala compiler create a tree that
roughly resembles the following code:

    
    
        class C extends scala.AnyRef {
          <paramaccessor> private[this] val x: Int = _;
          def <init>(x: Int) = {
            super.<init>();
            ()
          }
        }
    

This tree looks even more terryfing under the hood in terms of AST
constructors. Instead of making people figure out how that works we support
nice high level sugar instead:

    
    
        q"class C(x: Int)"
    

Where q is a magic string interpolator that is compiled to generate an
equivalent AST. It can be used for both construction of new AST nodes based on
the older ones (we use $name syntax to substitute thing in) and also
deconstuction of existing ones into smaller pieces via pattern matching ($name
extracts parts out.)

[1] [http://docs.scala-
lang.org/overviews/quasiquotes/intro.html](http://docs.scala-
lang.org/overviews/quasiquotes/intro.html)

~~~
chubot
OK interesting, Julia metaprogramming looks very similar to this, and I hope
to take inspiration from it for oil:

[http://docs.julialang.org/en/stable/manual/metaprogramming/](http://docs.julialang.org/en/stable/manual/metaprogramming/)

[https://en.wikibooks.org/wiki/Introducing_Julia/Metaprogramm...](https://en.wikibooks.org/wiki/Introducing_Julia/Metaprogramming)

They use :(expr) or quote/end for quotation, and $var for interpolation.
Elixir metaprogramming uses quote/end for quotation, and unquote for
interpolation. (In fact the entire Elixir language appears to be done with AST
metaprogramming, since it's on top of Erlang.)

I basically think of these schemes as "Lisp-like AST metaprogramming, but with
Syntax". Thanks for the pointer on Scala.

I saw a video where people asked why Clang source tools generate textual
changes rather than AST changes... and this is a good example. People for some
reason think that ASTs are "cleaner" or more usable, but they can be a pain.

------
edsrzf
The Go library's standard AST package[1] similarly stores comments and
whitespace information. It has to, since it's used by gofmt, and you wouldn't
be very happy you lost comments when formatting your code.

[1] [https://golang.org/pkg/go/ast/](https://golang.org/pkg/go/ast/)

~~~
nemo1618
Unfortunately, most people agree that the ast package is a mess. It isn't used
by the actual compiler. There is some talk of eventually deprecating it and
publishing the internal ast/parser/type-checker packages.

~~~
edsrzf
I think calling it a "mess" is hyperbole, but yes, it's true that it has
shortcomings and may be deprecated and replaced.

------
electrum
IDEs like IntelliJ need to do this as they support all sorts of
transformations (refactorings) while preserving the original source code as
much as possible. Their syntax trees must also handle errors, as they need to
work robustly in the editor while the user is typing or otherwise has syntax
errors. IntelliJ can even perform refactorings while the code has syntax or
semantic errors.

See this for more details:
[http://www.jetbrains.org/intellij/sdk/docs/basics/architectu...](http://www.jetbrains.org/intellij/sdk/docs/basics/architectural_overview.html)

------
krallja
The Roslyn compiler (C#) stores "syntax trivia" as properties of the AST nodes
they appear near. Whitespace and comments are considered trivia.

~~~
m0sa
[https://github.com/dotnet/roslyn/wiki/Roslyn-
Overview#syntax...](https://github.com/dotnet/roslyn/wiki/Roslyn-
Overview#syntax-trivia)

~~~
chubot
Thank you, this kind of design doc is exactly what I was looking for. I'm not
really familiar with Microsoft's ecosystem, but I mentioned it in the blog
post because I suspected that they had the most advanced technology in this
domain.

From that doc, which I plan to read thoroughly:

 _This enables the second attribute of syntax trees. A syntax tree obtained
from the parser is completely round-trippable back to the text it was parsed
from. From any syntax node, it is possible to get the text representation of
the sub-tree rooted at that node._

This is true with my representation too, but I don't actually attach "trivia"
to trees. Instead I just have every node store a bunch of span IDs. And then
if you want to reconstruct the text, then you just take min(span_ids of node)
and max(span_ids of node) and then concatenate those spans.

I also think the name "lossless syntax tree" makes sense, because they are
describing very specific properties that ASTs and CSTs / parse trees don't
have.

They also have an immutable property which is cool. I recall that Hjelsberg
had a video on this:

[https://news.ycombinator.com/item?id=11685317](https://news.ycombinator.com/item?id=11685317)

[https://channel9.msdn.com/Blogs/Seth-Juarez/Anders-
Hejlsberg...](https://channel9.msdn.com/Blogs/Seth-Juarez/Anders-Hejlsberg-on-
Modern-Compiler-Construction)

------
goerz
Are there any parsers (or parser generators) that allow to generate a
"Lossless Syntax Tree" for an arbitrary grammar? Specifically, a parser that
tags every parsed token with a tuple (line_id, col, length) that tells me
where it was in the original parsed file. I've been thinking for ages about
writing some refactoring tools for Fortran that would need something like
that.

~~~
maxbrunsfeld
This is a parser generator that I'm working on:

[https://github.com/tree-sitter/tree-sitter](https://github.com/tree-
sitter/tree-sitter)

It produces concrete syntax trees that can be queried by line or character
index. The library is specifically focused on _incremental_ parsing, for use
in a text editor, but it also works fine for normal 'batch' parsing workloads.
It has a simple C API that you should be able to use from any language.
Currently, there are bindings to JavaScript and Haskell. Here are some example
grammars:

* C - [https://github.com/tree-sitter/tree-sitter-c](https://github.com/tree-sitter/tree-sitter-c)

* JavaScript - [https://github.com/tree-sitter/tree-sitter-javascript](https://github.com/tree-sitter/tree-sitter-javascript)

* Go - [https://github.com/tree-sitter/tree-sitter-go](https://github.com/tree-sitter/tree-sitter-go)

* Ruby - [https://github.com/tree-sitter/tree-sitter-ruby](https://github.com/tree-sitter/tree-sitter-ruby)

* Python - [https://github.com/tree-sitter/tree-sitter-python](https://github.com/tree-sitter/tree-sitter-python)

~~~
goerz
That looks extremely interesting!

------
hzoo
For JavaScript we have
[https://github.com/estree/estree](https://github.com/estree/estree) as our JS
AST spec for parsers like esprima, acorn, babylon (under a config flag), etc.

JSCS ([https://github.com/jscs-dev/node-jscs](https://github.com/jscs-
dev/node-jscs)) (now merged with ESLint) started a CST project as well
[https://github.com/cst/cst](https://github.com/cst/cst) to help deal with
autoformatting by adding whitespace type nodes.

Currently the community has a lot of interest in
[https://github.com/jlongster/prettier](https://github.com/jlongster/prettier)
which just simply reprints the file from scratch in a consistent way (posted
in
[https://news.ycombinator.com/item?id=13365470](https://news.ycombinator.com/item?id=13365470)).

We also have
[https://github.com/benjamn/recast](https://github.com/benjamn/recast) to help
source to source transformations and
[https://github.com/facebook/jscodeshift](https://github.com/facebook/jscodeshift).

------
rav
Using the line/column information in Python's ast objects, you can sort of,
almost recreate the concrete syntax, and I've tried to do so in my toy project
[https://github.com/Mortal/fstrings](https://github.com/Mortal/fstrings) which
converts old %-style formatting to the new Python 3.6 f-string syntax. It's
not feature complete, but it mostly works.

~~~
icebraining
There's redbaron, which provides a full syntax tree, and is designed for
making such changes:
[https://github.com/PyCQA/redbaron](https://github.com/PyCQA/redbaron)

~~~
rav
Unfortunately, redbaron targets Python 2, which doesn't have the f-strings
added in Python 3.6. The project looks quite nifty otherwise.

------
lloydde
This is the first I've read about the oil shell. I've tried a few times to
move to fish shell, but I'm too simple a shell user not to be tripped up
regularly when I paste bash in, so I end up back at zsh for interactive and
bash for scripts.

~~~
ycmbntrthrwaway
oil shell is not ready to be used as a login shell, it is a prototype. The
most interesting part of it currently is its blog at
[http://www.oilshell.org/blog/](http://www.oilshell.org/blog/)

~~~
lloydde
I'm very much enjoying reading "Translating Shell to Oil"
[http://www.oilshell.org/blog/2017/02/05.html](http://www.oilshell.org/blog/2017/02/05.html)

------
MarcosDione
By chance do you have or plan to have a RSS feed? One that is some 60 (!)
posts long, so I can read everything you published in my RSS reader (I'm still
using one!).

------
ycmbntrthrwaway
Author never commits with his real email except for the initial commit (which
points to [https://github.com/andychu](https://github.com/andychu)) and the
project is owned by organization.

I wonder if it is possible to run anonymous projects by creating organization
and committing as "Anonymous <anonymous@example.com>". Of course you should
just create another account using Tor in this case, but just from a
theoretical point of view, can an outsider (not working at github) find out
who the author is?

~~~
chriswarbo
> I wonder if it is possible to run anonymous projects by creating
> organization and committing as "Anonymous <anonymous@example.com>". Of
> course you should just create another account using Tor in this case, but
> just from a theoretical point of view, can an outsider (not working at
> github) find out who the author is?

Should be trivially possible; GitHub identifies people via their SSH key, not
via a "name" or email address.

According to their terms of service (
[https://help.github.com/articles/github-terms-of-
service](https://help.github.com/articles/github-terms-of-service) ):

> You must provide your name, a valid email address, and any other information
> requested in order to complete the signup process.

Throwaway email addresses (e.g. mailinator.com ) are "valid", but not
providing "your name" is a violation of the TOS, and hence:

> Violation of any of the terms below will result in the termination of your
> Account.

Of course, that's not saying much when their terms also state:

> GitHub, in its sole discretion, has the right to suspend or terminate your
> account and refuse any and all current or future use of the Service, or any
> other GitHub service, for any reason at any time. Such termination of the
> Service will result in the deactivation or deletion of your Account or your
> access to your Account, and the forfeiture and relinquishment of all Content
> in your Account. GitHub reserves the right to refuse service to anyone for
> any reason at any time.

Hence anybody who actually cares about privacy, anonymity, publishing code
without it being taken down, etc. should probably avoid GitHub (other than for
mirroring). It's a single point of failure which can be closed down or sent a
subpoena.

The whole point of git is to be decentralised, so just use `git send-email`
and keep it that way. No need to give any control to a centralised entity,
especially a proprietary one!

~~~
skissane
> > You must provide your name

It never says "full legal name", just "your name". Suppose you decide to use
an alias? (Especially if you use one that sounds like a legitimate name?) Does
that violate the TOS?

I am not a lawyer so I am not going to even try to interpret GitHub's TOS. But
I do think GitHub ought to clarify this point–either explicitly say that
aliases/pseudonyms are permitted, or alternatively if they insist on your
legal name, say that. Although demanding one's legal name, if taken literally,
can produce some odd results – if everyone calls you "Peggy Smith" but your
legal name is "Margaret Helen Smith", should it be a TOS violation to create
an account as "Peggy Smith"? Or, to give another example, a woman who marries
may decide to adopt her husband's surname as her legal surname (or in some
countries that might even happen automatically by law without her having any
say in it), but still use her maiden name for professional purposes–that's
what my mother did at least.

~~~
chriswarbo
I imagine many GitHub users don't have a "full legal name"; e.g. I'm in the UK
and AFAIK there's no concept of a distinguished "legal name"; a name is
anything you're known by, i.e. I could give "chriswarbo" as my name in an
official setting, since that's one of the names I'm known by (although it
would probably cause me headaches, since that name doesn't appear in any of my
other official forms/documents).

Besides which, as I quoted above, GitHub can terminate any account at any time
for any reason, so it's really just nitpicking.

------
kazinator
HN feature request: "hide all from domain matching pattern".

