
Tree-sitter – a new parsing system for programming tools [video] - matt_d
https://thestrangeloop.com/2018/tree-sitter---a-new-parsing-system-for-programming-tools.html
======
maxbrunsfeld
Hey, author of Tree-sitter here. Thanks for sharing this! I'd be happy to
answer any questions people have about the project.

~~~
phodge
First up, thanks for dedicating a big chunk of your life to building tree-
sitter! I had a go making a similar parser last year (fast, dynamic-grammar,
self-correcting) and had to give up when I started to realize what a
ludicrously complex undertaking this is. The fact that tree-sitter works at
all is nothing short of amazing.

Question: do you have any plans to integrate tree-sitter with the language-
server project(s)? If fast, accurate parsing of any programming language is
now easily implemented in language servers via tree-sitter, it seems to make
sense for LSP to expand its protocol to include syntax highlighting as well.

~~~
maxbrunsfeld
Thanks for the kind words!

I haven't specifically pursued integration with LSP, but Tree-sitter has been
used to build a couple of language servers which work with both VSCode and
Atom (and probably other editors):

* Bash - [https://github.com/mads-hartmann/bash-language-server](https://github.com/mads-hartmann/bash-language-server) * Ruby - [https://github.com/rubyide/vscode-ruby/tree/master/server](https://github.com/rubyide/vscode-ruby/tree/master/server)

------
DannyBee
It's interesting that this appears based on Wagner and Graham's papers/work.

A bunch of that was eventually open sourced, but for years it was very state
of the art for incremental parsing, but imho languishing, because all the
implementations were closed[1]

[1] ensemble, pan, etc.

------
transreal
The project website is here: [http://tree-sitter.github.io/tree-
sitter/](http://tree-sitter.github.io/tree-sitter/), and the Github repo is
here: [https://github.com/tree-sitter/tree-sitter](https://github.com/tree-
sitter/tree-sitter).

------
mncharity
Here's the js grammar[1].

Despite the worrying talk of tokenization, it looks scannerless. Yay.

One big grammar. JSX is blended in. And ecmascript versions aren't segregated.
Which I didn't expect... but the motivation wasn't writing picky compilers.

[1] [https://github.com/tree-sitter/tree-sitter-
javascript/blob/m...](https://github.com/tree-sitter/tree-sitter-
javascript/blob/master/grammar.js)

~~~
maxbrunsfeld
It's not actually implemented as a scannerless parser, but the grammar API
mostly abstracts away the parser/lexer distinction.

We have to handle JSX and all language versions combined because for our use
case, users need to be able to open up `.js` files and have them Just Work.

It's a similar story with Python and Ruby - we have to handle the union of all
language versions.

~~~
mncharity
> It's not actually implemented as a scannerless parser, but the grammar API
> mostly abstracts away the parser/lexer distinction.

Any aspects of the 'not' part of 'mostly' which one should bear in mind?

~~~
maxbrunsfeld
Yeah, the grammar author does fully control the division between the parser
and lexer: every literal (string or regex) in the grammar corresponds to a
token. There's also a `token()` function that you can use to specify that an
arbitrary rule should be handled by the lexer as a single token.

In most cases, you don't have to think about it; the obvious way to write
something is the right way. There are cases where its helpful to have a mental
model of how lexing works.

~~~
mncharity
> grammar author does fully control the division between the parser and lexer

Nifty. So explicit GLR forks at rule level (`conflicts:`), non-forking token
"conflict" resolution[1], no token regex backtracking pressure across tokens,
and token resolution at a single code position can(?) differ across
conflicting rules? I'm uncertain on that last bit.

> In most cases, you don't have to think about it

Well yes, but, some of us crave _much_ more syntactically flexible languages.
:)

[1] [https://tree-sitter.github.io/tree-sitter/creating-
parsers#c...](https://tree-sitter.github.io/tree-sitter/creating-
parsers#conflicting-tokens)

------
specialp
The Strange Loop videos are in the process of being uploaded still. There are
a lot that deserve a look. I was there at this talk as well. This is one of
the most diverse and high level conferences out there.

[https://www.youtube.com/channel/UC_QIfHvN9auy2CoOdSfMWDw/vid...](https://www.youtube.com/channel/UC_QIfHvN9auy2CoOdSfMWDw/videos)

------
lioeters
Fascinating work!

This talk is a part of an archive not accessible from the site's menus:
[https://thestrangeloop.com/2018/sessions.html](https://thestrangeloop.com/2018/sessions.html)

------
crb002
TIL Tim Wagner did a thesis on IDEs before the serverless thing. The man has a
passion for developer user experience.

------
Avi-D-coder
This is great. Has any one implemented utilized Tree-sitter in another editor?
How does the performance of atom compare to vscode these days?

------
dwenzek
Pretty cool! I found impressive the way Max refactor a bunch of code using
extend selection (16 mn 45).

------
habitue
Ok, now I just need to get this working in emacs.

