Hacker News new | past | comments | ask | show | jobs | submit login
Tree-sitter: an incremental parsing system for programming tools (github.com/tree-sitter)
476 points by sbt567 14 days ago | hide | past | favorite | 133 comments



Tree Sitter is amazing. The parsing is fast enough to run on every keystroke. The parse tree is extremely concise and readable. It resembles an AST more than a parse tree (ie no 11 levels of binary op precedence rules in the tree). The parse tree emits specific ERROR nodes, so you can get a semi-functional tree even with broken syntax.

I can't wait for the tools to get built with this. Paredit for TypeScript. Syntax-tree based highlighting (vs regex highlighting). A command to "add an arg to current function" which works across languages. A command to add a CSS class to the nearest JSX node, or to walk up the tree at the className="| ..." position, adding a new className if it doesn't exist.

There's a nicely documented Emacs package for this [1]. The documentation is at [2]. The parse trees work great. There's syntax highlighting support and tree-walking APIs. There's a bit of confusion about TSX vs typescript langs but it's fixable with some config change [3].

[1]: https://github.com/ubolonton/emacs-tree-sitter [2]: https://ubolonton.github.io/emacs-tree-sitter/ [3]: https://github.com/ubolonton/emacs-tree-sitter/issues/66#iss...


Worth calling out that the syntax highlighting support is used to highlight several languages in github.com. (Linguist is still used for the long tail of languages, but we plan to migrate more and more over to tree-sitter-based highlighting over time.)

The query language is also what's used to drive the fuzzy/ctags-like Code Navigation feature. Both of those are powered by tree-sitter query files defined in each language's repo, like these for Go: https://github.com/tree-sitter/tree-sitter-go/tree/master/qu...


Awesome to hear that amazing tech like tree-sitter lives on even though Atom, the product it was built for, is pretty much on life support at this point.

Curious if there's any efforts to bring tree-sitter to VSCode? Exposing tree-sitter to extensions could open up so many possibilities like OP mentioned.



A friend of mine started working on an experimental Emacs mode to provide structural navigation of code based on tree-sitter: https://cs.tvl.fyi/depot/-/tree/users/Profpatsch/emacs-tree-...

The potential for this is essentially something like Paredit, but for all languages.


Can someone point some examples of what `paredit` for other languages provide? I do various lisp programming occasionally but have not used `paredit` yet.


Check out this video for a quick demo: http://emacsrocks.com/e14.html

If you know a Lisp I recommend just giving paredit a spin for a few minutes, it's an interesting experience.


Looks like it's mainly tree/code manipulation. Typing code on the keyboard is probably the least taxing thing when it comes to software development. But I guess it will be nice once it has become a "reflex" rather then a conscious key-combo.


It's not so much about reducing the amount of characters typed, and instead moving the way you think about code from the character level to a more structural level.

Calling it a "reflex" is an interesting phrase! Tools like magit let me encode complicated processes into muscle memory, in a way where retrieval doesn't have to go through remembering and typing a string. Structural editing is similar.


It's not about easier typing.

It's about typing code, as opposed to typing text, with all the structural, highlighting, auto-formatting, auto-completion, error-detection, etc advantages this brings.


I only started using it a few months ago. It's such a natural way to edit code, it only took me about a day for it to become reflexive.

Now it just feels vaguely annoying to work without it. It's fine, it's just one of those ergonomic changes that nags at you a bit. Kind of like the opposite of that feeling of taking off uncomfortable business clothes at the end of the day. Or what I imagine people who are better at vim than me keep talking about.


It's not just saving keystrokes. It eliminates a whole class of errors. I recently did ~4-500 lines of Clojure in CodeMirror and wanted to kill myself by the end of it.


Can someone recommend what's the best material to learn emacs and lisp at the same time


Neovim nightly already has some tools available as plugins. I'm using tree-sitter for syntax highlighting, text objects, and folding right now. Pretty satisfied so far.


The official release of built-in treesitter comes with neovim 0.5. Which looks like it'll be out pretty soon. I've been watching a fairly steady march toward release here: https://github.com/neovim/neovim/milestone/19


Tooting my own horn, Emacs’ csharp-mode[1] is undergoing a rewrite to be 100% based on tree-sitter rather than regexps.

The new code runs way faster and is so much nicer to work with.

Once all the kinks are gone, I can’t imagine going back.

[1] https://github.com/emacs-csharp/csharp-mode/blob/master/csha...


This looks awesome! Do you mind elaborating on any kinks which you found?


A few:

- indentation may be fine for a final doc, but not always while editing. Especially for new lines starting new code-blocks.

- adding new syntax not already known by tree-sitter requires up streaming to at least 2 repos before we can use it in a released version of our package. This can feel less hands on and slow than working in a single repo where you have full control.

No super-biggies yet though.


When tree-sitter reaches 1.0 [0], it may be possible to eliminate the tree-sitter-langs upstream, or both.

[0]: https://github.com/tree-sitter/tree-sitter/issues/930


I'm so excited for this to become built-in in more places! I think once non-lisp users can experience the Power of Structural Editing they'll say, "Hey, I understand now why you all feel so passionate about your parentheses!"

And I can stop feeling like my fingers have all lost a knuckle when I'm writing Typescript :)


Maybe I can finally have this syntax highlighting style: https://youtu.be/b0EF0VTs9Dc?t=900


There is an emacs package for this (maybe beta). I can't remember the name of it and Google is failing me.

EDIT: finally found it https://github.com/alphapapa/prism.el


Rainbow delimiters mode kind of does this, but doesn't maintain the scope color of referenced variables.


The idea is pretty awesome, but my eyes nearly rolled out of my head from the needless condescension at the beginning.


it's just how old people talk. Rob Pike speaks of syntax colouring in the same way - he quotes the Bible.

I don't see it as condescension though. I think it's just a way of speaking.


I know plenty of "old people" who don't talk like this. In fact, the "old people" that I respect tend to be a lot more open-minded.

It's one thing to joke about it a little, but this is just arrogance on display with obvious derision for us children who find that traditional syntax highlighting is beneficial.

I recall reading once that Vim tabs were a crutch for people who "can't remember what they're working on". It's the same kind of arrogance and presumptiveness.


I'm pretty sure you can implement this in Neovim with Tree Sitter now.

This would be a wonderful idea in any programming language.

But I agree with the other commenter that this speaker is really condescending. Not a person I'd want to work with.


nvim-treesitter is pretty promising, do you know if it supports this type of highlighting already?


if your mute is on, accompanying audio states:

"You've all seen syntax coloring, right? That's something we put in our text editors to make it easier for kindergardeners to do programming"

:)


> "Paredit for TypeScript"

Is there a list of ideas for Structural Editing in C-like languages?

I can think of `extend-selection, `move to parent block`, `add arg to function`


Don't have a list, but Dark is doing some really cool things with an editor that depends pretty much exclusively on structural editing (i.e. you can't even make a syntax error if you tried): https://darklang.com/


I tried to use this to ease the front end work load of students in a compiler project (building a C compiler) for a University course, so that the project could be focused on the more interesting middle and back end parts of the compiler. However, reported bugs in the C grammar that saw no activity at all [1] made this impossible. From this small sample of experiences, I was left with the impression that Tree Sitter is great for things like syntax highlighting, where wrong results are annoying but not dramatic, but not so suitable for tools that need a really correct syntax tree.

--- [1] https://github.com/tree-sitter/tree-sitter-c/issues/51


Hi there! You're right that the C grammar in particular is one that could use some love. C is not one of the languages that we're syntax highlighting with tree-sitter yet, nor is it one of the languages that we support Code Navigation for. That means that my team has had to prioritize their work in other places, and no community members have stepped up to take over or help out with maintenance of the C grammar. Not a satisfying answer, I realize, but an honest one.

I'm an engineer on the code intelligence team at Sourcegraph.

We've been busy building out true precise code intelligence/navigation support, but we also have a mode for zero-configuration code navigation based on text search, universal-ctags, and hand-rolled regular expressions (which works surprisingly well!). Tree-sitter would definitely give better results than our current ctags-based approach. It's been catching our attention more and more lately, and we have plans to use it to upgrade our out-of-the-box, instant code navigation experience.

It's not the exact right fit for our primary goals though, since it's designed around being extremely fast while editing and robust against errors. Sourcegraph is only used for navigating committed code, so we're leveraging formats like LSIF to generate complete semantic graphs of codebases and their entire dependency tree. That'll enable a lot of features that are out of reach for tree-sitter, but is a lot harder to get working out of the box and it's a much bigger technical investment.

It's very interesting to see the topological space that houses these solutions fill out. Every tool has its own set of unique trade-offs and fall somewhere on these spectrums:

- fast vs slow

- precise vs imprecise

- zero-configuration vs configuration required

We've visited a few islands in this space but still very curious to see what other islands can be discovered. We're especially excited about tools and formats like tree-sitter and LSIF around which a large and supportive community can grow so that all the products we love and rely on as developers can all make forward progress.


What are those features out of reach of tree-sitter? I can see that you theoretically want something that's optimized for parsing well-formed code all at once, rather than potentially malformed code incrementally, but what trade-offs does tree-sitter make in practice that limit its potential for your use case? On the face of it, it seems to me like tree-sitter could server as a perfectly fine building block for generating LSIF or whatever from a code file.


> On the face of it, it seems to me like tree-sitter could server as a perfectly fine building block for generating LSIF or whatever from a code file.

It does seem this way. Another reply [1] this post makes the same point with a nice proof-of-concept as well.

[1]: https://news.ycombinator.com/item?id=26230900


I wish there was a more universal format for parsers, but I just don't think there enough people who know their stuff.

Take PHP, a language that a lot of people use: the tree-sitter-php extension doesn't support features added in 2019, let alone features added towards the end of 2020.

If you want an up-to-date PHP parser, there's really only one open-source parser[0] that's accurate enough to be used on PHP codebases old and new, and it's written in PHP. Then if you want to parse in a robust fashion you have to adopt a number of hacks to get everything working.

I hadn't encountered LSIF before – can GitHub be configured to use those maps?

[0] https://github.com/nikic/PHP-Parser


We've looked at LSIF before, and decided against it for a few reasons, mostly around COGS, operational overhead, and indexing latency. I gave a talk at last year's FOSDEM [1] going into some of the details. (Caveat that that talk was from when we were using a different open-source library, Semantic, to power fuzzy Code Nav. It's much easier to support new languages using the now-current tree-sitter query approach!)

[1] https://dcreager.net/talks/2020-fosdem/


Could you compare Sourcegraph to something like Moose, FAMIX, GToolkit?

https://github.com/moosetechnology/Moose


Hey, Tree-sitter author here. Thanks for posting! Let me know if you have questions about the project.


There's been some recent discussion as to whether tree-sitter grammars can be used to parse markdown with some hacks or not (currently it's being done by working around all the tree-sitter machinery, resulting in a lot of problems), with no consensus among plugin authors:

https://github.com/nvim-treesitter/nvim-treesitter/issues/87...

Could you possibly chime into that discussion and help them with any possible insights you might have on that? That would be really awesome! TIA <3


I'm curious if tree-sitter can handle c++/c. I think it's supper difficult with meta programming. Without the preprocessor, I think it is not possible to parse c++ correctly.

I’ve been using tree-sitter via FFI from Common Lisp, but what I’d really like would be a way to write my own code generator so that the generated parser could be “native” lisp code. Otherwise, it’s an amazing tool: my only other complaint would be the lack of a grammar for objective-c which would be useful for a lisp/objective-c bridge I’ve been working on.


I think that it'd be pretty easy to generate parser code in other languages besides C, but it would be a lot of work to do to port the core library itself[1] to those other languages.

[1] https://github.com/tree-sitter/tree-sitter/tree/master/lib/s...

I agree about the Objective-C grammar! Although it looks like somebody's started work on it:

https://github.com/merico-dev/tree-sitter-objc


I've done two grammars for my own use in the last few months (well, one isn't quite complete yet) and it's been quite an enjoyable (learning) experience. Thanks for sharing this tool!


That's great to hear. Thanks!


There's an architecture for compilers that I've been wanting for years where a keystroke change to the sourcecode results in an incremental change to the AST, and then the compiler can consume that AST delta to generate a binary patch to the compiled executable.

Would tree-sitter be able to be used for that? (What I want is to feed tree-sitter a stream of keystroke changes and get out a stream of minimal AST changes as a result).


You don't get the AST _diff_ as the result (you get a new tree whose structure is shared with the old tree), but tree-sitter is specifically designed to support this kind of incremental edit use case: https://tree-sitter.github.io/tree-sitter/using-parsers#edit...


When I played around with tree sitter a bit I noticed there were situations where ast elements didn't exactly contain what I'd expect them to. For example: comments are represented in the AST but unfortunately they don't have the contents of the comment parsed out following the laguanges conventions.

I was wondering if this is a case I could open an issue about? Is this for the main tree sitter repo or should I open one language-by-language?

I was looking into automating some stuff across all languages with tree-sitter but handling all of the languages comments syntaxes made it very hard.


Most tree-sitter grammars just parse comments as a single token. Can you give an example of what you mean when you say "contents of the comment parsed out"?

Are you talking about conventions like JSDoc, for putting structured data inside of comments? On GitHub, we handle that by parsing JSDoc comments in a separate pass, using a separate parser. We do it this way because JSDoc isn't really part of the JavaScript language, not all projects use JSDoc, and not all applications are interested in parsing the text inside of comments.


My guess is that they meant parsing code that has been "commented out".


I interpreted it to mean, "Remove the *s from code like this:"

    /* This comment
     * Should just be alphanumeric.
     */


Yep, this is exactly what I meant. Turning

    /* Something */ 
or

    { Something }
into:

    " Something "
Or, even better, into:

    "Something"


Are there any plans to support modifying the grammar on the fly or without recompiling?


I don't think you can do this without recompiling, since the grammars get translated into C code before use. But the built-in command line tools (‘tree-sitter parse’, etc) all support a mode where they will detect local changes to a checked-out grammar definition, and recompile on the fly if needed. (This happens each time the CLI program is started up; it doesn't happen during a long-running process.)


The obvious answer is to embed TCC or another C compiler and either generate a dynamic library or generate wasm and load it directly into the process.

exec_wasm(generate_wasm(generate_c(grammar)))

Now if you can make that whole fn chain incremental, then a delta_grammar -> delta_c -> delta_wasm -> delta_recomputed_wasm_call stack, this will propagate deltas down to exec_wasm and you could dynamically execute the generated code as the grammar changes.


One day, I would love to generalize the web-based playground so that you could edit the grammars. But it's complicated, because we use C as our output language, so you would always need to recompile the C after changing the grammar.

So, I would say that it's not on our near-term roadmap.


Is it possible to use tree-sitter to generate parsers in languages other than C? How hard would it be to modify it to create parsers in e.g. Java?

Edit: sorry, I just saw that you had answered that below.


Thanks for building this. I had not heard of it before, but it looks great Are there more tutorials elsewhere on the Internet you would recommned, besides what is in the documentation?


Not that I know of, right now :(.

In the near future, we'll create some more GitHub-specific documentation that walks you through how to add advanced language support for any programming language on GitHub, by writing a Tree-sitter grammar, and then by writing the tree queries that are used for syntax highlighting, simple code navigation, and someday soon... precise code navigation.


To me, the most impressive use of tree-sitter was an iOS text editor that uses it to parse huge JSON files / mixed language files and highlight them in a very robust way. [0][1] I’m hoping tree-sitter becomes more common like LSP and Emacs can get exact highlighting and other tools with it…

[0]: https://twitter.com/simonbs/status/1352697855845273600

[1]: https://twitter.com/simonbs/status/1362492842141171720?s=21


Emacs does have a package to use tree-sitter [0]. I think emacs-lsp is aware of this highlighting backend and performs pretty well.

(semantic highlighting is pretty slow for C++ with font-lock, with tree-sitter it's a breeze :))

[0]: https://github.com/ubolonton/emacs-tree-sitter


Wait what? You mean to tell me that Emacs LSP already uses tree-sitter?

Yeah but I don't think LSP specs contain syntax-highlighting or semantic highlighting.


LSP supports semantic highlighting: https://microsoft.github.io/language-server-protocol/specifi...

Though AIUI the basic syntax highlighting is done by the editor (e.g. VSCode uses Textmate grammar support).


FYI there is tree-sitter.el for Emacs.


You can watch a good Strangeloop presentation on Tree Sitter. https://www.youtube.com/watch?v=Jes3bD6P0To


Tree-sitter is unfathomable to me. This is the grammar for Ruby:

https://github.com/tree-sitter/tree-sitter-ruby/blob/master/...

I find it absolutely amazing that a grammar for something as complicated as Ruby can be so concise. Less than a thousand lines. The corresponding Bison grammar is 13k lines. And I think the tree-sitter one is scannerless so also includes the lexer?! How do they do it?


No, the Ruby grammar is actually an outlier from what I've seen; it has one of the largest/most complex external scanners: https://github.com/tree-sitter/tree-sitter-ruby/blob/master/...

Precisely because the language is complicated and less amenable to LR parsing.


Not a ruby developer here: that sounds terrifying! Does it make it harder to have a proper mental model of the language (note: not the libraries) or is this mainly because of flexibility (too many ways to skin one cat)?


I don't write Ruby regularly either, but I wouldn't say that syntactic complexity, is necessarily equivalent to semantic complexity. And the syntax is the only part that's relevant to Tree-sitter: it's not an interpreter/compiler.

Note also that (as I alluded to above) the parsing technique that Tree-sitter uses, "LR parsing", makes some things more difficult to parse than they'd be with another kind of parser. This is a deliberate trade-off, because LR parsing makes certain features of Tree-sitter, like fast re-parsing in response to input changes, much much easier.


So, a syntactic tree is a list of elements, grouped by their ordering, which are to be parsed from their arguments, as they appeared in the input. Or a grammar tree, which is a set of elements. There's many things we can do to make Tree-sitter simpler to read and write. Perhaps, like in Perl, there are syntactic categories of types that make it much easier to find things like nodes in a tree, since they're the ones that come in the input. Or I'd be willing to say that maybe, like in Haskell, certain aspects of the language, are syntactic categories, like the parser. So some things that might not be obvious in code, like what the syntax for a class of names is, might be obvious in theory, too. Or, at least they might be obvious in a particular way. Or some aspects of the compiler are really special, and we can infer those in terms of what the compiler does. Or, of course, we can do all these other things, too. We can rewrite the parser, or the compiler, to try to do more or less anything that the parser does. Or maybe we can make Tree-sitter a lot simpler in general. Which I think is probably what you've been thinking about.


> Not a ruby developer here: that sounds terrifying! Does it make it harder to have a proper mental model of the language

It is a little terrifying in the sense that I'd not want to write language level tools (eg: syntax highlighter).

But if you have scheme on one end and natural language on the other, ruby leans à bit towards natural language - but in a good way. In some ways ruby isn't that different from Smalltalk - but it has a lot (sometimes I think too many, sometimes not) conveniences.

Parantheses and brackets are largely optional "where it makes sense". Conditionals support postfix, eg these are equivalent:

  if should_send?() 
    send_mail({to: 'u@x.com'}) 
  end

  send_mail to: 'u@x.com' if should_send?


My mental model of Ruby is one the simplest of any of the languages I've worked with, but it's also the hardest to put into any words. JS actually does beat it out, and then Scala and Python come after.

Everything is kind-of-but-not-really an object, a reference, and a function, all at the same time - which sounds complicated but in my head... turns out to be pretty simple. Everything's just kind of different flavors of the same thing. `attr_accessor` is a good place to see this in action.

The flexibility comes more from the variety of available core language options (procs, blocks, and lambdas) and core libraries (map/each/collect, for example), not from a variety of underlying concepts.


Flexibility. “Too many” is debatable: most organizations wind up settling on a subset of the idioms that Ruby provides, and some of the more esoteric constructs see infrequent use anywhere.

There has been, however, discussion about the need to clean up some of the lesser-used language feature, but obviously doing so carries risks.


It's mostly to work less surprising to the programmer, AFAIR. Probably the most complexity is from having to differentiate local variables and methods depending if the symbol had an assignment before in the scope.


This is more a function of Ruby than of tree-sitter. The tree-sitter grammars for other languages are hopefully less inscrutable. For Ruby, we basically just ported whitequark's parser [1] over to tree-sitter's grammar DSL and scanner API.

[1] https://github.com/whitequark/parser


I didn't mean the tree-sitter grammar was not understandable - it's very understandable - I just can't work out how to managed to find such a concise way to express grammars. Even compared to Whitequark it's 1/3 the size. What's the unique thing you do that makes it so concise?

It also seems somehow to be completely declarative? How have you managed to transform Ruby parsing to be context-free? For example where's the set of what's currently a local variable so you can distinguish from method calls?


Ahh my mistake! :-)

To be fair, we're cheating a little bit because the Ruby grammar relies so heavily on an external scannar, which is just under 1,000 lines of C++: https://github.com/tree-sitter/tree-sitter-ruby/blob/master/...


But for example how do you parse the difference between `x = 14; x` and `y = 14; x`? In the latter case `x` is a method call, and in the former it's a local variable read. I can't see where the parser maintains a set of local variables and where it queries this set. Is it somehow done declaratively? If so that's a huge achievement I don't think that's really been done before in a parser generator.

I really want to try tree-sitter for using in an actual Ruby implementation because it's so beautiful!


[EDITED to make the example actually line up with OP's test]

There's no symbol table in the parser, so at parse time, we don't distinguish those cases:

  $ cat test.rb
  module Test
    def test1
      x = 14; x
    end

    def test2
      y = 14; x
    end
  end
  $ tree-sitter parse test.rb
  (program [0, 0] - [9, 0]
    (module [0, 0] - [8, 3]
      name: (constant [0, 7] - [0, 11])
      (method [1, 2] - [3, 5]
        name: (identifier [1, 6] - [1, 11])
        (assignment [2, 4] - [2, 10]
          left: (identifier [2, 4] - [2, 5])
          right: (integer [2, 8] - [2, 10]))
        (identifier [2, 12] - [2, 13]))
      (method [5, 2] - [7, 5]
        name: (identifier [5, 6] - [5, 11])
        (assignment [6, 4] - [6, 10]
          left: (identifier [6, 4] - [6, 5])
          right: (integer [6, 8] - [6, 10]))
        (identifier [6, 12] - [6, 13]))))
In both cases the bit after the semicolon just parses as (identifier).

For some use cases (e.g. syntax highlighting, depending on your colorization rules) it doesn't matter, and so we don't want to pay the cost. If it does matter (like in an actual implementation), then you'd have to implement this yourself and drive it by the parse tree you get from tree-sitter.


Right you could just have a phase to fix-it-up after parsing. Much better than trying to shoe-horn an imperative action into a nice more-pure parser. Great idea!


So what's cool is that while we don't handle that during parsing, you can use another set of tree-sitter features to do tree queries to achieve this. Here's the query for detecting Ruby locals: https://github.com/tree-sitter/tree-sitter-ruby/blob/32cd5a0... and here's some better documentation for how the query language works: https://tree-sitter.github.io/tree-sitter/syntax-highlightin....


The code is obviously much simpler than its syntax - most importantly, its syntactical simplicity makes it way easier to deal with. So when you write the code to parse it you don't have to try to parse it in one fell swoop like you do in Whitequark.

So you can't read anything from a method call! I can make it so, if you're doing a class method (of any kind) you have to invoke the constructor, as described in "What is a method?" There's also a few new techniques like "new_class_method", which requires creating an object (of some kind) for that class... but what about that? It's not "I've just fixed Tree-sitter's problem"; it's that Tree-sitter hasn't yet resolved the problem yet - there are other parsing problems besides Tree-sitter in Ruby itself like those of classes (and classes are not part of Tree-sitter) and things that are known as "type-traits" and so on - so as it's not quite enough it can be done by other things. The reason for using LR grammar is that when it comes to this - what do I want from that grammar?

The point I'm making here is that LR doesn't give a reason for what you're doing. As a programmer you are trying to write code that is portable because - if it works in a domain you don't understand (such as Ruby) - then you don't know what you're doing is wrong. There can be a domain (as in any language) that's a lot more complex than this - but since we've got that, how can I be sure it won't mess up the code I'm writing?


> The code is obviously much simpler than its syntax

What code? The parser? How can it be simpler than its syntax? It has syntax and semantics, which is strictly more than the syntax.

> The point I'm making here is that LR doesn't give a reason for what you're doing.

What do you mean 'what you're doing'?


Hey thanks! I'm one of the primary developers of this grammar along with @maxbrunsfeld. It was the driving force for supporting an external scanner and while there are still some Ruby edges cases, I'm pretty happy with how it came out. I will say we spent a lot of time on this and I read both the bison Ruby grammar and whitequark's ruby parser (which is excellent) in great detail to understand how to deal with certain parts of the language.

One thing I love about tree-sitter is how both the grammar and the resulting ASTs are so readable. I can come back to this project after months of not contributing and pick up right where I left off.



No the JSON file there is generated (I believe?) from the JavaScript I linked, while the Bison file is hand-written.

With tree-sitter you're hand-writing a 1k file. With Bison you're hand-writing a 13k file.


@chrisseaton you are correct, the JSON file is generated. The handwritten parts are:

- https://github.com/tree-sitter/tree-sitter-ruby/blob/32cd5a0... - https://github.com/tree-sitter/tree-sitter-ruby/blob/32cd5a0...

So about 2k loc.

The trickiest (and most verbose) parts of the external scanner have to do with heredocs and the various ways to declare literals (strings, symbols, regexes, etc).


If curious, past threads:

Tree-sitter: new incremental parsing system for programming tools (2018) [video] - https://news.ycombinator.com/item?id=21675113 - Dec 2019 (28 comments)

Tree-sitter – a new parsing system for programming tools [video] - https://news.ycombinator.com/item?id=18213022 - Oct 2018 (25 comments)

Others?


One more that I know of:

Atom understands your code better than ever before - https://news.ycombinator.com/item?id=18349013 - Oct 2018


I recently used this to put together a unified PL classification model. It's nice because any language treesitter grows to support we'll support pretty effortlessly and treesitter captures more than enough nuance per language to derive high quality classifications.

It's fair to say we can classify a snippet of code based on either single or multiple AST paths produced by treesitter. Right now only doing the programming language but extending it to function classification or description etc isn't out of the question we just don't need it right now.


I'm curious to see if Tree-sitter can be used to provide fast and rich code navigation. I was able to implement simple goto definition/references [1], not sure if it can be used for more advanced navigation features in a language-agnostic way.

If you're interested, GitHub is already using it [2] for that purpose and Sourcegraph is experimenting it [3]

[1] https://github.com/alidn/lsif-os [2] https://github.com/github/semantic [3] https://github.com/sourcegraph/sourcegraph/issues/17378


At GitHub, we're in the process of building a more precise code navigation system on top of Tree-sitter, that models language-specific name-resolution rules in detail.

Our currently-available code navigation system also uses Tree-sitter, but it is pretty simple; it just matches up references and definitions by their name.


Is this the same thing neovim uses for syntax highlighting?

Is there a chance for it getting integrated to vim? Last I checked vim used a regex method which was slow and faulty.


Yup, neovim 0.5+ will be using treesitter for any supported languages, with the current Regex highlighting as a fallback.


Follow the nvim 0.5 release here: https://github.com/neovim/neovim/milestone/19


So far it's the amazing tool and we are happy to use it in our projects. The only two complaints I have is the dependency on JavaScript[1] and missing Rust runtime option[2].

[1] https://github.com/tree-sitter/tree-sitter/issues/465

[2] https://github.com/tree-sitter/tree-sitter/issues/465


We've been using tree-sitter for Semgrep and it's nothing short of incredible. Amazing work by Max and team.


We really appreciate the contributions you all have sent back to tree-sitter and the various language grammars, too!


Here's what it looks like to call it from Rust: https://github.com/tree-sitter/tree-sitter/tree/master/lib/b...

Seems like this would make it much easier to bootstrap a performant language-server. Very cool; maybe that will be my next project.


We also have several of the language grammars published as crates: https://crates.io/search?q=tree-sitter (And doing the same for other grammars is a fairly painless process.)

So if you're writing a tool for a single language (like a language server), it should be as easy as adding tree-sitter and tree-sitter-blah to your cargo manifest.


Awesome! Though my thinking was that it would have an especially large impact for languages that aren't popular enough to have their own LSP yet; you no longer have to be an expert in writing interactive compilers to set up a respectable LSP for a niche language, or even a home-grown one


Yes! This is a great point. It's similar to what I mentioned over on this thread [1] about how we're working on a more precise version of Code Navigation based on tree-sitter. The tl;dr is that you'd write something like tree-sitter queries [2], just like you do for the current fuzzy Code Nav, but the query DSL would be a bit more sophisticated, allowing you to specify the actual name resolution rules of your language. One of the things we're using to test this is an LSP shim that lets us test our rules in VS Code (or any other LSP-compliant editor).

[1] https://news.ycombinator.com/item?id=26227476 [2] https://tree-sitter.github.io/tree-sitter/using-parsers#patt...


> the query DSL would be a bit more sophisticated, allowing you to specify the actual name resolution rules of your language.

This sounds very interesting. Will the query DSL (spec) be available to the public?


That's the current plan! In particular, because we want to allow language communities to implement support for their own languages, and not have to be blocked on my team finding the time to do it. (Just like they can do now with the parser and syntax highlighting / fuzzy code nav rules.) Linguist is our role model here — it currently includes language detection and (regex-based) syntax highlighting rules for 500+ languages. Most of those are contributed by the community. There's no way that my team can migrate all of those in any reasonable amount of time, especially while having to balance that with other feature development and operational responsibilities.

Is the use case for this mainly IDEs or is it intended to replace traditional lexer and parser generators too?


I have used tree-sitter, but only for a very simple use case. The main shortcoming I am aware of are error messages, see here:

https://github.com/tree-sitter/tree-sitter/issues/255

Tree sitter will basically always generate a parse tree, even for malformed input, in which case it will add ERROR nodes for the bits it doesn't like (it will also inform you that there were problems with the parse by setting a boolean attribute). So you have some information you can use to construct a useful error message yourself, but some parser generators will handle this better (although it has to be said that the difficulty of obtaining good error messages from a parser generator are still one of the main the reasons production parsers are mostly written by hand).


Ah I see, so the reparation isn't avoidable for now? That doesn't seem very appropraite for compilers then.


Why would it not be appropriate? The only annoyance I see is that currently you will have to generate a good error message from it yourself, but a first pass at the problem shouldn't be too onerous.


Ok, I misunderstood. I thought it repaired without error sometimes but I see that you were clear that that isn't the case.


We are also using this to power a lot of the program analysis features on github.com. We use it to generate the symbol list for Code Navigation, as an example, and are starting to look at extracting more semantic information about some languages using tree-sitter parse trees as intermediaries.


This is really cool, I 100% agree that as programmers we’re editing and thinking in terms of ASTs. It just happens that text is a high density way to represent those ASTs.

I’m going to play with this and see if I can make a generic language server for vscode that works across languages. Unless someone has already done that.

What would be really cool is that tree-sitter (or a sister package) that provides incremental formatting primitives across languages.

The closest language agnostic formatter that comes to mind is prettier.js with its extensions.

incremental parser —> language server -> formatter across languages would be super rad.


While we're in this discussion: Say I want to implement "SQL" for my app (if you've used Jira, I want to make my own JQL). Is this the tool for that? I'm looking for something much simpler than ANTLR.


Should be able to create your own grammar.js, it’s actually much simpler than ANTLR.

Wrote tree-sitter-svelte. Was a good experience. I am also writing a programming language of my own similar to TypeScript and I am using tree-sitter for the same. Its a delight to work with it. Removes a lot of the worries.


You have your code/demo on GitHub, would love to play with it.

not yet man. its in progress

I half-wrote a tree-sitter grammar for a niche DSL (the PRISM probabilistic model checking language). It was a very nice experience. It's part of another half-written side project to create a language server for PRISM; I still haven't gotten around to making the whole end-to-end pipeline work.

With its syntax tree query frontend I wonder whether tree-sitter would make a good interpreter frontend for some niche languages, or you need something more powerful.


Does GitHub currently use tree-sitter for syntax highlighting? If yes, are the libraries open-source? Thanks :)

> Does GitHub currently use tree-sitter for syntax highlighting?

For some languages, yes. https://news.ycombinator.com/item?id=26227214

> If yes, are the libraries open-source?

They are! tree-sitter itself is open-source [1], as are all of the language parsers we've listed on the homepage [2]. The syntax highlighting support is documented here [3].

[1] https://github.com/tree-sitter/tree-sitter

[2] https://tree-sitter.github.io/tree-sitter/#available-parsers

[3] https://tree-sitter.github.io/tree-sitter/syntax-highlightin...


I tried looking through the docs, and couldn't find any mention of which algorithm you are using. It seems like some LR, grammar, but which kind? LALR? GLR? It seems like a very important bit of information, that's suspiciously missing.


It generates LR(1) parsers, and can use GLR on an opt-in basis, for handling specific conflicts.

Here is the "tracking issue" for JetBrains IDEs https://youtrack.jetbrains.com/issue/KT-45087 Upvote the issue if you wanna bump the priority


Are there any benchmarck available versus TextMate regexes?


Next steps: incrementally resolve symbols and type-check?


We're currently working on a more precise version of the Code Nav that's shipped on github.com, which is very similar in spirit to this!


Hi, can you consider adding Kotlin to the list of supported languages? Since the feature launched there is now a Kotlin tree sitter implementation https://github.com/fwcd/tree-sitter-kotlin (but maybe that it needs some improvments)


That’s great, thanks for the link to the grammar repo! I can’t commit to a specific timeline but I’ll definitely put this on the list.


Thanks, that would be really great :)


Where is the SQL parser? Any specific reason why is it missing (not even started)?

My team is only writing tree-sitter parsers as part of working on GitHub developer productivity features like Code Navigation. So the short version is that we (i.e., my team at GitHub) haven't written a tree-sitter parser for SQL because we haven't targeted SQL for Code Nav support yet.

That said, this is exactly why we've released tree-sitter as an open-source project. That way there's no need for anyone to be blocked on my team finding the time to work on an SQL parser. Most extant tree-sitter parsers [1] have been developed by external language communities, and not by the core tree-sitter maintainers.

(Also note that SQL is a particularly wrinkly language, since there are so many different dialects. Are you looking for an ANSI SQL parser? A MySQL SQL parser? One that covers all of them to some degree?)

[1] https://tree-sitter.github.io/tree-sitter/#available-parsers




Applications are open for YC Summer 2021

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: