Hacker News new | past | comments | ask | show | jobs | submit login
How to write a linter using tree-sitter in an hour (siraben.dev)
270 points by siraben on March 27, 2022 | hide | past | favorite | 11 comments

This is a great article and I think tree-sitter's design choices are creating a budding ecosystem of new program analysis tooling. This article discusses the speed advantages, but I think the fact that tree-sitter is dependency-free (which the author's previous article did mention) is worth highlighting again.

For context, some of my teammates maintain the OCaml tree-sitter bindings and often contribute to grammars as part of our work on Semgrep (Semgrep uses tree-sitter for searching code and parsing queries that are code snippets themselves into AST matchers).

Often when writing a linter, you need to bring along the runtime of the language you're targeting. E.g., in python if you're writing a parser using the builtin `ast` module, you need to match the language version & features. So you can't parse Python 3 code with Pylint running on Python 2.7, for instance. This ends up being more obnoxious than you'd think at first, especially if you're targeting multiple languages.

Before tree-sitter, using a language's built-in AST tooling was often the best approach because it is guaranteed to keep up with the latest syntax. IMO the genius of tree-sitter is that it's made it way easier than with traditional grammars to keep the language parsers updated.

Highly recommend Max Brunsfield's strange loop talk if you want to learn more about the design choices behind tree-sitter: https://www.thestrangeloop.com/2018/tree-sitter---a-new-pars...

Many folks have often (correctly!) pointed out that each language's existing parsing packages will be more guaranteed to keep up with language changes. Our hope is that if we can get a critical mass of useful tools that all use tree-sitter under the covers, there will _also_ be incentives to ensure that the tree-sitter parser for each language stays up-to-date. And because of how tree-sitter grammars typically live in dedicated repos, that work might be done by external volunteers, and not by the core language developers.

Another benefit to our approach is that it should be much easier to adapt OP's linter to work with other languages, since in a sense it's “parameterized” by the language grammar and queries. If you use Python's `ast` module, you can't easily adapt that code to work on Go programs, for instance.

This is great! I also found I was able to become productive with tree-sitter very quickly, even for tasks it’s not directly designed for. I wrote a (naive) type-stripping TypeScript “compiler” over a weekend using tree-sitter to identify the type annotation nodes to strip. Day 1 it performed on par with ESBuild (which admittedly is doing a lot more) for small modules. Day 2 it was close to ESBuild for a real world 10k LOC module.

This was all a spike to see how I could leverage tree-sitter for real use cases (I really do not want to build yet another TypeScript compiler, I just know the problem space well enough that it was a good starting point).

In the past week I’ve been writing a grammar for XPath. I have a bunch of compelling work related needs which could benefit from that, and maybe surprisingly no grammar for XPath already exists. It’s mostly going smoothly for a lot of my test cases (damn do I have a lot of test cases to reuse, gratefully), but I will say this has been the most challenging part of using tree-sitter so far. Not enough to deter, but worth mentioning if anyone is approaching the library from a similar perspective.

I’m mostly translating a bunch of BNF-ish grammar definitions into tree-sitter’s grammar definition DSL. It’s a good programmatic API for this, but conflict resolution is surprisingly hard to get an instinct for. And the s-expression tree representation + runtime API doesn’t give much insight into nodes which just appear as `(ERROR …stuff)` with no access to details about what caused the error (which doesn’t throw, it just silently builds an invalid tree).

None of this is meant to dissuade anyone from using tree-sitter! It’s an incredible library with a zillion applicable use cases if you’re looking for something with its capabilities. Just cautioning that developing a grammar has been one of the rough edges for me.

Writing grammars presents an interesting set of challenges, especially dealing with ambiguity and precedence as you mention. I find that the `tree-sitter playground` command is convenient for interactively trying out different possibilities. There's also a debug flag "-D" for `tree-sitter parse` which gives detailed information about the shift/reduce points in the parser state and different branches it parsed simultaneously (because it's a GLR parser).

Thanks! I found out about the debug flag yesterday, but I was oh my phone after I called it quits so I haven’t tried it yet. Gonna give it a try today.

I think I saw this post a week already and tried it, because I would need it for many languages.

despite the assurance it would be easy, I could not manage at all. and I gave up eventually. will try again next month probably. contributing to llvm, gcc, gdb, valgrind or qemu was definitely easier.

tree-sitter sounds really powerful, but FFI sounds a bit cumbersome, especially for cross-platform development... Is there any way to avoid it, never mind the performance loss?

The generated parsers are implemented in C, and so currently the only way to use them is via FFI. (Unless you're writing your analysis tool itself in C!)

The `tree-sitter generate` command also produces a full description of the grammar itself (typically written to src/grammar.json). In theory you could use this to generate each grammar implementation in other languages, as well. No one has tackled this yet, though, because in practice it has worked well enough to FFI in the C parser implementations.

Also, depending on what language you're trying to consume tree-sitter parsers in, we're trying to leverage the language's package ecosystem as much as we can to simplify this process. If you're using Rust, for instance, we publish crates for tree-sitter itself [1] and for various language grammars [2,3,4]. So you can often just add some dependencies to your Cargo.toml and the Rust build system takes care of everything for you. (Note that these crates also provide a more Rust-idiomatic API, instead of making you call the FFI'ed C functions directly.)

[1] https://crates.io/crates/tree-sitter

[2] https://crates.io/crates/tree-sitter-python

[3] https://crates.io/crates/tree-sitter-javascript

[4] https://crates.io/crates/tree-sitter-ruby

> No one has tackled this yet, though, because in practice it has worked well enough to FFI in the C parser implementations.

The only issue I’ve come up with so far is a very distinct lack of ARM(64) support in precompiled binaries for FOSS-projects using tree-sitter.

One risks having to manually set up rust and build it yourself on your ARM-host (if your host is capable of that and your host-environment allows it), because currently there’s no easy way to cross-compile it, despite being largely rust-based:


It’s literally the only thing I have not working on in my otherwise very portable Emacs-setup when deployed on an ARM host, so for me it kinda stands out.

Using Nix and home-manager I'm able to have this setup with Emacs work across Linux and macOS (x86 and arm64) because it compiles the grammars from Nixpkgs itself (the grammar list can also be augmented declaratively with other grammars).

[0] https://github.com/siraben/dotfiles/blob/a0f49781ecb05693908...

That’s a bit above my head, but if you could contribute with a “from scratch” ARM64 cross-build recipe for tree-sitter, in the official repo, that would be great!

Applications are open for YC Summer 2023

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact