Even if tree-sitter had no bugs it can never really be trusted because there is ...

maxbrunsfeld · 2024-03-20T23:53:01

> It has created a culture in which checking in enormous generates files is common.

You're not required to do that, and never have been. We're going to move the grammars away from checking the generated `parser.c` in, but early in the project, it was a pretty pragmatic solution, given that many consumers of the grammars weren't using a particular package registry like NPM or crates.io.

> It abdicates common and important features in popular languages like significant whitespace to error-prone external scanners.

External scanners are the right tool for this. Nobody has proposed a better solution for this that actually works, and I don't think there is one, because every programming language with significant implements it differently.

> The GLR algorithm sometimes fails inexplicably and is essentially impossible to debug (not sure if the bugs are in the algorithm or the implementation)

It sounds like you personally have had trouble with debugging a grammar you've tried to develop, but it solves a problem.

> It bizarrely uses rust and node.

Tree-sitter isn't very coupled to node. We just shell out to a JS engine because we use JS (the most popular programming language, period) as a grammar DSL, rather than inventing our own language. Node is the most common one.

> the generated c files are insanely (probably a consequence of the GLR algorithm)

No, it's not because of GLR. A lot of the reason for the C code size is that we want to generate nice, readable, multi-line C expressions to represent data structures like parse tables, rather than encoding it in some cryptic, unreadable way. Also, incremental parsing requires that a bit more data be present in the parse table than would be required for a batch-only parser. What really matters is the code size of the final binaries. The generated `.so` or `.wasm` files for a Python parser are 503k, and 465k, respectively. The wasm gzips down to 69k. I would love to make it smaller, but there are pretty good reasons why it occupies the size that it does, and it's currently pretty manageable.

> It also uses (unless it's been updated in the last six months) a terrible algorithm for unicode lexing.

I'm not sure what you're referring to here. UTF8 and UTF16 are decoded into code points using standard routines from `libicu`.

diffxx · 2024-03-22T00:41:36

I don't understand how you can say the code size has nothing to do with GLR and the explain that it is due to the formatting of the parse tables. The point is that table driven parsers are almost always huge compared to their recursive descent based alternatives (and, yes, recursive descent does have problems). The code size of the final binary is not all that matters because the huge c programs cause insanely slow compile times that make development incredibly painful. So that statement only holds if the time of grammar writers doesn't matter.

I was too vague on unicode. I meant unicode character class matching. Last I checked, tree-sitter seemed to use a binary search on the code point to validate whether or not a character was in a unicode character class like \p{Ll} rather than a table lookup. This lead to a noticeable constant slowdown of like 3x at runtime if I recall correctly for the grammar I was working on compared to an ascii set like [a-z].

nbadg · 2024-03-20T22:14:00

Thanks! That's super helpful!