Hacker News new | past | comments | ask | show | jobs | submit login

Even if tree-sitter had no bugs it can never really be trusted because there is always a divergence between the language's official parser and the tree-sitter grammar -- unless the language actually uses tree-sitter as its parser, which no popular language does and it seems unlikely a future popular language will.

For syntax highlighting and code folding, what I said above is probably generally fine. Except for the fact that it will probably be wrong from time to time and then downstream users will waste hours of their lives trying to fix the broken highlighting instead of actually working on their real work.

On the implementation side, the whole thing has bizarre choices. It has created a culture in which checking in enormous generates files is common. It abdicates common and important features in popular languages like significant whitespace to error-prone external scanners. The GLR algorithm sometimes fails inexplicably and is essentially impossible to debug (not sure if the bugs are in the algorithm or the implementation). It bizarrely uses rust and node. The js dsl for grammars is needlessly verbose. The codegen is fairly slow in spite of using rust and the generated c files are insanely huge (probably a consequence of the GLR algorithm) which leads to horrific compile times which makes incremental development a nightmare. It also uses (unless it's been updated in the last six months) a terrible algorithm for unicode lexing.

In my opinion, the whole premise of tree-sitter is kind of wrong. What would be better would be a uniform dsl for generating an AST (which can easily be represented with c structs or classes in any oo language). Trying to make a universal glr parser for all languages is a bad idea. It would generally be much easier, more accurate and faster to write the AST in the dsl that external tools recognize and then handwrite the parser that generates the AST. In the best case, the compiler already exposes an api to get the ast for some source code and you just need to write a translation program from compiler format to dsl format. Worst case, you rip out the parser in the compiler and have it generate the dsl ast format. Most handwritten parsers are only a few thousand lines of codes (maybe 10k at the high end).




> It has created a culture in which checking in enormous generates files is common.

You're not required to do that, and never have been. We're going to move the grammars away from checking the generated `parser.c` in, but early in the project, it was a pretty pragmatic solution, given that many consumers of the grammars weren't using a particular package registry like NPM or crates.io.

> It abdicates common and important features in popular languages like significant whitespace to error-prone external scanners.

External scanners are the right tool for this. Nobody has proposed a better solution for this that actually works, and I don't think there is one, because every programming language with significant implements it differently.

> The GLR algorithm sometimes fails inexplicably and is essentially impossible to debug (not sure if the bugs are in the algorithm or the implementation)

It sounds like you personally have had trouble with debugging a grammar you've tried to develop, but it solves a problem.

> It bizarrely uses rust and node.

Tree-sitter isn't very coupled to node. We just shell out to a JS engine because we use JS (the most popular programming language, period) as a grammar DSL, rather than inventing our own language. Node is the most common one.

> the generated c files are insanely (probably a consequence of the GLR algorithm)

No, it's not because of GLR. A lot of the reason for the C code size is that we want to generate nice, readable, multi-line C expressions to represent data structures like parse tables, rather than encoding it in some cryptic, unreadable way. Also, incremental parsing requires that a bit more data be present in the parse table than would be required for a batch-only parser. What really matters is the code size of the final binaries. The generated `.so` or `.wasm` files for a Python parser are 503k, and 465k, respectively. The wasm gzips down to 69k. I would love to make it smaller, but there are pretty good reasons why it occupies the size that it does, and it's currently pretty manageable.

> It also uses (unless it's been updated in the last six months) a terrible algorithm for unicode lexing.

I'm not sure what you're referring to here. UTF8 and UTF16 are decoded into code points using standard routines from `libicu`.


I don't understand how you can say the code size has nothing to do with GLR and the explain that it is due to the formatting of the parse tables. The point is that table driven parsers are almost always huge compared to their recursive descent based alternatives (and, yes, recursive descent does have problems). The code size of the final binary is not all that matters because the huge c programs cause insanely slow compile times that make development incredibly painful. So that statement only holds if the time of grammar writers doesn't matter.

I was too vague on unicode. I meant unicode character class matching. Last I checked, tree-sitter seemed to use a binary search on the code point to validate whether or not a character was in a unicode character class like \p{Ll} rather than a table lookup. This lead to a noticeable constant slowdown of like 3x at runtime if I recall correctly for the grammar I was working on compared to an ascii set like [a-z].


Thanks! That's super helpful!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: