Hacker News new | past | comments | ask | show | jobs | submit login
Emacs: Feature/tree-sitter merged into master (lists.gnu.org)
305 points by signa11 on Nov 23, 2022 | hide | past | favorite | 89 comments



For those unfamiliar with it, tree-sitter (https://emacs-tree-sitter.github.io/) aims to be a foundational package that understands code structurally (think abstract syntax trees). This was done earlier via regex's, which has its limitations.

This talk: https://www.thestrangeloop.com/2018/tree-sitter---a-new-pars... by the author is quite instructive as well.


This HN post though is about a (new) core tree-sitter implementation in Emacs itself, which is not the same as the third party package[1] you linked. To give credit where credit is due though, it was obviously inspired by this work and what it allowed in community-maintained packages.

The new implementation has been authored by Yuan Fu in close collaboration with the core Emacs maintainers and the rest of the community. It has been an ongoing effort for many, many months.

This is great news, and means that also core Emacs language-binding provided as part of Emacs itself will now be able to make use of tree-sitter based parsers as well, something which wouldn't have been happening if they would have to depend on a third-party package to get those bindings.

I've been somewhat involved in the process, although not a major player, but needless to say I'm very excited about these news and can't wait to see what sort of improvements this enables across the line once people start using it.

[1] https://github.com/emacs-tree-sitter/elisp-tree-sitter


So, we still have to wait for each major-mode mainteners to update their code in order to benefits from those change ? In this case, how big should the change be for a "typical" mode ? Is it going to happen for C/python/typescript/etc.. anytime soon ?


If you follow the Emacs-devel mailing list, you will see many of the built in modes adds support for tree-sitter to various degrees lots of languages are already on the list (C, Python, Javascript, Bash, JSON and CSS).

It also includes some new language-modes which has never been part of Emacs before (like typescript).

I'd love to see C# on the list, but that might depend on me having the time to land production-grade major-mode, so that might end up happening later rather than sooner.

Anyway, from what I understand what has been merged so far should all be available as part of Emacs 29 once released.


One thing I'll add, because I think it's an interesting insight about the priorities of code parsing for text editors: Tree-sitter is specifically designed to be very effective at parsing code that's in an invalid state. E.g., think about adding a new line to a program, the new line you're adding is typically invalid for the majority of time until you've finished typing it out.


> E.g., think about adding a new line to a program, the new line you're adding is typically invalid for the majority of time until you've finished typing it out.

That doesn't really bother me. The line of code you're currently typing is generally so invalid that there's not much point trying to color it.

But what does bother me is the code below where you're typing changing its color as the IDE tries to make your partial line fit.


The point is trying to make sure that the parser doesn't choke on the entire file just because one line is incorrect.


That and very fast incremental parsing as you type.

Slow parsing is why emacs used regexps for syntax highlighting instead of parsers in the first place.


Unfortunately this is a really hard problem, and I’ve used projects with tree-sitter and it actually chokes up on invalid syntax lines.

Granted, maybe the projects weren’t using tree-sitter correctly. But regex parsing is surprisingly practical, so despite being really amazing at its goals, tree-sitter may not have a definitive advantage


Is there a chance that this is going to make the parsing of large org mode files faster?


It's not related to tree-sitter, but recent work on using text properties instead of overlays for folded regions in org has improved performance opening org files with folded regions from O(n^2) to O(nlogn). See https://blog.tecosaur.com/tmio/2022-05-31-folding.html It's a big improvement in practice.


Sounds promising, thanks. The most annoyingly slow thing is the agenda buffer which takes ages to build. (Have to admit that the files from which it is sourcing entries are humongous after 15 years of managing my life with org mode.)


Incremental parsing of incorrect code is one of those things that is literally impossible in the general case, but tree-sitter has found a lot of good ways to do it that are not just possible for a large fraction of reality, but also performant. It's hard to understate how impressive a piece of engineering this is.


I think you mean "hard to overstate" :)


Indeed. Too late to edit now though.


If you're wondering what Tree-Sitter is and why Emacs would want it, I wrote about it a while ago:

https://www.masteringemacs.org/article/tree-sitter-complicat...


There is not an Emacs topic I'd like to know more about haven't already covered on your website.

Thanks, your articles and your book are the best guides into the world of Emacs.


Thank you :) I'm glad you like my site and my book!


That point you make about syntax highlighting being slow while using eglot/LSP-mode is a great one. I've been a bit underwhelmed with eglot, and I think that must be the reason: it feels like I'm programming in a bowl of oatmeal with every keystroke.

Do you have any tips or guides for using treesitter for syntax highlighting/structural editing and eglot/LSP-mode for everything else?


AFAIK eglot/lsp-mode don't do syntax highlighting. The article's just explaining why that is (i.e. because it would be too slow).

If you don't have tree-sitter your syntax highlighting will be done by the regex based font-lock-mode. I don't think eglot/lsp-mode make that slower, and I believe tree-sitter should speed it up (and make it more correct) without affecting them. I haven't tried it yet, though.


It must be a matter of configuration. At work, I use buffer re-fontification as an indicator that clangd correctly processed the C++ source file I just opened. That's with an Emacs built from source ~half a year ago + LSP mode.


Oh, interesting, it does appear lsp-mode now does "semantic highlighting" if the server supports it. I switched to eglot a while back (before it was added), which doesn't.

I don't think I'd want that. Syntax highlighting and indentation are things I want instant feedback from.

That affects the answer to the question. I assume you'd need to persuade lsp-mode not to do this and leave it to tree-sitter, but I don't know how to do that.


Last I checked, "semantic tokens" were opt-in; I think it's still the case:

https://emacs-lsp.github.io/lsp-mode/page/settings/semantic-...


tree-sitter emacs package tree-sitter-hl-mode

What lang do you use?


I'm really impressed with the strides Emacs has made recently: native compilation, project.el, eglot, and now tree-sitter?

As a user who hadn't kept up with development news until recently, I'd always mentally sorted Emacs into the same taxonomy as stuff like `find`: old, powerful, with a clunky interface and a stodgy resistance to updating how it does things (though not without reason).

I'm increasingly feeling like that's an unfair classification on my part--I'm genuinely super excited to see where Emacs is in 5 years.


Yes, it feels there is a lot of momentum going on recently.

Both neovim and Emacs are being improved at breakneck pace, and it is quite incredible for such an old piece of software with, dare I say, a quirky contribution model. The maintainers are working really hard on keeping it current and competitive.


I'm really hoping that Emacs becomes multithreaded somehow. Or at least improves some operations so that they're non-blocking.

I've been using Emacs primarily for org-mode/roam/babel for a few years now. I'm very glad for its existence, I really think I've become a more effective DevOps person because of it.


I'll be entirely satisfied with a process/event queue/loop that we can submit tasks to like Javascript's. There is already a command loop in Emacs, we just can't use it for anything other than input events and commands. Once we have an good event loop, we can build a state machine like Redux on it, then we can start rebuilding the display machinery, then we can start deleting all those hooks that constantly interfere with each other...


It already has a process/event queue like JS since Emacs 26. What did you mean in particular?


What did YOU mean in particular?


I meant exactly that, both core JS and ELisp share the same model of a global interpreter where blocks of code can be interleaved and yielded (both manually and in response to external events). I wanted to know what in particular about JS you were missing.


You have been able to emulate cooperative multitasking using callbacks since at least the 70s with lisp. There's always been some form of concurrency in any lisp, the problem is, the event listening/emitter pattern is pretty much not a thing in Emacs, so you can't fire a custom event and hope it will drop into the event loop and get responded to by whatever is listening, because it's not an event loop, it's a command loop. The types of events you can respond to is fixed.


I don't think it needs to become multithreaded, it just needs better support for async/event loop style concurrency!

Right now we can run subprocesses without blocking anything with "make-process", but interacting with the process is pretty clunky, and you have to use the process sentinel and filter to perform callbacks when the process changes state or exits. There is quite a lot of boilerplate to setup for all of this and the control flow is pretty confusing IMO.

A nice "async/await" style interface to these things could really go a long way I think!


Indeed, I'm using Emacs for Code, reading/writing documents and emails, as well as consuming RSS feeds. The ecosystem and values that underpin Emacs are fantastic - in my personal case the only downside to heavy use of Emacs is that it can struggle to utilise my hardware. This tends to be particularly noticeable when using TRAMP and Eglot, or producing large org tables.


Running emacs in tmux over SSH can be faster than TRAMP. TRAMP gets very biggy when you have a lot of third-party elisp extensions.



I probably didn't use the right terminology. I mean that if I list-packages then U, then x to start updating, I should be able to go back to my editor and continue working.


There was a package[1] that did exactly that, so it should be technically possible, unfortunately it has been unmaintained for a while. In any case I/O asynchronicity is achievable without actual multithreading (there are also IRC/telegram/matrix/mastodon clients that don't freeze the UI).

[1] https://github.com/Malabarba/paradox


I think a lot of packages are not yet using threads. And to be honest, I'm a bit scared of packages starting to use threads because there are a million ways in which you can mess up with threads especially given Emacs’ architecture. What if two threads start manipulating the same buffer? Emacs wasn’t built with these scenarios in mind. But perhaps I'm too pessimistic and there are good answers for that.


I was sad the day I saw Emacs implemented threads before a proper async event loop / futures / etc. Do those first, see what kinds of concurrent code people actually want to write, then write a multithreaded scheduler for that.

Instead it’s backwards, now we have hard-to-use concurrency primitives and still shitty UIs.


I want to see good interactive tools for working with and introspecting threads / async / other concurrency models first. In general, because I don't know of any, and in Emacs in particular.

My current experience with Emacs concurrency is mostly negative - occasionally, an async-heavy package (like e.g. Magit-style UI for Docker) will break, and I find it hard to figure out why. Futures-heavy code I've seen tends to keep critical data local (lexically let-bound), which is the opposite of what you want in a malleable system like Emacs. For example, I'd like to have a way to list all unresolved futures everywhere in Emacs, the way I can with e.g. external processes. But it seems that at least the async library used (aio, IIRC) is not designed for that.


> For example, I'd like to have a way to list all unresolved futures everywhere in Emacs, the way I can with e.g. external processes.

I think you could get this done by advising promise creation/resolution functions, aio-promise and aio-resolve. The async/await macros are wrappers around generators-over-promises in this library.

But yes, in general Emacs concurrency sucks. The least bad option I found was using promises' implementation (chuntaro/emacs-promise) that uses `cl-defgeneric` for `promise-then` and (obviously) moving as much processing to a subprocess as possible. The former allows you to make any type "thenable" by implementing the method for it, which is nice for bundling the state around async operations. cl-defstructs are nice for the purpose.


Emacs threads are not parallel and are co-operative rather than preemptive. The co-operative parts mean that you can guarantee atomic updates very easily when updating global variables or buffers so you don't run into some of the more nasty issues that can happen with preemptive threading.


Like the way Python have threads lol. Emacs has generators too, and there are promises implemented on top of them, but they aren't very useful in the elisp ecosystem because at some point you are still going to have to poll due to a lack of a JS like event loop that users can submit tasks to.


Yeah the extra micro-waits introduced by some IDE-like features were annoying last time I used it.



This is excellent!


I have the same feeling.

There is one more, possibly gigantic, thing though: Better handling of very long strings. I know the data structures for strings have various tradeoffs, but properly abstracted, it should be possible to even give a choice, no? So users could choose the data structure, based on their use cases. But I know little about the internals and maybe that is all too low level to be something a user could choose from the user interface or configuration.

I hope string data structure is properly abstracted from, so that it is exchangable for another data structure, but I have my doubts. Would like to be surprised here and anyone credibly telling me, that string data structure in Emacs has an abstraction barrier around it, and is actually exchangable, by implementing basic string functions like "get nth character" or "get substring" in terms of another data structure.

If it is not properly abstracted from, then of course it could be a nightmare to change the data structure.


This was also something that was enhanced recently and will be in Emacs 29: https://github.com/emacs-mirror/emacs/blob/21b387c39bd9cf07c...

> Emacs is now capable of editing files with very long lines.

> The display of long lines has been optimized, and Emacs should no

> longer choke when a buffer on display contains long lines.

> ...


So looking forward to Emacs 29 being available in standard package repositories!


Sweet! When did that merge?


I believe it was over the summer.


I use IntelliJ products but still prefer Emacs as an editor. I moved off it for code for IDE features, even if I managed to get some convenience in Emacs it ran synchronously which meant experience could be pretty laggy at times vs "at worst popup with extra info will be delayed" in IDEA


Check out the emacs-devel@gnu.org list sometime. It's incredibly well run and is in my opinion the secret sauce that keeps the project running.


I have a huge belief in tree-sitter. I think it's going to continue to grow and become an important tool, especially in security/code tooling contexts.


The main innovation of tree-sitter, even more than incremental parsing, as I see it is that it provides a uniform api for traversing a parse tree, which makes it relatively straightforward to onboard a new language to a tool with tree-sitter support. The problem though is that the tree-sitter grammar is nearly always going to be an approximation to the actual language grammar, unless the language compiler/interpreter uses tree-sitter for parsing. To me, this is problematic for tooling because it is always possible for a tree-sitter based tool to be flat out wrong relative to the actual language. For syntax highlighting, this is generally not a huge deal (and tree-sitter does generally work well, though there are exceptions), but I'd be more cautious with security tools based on tree-sitter.

If all languages changed their reference parsers to tree-sitter, this would be moot, but that seems unlikely. Language parsers are often optimized beyond what is possible in a general purpose parser generator like tree-sitter and/or have ambiguities that cannot be resolved with the tree-sitter dsl.

What feels perhaps likely in the future is that a standard parse tree api emerges, analogous to lsp, and then language parsers could emit trees traversable by this api. Maybe it's just the tree-sitter c api with an alternate front end? Hard to say, but I suspect either something better than (but likely at least partially inspired by) tree-sitter will emerge or we will get stuck in a local minimum with tooling based on slightly incorrect language parsers.


> as I see it is that it provides a uniform api for traversing a parse tree, which makes it relatively straightforward to onboard a new language to a tool with tree-sitter support. The problem though is that the tree-sitter grammar is nearly always going to be an approximation to the actual language grammar, unless the language compiler/interpreter uses tree-sitter for parsing.

Author of DiffLens (https://marketplace.visualstudio.com/items?itemName=DiffLens...) here. A uniform API for traversing a parse tree for all languages would be amazing for DiffLens! However, I fear languages are different enough that this ideal may never be reached :) Or maybe there would be a core set of APIs and extensions for the idiosyncrasies of each language. For DiffLens though, we try to use the language's official parser/compiler if it exposes an AST


I think your skepticism is justified, but there are two tools that I have seen that use tree-sitter for diffing:

https://github.com/wilfred/difftastic

https://github.com/afnanenayet/diffsitter

(I have not tried either personally.)


> unless the language compiler/interpreter uses tree-sitter for parsing

Doubtful, last time I tried tree-sitter would parse invalid inputs without even tagging any errors in the parse tree. For example, it would silently accept extra tokens, or keywords in the place of identifiers. Replacing the built-in lexer and then validating the parse tree for correctness would be close to writing the grammar twice.

And accepting partially correct inputs within the compiler toolchain isn't too hard, so I don't really see the advantage of agreeing on tree-sitter and not just on a parse tree representation that editors can then query, as you then suggested. If the big deal is having it execute client-side or being sandboxed, I feel that's orthogonal to parsing algorithms.


tree-sitter is a bit better than regexp but it is not an actual parser of grammars, a fast actual parser of all languages for syntax coloring is the future I think, tree-sitter is a pragmatic middle-ground while we wait for the prime solution


What's the "explain it like I'm 5 years old" (ELI5) for tree-sitter? Why should I, an emacs user but not lisp hacker, care about it?


tree-sitter creates parsers, e.g. for programming languages, config formats, etc.

Emacs modes can use those parsers on buffer contents, e.g. for syntax colouring/highlighting, finding matching delimiters (e.g. moving the cursor over an `if`, and having all the corresponding clauses (e.g. else/elif/fi) highlighted), for contextual editing (e.g. escaping " when inside a string), etc.

This can be remarkably tricky to get right; e.g. consider languages which can splice expressions inside strings (which can themselves contain strings, containing spliced expressions, etc.)

Using tree-sitter should make this easier and more robust (i.e. less time spent implementing parsers; more time spent implementing features!). I think it would also allow grammars to be re-used across different tools, which should improve support for obscure/niche languages.


Does this mean that every emacs language package would automatically make use of this once it is built in. Or will this rather enable the possibility to write/rewrite programming language modes so they make use of tree-sitter because they can assume it is available in the default emacs install from then on?


It needs to be explicitly used. As far as I'm aware it doesn't slot in behind an existing API and magically make things better.


Got it. Are there any beginner guides yet on how to write an emacs (language) package while making use of it?


Unknown if this qualifies as "beginner guide" but the in-tree document is titled "STARTER GUIDE ON WRITING MAJOR MODE WITH TREE-SITTER": https://git.savannah.gnu.org/cgit/emacs.git/tree/admin/notes...


I hindsight I should have left out the "beginner" part. Thanks ofr pointing me in the right direction, exactly what I was looking for.


Another useful feature is that it makes it easier to support mixing languages in the same file.

Think highlighting for html/JS/CSS in a single file or fully featured highlighting inside markdown code snippets.


You know how emacs typically has the worst syntax highlighting of all mainstream editors for a given language? This makes it better.


Congratulations, Emacs! I hope it will be a similar success story as in Neovim. If more systems use it, the question "should my programming language provide a Tree-Sitter parser" becomes a no brainer.


If you're wondering what Tree-Sitter is and why Emacs would want it, I wrote about it a while ago:

https://www.masteringemacs.org/article/tree-sitter-complicat...


I use tree-sitter in neovim and the syntax highlighting is on par with VSCode


Interesting. Current syntax highlighting in emacs is mostly fine, except for how it occasionally blows up - an unterminated quote, in some languages, can run out and match the entire tail as a string, potentially freezing on a large file. Paredit(♡) avoids this by not even letting you do that unless you ask very nicely. I wonder if tree-sitter helps there.


Maybe you and i are using different major modes, but i see the warts of regex high lighting in eMacs. It’s quite bad in auctex


Has anyone been able to replicate vscode/coc.nvim’s instant lsp autocompletion on each keystroke with emacs when eglot/company are enabled? I tried configuring company to lower the idle delay to 0 and set the minimum prefix to 0, but the gui locks up on every keystroke


So if I have a fairly unremarkable setup with LSP to give me completions, what do I get by fooling around with tree-sitter. It seems like this is more geared toward building an AST, so I'm not sure how it would present itself to the end user currently?


You are correct. tree-sitter is not in competition with lsp; lsp is project wide (different files), so will do say code completion. tree-sitter is analyzing the current buffer and applying things like highlighting, brackets etc.

lsp can do some of these tings as well, but sending the entire buffer over to the lsp server every time you want to update the buffer is an expensive operation. tree-sitter does it locally.


I think at this point it’s a new building primitive mostly aimed at major-mode authors.

That said, tree-sitter should make it possible to create paredit-like implementations for languages not LISP and other stuff like that, which IMO could turn out to be really neat.

As a change, this is quite significant, but not directly aimed at end-users.



Faster and more correct syntax highlighting is the main benefit atm, as I understand it.

In general it's for functionality that needs to understand syntax but doesn't need a full compilation-level understanding of the code, that can benefit from much faster responses than an LSP server can provide and that people may want working out of the box for many languages without having to install and configure language servers, generate compilation DBs, etc.


Also: indentation, beginning-of-defun/end-of-defun navigation, and probably Imenu later.

All the basics of major modes that up until now have been implemented using bespoke algorithms, rewritten ad-hoc for each language.


For easy thread reading. Imenu is a way to jump around in the file M-x imenu gives menu of different types of locations (ex functions), you select functions, then you can select the function to jump to.


To save others from needing to research what this is:

tree-sitter is an Emacs binding for Tree-sitter, an incremental parsing system.

It aims to be the foundation for a new breed of Emacs packages that understand code structurally.


Awesome! Pretty soon vanilla emacs will be more than enough, won't need melpa packages.


I'm so impressed by this


But have they made Emacs multithreaded yet?


Have you seriously tried using emacs yet? I always find this sort of text editor comparisons pointless.

And anyway, emacs is hardly a "text editor". It's a text manipulation engine, likely the only one.


I have. Most long running commands of any kind blocks until it is done.


What sort of long-running commands are you using?

There's an `async` call for system functions in elisp.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: