The goal of the extension is great, as it allows developers to use WASI easily when creating new extensions. Mad props to the team that worked on it.
But is unfortunate that they used literally the name of the most famous extension for WebAssembly in VSCode (vscode-wasm), which has been around already for a few years, as it might confuse people on what is what:
> But is unfortunate that they used literally the name of the most famous extension for WebAssembly in VSCode (vscode-wasm), which has been around already for a few years, as it might confuse people on what is what:
At this point, where Microsoft frequently ends up naming something identically to something else already existing (and sometimes quite popular) I'm done believing it's "unfortunate" and not on purpose. Once or twice maybe it's some unfortunate accident, but at this frequency?
> WASI stands for WebAssembly System Interface. It's an API designed by the Wasmtime project that provides access to several operating-system-like features, including files and filesystems, Berkeley sockets, clocks, and random numbers, that we'll be proposing for standardization.
Am not expert, but here's the understanding I've picked up:
When you use JavaScript through Node, your JS gets the ability to open files and make network sockets and so on, because there are a set of APIs implemented by Node and exposed to your JS code, like so:
JS <-> native APIs <-> Node implementation of platform-specific operations
WASI applies to WebAssembly instead of JS. So again if we're on Node:
Getting a FS going on the web has been such a brutal slog for me. There's some pretty decent existing projects lole lightning-fs but it feels like the web APIs have been changing & changing & I keep wanting newer.
Very cool to see this nicely tackled. Still poking around but seems integrative with wasi, which hopefully grows closer & closer with good fast web APIs. There's been a ton of criticism I think justly so that neither the wasi nor the web fs teams have spent much time trying to figure out how to work together, & so there's real peril possible too, but I think wasi is a much better much less hacked out model & is probably the "right" real target. Having build in object capability sandbox security puts it lightyears beyond what the web spec folk have been up to, and the rest of their fs model is also much much better built out.
Alas it just means that running on the web the target so going to be primarily in-memory for a long time because the web spec folk seem so uninterested in doing a good actual filesystems impl. Origin Private File System seems to be the one place there's interest in being fast at all, and they seem overwhelmingly concerned about making sqlite in emscripted fast & not a ton else. What we would do for real wasi on the web!
Already a very solid start building out coreutils. There's a bunch of similar ish efforts out there already, building JS impls of core utils. For example, posix-tools, https://github.com/shama/posix-tools
> running on the web the target so going to be primarily in-memory for a long time
What do you mean by this? FS implementations I've seen (including lightning-fs) use IndexedDB for the backing store, which is not "in-memory". I authored VS Code's web file system driver and it also uses IndexedDB, which I didn't see evidence of being memory-constrained. Furthermore, there's the File System Access API for web which is also not in-memory (I also authored VS Code's driver for that...)
Color me impressed. I have been aware of indexeddb drivers, and I didn't investigate Vs code's driver (looked briefly through the PR but was short on time). I don't know why I don't credit that route enough.
File System Access API just feels way to limited. There's so much missing stuff. No symlinks, no hard links, no xattr. I worry about what the user experience is going to be when the web api has so many limitations.
> There's so much missing stuff. No symlinks, no hard links, no xattr. I worry about what the user experience is going to be
Personally, I've never met a "normal" user who cares about any of those. You can try vscode.dev => explorer => "Open Folder" to see what the experience is like for yourself. I'd be interested to hear of the pain points, though I moved on from the project a while back.
> Alas it just means that running on the web the target so going to be primarily in-memory for a long time because the web spec folk seem so uninterested in doing a good actual filesystems impl. Origin Private File System seems to be the one place there's interest in being fast at all, and they seem overwhelmingly concerned about making sqlite in emscripted fast & not a ton else. What we would do for real wasi on the web!
Despite the constraints of the browser sandbox and the performance penalties, when you consider all the environments where people have actually used coreutils to great effect, it's undeniable that people were getting real work done in the 90s and early 2000s on strictly worse systems; if you tried to simulate many of those experiences entirely in-browser today, then you still wind up at an advantage (even with modest hardware, like, say, a 5-year old Chromebook) compared to what people were actually getting by with at the time.
The biggest stopping force is really the premature fatalism that follows from the comparison to programs that run with the system linker/loader, have unmitigated access on the host, and the freedom ("pleasure"?) of running memory-unsafe code, or so on. I'll further stipulate that programs that are written in such a way that they require any one of these things are fundamentally broken (at least for most values of x).
Can anyone explain what this is? all of the comments seem to be talking about different things, and when I look at code though it looks to me like they're implementing a JS based terminal?
Tree-sitter grammars take more effort than textmate ones for the most basic highlighting, but provide less accurate parse trees than a compiler could use for more advanced features and highlighting. I understand what they were going for but I don't "get" the trade-off.
At least as long as the parse trees are inherently worse: Even ignoring context-dependent grammars, last time I checked tree-sitter parsers had many edge cases in which they'd accept invalid syntax but not mark the parse trees as bad, which makes them unsuitable for compiler use.
Using compilers for highlighting basically doesn't happen because they are too slow. They are used for progressive improvement as a result (depending how fast, it can take at least 5-10 seconds post-update to get a result. For slower languages, minutes).
I'm not sure what "more effort" means - for lots of languages, it's less effort than textmate because people spend lots of hacks trying to work around textmate to get it to highlight even slightly complex languages right. It's nesting mechanism is complicated, etc.
Tree-sitter is faster than textmate on initial parse in lots of cases, and update a lot faster than textmate (since they can be incrementally updated and textmate can't).
They are not meant as compiler replacements.
Again, compilers are slow, and rarely incremental.
Using tree sitter for folding/highlighting/etc, and compilers for additional progressive type info, is the right path.
Textmate is worse than useless for anything but super simple languages.
I say this as a person who has written about 15 vscode language servers for all sorts of varied languages, from CNC g-code to you name it.
I even implemented incremental parsing/lexing for ANTLR's typescript port (and then java port), as a comparison to tree-sitter ;)
It can be made as fast, but tree-sitter is clearly winning the current popularity contest ;)
I'd say for those languages, textmate work was the worst part of all of them, and getting textmate to highlight properly required more work than writing proper lexers/parsers. I still fix more bugs in textmate grammars than anywhere else.
Part of the reason is that it requires fairly complex regexps. Part of it is getting it to stop trigger on things it shouldn't, because you have to emulate a lot of state.
Again, you start to spend a lot of time working around it, rather than working with it.
Meanwhile, yes, the upfront work of a lexer in tree-sitter is sometimes more. But it is much easier to get it to work, to rely on it working, etc.
In the end, all my language servers were easier to implement in tree-sitter, and some don't even have compilers (like g-code), but have control flow/blocks/etc. tree-sitter handled all language duties there.
Hopefully this helps you "get" the tradeoff.
Anyone who has written enough vscode language servers would tell you the same thing. I have yet to meet someone in the community who writes language servers who is excited or happy with textmate vs tree-sitter.
First, I specifically said that textmate takes less effort for "the most basic highlighting", that would be keywords, literals, comments, etc. Trying to encode a CFG in it would obviously be painful, but why would you? Use an actual parser at that point, but you might as well use a correct one.
Second, there's this idea that compiler parsers are inherently slow and require a full build to use or something. As if tree-sitter parsers were using some magic tech unavailable to compilers, and most new ones weren't modular or didn't offer APIs. It's not even hypothetical, increasingly more modern languages have semantic highlighting in real-time as part of the toolchain.
And in fact, I actually wanted to use that tech in a compiler, I said I tried to use tree-sitter for one, but that's where my big issue with tree-sitter came in: It would return parse trees that blatantly didn't match the grammar I wrote, and not mark them as bad syntax in any way. That's not error recovery, that's swallowing errors. If it does that, I'd rather work on making toolchain parsers better than fighting tree-sitter.
Even the "run the parser on the editor" part is solvable, since tree-sitter parsers are being shipped as WASM. A WASM interface for running possibly incremental parsers (that tree-sitter would also use) sounds more interesting, really.
There's a lot here, and i'mn not sure where to start.
You are complaining about a lot of things simultaneously, and it's hard to tell exactly which point you want to tackle.
For correctness - gonna leave this alone, i don't know of issues here and it seems fairly orthogonal to speed which is where we started.
I've done parsing and lexing for decades now, including multiple compilers. So i'll try again - for magic tech, tree-sitter does actually have "magic tech" unavailable to most compilers, and i explained exactly what it was - but you simply want to deride it for some reason - not obvious to me.
the magic tech is that
1. tree-sitter ast/lex tokens (for purposes of incremental lexing/parsing) are compiler/language agnostic. it does not care. This is not true of most compilers
2. it allows incremental lexing/parsing when you keep those ast/tokens under the text of your editor. It only needs to be informed of what edit was made (add/change/remove of a range of characters is fine) and have random access to the text. This is not true of most compilers.
3. The incremental lexing/parsing is extremely fast. It is optimal in the amount that has to be re-lexed/parsed, which for both is bounded by the amount of max used lookahead in the grammar, which for most languages, is really small (again, not the theoretical lookahead, but the actual-used-during-lex/parse lookahead).
Not true of most compilers.
Can you design compilers that can do this? Sure, i was at IBM when we did visualage, which could do this in other ways, though not optimally. As mentioned, i've even implemented the optimal incremental lexing/parsing algorithms in other parser/lexer generators, and even some compilers.
Is it common? Not a chance.
Your claim that increasingly more modern languages have semantic highlighting in real time "built in" is simply false - most cannot real time semantic highlight even a 100k file on every keystroke. There are a very small number which can, and it's mostly by luck - they fall down on larger files or other complex cases because they have no incrementality.
clangd is about the closest you get.
rust-analyzer definitely can't.
etc
Lexing has always been really fast, and doable in a single pass, that has never been the problem.
Meanwhile, what i just described (updated highlighting on ever keystroke) is trivial with tree-sitter for all languages because of its optimality and agnostism. If you want to see your claim in action - turn on semantic highlighting for vscode and type fast - most of the time you will lose syntax highlighting because things can't keep up. see, e.g., https://github.com/microsoft/vscode-extension-samples/issues...
Again, yes, you could make a compiler for every language which supports a mode that does what tree-sitter does. and people could carefully implement parsers/lexers in each of their favorite compilers and languages that support optimal incremental parsing in them. and then pay the cost of integrating 50 language specific ways of transforming these to work with the editor, basically reinventing what tree-sitter already did right. (As per above, the current semantic token and highlighting support does not resolve this, and the interface it provides can't support it, last i looked).
You seem to really otherwise be complaining that you believe it does always generate correct parsers. Like I said, that seems totally orthogonal to anything about the speed issue, and if that's your real concern, have at it.
> You are complaining about a lot of things simultaneously, and it's hard to tell exactly which point you want to tackle.
Thanks, that's exactly how I felt about your first reply.
> tree-sitter does actually have "magic tech" unavailable to most compilers
No, not using it doesn't mean it's unavailable, at best it means the toolchain is up for an upgrade.
> the magic tech is that
I too have written compilers and language servers, even tested novel parsing techniques. I don't need tree-sitter explained to me, I understand the theory behind it perfectly well and, as I already said, understand what they're going for in terms of ecosystem. I just don't think it's the sweet-spot you think it is.
In any case, making incremental edits is not novel, it's easy even with good old recursive descent by simply traversing the ast, skipping the prefix of the change, reparsing the inner node and checking that the edit was well-bounded (hopefully your language doesn't have multiline strings, or you'll have to accept that some incremental changes consume the whole file in the worst case anyway). I've done this, and again, if some toolchain doesn't then it's up for an upgrade.
The actually interesting parts of tree-sitter, to me, are the error recovery and the shared parse tree interface. The latter is orthogonal to the parsing technique, yet with tree-sitter it's all or nothing. The former would be great if it actually marked the error in the parse tree, which it didn't when I checked and it wasn't deemed important.
> most cannot real time semantic highlight even a 100k file on every keystroke
Even you concede that the bottleneck here is not usually parsing, and even then I just said it can be made incremental easily, but all the other work that the compiler is choosing to do before answering back to the editor. When this work is unnecessary (at this step anyway), then sure, you can speed past them by doing the work yourself on the tree-sitter parse tree.
> you could make a compiler for every language which supports a mode that does what tree-sitter does
Yes, we could. Maintaining and testing tree-sitter grammars also has a cost, maybe the difference is that compiler maintainers are externalising this cost right now.
> and then pay the cost of integrating 50 language specific ways of transforming these to work with the editor
No! We went through this with LSP. We can have any number of parsers with a shared driver interface, maybe like tree-sitter's, but we don't need to push every grammar through a single parser toolkit that isn't good enough to replace the existing parser in the toolchain.
In fact, tree-sitter grammars are shipped precompiled to editors, so that's kinda the situation already, but their current interface AFAIK expects a particular implementation on the parser (theirs).
> You seem to really otherwise be complaining that you believe it does always generate correct parsers
I guess you mean doesn't. Then yes, but that's not an "otherwise", that's a big deal, because if tree-sitter parse trees were actually sound I could just maintain a single tree-sitter grammar for the entire toolchain, compiler and editor, and this debate would be redundant.
Getting a syntax tree from clangd is faster than you might think because it just needs to hook into the lexing/parsing portion of the compiler, which is actually pretty fast and much, much faster than actually compiling code. And generating fixes and warnings using clang-tidy (which can be built as a standalone binary, but is typically just linked into clangd) requires parsing and lexing anyway, so the parse information is basically "free" since it's work clangd needs to do regardless. And at least for C/C++ you can't really do accurate parsing without the compiler anyway due to the fact that preprocessor macros can do nearly anything. Even if your preprocessor macros are generally doing sane things and could be parsed without the compiler, most people want to see things like #if/#else/#endif branches highlighted so they can see which code paths the compiler is actually going to select when it compiles code, and again that requires actually running a full preprocessor with the same flags as the code will use when it is compiled.
That being said, I believe that most editors do a hybrid approach to lexing/parsing because getting syntax information entirely from clangd is too much to ask. So basically the editor will do a best-effort parse in real time of your code using some built-in syntax highlighting engine (possibly tree-sitter, possibly something else) and then once clangd sends syntax information they'll update the document as necessary.
I mean, it depends on how much code it has to lex/parse, but yes, it's pretty fast but not realtime.
That is usually the goal for editors, or at least, what is tried to achieve.
It can't be done every keystroke, it can't be done close to every keystroke. For clangd it's dependent on the size of includes/etc, though it does a reasonable job of caching :)
But it's not hard to out-type it.
"That being said, I believe that most editors do a hybrid approach to lexing/parsing because getting syntax information entirely from clangd is too much to ask. So basically the editor will do a best-effort parse in real time of your code using some built-in syntax highlighting engine (possibly tree-sitter, possibly something else) and then once clangd sends syntax information they'll update the document as necessary."
Which is precisely what i said most of them should do.
It's just that starting with "textmate" basically makes no sense when things like tree-sitter are available.
Tree sitter grammars generate a complete/concrete syntax tree with incremental updates, they necessarily have to be able to represent invalid source text (because users will type it!).
If you're using a tree-sitter grammar as the input to a compiler (which is not what its designed for) you need to write a pass to convert from CST to AST, as you normally would.
No, I specifically said "accept invalid syntax but not mark the parse trees as bad". Tree-sitter is supposed to accept invalid syntax, yes, but also tag the nodes as such. And at the time it seemed upstream didn't intend to fix the edge cases where it fails to do so, it just gave syntax trees that didn't match the grammar.
But is unfortunate that they used literally the name of the most famous extension for WebAssembly in VSCode (vscode-wasm), which has been around already for a few years, as it might confuse people on what is what:
https://marketplace.visualstudio.com/items?itemName=dtsvet.v... (113k installations)
Repo here: https://github.com/wasmerio/vscode-wasm