This evening I'm even trying to port it to a pure Rust regex engine that should eliminate non-Rust code and make it substantially faster.
It also implements the sublime-syntax format which is a superset of tmlanguage that allows even nicer highlighting.
For context Sublime Text 3 takes 2 seconds with the same grammar and same file on the same computer due to using a better custom regex engine written specifically for highlighting.
Given what alexdima mentioned in a different comment about spending most of the time in the regex engine, I'm not sure that my engine would be substantially faster under exactly identical conditions since I'm also bottlenecked by Oniguruma.
However, maybe after I port my engine to https://github.com/google/fancy-regex I'll be substantially faster. And if I do it is likely they could also benefit from a fancy-regex port.
Just eyeballing the cited numbers, they take 3939ms to handle a 1.18MB input on "a somewhat powerful desktop machine". Assuming that that means a chip running at 2GHz, we're talking about over 6300 cycles per byte!
That's quite frankly ridiculous. An improvement by at least one order of magnitude should be possible. Where's the ambition?
(Yes, there's always a trade-off with these things. But I feel someone has to point this out when the OP is explicitly about getting kudos for performance work.)
I've looked in the past for optimization opportunities in the C land (mostly through better caching), which yielded quite nice results . I would love if you'd want to take a look too.
At this point, in tokenization, 90% of the time is spent in C, matching regular expressions in oniguruma. More precisely, regular expressions are executed 3,933,859 times to tokenize checker.ts -- the 1.18MB file. That is with some very good caching in node-oniguruma and it just speaks to the inefficiency of the TM grammars regex based design, more than anything else.
It is definitely possible to write faster tokenizers, especially when writing them by hand (even in JS), see for example the Monaco Editor where we use the TypeScript compiler's lexer as a tokenizer.
At least in this case, inefficiencies are not caused by our runtime.
It's non-trivial because TextMate grammar seem like they're just a little bit too general to be convenient. So there's definitely a trade-off. But if I wanted to really get as fast as possible, I would try to see if I can get there.
I could get way better performance by rewriting all those grammars using compiled parsers in Rust (like Xi has as an option https://github.com/google/xi-editor/blob/master/rust/lang/Ca...) but it would take an absurd amount of effort.
The speed of highlighting tmLanguage files is limited mostly by the regex library. My code spends 50% of its time in Oniguruma. You can improve that a bit by using a fancy regex engine, which is how Sublime is faster than my engine, but this evening I'm going to try to port my library to a faster engine based on Rust's regex crate. But that library (https://github.com/google/fancy-regex) is brand new and almost untested. It's the only open source library that is faster than Oniguruma while supporting all the right features. In the future VSCode may be able to gain some speed by switching to it, but they have much more overhead over the regex library to eat away first than I do.
We had to implement getuid  as an executable in C, which just looks up /etc/passwd by name and returns a number.
By using specialised trie-like data structures and, in the end, also custom memory alignment instead of malloc, we squeezed the whole algorithm runtime down to an average of 45 cycles or so.
And, do you have an issue for the regex port? Would love to see the benchmark.
I have yet to do the regex engine port, I'm doing that later this evening. I'm porting it from Oniguruma to https://github.com/google/fancy-regex which accelerates common types of regexes using the awesome Rust regex crate which is super fast.
I think it's their focus on performance and good architecture that makes vscode stand out.
Sure sublime also has great engineering behind it, but being able to contribute and look under the hood of tools, we developers use is very exciting. It feels like a very democratic process.
I remember filing the mini-map bug in Monaco, I'm so glad to see they are working on implementing it in a performant way even though it will require large rewrites of their editor rendering code.
That doesn't sound right... but then again I don't know enough about TextMate grammars to argue.
- the only way to interpret the grammars and get anywhere near original fidelity is to use the exact same regular expression library (with its custom syntax constructs)
- in the Monaco Editor, we are constrained to a browser environment where we cannot do anything similar
- we have experimented with Emscripten to compile the C library to asm.js, but performance was very poor even in Firefox (10x slower) and extremely poor in Chrome (100x slower).
- we can revisit this once WebAssembly gets traction in the major browsers, but we will still need to consider the browser matrix we support. i.e. if we support IE11 and only Edge will add WebAssembly support, what will the experience be in IE11, etc.
The article says "It was shipped in the form of the Monaco Editor in various Microsoft projects, including Internet Explorer's F12 tools"; presumably some of those projects also embed it into webpages.
Maybe more things, I stopped reading the docs at that point to comment here. They have other features absent from JS that can be more or less polyfilled, like possessive quantifiers and sticky matching (they do it with an escape sequence, though, so it can only be polyfilled using this trick if the escape character applies to the whole regexp rather than part of it).
I'm surprised I've never seen that typo for regex before. It's wonderful.
Can anyone explain why this is the case? It's not only in VSCode, I remember seeing something about TextMate grammars also in other editors.
It's harder than it sounds if you want to support many languages. The Sublime syntaxes repo I use has 34,000 lines of grammars whereas my engine is only 3000 lines of code. If you count all the tmLanguage files for nice languages available online it's probably hundreds of thousands of lines, and that's in a pretty dense format. The whole point of using tmLanguage files is that people don't care about how fast other languages are if there is no highlighting for their language.
I could get way better performance by rewriting all those grammars using compiled parsers in Rust (like Xi has as an option https://github.com/google/xi-editor) but it would take an absurd amount of effort.
Why don't we use other text editors grammars that are simpler/quicker to parse in JS? I have no idea on the technicalities, but for instance, Vim or Emacs grammars instead?
Maybe they meant that browsers don't usually have access to the file system, but that's changing and also not applicable since they're using Electron and have NodeJS at their disposal.
Also all that video tests for is the presence of an optimization where it updates the on-screen colours as soon as that part of the file is done instead of after the entire file is done. It tells nothing about the underlying speed of the highlighting engines. Perhaps an important optimization, but not much information here.
In the end, we simply could not write tokenizers for all languages by hand. And our users wanted to take their themes with them when switching to VS Code. That's why we added support for TM grammars and TM themes, and in hindsight I still consider it to be a very smart decision.
I can see that reusing .tmLanguage files saves a lot of work, but that format is atrocious -- hard to both read and write. (I once wrote a parser/highlighter for it in ObjC, it was not a lot of fun.)
Syntax highlighting is eye candy. Automatic indentation is what turns a text editor into a code editor.