
Optimizations in Syntax Highlighting - riejo
https://code.visualstudio.com/blogs/2017/02/08/syntax-highlighting-optimizations
======
trishume
Shameless plug: my implementation of Sublime's syntax highlighting engine in
Rust has similar optimizations and more. I'm not at my computer to benchmark
on the same files but it should be >2x as fast as their "after" numbers just
based on lines/second for JS-like files.

This evening I'm even trying to port it to a pure Rust regex engine that
should eliminate non-Rust code and make it substantially faster.

It also implements the sublime-syntax format which is a superset of tmlanguage
that allows even nicer highlighting.

[https://github.com/trishume/syntect](https://github.com/trishume/syntect)

~~~
nhaehnle
Thanks for this. Clearly the original post describes good work, but I can't
help feeling the JS community is slacking off when it comes to performance.

Just eyeballing the cited numbers, they take 3939ms to handle a 1.18MB input
on "a somewhat powerful desktop machine". Assuming that that means a chip
running at 2GHz, we're talking about over 6300 cycles per byte!

That's quite frankly ridiculous. An improvement by at least one order of
magnitude should be possible. Where's the ambition?

(Yes, there's always a trade-off with these things. But I feel _someone_ has
to point this out when the OP is explicitly about getting kudos for
performance work.)

~~~
alexdima
I hate slowness and inefficiency too, that's why I try to make the editor as
fast as possible :), but at least in this case, it is not the dynamic nature
of JS to blame, but rather the nature of TM grammars. TM grammars consist of
rules that have regular expressions, which need to be constantly evaluated;
and in order to implement a correct TM grammar interpreter, you must evaluate
them.

I've looked in the past for optimization opportunities in the C land (mostly
through better caching), which yielded quite nice results [1][2]. I would love
if you'd want to take a look too.

At this point, in tokenization, 90% of the time is spent in C, matching
regular expressions in oniguruma. More precisely, regular expressions are
executed 3,933,859 times to tokenize checker.ts -- the 1.18MB file. That is
with some very good caching in node-oniguruma and it just speaks to the
inefficiency of the TM grammars regex based design, more than anything else.

It is definitely possible to write faster tokenizers, especially when writing
them by hand (even in JS), see for example the Monaco Editor[3] where we use
the TypeScript compiler's lexer as a tokenizer.

At least in this case, inefficiencies are not caused by our runtime.

[1] [https://github.com/atom/node-
oniguruma/pull/40](https://github.com/atom/node-oniguruma/pull/40)

[2] [https://github.com/atom/node-
oniguruma/pull/46](https://github.com/atom/node-oniguruma/pull/46)

[3] [https://microsoft.github.io/monaco-
editor/](https://microsoft.github.io/monaco-editor/)

~~~
nhaehnle
Do you pre-process the regular expressions into a common DFA, or does
oniguruma do that for you? That would seem like the natural design for this.

It's non-trivial because TextMate grammar seem like they're just a little bit
too general to be convenient. So there's definitely a trade-off. But if I
wanted to really get as fast as possible, I would try to see if I can get
there.

------
nojvek
I once dabbled inside vscode tokenizer code. There is a lot going but putting
tokens in a buffer took me by a surprise. It was a very smart implementation.

I think it's their focus on performance and good architecture that makes
vscode stand out.

Sure sublime also has great engineering behind it, but being able to
contribute and look under the hood of tools, we developers use is very
exciting. It feels like a very democratic process.

I remember filing the mini-map bug in Monaco, I'm so glad to see they are
working on implementing it in a performant way even though it will require
large rewrites of their editor rendering code.

------
ex3ndr
Does VS Code support highlighting for "non-regexp" cases. For example: in code
i can reference a Class by it's name, but in the same time if could be a
function - how you can distinct one from another by just regexp when you don't
use capitalization marker for class names? Some times Class is an Object in
some languages (Scala, Kotlin...), how this case is handled?

------
mrgalaxy
> there is no feasible way to interpret TextMate grammars in the browser even
> today

That doesn't sound right... but then again I don't know enough about TextMate
grammars to argue.

~~~
alexdima
\- all the regular expressions in TM grammars are based on oniguruma, a
regular expression library written in C.

\- the only way to interpret the grammars and get anywhere near original
fidelity is to use the exact same regular expression library (with its custom
syntax constructs) in VSCode, our runtime is node.js and we can use a node
native module that exposes the library to JavaScript

\- in the Monaco Editor, we are constrained to a browser environment where we
cannot do anything similar

\- we have experimented with Emscripten to compile the C library to asm.js,
but performance was very poor even in Firefox (10x slower) and extremely poor
in Chrome (100x slower).

\- we can revisit this once WebAssembly gets traction in the major browsers,
but we will still need to consider the browser matrix we support. i.e. if we
support IE11 and only Edge will add WebAssembly support, what will the
experience be in IE11, etc.

~~~
alimbada
Sorry if I'm missing something, but why do you care about supporting other
browsers? Isn't VSCode built on Electron which is a self contained
server/browser environment?

~~~
wolfgang42
Monaco is a standalone component which also works in the browser:
[https://microsoft.github.io/monaco-
editor/](https://microsoft.github.io/monaco-editor/)

The article says "It was shipped in the form of the Monaco Editor in various
Microsoft projects, including Internet Explorer's F12 tools"; presumably some
of those projects also embed it into webpages.

------
iveqy
So VSCode is great in many ways, and the article might be interesting. But I
would never call it fast. It's still really really slow. Just see this
comparision:
[https://www.youtube.com/watch?v=nDRBxtEUOFE](https://www.youtube.com/watch?v=nDRBxtEUOFE)

~~~
alexdima
When we started the project, we did write tokenizers by hand. I mention that
in the blog post. You can write some very fast tokenizers by hand, even in
JavaScript. Of course they won't be as fast as hand written tokenizers in C,
but you'd be surprised how well the code of a hand written tokenizer in
JavaScript can be optimized by a JS engine, at least I was :). IR Hydra 2 is a
great tool to visualise v8's IR representation of JS code [1]. It is a shame
it is not built into the Chrome Dev Tools.

In the end, we simply could not write tokenizers for all languages by hand.
And our users wanted to take their themes with them when switching to VS Code.
That's why we added support for TM grammars and TM themes, and in hindsight I
still consider it to be a very smart decision.

[1] [http://mrale.ph/irhydra/2/](http://mrale.ph/irhydra/2/)

~~~
atombender
What about a parser generator that takes something like a BNF-type language
and generates optimal JS/TS code on the fly, similar to Lex/Yacc? (The BNF
would be portable, the generated code would be a cache.)

I can see that reusing .tmLanguage files saves a lot of work, but that format
is atrocious -- hard to both read and write. (I once wrote a
parser/highlighter for it in ObjC, it was not a lot of fun.)

------
eduren
So with the comparisons at the end of the article, does this mean that there
were a lot of edge cases where the theming wasn't being correctly applied
prior to 1.9? Were there themes that incorporated less stylistic choices
because of the limitations?

~~~
alexdima
Yes, those comparisons at the end show differences in rendering caused by the
"approximations" used prior to VS Code 1.9. They were all caused by the
difference between the ranking rules of CSS selectors and the ranking rules of
TM scope selectors

------
joshschreuder
Those who switched from ST to VS Code, did you stick with VS Code? Do you have
any advice for making the transition easier - key bindings, packages etc.?

~~~
bicubic
I stuck with VS Code. To be honest, I don't think it's even in the same league
as ST or Atom due to the amount of integrated tooling. The integrated
debugger, git, and task runner management is a god send. VS Code is also the
only editor I've used where javascript type lookups and auto completion/code
doc parsing Just Works out of the box.

------
pjmlp
Great article, specially regarding the type of optimisations that were
applied.

------
Animats
_(Syntax highlighting) is the one feature that turns a text editor into a code
editor._

Syntax highlighting is eye candy. Automatic indentation is what turns a text
editor into a code editor.

~~~
farnsworth
Automatic indentation saves a few keystrokes. A languages service (go to
definition, etc) is what turns a text editor into a code editor!

~~~
whatever_dude
A language service is for people with poor memory. A terminal window is what
turns a text editor into a code editor.

~~~
flamedoge
I need OOP navigator to use C++/Java/C#

~~~
hex13
I think support for butterflies is what turns a text editor into a code editor
(xkcd#378) XD

