Hacker News new | past | comments | ask | show | jobs | submit login

TL;DR CRDT is completely irrelevant to any of the highlighting/etc stuff

Most highlighters are lexers. Advanced highlighters/folders are parsers.

The lexing/parsing that is required for highlighting is easy to make incremental for all sane programming languages.

for LL(star) grammars, adding incrementality is completely trivial (i sent patches to ANTLR4 to do this)

for LR(k) grammars, it's more annoying but possible (tree-sitter does this)

For lexing, doing it optimally is annoying, doing it near optimally is very easy.

optimal incremental lexing requires tracking, on a per token basis, how far ahead in the character stream the recognizer looked (easy), and computing the affected sets (annoying)

Most real programming languages have a lookahead of 1 or 2. Near-optimally requires tracking only the max lookahead used, and assuming all tokens need that max-lookahead. In a world where min token length is 1, that means you only need to re-lex an additional (max lookahead) tokens before the changed range. In a world where min token length is 0, it's all zero length tokens + (max lookahead) tokens before the changed range. This does not require any explicit per-token computation.

Again for basically all programming languages, this ends up re-lexing 1 or 2 more tokens total than strictly necessary.

Tree-sitter does context-aware on-demand lexing. i have patches on the way for ANTLR to do the same.

The only thing CRDT helps with in this equation at all is knowing what changed and producing the sequence of tree-edits for the lexer/parser.

The lexer only cares about knowing what character ranges changed, which does not require CRDT. The typical model for this kind of edit is what vscode's document text changes provide (IE For a given text edit, old start , old end, new start , new end)

The parser only cares about what token ranges changed, which does not require CRDT.




Oh wow, cool!

I made a simple proof-of-concept realtime PEG parser a couple years ago, which ingests text OT/CRDT operations ("insert at position X", etc) and invalidates & recalculates the compiler output by invalidating all overlapping ranges and recalculating from the root. My implementation is way slower than I expected it to be - though I'm sure you could use a lot of tricks from well optimized parsers to speed it up. I agree - CRDT / OT / whatever is sort of orthogonal, although its really nice being able to feed the parser with the same operation format, and have it update its output.

https://home.seph.codes/public/miniohm/ / https://github.com/josephg/miniohm if you're curious.

I'd love to see this sort of thing applied at a larger scale in a compiler. For example, I could imagine editing a C program with a compiler running live. Instead of batch compiling artifacts to disk like its 1970, instead as I type each character the compiler recompiles just the parts of my program that could be affected by my change. The compiler passes code changes to the linker (maybe with a function level granularity). And the linker could then (live) manage writes to an executable, allocating new byte ranges for changed functions and updating references on the fly.

Even on very large C++ project like Chrome, there's no reason why incremental updates should take more than 1-2ms. If all you're doing is changing a single function in the binary file, why do our linkers rewrite the whole thing?


> Instead of batch compiling artifacts to disk like its 1970 [...]

I am delighted to see people discussing this. Batch orientation made sense in a very resource-constrained era. But at this point we have more RAM and CPU than we know what to do with. It seems so obvious to me that the correct solution is to prioritize developer experience and keep everything hot.

My single biggest barrier to developing faster is the latency between making a change and seeing what that does to the test results. I would like that to be well under a second. Given the cost of programmer time, I could happily throw money at RAM and CPU. But standard, batch-oriented tools won't take advantage of it 99% of the time.

I think there's a revolution in developer experience out there, and I really want to see it happen.


The perfect place to start would be support in LLVM backends, maybe they are already working on this.


I don’t know that that’s true. I doubt you could do this sort of redesign simply in an LLVM backend because a “hot” compiler would work differently at so many levels of abstraction. I think an easier place to start would be a self contained compiler with a little less scope than llvm. There we could figure out the patterns, find a clean set of internal APIs and make some sweet demos. And then with that approach the same task in LLVM.


I have done exactly this in the Markdown editor I developed: Stylo: https://www.textually.net if you're curious.

I used what I call "partial compilation" for each edit made in the Markdown editor that "re-compiles" only affected parts of the text. A function is responsible for computing the ranges to be recompiled based on the edits made in the text. In Stylo it was absolutely essential since CSS is used for syntax highlighting and a complete DOM is kept in memory for the complete Markdown text. The necessity to keep the complete DOM in memory could have been avoided without "following sibling" selectors handling but I really wanted "complete" CSS support. Since Markdown can be used for texts ranging from 1 word to 2000 pages, complete recompilation was unthinkable, so partial one with some exceptions was implemented.

The main difficulty with this kind of approach is that it's possible that the compilation, for whatever range determined, does not stop. Meaning that the effect of the edit goes far beyond the determined range, which function can not know this without compiling... It can happen for example if you edit in C and type "{" then all the scopes below are affected by this opened (and not closed yet) scope. In Markdown it can happen when you open a fence code block (using the ``` opening fence). So you need to have a proper way to handle these edge cases. The other problem is to handle the state associated with these regions and handle the non-stopping compilation case talked before.

All this, as I have discovered, make this kind of compilation and it's associated syntax highlighting really difficult to implement properly and fast enough.

In order to be fast, this process needed to be as asynchronous as possible, because there was some tiny but noticeable lags on the main thread. Even partial compilation was more around 8ms to 12ms on my computer (an old macbook pro 2012). But there is also some cases where asynchronous is not suitable, in which cases synchronous processing is used, but, only when absolutely necessary. So, a mechanic which allows to switch between asynchronous and synchronous processing was needed.

All this to say, it's possible but really difficult to do properly. Every bit of the CPU power have to be extracted to make it comfortable to use. I can just imagine with the complexity of C++ what a nightmare it would be to manage, but it should be possible with a proper compiler architecture that modularise partial compilation and encapsulates the inherent complexity of this approach.


Thanks for writing this - its really interesting to read and I'm glad other people are trying this approach.

Is your markdown editor opensource? I'd love to play with it.


"Even on very large C++ project like Chrome, there's no reason why incremental updates should take more than 1-2ms. If all you're doing is changing a single function in the binary file, why do our linkers rewrite the whole thing"

Because it requires N^3+ dataflow problems to track what is possibly changed with significant accuracy.


Does it?

Let’s say I edit a function, and my editor (attached at the hip to my compiler) knows which bytes have been changed. So it knows my change was isolated to a function in a .cpp file. The compiler has cached the compilation environment from the last build, and with that in mind it could recompile just this one function. Then we need to binary patch the function - which essentially requires allocating disk space inside the binary executable. If the previous compiler pass left some free space, we could allocate out of that and mark the old function memory as reusable. Then we need to update CALL pointers - either store them in a list or have every function do indirect calls through a table, and then you just need to update the table entry. If you want to get fancy you could do better than function level granularity, but I think that alone would give us sub-ms builds.

I wouldn’t want to change the process for release builds - but for incremental debug builds I don’t see where you’re getting an N^3 from.


Yes, it does.

"The compiler has cached the compilation environment from the last build, and with that in mind it could recompile just this one function."

First, this is actually the most trivial possible case, and even there, as you'll see, this is not correct.

Outside of a function it does not handle changing types, changing access permissions in classes, changing between virtual and non, etc. It does not handle changing globals, things with attributes of various sorts, etc.

All of those require understanding the uses, or overestimating them.

As for whether function-local changes can have non-local effects, the answer is flatly "yes". You don't even have to be that creative, since you can do things like force template function output and other things that affect global state from a function in C++.

The assumption that because i changed text in one function, it means the possible changes are limited to that function, is not correct in C++ :)

There are languages in which it is, mind you, just C++ is 100% not one of them.

Honestly, I would strongly suggest you start to read papers on this, this is an incredibly well studied area, and the costs are well known.

If it was easy and cheap, we'd still be doing it ;)

It used to be a pretty standard feature of editing environments for things like LISP, etc.


Do you have any recommendations for places to start reading? I didn't realise it was a well studied area - naively it seems like it would be possible in many languages. It would certainly be trivial for 90% of the transpiled / polyfilled stuff that we do in modern typescript / javascript.

I completely agree that doing this to live-update an optimized build would extremely expensive and difficult. And I'll happily acknowledge that (for example) adding or moving fields in a struct would be hard to implement.

But adding a compiled specialization of a template function / class should be "easy" to handle. Its non-local, but I can imagine a dependancy tree of functions needed for the compilation. On update, we can figure out a diff of that tree, free all the now-unused function bodies in the binary and then add (compiling if necessary) all the added / updated functions.

C++ might be one of the worst languages to implement this for - .NET, JVM, typescript or Go might all be much easier. But yeah, if this has been extensively looked in to I'd love to learn why we don't do it for modern development. It seems like an obvious win, and if we don't do it I hope its for good reasons.


But, this is a hard thing to do, the compiler code will have to be organized in a different way, probably the language should be designed specifically towards this end. And the compiler code will be more brittle for sure and will probably be slower for batch compiles.

Most languages do already support ok-fine grained recompiles (at the module level) and it's just many linkers that are slow (I keep hearing from people that this need not be). And actually some of them already do something that is called "incremental linking".

Even C compilers regularly compile 100K+ LOC/second, so I think there's not much to be gained here. Yes, complicated C++ can be slow to compile. Maybe we shouldn't write compliated C++ code to begin with? In the end, batch compile performance also matters.


Very interesting and informative. But I suppose what the user really wants is the next step: not just incremental parsing, but also incremental compilation (or at least the error-checking part of the compilation process).


The basic error checking comes with incremental parsing. A lot of semantic error checking is also not that bad to make incremental (though some languages are just really really bad here).

Incremental compilation depends on the constraints on environment (IE are you trying to incrementally produce normal object files or what)

The more you relax the environmental constraints the easier it gets.

It also depends on just how non-local an effect the programming language allows. For most programming language, general incremental compilation requires either strongly overestimating possibly-modified sets, or global interprocedural dataflow + pointer analysis.

Most choose the former, or don't do general incremental, but instead something like "edits to functions are incremental, edits to global data/types/etc are not"

(This is even without compiler optimization on).


The Skip language (which I mentioned in another comment) is a language based entirely on incremental computation. Every piece of code you write in it becomes incremental, as far as I know.

AND the compiler is self-hosted!!! So that means they have a fully incremental compiler.

Although there is an LLVM back end, so maybe only the front end of the compiler is incremental? That's still important. It comes from the same people as the Hack language, which I believe has a fast "online" typechecker -- which I think is incremental (?)

I believe you could write a C compiler in Skip the same way?

If you peek at the source code, it has a 6000 line recursive descent parser, and 6000 line type checker, much in the style of OCaml (but in Skip).

I think this deserves a blog post because it's not clear to me that the message of Skip got through (7 months ago on HN). Their website has some examples, but not too many.

Of course, I haven't tried it yet, and it's a super ambitious project, so there could be some big problem with it.

But if the goal is to make an entire suite of compiler and devtools that are incremental, then Skip seems like it's ahead of the state of the art and something to borrow from.

Salsa was also mentioned, and I believe it's based on some of the same academic work like Adapton. But I don't know if they have an entire self-hosted compiler as a demo like Skip does. It seems to be "query-oriented" rather than supporting arbitrary code, which is what I would expect for most incremental computing frameworks. Skip seems to be more ambitious.

https://github.com/salsa-rs/salsa

I believe that the Skip language does this, because it's compiler is self-hosted. It's http://skiplang.com/blog/2017/01/04/how-memoization-works.ht...

https://github.com/skiplang/skip/tree/master/src/frontend

Skip – A programming language to skip the things you have already computed https://news.ycombinator.com/item?id=18077612

https://github.com/skiplang/skip/blob/master/src/frontend/sk...


This is a great comment!


Yeah I think I get what you're saying -- basically that if you have incremental algorithms then you don't need to take a dependency on CRDT or OT to make the UI updates fast?

In other words, you can do things synchronously BUT still quickly, since you have an incremental algorithm. In any case, I basically asked that same question here:

https://lobste.rs/s/jembrx/thoughts_on_why_crdt_didn_t_work_...

I've seen Treesitter and am impressed by it, which I wrote a little about in that comment.

-----

Since you speak of "sane" languages, I wrote a parser for an "insane" language by hand, with a principled regex-based lexer:

How to Parse Shell Like a Programming Language http://www.oilshell.org/blog/2019/02/07.html

In fact the first thing I tried was to use ANTLR to translate the POSIX shell grammar (which only covers about 1/4 of the shell language). This drove home the limitations of grammar-based approaches. I found its model of lexing to be very limited (along with insisting on materializing a parser tree rather than providing semantic actions in ANTLR v4).

So it's good to hear that it's being updated for "context-aware lexing", and incremental lexing. Are you using ANTLR for IDE-like tools? I wasn't aware that it was good or commonly used for that kind of use case.

-----

I briefly talked with the author of Treesitter on Github about his bash syntax highlighter with Treesitter, which is of course a parser and not just a lexer, and can be made incremental.

I wasn't sure if the context-aware lexing mechanisms here could really handle the shell language:

https://tree-sitter.github.io/tree-sitter/creating-parsers#l...

I think his response was basically that it doesn't have to be 100% correct, which I agree with for an editor. Trading off some corner cases for speed is probably the right thing.

But I have thought about what it would take make a parser for a "crazy" language incremental and 100% correct (which you would want for further semantic analysis).

I wonder if you can write recursive descent parser in the Skip language and then get incrementality for free? The language is designed so that every algorithm can be made incremental, and it's not limited to pure functional constructs like other systems. Supposedly it goes to some lengths to provide you with a more imperative / mutable state modelk, so it should be able to express an "arbitrary" parser.

http://skiplang.com/

In other words, instead of having different algorithms for making LL(star) and GLR parsing incremental, Skip might give you a more expressive model for the parser. A bunch of my blog posts are about the limitations of grammar-based approaches.

-----

FWIW the style of context-aware lexing I use is explained here:

When Are Lexer Modes Useful? http://www.oilshell.org/blog/2017/12/17.html

And the other post lists the 14 different modes or contexts that the shell front end has:

http://www.oilshell.org/blog/2019/02/07.html#toc_7

It's a very simple and powerful mechanism, but my parser only cares about batch mode for now. It's not clear to me how it differs in expressiveness from what Treesitter offers (or ANTLR). The current lexer mode basically depends on the control flow of the parser.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: