Hacker News new | past | comments | ask | show | jobs | submit login

Hm I don't really think RE2 is even that influential (unfortunately IMO). Rust is the only new language whose engine borrows its design AFAIK. For example, I was disappointed to learn that Julia uses PCRE.

----

I think backtracking is firmly in the past because it causes too many problems in practice [1], but I think there's scope for a text matching engine that's built on top of regular languages, but more expressive.

As mentioned here and on lobste.rs recently, regular languages aren't expressive enough for syntax highlighting (or lexing most programming languages in their entirety).

But it doesn't take much to make them more expressive, WHILE meaning these properties:

1. linear-time and constant space

2. Composition by OR

Backtracking fails at both properties. Backtracking is bad. Like you say with backreferences, there should be something more expressive than regular languages that maintains both properties (and it's likely to be useful).

Example of tiny mechanism on top of regular languages:

When Are Lexer Modes Useful? http://www.oilshell.org/blog/2017/12/17.html

https://lobste.rs/s/m24zv1/xi_editor_retrospective#c_p3zadg

-----

Here's an implementation idea for such a text matching engine:

(1) Start with re2c, because it has some different scaling properties than other engines https://news.ycombinator.com/item?id=23665569

The compile time and code size are linear with respect to the number of patterns. What was very surprising to me is that the compiled C code starts off slower than interpreters, but scales better as you increase the number of patterns.

(2) Write a WebAssembly backend for it. Compiling its C output to WebAssembly is not likely to be great, because it uses gotos on every line, and gotos in WebAssembly need the Relooper algorithm last time I checked.

(3) Now add a linear-space constant-time VM on top. As a test of this property, it should be able to express LL(1) parsing and LALR(1) parsing. Something like "lexer modes" should also fall out trivially.

I believe this strategy saves a bunch of work, and has some desirable properties:

- an entire RE front end, handling captures and such in DFAs: https://arxiv.org/abs/1907.08837 . I'm sure re2c is slower than Hyperscan, but as you hint, I think it's also more general and widely used.

- wasm could be more palatable to embed than other VMs. There are multiple implementations of it.

- wasm avoids dealing with C code and assembly code. It has a few high quality JITs.

- The first version of wasm has a ton of limitations (lack of GC, rich interfaces to the host), but this application hits none of them. It's a perfect problem for it.

(I'm also pretty sure this hasn't been done, but I could be wrong. I googled "re2c wasm" and found that it is used in some wasm tools, but I think that's unrelated: https://chromium.googlesource.com/external/github.com/WebAss... )

And it's also significantly different than RE2 or rust/regex, if that's your complaint :)

If any of that is interesting feel free to contact me at any method listed here: http://www.oilshell.org/ (email, zulip, etc.)

-----

[1] https://lobste.rs/s/m24zv1/xi_editor_retrospective#c_pyfpv1




Actually one more thing I realized about this idea that would be wild...

Since re2c itself is written in C, it can be easily compiled to wasm (even if it has a few gotos). So then you get an engine that can compile user patterns at runtime, rather than having it be an offline process.

(I don't really think there is that much difference in constraints between the offline and online processes, as some of have suggested. For example I observed that re2c compile time is linear with respect # constants strings in a large range in that benchmark.)

I have been thinking of embedding wasm in Oil, mainly to do reasonably fast string transformations. Because one goal of Oil is to provide an alternative to extremely tortured string transformation syntax like: https://github.com/dylanaraps/pure-bash-bible

And Oil has the eggex language, that compiles to egrep syntax, and uses libc's engine.

But it would be cool if it could compile to DFAs with an embedded re2c compiler.

Then again, I was surprised that the compiled C code was not faster than an intepreter for small cases, so maybe this whole thing is too complex for the benefit... I guess I would have to do some up front benchmarks to see if this is even worth it.

It could be that you need an extremely fast wasm engine to make it worthwhile, and all of those are tied to big JS engines and their very non-portable JITs.

[1] https://johnwickerson.wordpress.com/2020/05/21/diagrams-for-...

----

edit: found some related comments:

https://news.ycombinator.com/item?id=18553393

Surprise, the "totally shocking" result there is due to BACKTRACKING :) Automata-based regex in wasm beats native browser regex because it doesn't backtrack.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: