Yeah the "optimize for years" part is interesting... Supposedly the derivatives technique (re-popularized by a 2009 paper) will build a more optimal DFA directly, rather than building the NFA first, converting to DFA, and then optimizing the DFA.
I put a bunch of links and quotes about that here, including nascent implementations:
About Unicode, this derivatives project (with video linked in the post) appears to be motivated by Unicode support (though I don't recall exactly why, something about derivatives makes it easier?).
If anyone wants to write a glob engine for https://www.oilshell.org/ let me know :) Right now we use libc but there are a couple reasons why we might have our own (globstar and extended globs)
Trivia: extended globs in bash give globs the power of regular expressions, e.g.
[[ abcXabcXXabcabc == +(abc|X) ]] ; echo $?
0
where +(abc|X) is equivalent to (abc|X)+ in "normal" regex syntax, and == is very confusingly the fnmatch() operator.
The derivatives approach makes Unicode support easier since its able to keep the symbols sets for each transition edge (in the DFA) more compact by virtue of supporting negation. If you add in aggressive term-normalization, hash-consing, and an efficient dense-set implementation (all of which I’ve done in my implementation), the derivatives approach can be extremely fast, even when generating the DFA for something like the lexer of a full programming language (in my case, F#).
Very cool! And thanks for the reminder about Unicode. I think supporting union and intersection is also somewhat unique to the derivatives method, and also related? (Although I think there are really 2 derivatives methods: Brzozowski and Antimirov)
What happens if you don't do the optimizations? Does the DFA blow up in size, meaning the compile time is large? Or does it make for a slower runtime? I would expect most DFAs to run at about the same speed, unless they are really huge...
I'd be interested in any rough ideas about performance, e.g. how fast a realistic lexer+parser is, maybe in lines/ms.
It does look like the code is pretty short -- a large part of it is an AVL tree library I guess for hash consing?
I'm interested in any downsides of the derivatives technique vs. the NFA->DFA method. I feel like regex compile time shouldn't matter for many applications, and most DFAs will run in the same speed, which only leaves runtime memory usage (or code size for generating F# code like you appear to be doing).
* Normalization: this is where "smart constructors" come in handy; having a normal form for the terms allows the caching to work better. This also impacts the compactness of the generated DFA.
* Hash-consing: this turns structural equality (in this case) to a simple pointer equality; applied recursively, this makes it much faster to compare two terms for equality, and overall speeds up the DFA generation by a non-trivial amount (I forget the exact numbers, but it was significant).
* Dense set implementation: The AVL tree-based data structure in the facio/Reggie code is an implementation of the Discrete Interval Encoding Tree (DIET) data structure from "Diets for fat sets" and "More on Balanced Diets" papers.
Note the optimizations I've mentioned here impact the performance of generating the DFA. Once you have the DFA, it'll run at the same speed as one generated in any other way. Part of the motiviation for my writing this library was to learn about regex/DFAs/grammars, but also to try to improve on the performance of fslex/fsyacc at the time. Using this library, the FSharpLex tool can generate the DFA for the full F# language grammar in well under 1 sec; the code generation takes a bit longer, largely due to having to convert the DFA into a different form for backwards-compatibility with fslex.
Overall, I feel like the derivatives technique is generally better and simpler, and I'm not aware of any real downsides. The only one that comes to mind is if you're wanting to implement things like backreferences and capture groups -- those obviously make the implementation (of the DFA) more complicated, and there's a lot less literature on it (last I saw, maybe only one or two papers on implementing those features on top of a derivatives-based regex engine).
Great info, thanks! I think derivatives could be very suited for a compact implementation of POSIX regular expressions. You need to handle unicode classes but not backreferences!
Although it does seem more suited for functional languages for sure, whereas I basically only have a C runtime.
but I think it's actually being a bit pedantic, i.e. if "almost all POSIX implementations are buggy" then applications don't rely on that exact semantic (they probably rely on the buggy one, if anything ...)
I put a bunch of links and quotes about that here, including nascent implementations:
http://www.oilshell.org/blog/2020/07/ideas-questions.html
Also related: http://www.oilshell.org/blog/2020/07/eggex-theory.html
About Unicode, this derivatives project (with video linked in the post) appears to be motivated by Unicode support (though I don't recall exactly why, something about derivatives makes it easier?).
https://github.com/MichaelPaddon/epsilon
https://github.com/MichaelPaddon/epsilon/blob/master/epsilon...
If anyone wants to write a glob engine for https://www.oilshell.org/ let me know :) Right now we use libc but there are a couple reasons why we might have our own (globstar and extended globs)
Trivia: extended globs in bash give globs the power of regular expressions, e.g.
where +(abc|X) is equivalent to (abc|X)+ in "normal" regex syntax, and == is very confusingly the fnmatch() operator.