I think it's the same implementation described here: http://www.cs.princeton.edu/courses/archive/spr09/cos333/bea... (not 100% sure as I lost my copy of Beautiful Code :()
EDIT: I see the author links to the article at the beginning of the post. Still, I missed this on my first reading, so I think posting the link here is still worthwhile. Especially because the translation to JS kind of misses the point - the beauty of the Rob's implementation comes from recursion and pointers and JS lacks the latter.
https://swtch.com/~rsc/regexp/regexp1.html (previously discussed at https://news.ycombinator.com/item?id=466845)
That's precisely where the exponential behaviour comes from; consider e.g. matching the pattern "a?a?a?aaa" against "aaa". It will try matching "a?" against the first "a", which succeeds, leading to a recursive call to match "a?a?aaa" with "aa". That eventually fails, so it tries matching "a?a?aaa" against "aaa"; and inside those two branches, it also splits into two depending on whether to match the "a?", etc. The result is, for each "a?" and "a" added, the total amount of work involved in matching doubles, so it is exponential.
"considerable simplification" sounds like it's kinda regex. It's not even basic regular expressions/regular languages.
The author reimplements code from https://www.cs.princeton.edu/courses/archive/spr09/cos333/be... which acknowledges that the implemented subset does not match all classes but they propose that the classes they can parse, are already useful. The author of the new article makes no such acknowledgement that their implementation just represents a subset of regular languages.
Of course, the regular languages themselves are also just a subset of regexp, which can match more languages due to constructs like references.
Bad side is that luarocks works fine on Windows only if no build step is involved, otherwise you’re doomed to mess with mingw/msys/msys2 environment that isn’t well-supported by third-parties; often not supported at all. It is not Lua’s fail, but it happens. New complicated build systems like CMake only make things worse since you cannot simply guess flags and gcc .c together anymore.
Edit: this is also true for all languages except maybe perl that includes full mingw system with it. Idk why some package managers do not prebuild windows packages on server-side. Windows actually does a lot to maintain binary compatibility.
The most obnoxious part is backreferences. The atom \3 is a backreference if the whole regexp contains at least 3 capture groups; otherwise it is an octal (!) escape for char code 3. But you don't know how many capture groups there are until you're done parsing. This is why JS regexp parsers sometimes must make two passes!
And my implementation of the same in scala (40 lines if you ignore some niceties, and its terribly fast asymptotically) https://gist.github.com/yellowflash/826004277874cadabbc502e6...
For TLDR on the paper, It slowly builds an abstraction and implementation on regex engine which runs on O(mn) where m is length of the regex and n - length of the text. Then they generalize it to do grouping and even extend it to match context free grammar (using lazy evaluation mostly).
The “match” function. And yes, this is ugly and needs to be cleaned up.
Glad to see that come up in this thread. It was the first clearly explained regex engine for me in _Software Solutions in C_ (Schumacher D., Academic Press, 1994). I still have a copy and the original disk (and disk image). If anyone is interested I could scan that chapter tonight.
For that you need to be able to build an equivalent to the regular language expression
Once you add captures and maybe even backreferences, you get a whole new world of weird :-)