Hacker News new | comments | show | ask | jobs | submit login
Building a Regex Engine in Fewer Than 40 Lines of Code (nickdrane.com)
148 points by tambourine_man 9 months ago | hide | past | web | favorite | 34 comments



In the "Beautiful Code" book there is a chapter, I think from Rob Pike, which presents a bare-bones regex implementation in C. It doesn't implement alternatives or grouping if I remember correctly, but the implementation is breathtakingly beautiful and not any longer than this one.

I think it's the same implementation described here: http://www.cs.princeton.edu/courses/archive/spr09/cos333/bea... (not 100% sure as I lost my copy of Beautiful Code :()

EDIT: I see the author links to the article at the beginning of the post. Still, I missed this on my first reading, so I think posting the link here is still worthwhile. Especially because the translation to JS kind of misses the point - the beauty of the Rob's implementation comes from recursion and pointers and JS lacks the latter.


Literally the first sentence in the posted article mentions this:

I stumbled upon an article the other day where Rob Pike implements a rudimentary regular expression engine in c. I converted his code to Javascript and added test specs so that someone can self-guide themselves through the creation of the regex engine. The specs and solution can be found in this GitHub repository.


If you liked that chapter, check out some recent ideas of his here: https://www.youtube.com/watch?v=HxaD_trXwRE.


It's small, but unfortunately due to how ? is implemented with recursive backtracking (look at how matchQuestion() tries both alternatives), has an exponential worst-case runtime. Fortunately, the algorithm to do it in linear time is pretty simple too:

https://swtch.com/~rsc/regexp/regexp1.html (previously discussed at https://news.ycombinator.com/item?id=466845)


I actually don't think this is the case (I definitely might be wrong). Although the matchQuestion function has the pattern that typically resembles a function of exponential runtime (where a function invokes itself recursively multiple times), there is a slight difference in our scenario. If you look at matchQuestion's invocations of match, you'll notice that on both sides of the OR, the pattern is stripped of two characters (the "_?"). This means that the recursive invocation of match will never invoke matchQuestion a second time, unless there is a second '?', in which case it's entirely appropriate.


unless there is a second '?', in which case it's entirely appropriate.

That's precisely where the exponential behaviour comes from; consider e.g. matching the pattern "a?a?a?aaa" against "aaa". It will try matching "a?" against the first "a", which succeeds, leading to a recursive call to match "a?a?aaa" with "aa". That eventually fails, so it tries matching "a?a?aaa" against "aaa"; and inside those two branches, it also splits into two depending on whether to match the "a?", etc. The result is, for each "a?" and "a" added, the total amount of work involved in matching doubles, so it is exponential.


Here is the reference implementation for Russ Cox's work:

https://github.com/google/re2


It have been a few years since I read the code, but it was so beautifully written that I cannot imagine it was screwed up in the meantime: Edi Weitz' CL-PPCRE[1] is a beautiful implementation in CL and highly recommended if one understands one aspect (CL or Perl compatible regular expressions) and wants to learn the other one. IIRC he even discovered some bugs in the original perl implementation while creating this library.

[1] https://github.com/edicl/cl-ppcre


This does not implement grouping "()" or alternatives "|". Hence, looping is only required on individual characters. This is a considerable simplification over full regexp.


Which means that this CAN NOT BE USED TO ACCEPT REGULAR LANGUAGES. The language (a|b)*, any number of as or bs in any order, can not be descibed using that parser.

"considerable simplification" sounds like it's kinda regex. It's not even basic regular expressions/regular languages.

Edit: The author reimplements code from https://www.cs.princeton.edu/courses/archive/spr09/cos333/be... which acknowledges that the implemented subset does not match all classes but they propose that the classes they can parse, are already useful. The author of the new article makes no such acknowledgement that their implementation just represents a subset of regular languages.

Of course, the regular languages themselves are also just a subset of regexp, which can match more languages due to constructs like references.


I really hated that Lua's "pattern" [1] is never a regular expression nor a regular language. So annoying that it will be better not to have it.

[1] https://www.lua.org/manual/5.3/manual.html#6.4.1


Lua is not a language to be used as is (and it isn’t in practice). Simple pattern matches are there for internal needs like constructing package paths, but one can luarocks install any regex implementation at will.

Bad side is that luarocks works fine on Windows only if no build step is involved, otherwise you’re doomed to mess with mingw/msys/msys2 environment that isn’t well-supported by third-parties; often not supported at all. It is not Lua’s fail, but it happens. New complicated build systems like CMake only make things worse since you cannot simply guess flags and gcc .c together anymore.

Edit: this is also true for all languages except maybe perl that includes full mingw system with it. Idk why some package managers do not prebuild windows packages on server-side. Windows actually does a lot to maintain binary compatibility.


In Lua, LPeg[1] is my goto for anything that can't be done/gets too complex with patterns.

[1] http://www.inf.puc-rio.br/~roberto/lpeg/


I've written a JS regexp parser and engine. It did not fit in 40 lines.

The most obnoxious part is backreferences. The atom \3 is a backreference if the whole regexp contains at least 3 capture groups; otherwise it is an octal (!) escape for char code 3. But you don't know how many capture groups there are until you're done parsing. This is why JS regexp parsers sometimes must make two passes!


FWIW back references mean you're way outside of "regular languages" so e.g. DFA usually don't support them.


Regular expressions are truly elegant. If the regex engine is built in a functional (compositional style), it is even more elegant. This particular Functional pearl is my favorite, http://sebfisch.github.io/haskell-regexp/regexp-play.pdf

And my implementation of the same in scala (40 lines if you ignore some niceties, and its terribly fast asymptotically) https://gist.github.com/yellowflash/826004277874cadabbc502e6...

For TLDR on the paper, It slowly builds an abstraction and implementation on regex engine which runs on O(mn) where m is length of the regex and n - length of the text. Then they generalize it to do grouping and even extend it to match context free grammar (using lazy evaluation mostly).



That does indeed implement grouping and alternative


It's nicely coded but leaves out an important part of that algorithm: the regex derivatives never get simplified for comparison, so equal states appear different, so the set of states blows up as you march along the string. If you're OK with an inefficient matcher like that, then a backtracking algorithm probably gives you even simpler code.


Depending on how it’s being counted, I have a regex engine in around 30 lines [1] (the parser is longer). It handles branching, grouping, etc. and it’s run time is proportional to the length of the string being searched (I.e. no infinite loops on certain patterns, etc.).

[1] https://github.com/jason-johnson/frobo/blob/master/src/Text/...

The “match” function. And yes, this is ugly and needs to be cleaned up.


A bit longer than 40 lines, but back in the day, I was a big fan of the simplicity and clarity of Henry Spencer's regex code: https://github.com/garyhouston/regexp.old


> Henry Spencer's regex code: https://github.com/garyhouston/regexp.old

Glad to see that come up in this thread. It was the first clearly explained regex engine for me in _Software Solutions in C_ (Schumacher D., Academic Press, 1994). I still have a copy and the original disk (and disk image). If anyone is interested I could scan that chapter tonight.


I looked all over for a copy. I'd love a scan. Is the disk 3.5 or 5.25?


Hi, I'm the original author of this article. I just want to say thank you for all the positive and constructive comments. I'm happy to answer questions that anyone has.


I would suggest that since it lacks grouping, like one other commenter pointed out, it's not a regular expression engine, it only implements a useful subset which is not a regular expression language.

For that you need to be able to build an equivalent to the regular language expression

    (x|y)*


Nice, remind me of a custom Regex engine i built some years ago in JS while trying to build a fast & small lexer, it does not support ? but +, - and sub-rules, it is around 60 LOC if I remind, mostly built by "accident", it work with JSON as input and output a finite state automaton (compiled version ?)

https://github.com/grz0zrg/jsb/blob/master/lexer.js


In my experience, it's the edge cases that are a pain when implementing a regex engine. Think about weird regexes like "^*", "$?" (or any quantified zero-width match), anchors that appear in the middle of regexes, nested quantifiers (don't make much sense, but you need to generate good error messages).

Once you add captures and maybe even backreferences, you get a whole new world of weird :-)


You can straight away drop any repetition of a zero-width assertion, no?


You can, but you need to drop the entire zero-width assertion if the repetition allows zero occurrences.


I did that with grouping (), alternatives |, and classes [], for Asterisk some decades ago. They had a very special phonenumbering syntax. Less than 100 lines in C. In lisp it's trivial, less then 10 lines for a simple matcher.



I'd almost admire how convenient it was that the language that was used ships regexes already.


To be fair, he doesn't use the builtin regex support in his 40 LOC.


TBH, that's exactly why I can't technically admire it. :)




Applications are open for YC Winter 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: