
Building a Regex Engine in Fewer Than 40 Lines of Code - tambourine_man
https://nickdrane.com/build-your-own-regex/
======
klibertp
In the "Beautiful Code" book there is a chapter, I think from Rob Pike, which
presents a bare-bones regex implementation in C. It doesn't implement
alternatives or grouping if I remember correctly, but the implementation is
breathtakingly beautiful and not any longer than this one.

I think it's the same implementation described here:
[http://www.cs.princeton.edu/courses/archive/spr09/cos333/bea...](http://www.cs.princeton.edu/courses/archive/spr09/cos333/beautiful.html)
(not 100% sure as I lost my copy of Beautiful Code :()

EDIT: I see the author links to the article at the beginning of the post.
Still, I missed this on my first reading, so I think posting the link here is
still worthwhile. Especially because the translation to JS kind of misses the
point - the beauty of the Rob's implementation comes from recursion and
pointers and JS lacks the latter.

~~~
13throwaway37
Literally the first sentence in the posted article mentions this:

I stumbled upon an article the other day where Rob Pike implements a
rudimentary regular expression engine in c. I converted his code to Javascript
and added test specs so that someone can self-guide themselves through the
creation of the regex engine. The specs and solution can be found in this
GitHub repository.

------
userbinator
It's small, but unfortunately due to how ? is implemented with recursive
backtracking (look at how matchQuestion() tries both alternatives), has an
_exponential_ worst-case runtime. Fortunately, the algorithm to do it in
linear time is pretty simple too:

[https://swtch.com/~rsc/regexp/regexp1.html](https://swtch.com/~rsc/regexp/regexp1.html)
(previously discussed at
[https://news.ycombinator.com/item?id=466845](https://news.ycombinator.com/item?id=466845))

~~~
zoner14
I actually don't think this is the case (I definitely might be wrong).
Although the matchQuestion function has the pattern that typically resembles a
function of exponential runtime (where a function invokes itself recursively
multiple times), there is a slight difference in our scenario. If you look at
matchQuestion's invocations of match, you'll notice that on both sides of the
OR, the pattern is stripped of two characters (the "_?"). This means that the
recursive invocation of match will never invoke matchQuestion a second time,
unless there is a second '?', in which case it's entirely appropriate.

~~~
userbinator
_unless there is a second '?', in which case it's entirely appropriate._

That's precisely where the exponential behaviour comes from; consider e.g.
matching the pattern "a?a?a?aaa" against "aaa". It will try matching "a?"
against the first "a", which succeeds, leading to a recursive call to match
"a?a?aaa" with "aa". That eventually fails, so it tries matching "a?a?aaa"
against "aaa"; and inside those two branches, it also splits into two
depending on whether to match the "a?", etc. The result is, for each "a?" and
"a" added, the total amount of work involved in matching doubles, so it is
exponential.

------
jlg23
It have been a few years since I read the code, but it was so beautifully
written that I cannot imagine it was screwed up in the meantime: Edi Weitz'
CL-PPCRE[1] is a beautiful implementation in CL and highly recommended if one
understands one aspect (CL or Perl compatible regular expressions) and wants
to learn the other one. IIRC he even discovered some bugs in the original perl
implementation while creating this library.

[1] [https://github.com/edicl/cl-ppcre](https://github.com/edicl/cl-ppcre)

------
lindig
This does not implement grouping "()" or alternatives "|". Hence, looping is
only required on individual characters. This is a considerable simplification
over full regexp.

~~~
lifthrasiir
I really hated that Lua's "pattern" [1] is never a regular expression nor a
regular language. So annoying that it will be better not to have it.

[1]
[https://www.lua.org/manual/5.3/manual.html#6.4.1](https://www.lua.org/manual/5.3/manual.html#6.4.1)

~~~
wruza
Lua is not a language to be used as is (and it isn’t in practice). Simple
pattern matches are there for internal needs like constructing package paths,
but one can luarocks install any regex implementation at will.

Bad side is that luarocks works fine on Windows only if no build step is
involved, otherwise you’re doomed to mess with mingw/msys/msys2 environment
that isn’t well-supported by third-parties; often not supported at all. It is
not Lua’s fail, but it happens. New complicated build systems like CMake only
make things worse since you cannot simply guess flags and gcc .c together
anymore.

Edit: this is also true for all languages except maybe perl that includes full
mingw system with it. Idk why some package managers do not prebuild windows
packages on server-side. Windows actually does _a lot_ to maintain binary
compatibility.

------
ridiculous_fish
I've written a JS regexp parser and engine. It did not fit in 40 lines.

The most obnoxious part is backreferences. The atom \3 is a backreference if
the whole regexp contains at least 3 capture groups; otherwise it is an octal
(!) escape for char code 3. But you don't know how many capture groups there
are until you're done parsing. This is why JS regexp parsers sometimes must
make two passes!

~~~
masklinn
FWIW back references mean you're way outside of "regular languages" so e.g.
DFA usually don't support them.

------
yellowflash
Regular expressions are truly elegant. If the regex engine is built in a
functional (compositional style), it is even more elegant. This particular
Functional pearl is my favorite, [http://sebfisch.github.io/haskell-
regexp/regexp-play.pdf](http://sebfisch.github.io/haskell-regexp/regexp-
play.pdf)

And my implementation of the same in scala (40 lines if you ignore some
niceties, and its terribly fast asymptotically)
[https://gist.github.com/yellowflash/826004277874cadabbc502e6...](https://gist.github.com/yellowflash/826004277874cadabbc502e6d406b39e)

For TLDR on the paper, It slowly builds an abstraction and implementation on
regex engine which runs on O(mn) where m is length of the regex and n - length
of the text. Then they generalize it to do grouping and even extend it to
match context free grammar (using lazy evaluation mostly).

------
foota
Code golf link to a similar challenge
[https://codegolf.stackexchange.com/questions/125708/regular-...](https://codegolf.stackexchange.com/questions/125708/regular-
expression-parser)

~~~
maweki
That does indeed implement grouping and alternative

~~~
abecedarius
It's nicely coded but leaves out an important part of that algorithm: the
regex derivatives never get simplified for comparison, so equal states appear
different, so the set of states blows up as you march along the string. If
you're OK with an inefficient matcher like that, then a backtracking algorithm
probably gives you even simpler code.

------
jason-johnson
Depending on how it’s being counted, I have a regex _engine_ in around 30
lines [1] (the parser is longer). It handles branching, grouping, etc. and
it’s run time is proportional to the length of the string being searched (I.e.
no infinite loops on certain patterns, etc.).

[1] [https://github.com/jason-
johnson/frobo/blob/master/src/Text/...](https://github.com/jason-
johnson/frobo/blob/master/src/Text/ExpressionEngine/NFA/Matcher.hs)

The “match” function. And yes, this is ugly and needs to be cleaned up.

------
microtherion
A bit longer than 40 lines, but back in the day, I was a big fan of the
simplicity and clarity of Henry Spencer's regex code:
[https://github.com/garyhouston/regexp.old](https://github.com/garyhouston/regexp.old)

~~~
rlonstein
> Henry Spencer's regex code:
> [https://github.com/garyhouston/regexp.old](https://github.com/garyhouston/regexp.old)

Glad to see that come up in this thread. It was the first clearly explained
regex engine for me in _Software Solutions in C_ (Schumacher D., Academic
Press, 1994). I still have a copy and the original disk (and disk image). If
anyone is interested I could scan that chapter tonight.

~~~
sitkack
I looked all over for a copy. I'd love a scan. Is the disk 3.5 or 5.25?

------
zoner14
Hi, I'm the original author of this article. I just want to say thank you for
all the positive and constructive comments. I'm happy to answer questions that
anyone has.

------
zaarn
I would suggest that since it lacks grouping, like one other commenter pointed
out, it's not a regular expression engine, it only implements a useful subset
which is not a regular expression language.

For that you need to be able to build an equivalent to the regular language
expression

    
    
        (x|y)*

------
onirom
Nice, remind me of a custom Regex engine i built some years ago in JS while
trying to build a fast & small lexer, it does not support ? but +, - and sub-
rules, it is around 60 LOC if I remind, mostly built by "accident", it work
with JSON as input and output a finite state automaton (compiled version ?)

[https://github.com/grz0zrg/jsb/blob/master/lexer.js](https://github.com/grz0zrg/jsb/blob/master/lexer.js)

------
perlgeek
In my experience, it's the edge cases that are a pain when implementing a
regex engine. Think about weird regexes like "^*", "$?" (or any quantified
zero-width match), anchors that appear in the middle of regexes, nested
quantifiers (don't make much sense, but you need to generate good error
messages).

Once you add captures and maybe even backreferences, you get a whole new world
of weird :-)

~~~
thesmallestcat
You can straight away drop any repetition of a zero-width assertion, no?

~~~
perlgeek
You can, but you need to drop the entire zero-width assertion if the
repetition allows zero occurrences.

------
rurban
I did that with grouping (), alternatives |, and classes [], for Asterisk some
decades ago. They had a very special phonenumbering syntax. Less than 100
lines in C. In lisp it's trivial, less then 10 lines for a simple matcher.

------
hyperpallium
14 lines
[https://news.ycombinator.com/item?id=3202313](https://news.ycombinator.com/item?id=3202313)

------
mar77i
I'd almost admire how convenient it was that the language that was used ships
regexes already.

~~~
majewsky
To be fair, he doesn't use the builtin regex support in his 40 LOC.

~~~
mar77i
TBH, that's exactly why I can't technically admire it. :)

