
Write Your Own Regular Expression Parser - apotheon
http://www.codeguru.com/cpp/cpp/cpp_mfc/parsing/article.php/c4093
======
mey
Now you've got 3 problems?

------
scorpioxy
This brings back old memories. We were given this task as an assignment for a
course in college called "Theory of Computation". I stayed up for 3 days
without sleep to implement this thing. I did it in c++ though, so a lot of
that time was due to the choice of the language.

Edit: Forgot to add, got 140pts on it(over a 100, there was a bonus). Yeah,
I'm awesome.

~~~
barrkel
Indeed. In college, I implemented a regular expression compiler in Java,
targeting JVM bytecodes directly, which ended up outperforming java.util.regex
of the time (JDK 1.4).

For regular expressions in actual practice, keeping the expression as an NFA
is often done, as things like captures with backreferences cannot be expressed
as a DFA, and also the NFA to DFA conversion may, in pathological cases,
result in an exponential explosion in states.

If you're comfortable with your lexing and parsing, tree building and
threading (i.e. threaded tree), you can implement a basic NFA regex matcher
supporting captures and backrefs quite easily, in no more than a day. Matching
is basically a recursive visit over the threaded parse tree nodes, with some
creative threading links, such as in series for concatenation, in parallel for
alternation, and in cycles for closures. Backing out from recursion handles
the nondeterminism.

~~~
ajross
Exactly. And that in a nutshell is the problem with this article. It's an
exposition of the kind of regex engine we all learned about in school, not the
kind of tool that is useful for practical software development.

A regex matcher, alone, isn't really useful for much beyond a lexer generator.
It's captured subexpressions and backreferences (well, really just captures)
that turn it into the gadget we all love. And as you point out, that is _not_
achievable with the simple state machine engine we all remember from class.

~~~
haberman
> It's captured subexpressions and backreferences (well, really just captures)
> that turn it into the gadget we all love. And as you point out, that is not
> achievable with the simple state machine engine we all remember from class.

Did you read [0]? It demonstrates that submatch extraction can be (and has
been) implemented with DFAs:

"The extraction of submatch boundaries has been mostly ignored by computer
science theorists, and it is perhaps the most compelling argument for using
recursive backtracking. However, Thompson-style algorithms can be adapted to
track submatch boundaries without giving up efficient performance. The Eighth
Edition Unix regexp(3) library implemented such an algorithm as early as 1985,
though as explained below, it was not very widely used or even noticed."

[0] <http://swtch.com/~rsc/regexp/regexp1.html>

~~~
barrkel
I said captures with backreferences, not captures alone. As the link you cite
mentions, "as far as the theoretical term is concerned, regular expressions
with backreferences are not regular expressions. The power that backreferences
add comes at great cost: in the worst case, the best known implementations
require exponential search algorithms, like the one Perl uses. [...] No one
knows how to implement regular expressions with backreferences efficiently,
though no one can prove that it's impossible either. (Specifically, the
problem is NP-complete [...])"

