
Regular Expression Matching Can Be Simple And Fast - ambition
http://swtch.com/~rsc/regexp/regexp1.html
======
jimrandomh
Textbook regex algorithms assume that you're trying to determine whether a
string matches an expression using a standard set of operators. Perl regular
expressions have to match text to subexpressions, rather than just return a
yes/no answer, and have a greatly expanded set of operators, some of which are
completely incompatible with the usual NFA and DFA algorithms. It even lets
you drop arbitrary perl code into the middle of a regex. You can easily write
a perl regex that is NP-complete to match. You can't do that with the
simplified regex language that textbooks talk about.

~~~
alexk
"some of which are completely incompatible with the usual NFA and DFA
algorithms" - It would be nice if you pointed to some article/resource giving
more info on this.

~~~
tome
Well here's a starter for you. Using enhanced regex syntax you can match a
language containing strings of prime length. On the other hand (using the
pumping lemma[1]) you can prove that that set is not regular, hence not
detectable by a standard regex.

[1]
[http://books.google.com/books?id=MDQ_K7-z2AMC&pg=PA47...](http://books.google.com/books?id=MDQ_K7-z2AMC&pg=PA47&lpg=PA47&dq=finite+automaton+prime&source=bl&ots=0wWTv5AeBd&sig=6sOwOf6PNdQmcvI6QhWVMKLvKKA&hl=en&sa=X&oi=book_result&resnum=3&ct=result)

~~~
llimllib
Sipser's book is great for understanding this stuff, IMHO:
[http://books.google.com/books?id=eRYFAAAACAAJ&dq=sipser+...](http://books.google.com/books?id=eRYFAAAACAAJ&dq=sipser+theory&ei=SvqKSfnBHZb0ygS22KG6BQ)

------
harpastum
This reminds me of a very interesting article on the same vein from the great
(now sadly defunct) Ridiculous Fish

The Treacherous Optimization
[http://ridiculousfish.com/blog/archives/2006/05/30/old-
age-a...](http://ridiculousfish.com/blog/archives/2006/05/30/old-age-and-
treachery/)

~~~
ambition
That's a good article. Ridiculous Fish is awesome. You should submit it!

------
chime
So why don't Perl, Ruby etc. use Thompson NFA?

~~~
newt0311
Same question here. I didn't realize that they didn't. regular expression NFA
and DFA algorithms are in _every_ compiler book out there. I find it hard to
believe that guido et al. didn't know about these algorithms.

~~~
nostrademons
Yeah, jimrandomh's comment basically explains it. Perl regexp's have features
that can't be implemented by a DFA, like backreferences and arbitrary code
evaluation. (I believe you _can_ implement capturing subgroups, you'd just
need to annotate the states with a marker for the current position in the
string.) Those features are used often enough that you can't just leave them
out.

I see no reason why popular languages couldn't implement _both_ algorithms,
though, and then select the backtracking one only if the regexp contains
backreferences or expressions. It's easy enough to tell if a regexp uses these
features, just checking for their existence when the regexp is compiled. Then
you could use the fast algorithm when possible and the featureful one when
not.

~~~
newt0311
Nope. Backref and friends can be implemented on top of DFAs. Grep offers an
example of such an implementation.

------
litewulf
I'm curious why they don't do some sort of cost-estimation when doing the
regex compile. Oftentimes you use a regex only a few times in the life of the
application, and it'd be nice if maybe it would build the regex automaton with
some time-budget cap (say, N states created) and fallback to PCRE until its
fully built.

(Backseat nitpicker here. And just thinking about it seems deliciously hairy.)

