
Regular Expression Matching Can Be Simple and Fast (2007) - humanarity
https://swtch.com/~rsc/regexp/regexp1.html
======
tikhonj
…if your regexes are, you know, _regular_. Which Perl-style regular
expressions most definitely are not. So it's not a matter of just implementing
the regexes you know faster, but limiting them to a subset of features first.

This is a perfectly valid thing to do, but it _is_ changing the power of
regexes significantly. I feel the article makes this fact easy to overlook—I
certainly did until I took a class on automata.

But yes, presumably Perl could use an NFA-based algorithm if given a regex
that is actually regular. One of my professors attributed the fact that it
doesn't to old patent issues, but I have not been able to find a good source
for that. It's also possible that it's just a case of worse is better:
backtracking is fast enough 99% of the time, and if performance really
matters, perhaps you shouldn't be using regexes anyhow.

~~~
FreakLegion
I don't know about Perl proper, but PCRE does have an automaton option (which
the docs call a DFA but, if I remember correctly, is actually an NFA).

In practice, the vast majority of production regexen don't require
backreferences or generalized assertions, so automata are an attractive
option. This is particularly true for security applications (e.g.
firewalls/NIDS) and public-facing services like Google Code Search. I wouldn't
be surprised if people were attacking Snort boxes via PCRE, for example.

------
pcwalton
To be honest, I've never been much of a fan of this article, because it
neglects the fastest way to implement regexes _in practice_ : use a fast JIT
to emit code that performs recursive backtracking. The Thompson NFA is a
better algorithm for pathological regexes that people don't frequently write
(e.g. "(a?)﹡a﹡)"), but the constant factors are much worse for common cases.

The Thompson NFA has its place as a fallback to avoid exponential blowup on
pathological regexes, but if I were implementing a performance-critical regex
engine, I would start with a JIT for the recursive backtracking approach,
because in practice constant factors dominate over algorithmic worst-time
bounds, and only add the Thompson NFA later.

~~~
humanarity
JIT sounds interesting. Do you have a good ref?

I don't feel the constant factors are large on Thompson. It's essentially
linear to construct IIRC, and matching next char is then proportional to the
number of current States which are valid. Which is bounded by the vertexes in
the NFA which will mostly be less than DFA.

I don't see how you can get more efficient than that, because you're just
keeping track of all the matching states which would seem to be the minimum
required. I'm really keen to learn an even better way.

The proposal is a workable heuristic ( use the practically working algorithm
rather than the one with a theoretical low bound which is costly in practice
), and people do well to heed it in many places, however I don't believe it
applies here because Thompson is great in practice.

You sound like you know a lot about this, what am I missing?

------
humanarity
And Ken Thompson's original 1968 paper,

[http://www.fing.edu.uy/inco/cursos/intropln/material/p419-th...](http://www.fing.edu.uy/inco/cursos/intropln/material/p419-thompson.pdf)

------
aioprisan
Any good web-based regex tools that you use daily?

~~~
milesokeefe
My personal favorite is RegExr:

[http://www.regexr.com/](http://www.regexr.com/)

------
smegel
And when was the last time anyone used grouping in a regex!

~~~
McUsr
I use groups, (look forward, and lookbehind) all the time, now that I have
been comfortable with them, they are great in that I can write simpler
regexps, or use fewer patterns. And I use perl for one liners, which is as far
as I can go comfortably with perl. Thing is, I use regexps most for one-
liners, and with perl, and grouping, it is simpler for me, than using sed for
instance, where I then would have to think, (and test), substantially more, or
filter the output through yet anoter sed expression. So grouping works for me.
:) It is of course slower, but so far, the most critical time for me, has been
the speed with which I can make a solution that works.

