
Regular Expression Matching Can Be Simple And Fast - edw519
http://swtch.com/~rsc/regexp/regexp1.html?
======
tc
_Thompson introduced the multiple-state simulation approach in his 1968 paper.
In his formulation, the states of the NFA were represented by small machine-
code sequences, and the list of possible states was just a sequence of
function call instructions. In essence, Thompson compiled the regular
expression into clever machine code. Forty years later, computers are much
faster and the machine code approach is not as necessary._

Interestingly, if you think in Lisp, it's obvious how much more elegant
Thompson's approach is (than a C struct-based state machine), and how you
would implement it in Thompson's way with closures.

~~~
agazso
In fact Thompson invented JIT compiling with this in 1968.

------
apu
Does anyone know why this is not implemented within Python, Perl, etc.? I
don't think the article mentioned...

~~~
blasdel
It's because the standard 'Regular Expressions' include highly irregular
language features -- Backreferences in particular are a huge bitch.

The clean transform in Thomson's algorithm only works for Regular Languages.

~~~
viraptor
But "they" could include 2 engines in languages that have different versions
for compiled and adhoc RE (like python)... Let's say it could work like this:

If you use adhoc, you always get the huge version (it probably doesn't matter
anyways - it's going to be used once). If you compile, simple version would
start constructing the pattern matching graph - as soon as anything not-
supported is found, you go back to the big RE. One time cost should not be big
if you're already prepared to match many strings (you're using crazy features
and compiling that RE explicitely after all - that's going to take long).
Optionally a flag could force a specific engine.

~~~
steveklabnik
As the article mentions, usually PCREs are just 'fast enough.' Not that an
anecdote is better than actual data, but I prefer to use regular expressions
wherever possible, and I've never had matching time be an issue. It's not as
if people are clamoring for better matching times. Supporting two different
matching engines so that a fairly trivial speed increase can be had seems
silly.

~~~
blasdel
But it's not a trivial speed increase -- for some cases PCREs are O(2^n) where
Thomson's NFA is at worst O(n^2). The Thomson NFA quickly gets to be _millions
of times faster_

Thomson's algorithm would let you could use regular expressions for cases
where you would otherwise have never tried.

~~~
haberman
> The Thomson NFA quickly gets to be millions of times faster

The O(2^n) worst case only comes from highly ambiguous patterns which are
better written in other ways. So this isn't really accurate in practice.

------
BlueZeniX
The D language library 'Tango' implements this:

[http://dsource.org/projects/tango/docs/stable/tango.text.Reg...](http://dsource.org/projects/tango/docs/stable/tango.text.Regex.html)

