
Regular Expression Matching Can Be Simple and Fast (2007) - prabhupant
https://swtch.com/~rsc/regexp/regexp1.html
======
feniv
The author, Russ Cox, has a whole series of articles on the topic of regex -
[https://swtch.com/~rsc/regexp/](https://swtch.com/~rsc/regexp/)

The rust Regex crate docs even refer to his posts as a reference since the
rust version is based on his efficient RE2 Regex implementation at Google.

------
dang
Threads from 2018:
[https://news.ycombinator.com/item?id=16341519](https://news.ycombinator.com/item?id=16341519)

2015:
[https://news.ycombinator.com/item?id=9374858](https://news.ycombinator.com/item?id=9374858)

2009:
[https://news.ycombinator.com/item?id=820201](https://news.ycombinator.com/item?id=820201)

also 2009:
[https://news.ycombinator.com/item?id=466845](https://news.ycombinator.com/item?id=466845)

------
Quarrel
But if you want faster, (but more complicated):

[https://github.com/intel/hyperscan](https://github.com/intel/hyperscan)

~~~
rurban
With much less features. hyperscan only works with very limited regular
expressions.

Russ' re2 just cannot do backtracking.

------
DigitalTerminal
Also relevant is Google V8's Irregexp:
[https://blog.chromium.org/2009/02/irregexp-google-chromes-
ne...](https://blog.chromium.org/2009/02/irregexp-google-chromes-new-
regexp.html). They claim 3x speedups.

~~~
zamadatix
[https://github.com/google/re2](https://github.com/google/re2) is probably a
better matched library by Google since the post focuses on truly regular
expressions and a lot of irregexp's design is around making the irregular part
not as costly.

------
pier25
A bit off topic, but here are some Regex benchmarks that might be of interest:

[https://github.com/mariomka/regex-
benchmark](https://github.com/mariomka/regex-benchmark)

Surprisingly JS is about 10 times faster than Go.

~~~
statictype
Presumably the JS's regex engine is implemented in C?

Can't believe the .NET implementation is so slow.

~~~
pier25
Yes, most likely, or maybe C++.

~~~
ridiculous_fish
v8 and JSC both JIT-compile regexes.

~~~
pier25
Directly to assembly?

~~~
ErikCorry
Yes, directly to machine code.

------
yxhuvud
What I wonder is if the regexps could be preprocessed in a way so that the
fast variant is used for all or at least most cases where it it possible to
use it.

------
abainbridge
Is there any information about the performance when not looking for things
like when a?a?a?aaa matches aaaaaa?

------
hyperpallium
Now you have an exponential space problem.

~~~
tempguy9999
A quick skim of the article doesn't show any sophisticated pattern reduction
pattern so the parent's comment would appear fair. Straightforward regex
implementations have to be exponential in time or space for pathologies, and
if it's not time here then it's space, so why is this being downvoted?

~~~
chubot
No, your understanding and the parent's isn't correct.

The NFA is linear in the size of the regex. That is, a regex like a+b+c+ will
have an NFA that's 3x the size of the NFA for a+.

Interpreting the NFA is done in time LINEAR with respect to the input data.
Code is given in Cox's articles. Yes worst case linear.

You're probably thinking of NFA -> DFA conversion, i.e. the "subset
construction".

First of all, RE2 doesn't always do that. You can always operate on NFAs. (and
it does that because of capturing is hard with DFAs, IIRC)

Second of all, "exponential in size of input pattern" is NOT a problem.
Exponential in the size of the DATA is a problem, and that doesn't happen.
When you are measuring the runtime of regexes, the size of the pattern is a
small constant. (Think matching a+b+c+ against gigabytes of data.)

The magic of regexes is that you can match _any regular language_ in linear
time and constant space. You can't do that with "regexes". That's the WHOLE
POINT of these articles.

I have wanted to write a summary of Russ Cox's articles on my blog for awhile.
They are very good, but they are very dense, and most people still aren't
getting the message.

The message is: "Use regular languages and not regexes". (e.g. constructs like
backreferences and lookaround assertions make the language nonregular)

[1] Side note: I tested re2c and its regex compilation time is linear up to
matching 47,000 fixed strings in parallel (from /usr/share/dict/words)

[https://github.com/oilshell/blog-code/blob/master/fgrep-
prob...](https://github.com/oilshell/blog-code/blob/master/fgrep-problem-
benchmarks/fixed-strings.sh#L328)

re2c ALWAYS builds a DFA, unlike RE2. (Note: they are completely unrelated
projects, despite their similar names.)

Even taking into the account that exponential in the size of the pattern is
not a problem, it's still not exponential in the size of the pattern (for the
fgrep problem). If the DFA were exponentially sized, then it would take
exponential time to construct it, but it doesn't in this case.

~~~
hyperpape
There's definitely room for a helpful blog post laying out these principles in
a less dense way. Go for it!

There's also a lot of room for discussion of the choices around implementation
techniques different libraries use (things like whether or not DFAs are
created on the fly). I'm interested in this area, so I should probably work up
to writing up some of these examples, as a way of forcing myself to learn them
better.

Also, I think you're not painting the full picture about the issue of DFA
conversion being exponential. You're right that it's much less of an issue
than being exponential in the input size, as it affects fewer use cases.
However, it's still a potential DOS issue, so it definitely matters for any
application that takes user controlled input and produces a DFA. I think it
also matters for getting optimal performance out of an engine even in non-
adversarial conditions.

For an example of how easy this is to trigger: (a|b)*a(a|b)^(n-1) produces a
DFA of size exponential in n. With n = 15, re2c generates a 4mb file.

~~~
chubot
RE2 was used in Google Code Search in an adversarial context, and it handles
the DFA issue by simply putting an 8MB limit on the DFA for a regex:

[https://github.com/google/re2/blob/master/re2/re2.h#L601](https://github.com/google/re2/blob/master/re2/re2.h#L601)

I learned a few months ago that Rust's regex crate, which is based heavily on
RE2, does the same thing.

There is perhaps something worth exploring there, but after reading the Cox
articles, I can see that there are so many issues that come up in regex
implementation, and this seems to be one of the smaller ones.

I don't think the "regex as untrusted data" use case is very common. That is,
there are probably 10,000 programs that use regexes for every one that accepts
regexes from users.

\-----

Although, one thing I've wanted to explore is Regex derivatives. Supposedly
that technique constructs an optimal DFA directly. You don't have to make an
NFA, convert it to a DFA, and then optimize the DFA.

Cox describes it as a "win-win" here (in contrast to parsing with derivatives,
which is asymptotically worse than traditional parsing algorithms)

[https://research.swtch.com/yaccalive](https://research.swtch.com/yaccalive)

There a bunch of non-production implementations floating around like

[https://github.com/google/redgrep](https://github.com/google/redgrep)

[https://github.com/MichaelPaddon/epsilon](https://github.com/MichaelPaddon/epsilon)

(This technique was revived by the 2009 paper cited)

\-----

Regarding the blog post, I have some material here. I gave about 10% of this
as a 5-minute talk, and realized it's about 2 hours worth of material.

[https://www.oilshell.org/share/05-31-pres.html](https://www.oilshell.org/share/05-31-pres.html)

I still think the table toward the end is a good summary and perhaps worth
publishing with some more context / explanation.

\----

One interesting test of whether the message from Cox's articles got through is
what style of regex engine newer languages are using.

I just noticed Julia uses PCRE, which is unfortunate IMO. I like RE2 -- it has
a nice API, great performance, and a lot of features.

[https://docs.julialang.org/en/v1/manual/strings/index.html#R...](https://docs.julialang.org/en/v1/manual/strings/index.html#Regular-
Expressions-1)

Swift also has Perl-style regexes, but maybe they inherited it from Objective
C:

[https://developer.apple.com/documentation/foundation/nsregul...](https://developer.apple.com/documentation/foundation/nsregularexpression)

So basically Rust is the only newer language that appears to have gotten the
message.

The Oil shell will of course use "regular languages" rather than regexes, but
that's natural because shell never had Perl-style regexes.

One thing I think Cox didn't explore enough is that most libc's appear to be
using the "fast" good method, because they don't need Perl constructs.

I guess that would be a good follow-up article. To test GNU libc and musl libc
on those pathological cases. I looked at their code a few years ago and they
looked "right" (I've forgotten the details), but I don't think I ever did the
test.

