
Perl Regular Expression Matching Is NP-Hard - brudgers
http://perl.plover.com/NPC/
======
robinhouston
This result was (as far as I know) first published in 1990 by Aho. [1]

In 2001 I posted online a reduction from SUBSET-SUM to regular expression
matching with back-references (which I worked out on paper while on holiday,
in the days before the internet was readily accessible from Greek islands).
[2]

1\. A. Aho. Algorithms for finding patterns in strings. In J. van Leeuwen,
editor, Handbook of Theoretical Computer Science, volume A, chapter 5, pages
255–300. Elsevier, Amsterdam, 1990.

2\. [http://bumppo.net/lists/fun-with-
perl/2001/06/msg00008.html](http://bumppo.net/lists/fun-with-
perl/2001/06/msg00008.html)

~~~
williadc
For those that don't know, Alfred Aho is the "a" in awk (Aho, Weinberger, and
Kernighan).

------
kyrra
I'm not sure the date of the linked page, but this is why RE2[0] was created.
Russ Cox has some examples where PCRE falls over flat[1].

EDIT: Looks like the original linked page was first published in 2001[2].

[0] [https://github.com/google/re2](https://github.com/google/re2)

[1]
[https://swtch.com/~rsc/regexp/regexp1.html](https://swtch.com/~rsc/regexp/regexp1.html)

[2]
[https://web.archive.org/web/*/http://perl.plover.com/NPC/](https://web.archive.org/web/*/http://perl.plover.com/NPC/)

~~~
brudgers
From the Cox article:

 _As far as the theoretical term is concerned, regular expressions with
backreferences are not regular expressions. The power that backreferences add
comes at great cost: in the worst case, the best known implementations require
exponential search algorithms, like the one Perl uses. Perl (and the other
languages) could not now remove backreference support, of course._

Cox then proposes a hybrid engine that can be quicker when capture groups
aren't needed. TCL has this sort of hybrid implementation,
[http://wiki.tcl.tk/396](http://wiki.tcl.tk/396). It's worst case performance
won't be big-O better than Perl's for pathological cases that require
backtracking.

If P no more or less equals NP today than in 2001, the article's conclusion
has not lost correctness.

~~~
thaumasiotes
> Cox then proposes a hybrid engine that can be quicker when capture groups
> aren't needed.

Capture groups aren't the same thing as backreferences. You can do capturing
groups fine in linear time (or at least, something close to linear time).

~~~
brudgers
Thanks for the clarification. This page was helpful:
[https://msdn.microsoft.com/en-
us/library/thwdfzxy(v=vs.110)....](https://msdn.microsoft.com/en-
us/library/thwdfzxy\(v=vs.110\).aspx)

------
onan_barbarian
This result has always been a bit of a head-scratcher for me, in that it seems
to require that we increase the pattern size (specifically, the number of
back-references). While the result is correct, it seems to run counter to
common usage of regular expressions, where the regular expression is fixed and
the interest is generally in how fast a given regular expression runs over N
bytes of input data.

It feels a bit like informing people that comparing two integers is O(N)
because maybe the integers are bignums with N bits each; true, but counter to
common usage.

~~~
gohrt
For a given piece of text, pattern-matching is (if P!=NP) exponential in the
number of back-references (one measure of "size of the pattern"), in the
general non-lucky cases of patterns.

~~~
onan_barbarian
This seems plausible, but is the inverse of the traditional expectation of
what grows and what stays the same. I refer only to a pragmatic experience of
regex implementation (as opposed to a theoretical refutation of the point) -
the size of the pattern is typically a constant and the metric of interest is
usually the difficulty of scanning a _fixed_ pattern over an arbitrary and
increasing-sized input buffer with worst case input.

------
Houshalter
Maybe relevant: Regular Expression Denial of Service (ReDoS):
[https://en.wikipedia.org/wiki/ReDoS](https://en.wikipedia.org/wiki/ReDoS)

------
rurban
Almost every knows that supporting backrefs makes it slow, i.e. exponential.
I'm tired of these excuses which come up all the time.

What is perl5's solution to this problem? Saying that is unsolvable (like
here), and supporting external matchers. Nobody uses them because it's a joke.

What it should of course do is to check for abnormal backrefs and use a fast
matcher without backrefs. 90% of the regexes have no backrefs.

There would be a tiny performance problem if perl5 would use a normal dynamic
one-pass compilation, something a modern matcher does. But it uses old-style
two-pass compilation, so it's trivial to check if there's a backref, and use a
proper Thompson NFA if not. Or a faster jitted variant. Instead p5p still
holds on to their awful Spencer DFA with longjmp's all over the place to fix
the broken architecture, and engages heavily in badmouthing better algorithms
and implementations.

The few changes that were done there made it even slower. Davem's conversion
from recursion to iteration caused a significant performance bump. Davem's
support for proper /e (eval rhs expressions) caused another significant
performance bump. You are not allowed to critize that. The p5p zealots will
come all over you. Ross Cox ran away very fast.

------
Animats
If it's NP-hard, it's not really a "regular expression". It's more like a
SNOBOL pattern, but with worse syntax. In SNOBOL, backing up was referred to
as "pulling the needle back" as it backed out of a failed subpattern to try a
different one.

~~~
Scarblac
The title says "Perl regular expression". It is well known that they have
important extensions compared to the things known as regular expressions in
theory.

