

Regular Expressions Will Stab You in the Back - pjscott
http://tech.blog.cueup.com/regular-expressions-will-stab-you-in-the-back

======
kibwen
If you're really worried about the worst-case behavior of regular expressions
be sure to select an engine designed specifically to not choke on pathological
inputs, such as Russ Cox's RE2:

<http://code.google.com/p/re2/>

The rather interesting analysis is here:

<http://swtch.com/~rsc/regexp/regexp1.html>

Not that it's a free lunch, since you're gaining a constant amount of overhead
in exchange for a sane growth pattern.

EDIT: Of course the article would mention RE2. That's what I get for reading
the comments before TFA...

~~~
pjscott
This gets a section in the article, but it's worth repeating, since RE2 is
just that good. It's as close to Perl-compatible as a non-backtracking regular
expression engine is likely to get, pleasantly fast, and stable enough for
Google to use in production.

Regarding speed, I've actually found that re2 is faster on some non-
pathological regexps, and a little slower on most. As usual, if you care about
this, benchmark it. I've managed to get some easy 2x speedups by switching
from Python's re to re2, even though re is usually faster.

~~~
kibwen
Good to hear that RE2's constant factor doesn't seem too large, I know the
Rust guys were looking to potentially use it as the regex engine in their
stdlib, but were torn on whether that constant factor would prove troublesome.

------
nmcfarl
Or they won’t as everyone who worked on perl based website from 2000 will tell
you. This is a massively overstated title and lede.

However the solutions given, if you do end up with this problem, are good and
detailed.

They did miss perl's solution, Special Backtracking Control Verbs :
[http://perldoc.perl.org/perlre.html#Special-Backtracking-
Con...](http://perldoc.perl.org/perlre.html#Special-Backtracking-Control-
Verbs)

~~~
pjscott
Perl also has some clever memoization which handles a lot of cases that choke
less mature regular expression engines. But neither of these is a satisfying
solution; backtracking control verbs can only help you if you use them, and
the memoization is not the solution to all woes, as the benchmark in the
"Performance" section of this article shows:

<http://swtch.com/~rsc/regexp/regexp1.html>

The sky isn't falling, but it's still possible to get struck by lightning.
(And of course, you're vulnerable to DoS attacks if you run user-supplied
regexps with backreferences -- this is provably NP-hard.)

~~~
vampirechicken
Not to gainsay everything you've written here, because I think that I think
there's plenty of room in my universe for new regex engines, but the article
is five years old and benchmarks Perl 5.8.7.

We are of course up to 5.16.1 these days, and I'd be more inclined to pay
attention to benchmarks if they were performed with remotely current versions
of the software.

Otherwise, thanks for the enlightening post.

~~~
pjscott
Just ran the same benchmark on Perl 5.12.3 (because it came installed by
default on my machine), and got the same exponential-time matching behavior.
If you have a more recent version of Perl than I do, I'd be interested in
hearing the results. This takes about 10 seconds on my machine, but should be
nigh-instantaneous if they've fixed the problem in a later version of Perl:

    
    
        "aaaaaaaaaaaaaaaaaaaaaaaaaaa" =~ /a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaaaaaaaaa/

~~~
vampirechicken
I have 5.8.8, and I have 5.14.2. I've also taken the liberty of converting the
string of a?'s into a{0,27}.

I recognize that you're trying to demonstrate pathologically bad behavior, but
as written, you've got a pathologically bad regex.

Note the speed up from 24 & 30 seconds to seven one-thousandths of a second in
both cases. Again, I like the idea of a regex engine with different design
considerations than Perl's.

Your article would have been much better if it has used a class of
pathological data with real world examples, and not used contrived examples
that I rewrote in less time than it took to run.

So here's my results:

5.14.2:

    
    
         $ time perl -e 'print "aaaaaaaaaaaaaaaaaaaaaaaaaaa" =~ /a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaaaaaaaaa/ '
         1
         real    0m24.227s
         user    0m24.210s
         sys     0m0.008s
    
         $ time perl -e 'print "aaaaaaaaaaaaaaaaaaaaaaaaaaa" =~ /a{0,27}aaaaaaaaaaaaaaaaaaaaaaaaaaa/ '
         1
         real    0m0.007s
         user    0m0.004s
         sys     0m0.004s
    

\----------------------------------

5..8.8:

    
    
         $ time perl -e 'print "aaaaaaaaaaaaaaaaaaaaaaaaaaa" =~ /a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaa                               aaaaaaaaaaaaaaaa/ '
         1
         real    0m30.065s
         user    0m30.046s
         sys     0m0.016s
    
         $ time perl -e 'print "aaaaaaaaaaaaaaaaaaaaaaaaaaa" =~ /a{0,27}aaaaaaaaaaaaaaaaaaaaaaaaaaa/ '
         1
         real    0m0.007s
         user    0m0.004s
         sys     0m0.004s
    

\--------------------------------

~~~
pjscott
Real-world examples tend to be subtler, and not quite as easy to explain in an
introductory article. Of course you wouldn't really use the for-demonstration-
only regex I mentioned. Instead, you would write something reasonable-looking
that works just fine unless it encounters a highly unlikely bad input. That's
the way I've always seen it happen.

~~~
vampirechicken
In the "real world" I usually try to optimize for the average case, and deal
with the "highly unlikely bad input" as a special case.

I'm still impressed with the new regex engine, but I have a pathological
hatred of contrived examples.

It's not you, it's me.

