
Efficiency of regular expressions - wglb
http://www.johndcook.com/blog/2011/01/20/efficiency-of-regular-expressions/
======
stevek
The single worst thing that people do wrong when writing regular expressions
is to use .* x when really they should use [^x]* x (i.e. the common case of
looking for some kind of terminator 'x')

The worst case I ever saw had many .* running over some c++ source which would
take several minutes per file. Presumably trying all combinations of
backtracking. With the negated character class [^x] it was < .1 of a second.

Edit: I see his twitter feed has .* as the icon. Ha!

Edit2: My * are getting eaten by the formatter. There should be no space after
them

~~~
dedward
This would be highly dependent on the implementation of the regex engine
itself, would it not?

~~~
bmm6o
Right. The size of the difference will depend on the implementation details
(e.g. if it uses backtracking or not), but the form the GP proposes will never
be slower. And arguably it better captures the intent of the RE.

------
silentbicycle
Another problem with regular expressions is using them inappropriately. If you
have elaborate, convoluted regular expressions, it's usually a sign that you
should try using a parser as well. Regular expressions cannot maintain context
(though some extensions have ad hoc extensions for doing this inefficiently
and in a limited fashion; apparently the problem is NP-complete), so doing
things like balancing parenthesis or HTML tags in a recursive expression is
impossible.

Using regular expressions to break the input text into tokens ("tokenizing" or
"lexing") and then matching the sequence of tokens according to a grammar
("parsing") is usually much simpler. The two-phase approach often factors out
some complexity - only the lexer needs to care about things like whitespace,
for example.

While most compiler textbooks cover lexing and parsing, I particularly
recommend Andrew Appel's _Modern Compiler Implementation in ML_ and Niklaus
Wirth's _Compiler Construction_ (free online, <http://www-
old.oberon.ethz.ch/WirthPubl/CBEAll.pdf>).

More recommendations on learning parsing here:
<http://news.ycombinator.com/item?id=1820858>

~~~
amalcon
There are exactly two reasons that regular expression parsing "needs" to be
slow. First, we have the "ad hoc extensions for doing this inefficiently and
in a limited fashion", and second, we have subexpression capture.
Backreferences (the former) are generally only useful in situations where you
really ought to be using some other construct. The backreferences are the part
that's NP-complete.

Subexpression capture is great, and at least that can be done in (low-order)
polynomial time with a nondeterministic automaton approach. So really, the
only problem is the backreferences.

------
numeromancer
The only useful part of this article is the article it refers to. To save
readers some time, here it is:

<http://swtch.com/~rsc/regexp/regexp1.html>

~~~
Semiapies
I hate when people submit a rambling blog post that contributes no content to
the actual item the post links to.

------
wbhart
I recently used a regexp library in Scheme for lexing. The result was O(n^2)
on the number of tokens in the input stream. I searched for ages to find where
I had gone wrong and finally realised it was the regexp library itself. I
switched to Common Lisp, which has a greater choice of regexp libraries. The
same library exists there too, with the same problems, but switching to the
cl-ppcre library made the problem disappear instantly.

One problem with Scheme itself appears to be that there is no way to examine a
portion of a string. The substring function actually makes a whole new copy of
the part you want to examine (which may be the remainder of the input string
after the token you just lexed, as it was in my case).

Most of the string functions in Common Lisp are set up so that you can specify
a starting index in a string so that it is possible to examine portions of a
string without making a whole new copy.

------
metageek
I wish we had a term that meant "things like regular expressions, that may not
actually define regular languages". A real regular expression can always be
translated to a DFA, and be executed in O(len(input)) time. Perl regexpen are
not regular.

