
The Power of PCRE Regular Expressions (2012) - doomrobo
https://nikic.github.io/2012/06/15/The-true-power-of-regular-expressions.html
======
alipang
I've never understood the software engineering community's desire to call
PCREs regular expressions simply because it has similar syntax to "formal"
regular expressions.

There is good reason to distinguish the two, since they have completely
different characteristics. PCREs may use an unbounded amount of extra memory,
and may use exponential time. "Formal" regular expressions take linear time
and memory, and can match only the regular languages. They operate completely
differently (a regular automation vs. a stack based pattern matcher).

Regular expressions can NOT match HTML, PCREs CAN. Conflating the two is not
helpful.

~~~
nmadden
Actually, "plain" regular expressions/FSAs _can_ parse HTML up to any
arbitrary finite nesting depth (i.e. all those you will ever encounter in
practice). The trade-off is that they need exponentially more states to do so
than an equivalent CFG/pushdown automata.

For example the a^n b^n example can be recognised up to n<=3 by the RE
^(|ab|aabb|aaabbb)$.

~~~
_delirium
Fwiw browsers do this in practice, so the subset of HTML usable on the web is
already regular. For example, Webkit-based browsers impose a nesting depth
limit of 512.

~~~
gsnedders
In case anyone is curious:

WebKit and Blink both use 512. Gecko uses 200. Trident uses a limit beyond
what I've tested quickly (over 4096). Presto uses 500.

------
plesner
Alternative title: the true danger of conflating regular expressions and
backtracking pattern matchers. I wonder if it would be possible to reintroduce
that distinction.

~~~
igravious
backtracking, i.e. utilizes a stack, i.e. more powerful than a regular
(language) pattern matcher.

Alternative title: you can use regex derivatives (PCRE, Ruby ::Regexp,
Boost.Regex, ...) to match non-regular syntax but you shouldn't.

The advice still holds. In the case of HTML you should reach for an XML
parser.

~~~
Veedrac
I think the warning should be parsing complex domains with regex libraries,
not a restriction to regular syntaxes. There are simple cases where
backtracking (or some other extension) plus regex makes sense. Something like

    
    
        (\w+) \d+ \1
    

is simpler as a regex than any other common tool available.

------
mpu
Good job on the 3-SAT encoding, it was very clear.

------
dang
Ok, we added "PCRE" to the title to satisfy the unsatisfied. Though now it's
like "PIN number".

------
JadeNB
I wrote about a different approach (via the SKI combinator calculus and Rule
110) on PerlMonks:
[http://www.perlmonks.org/?node_id=809842](http://www.perlmonks.org/?node_id=809842)
.

