Hacker News new | past | comments | ask | show | jobs | submit login
The Power of PCRE Regular Expressions (2012) (nikic.github.io)
91 points by doomrobo on June 20, 2015 | hide | past | favorite | 14 comments

I've never understood the software engineering community's desire to call PCREs regular expressions simply because it has similar syntax to "formal" regular expressions.

There is good reason to distinguish the two, since they have completely different characteristics. PCREs may use an unbounded amount of extra memory, and may use exponential time. "Formal" regular expressions take linear time and memory, and can match only the regular languages. They operate completely differently (a regular automation vs. a stack based pattern matcher).

Regular expressions can NOT match HTML, PCREs CAN. Conflating the two is not helpful.

> They operate completely differently (a regular automation vs. a stack based pattern matcher).

It's more exciting than this actually! An implementation of formal regexes doesn't have to simulate an automaton explicitly. In fact, it can also use backtracking and keep its linear running time. Such an implementation would use an explicit stack and also keeps track of each state that has been visited for a particular position in the search text. If a state has already been visited, then it doesn't repeat itself.

A backtracking implementation can be faster than a full NFA simulation if: 1) you need to track submatch boundaries, 2) you have a small regex and 3) you have small input. (2) and (3) are necessary because of the memory required to keep track of the states you've visited (proportional to len(regex) * len(search text)). (1) is necessary because you could otherwise use a DFA, which AFAIK, does not typically track submatch boundaries. Tracking submatch boundaries in a backtracking implementation is faster than simulating an NFA automaton because you only need to keep one copy of the capture locations around at any point in time. The NFA simulation has to copy them around a lot.

(Of the regex implementations that guarantee linear running time, C++'s RE2, Go's `regexp` and now Rust's `regex` all implement a bounded backtracking algorithm... among others!)

Part of it may be that "pee cee are eey" (or less commonly, "picker", "picker eey", etc.) doesn't really sound as nice as "regex".

Also, PCRE does stand for "Perl Compatible Regular Expression".

Actually, "plain" regular expressions/FSAs can parse HTML up to any arbitrary finite nesting depth (i.e. all those you will ever encounter in practice). The trade-off is that they need exponentially more states to do so than an equivalent CFG/pushdown automata.

For example the a^n b^n example can be recognised up to n<=3 by the RE ^(|ab|aabb|aaabbb)$.

Fwiw browsers do this in practice, so the subset of HTML usable on the web is already regular. For example, Webkit-based browsers impose a nesting depth limit of 512.

In case anyone is curious:

WebKit and Blink both use 512. Gecko uses 200. Trident uses a limit beyond what I've tested quickly (over 4096). Presto uses 500.

If you are really wondering, PCRE means perl-compatible regular expressions. That may have something to do with it.

Alternative title: the true danger of conflating regular expressions and backtracking pattern matchers. I wonder if it would be possible to reintroduce that distinction.

backtracking, i.e. utilizes a stack, i.e. more powerful than a regular (language) pattern matcher.

Alternative title: you can use regex derivatives (PCRE, Ruby ::Regexp, Boost.Regex, ...) to match non-regular syntax but you shouldn't.

The advice still holds. In the case of HTML you should reach for an XML parser.

I think the warning should be parsing complex domains with regex libraries, not a restriction to regular syntaxes. There are simple cases where backtracking (or some other extension) plus regex makes sense. Something like

    (\w+) \d+ \1
is simpler as a regex than any other common tool available.

> Alternative title: you can use regex derivatives...

This is particularly confusing because you mean "derivative" in the sense of "etymologically/conceptually related", but there is actually a (formal) notion of the derivative of a regular expression.

Good job on the 3-SAT encoding, it was very clear.

Ok, we added "PCRE" to the title to satisfy the unsatisfied. Though now it's like "PIN number".

I wrote about a different approach (via the SKI combinator calculus and Rule 110) on PerlMonks: http://www.perlmonks.org/?node_id=809842 .

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact