The true power of regular expression

eru · on May 27, 2013

> But when programmers talk about “regular expressions” they aren’t talking about formal grammars. They are talking about the regular expression derivative which their language implements. And those regex implementations are only very slightly related to the original notion of regularity.

That's why we call those derivatives "Franken-xpressions". They can take exponential time in the worst case. See "Regular Expression Matching Can Be Simple And Fast (but is slow in Java, Perl, PHP, Python, Ruby, ...)" (http://swtch.com/~rsc/regexp/regexp1.html).

The additional power of Franken-xpressions comes at a cost.

For general purpose parsing, I suggest using a parser combinator library.

andrewcooke · on May 27, 2013

in case it's not clear - the "regular expressions" he's using are perl's (well, based on perl's - PCRE) which are notoriously extended past the theoretical definition, or past what almost all other regexp libs / languages provide.

the examples won't work in python or java, for example.

abecedarius · on May 27, 2013

Right. In Python you might check out my https://github.com/darius/peglet/blob/master/peglet.py which is PEG-based but tries to feel like a modest extension of the built-in re module.

garethadams · on May 27, 2013

"Your [computer] scientists were so preoccupied with whether or not they could, they didn't stop to think if they should"

daGrevis · on May 27, 2013

https://news.ycombinator.com/item?id=5109549

ggchappell · on May 27, 2013

A nice little article. But I think it misses a noteworthy issue.

> Context-sensitive languages are something that you will rarely encounter during “normal” programming.

In my experience, context-sensitive languages, as a class, are never encountered in programming. Certainly, one might need to deal with a formal language that happens to be context-sensitive but not context-free; however, the fact that it lies in the context-sensitive class does not ever seem to be a useful fact.

The concepts of regular, context-free, and recursively-enumerable languages are obviously useful ones for us. But I think the only reason programmers ever mention the class of context-sensitive languages is that they feel obligated to, having mentioned the other three classes in Chomsky's hierarchy.

If anyone knows of an instance where the fact the concept of context-sensitive languages/grammars was useful in a computational context[1], I'd like to hear about it.

[1] Heck, any context. Are context-sensitive grammars really useful in the study of natural languages? I wouldn't be surprised if the answer turns out to be "no".

nly · on May 27, 2013

One of my favourite modern regex libraries for C++ is Boost.Xpressive, which allows for parsing context-free grammars using recursively nested rules. After fighting with Boost.Spirit (a more formal parser framework) for some time, I achieved what I wanted with Xpressive in a fraction of the time.

From the manual:

The theoretical computer scientists out there will correctly point out that a self-referential regular expression is not "regular", so in the strict sense, xpressive isn't really a regular expression engine at all. But as Larry Wall once said, "the term [regular expression] has grown with the capabilities of our pattern matching engines, so I'm not going to try to fight linguistic necessity here."

gilney · on May 27, 2013

> (Reminder: When I say “regular expression” here I obviously mean it in the programmer sense, not the formal language theory sense.)

Once I read an article where the autor recomended using "regular expression" meaning the formal theory and "regex" to point out the (more powerfull) implementation. I think it's a good way to provide a common vocabulary.

bwindels · on May 27, 2013

The main point against using regular expression for me is that it takes developers that are new to the code (or yourself in a few months) so long to figure out what is happening there, not that they are not powerful. The fact that there is no way to debug them in most environments and no comments does not help.

logn · on May 27, 2013

I've been on several projects maintaining code by people who were ignorant to regex or otherwise refused to use them. Their code is littered with indexOf operations and charAt and all sorts of things like that. And they build those calls up one after the other creating all sorts of intermediate variables. It's a gigantic mess and I can usually remove 50-80% of that code (my guess) with regex. I think regex is far more maintainable and easier to read.

As far as no comments or not being able to debug, you can place each regex in a clearly named variable. And for debugging, there are a wealth of external tools out there, and in Java you certainly can step through them in a debugger (if you like pain). Like code, there are ways to write regex that are more readable and maintainable than others. We should make our regexes clear and used in conjunction with well thought out code. Then the software is a joy to work with, after spending 2 days studying regular-expressions.info.

lightknight · on May 27, 2013

So...on the off-chance that the default SDKs for Java / .Net do not implement this natively, anyone have a link to some (preferably free) libraries that do make use of this implementation?

stefantalpalaru · on May 27, 2013

http://www.pcre.org/

gizzlon · on May 27, 2013

How Perl-compatible is pcre nowadays? Does it implement all the regular expression features in Perl?

stefantalpalaru · on May 27, 2013

The Perl 5 compatibility is good. Here's a list of the differences: http://www.gsp.com/cgi-bin/man.cgi?topic=pcrecompat

sjs1234 · on May 27, 2013

While these 'regular expressions' are unreadable, on the plus side they don't ensure O(n) runtime. There should be an anti-Turing award for their creation.