

The true power of regular expression - RohitS5
http://nikic.github.io/2012/06/15/The-true-power-of-regular-expressions.html

======
eru
> But when programmers talk about “regular expressions” they aren’t talking
> about formal grammars. They are talking about the regular expression
> derivative which their language implements. And those regex implementations
> are only very slightly related to the original notion of regularity.

That's why we call those derivatives "Franken-xpressions". They can take
exponential time in the worst case. See "Regular Expression Matching Can Be
Simple And Fast (but is slow in Java, Perl, PHP, Python, Ruby, ...)"
(<http://swtch.com/~rsc/regexp/regexp1.html>).

The additional power of Franken-xpressions comes at a cost.

For general purpose parsing, I suggest using a parser combinator library.

------
andrewcooke
in case it's not clear - the "regular expressions" he's using are perl's
(well, based on perl's - PCRE) which are notoriously extended past the
theoretical definition, or past what almost all other regexp libs / languages
provide.

the examples won't work in python or java, for example.

~~~
abecedarius
Right. In Python you might check out my
<https://github.com/darius/peglet/blob/master/peglet.py> which is PEG-based
but tries to feel like a modest extension of the built-in re module.

------
garethadams
"Your [computer] scientists were so preoccupied with whether or not they
_could_ , they didn't stop to think if they _should_ "

------
daGrevis
<https://news.ycombinator.com/item?id=5109549>

------
ggchappell
A nice little article. But I think it misses a noteworthy issue.

> Context-sensitive languages are something that you will rarely encounter
> during “normal” programming.

In my experience, context-sensitive languages, _as a class_ , are _never_
encountered in programming. Certainly, one might need to deal with a formal
language that happens to be context-sensitive but not context-free; however,
the fact that it lies in the context-sensitive class does not _ever_ seem to
be a useful fact.

The concepts of regular, context-free, and recursively-enumerable languages
are obviously useful ones for us. But I think the only reason programmers ever
mention the class of context-sensitive languages is that they feel obligated
to, having mentioned the other three classes in Chomsky's hierarchy.

If anyone knows of an instance where the fact the concept of context-sensitive
languages/grammars was useful in a computational context[1], I'd like to hear
about it.

[1] Heck, _any_ context. Are context-sensitive grammars really useful in the
study of natural languages? I wouldn't be surprised if the answer turns out to
be "no".

------
nly
One of my favourite modern regex libraries for C++ is Boost.Xpressive, which
allows for parsing context-free grammars using recursively nested rules. After
fighting with Boost.Spirit (a more formal parser framework) for some time, I
achieved what I wanted with Xpressive in a fraction of the time.

From the manual:

 _The theoretical computer scientists out there will correctly point out that
a self-referential regular expression is not "regular", so in the strict
sense, xpressive isn't really a regular expression engine at all. But as Larry
Wall once said, "the term [regular expression] has grown with the capabilities
of our pattern matching engines, so I'm not going to try to fight linguistic
necessity here."_

------
gilney
> (Reminder: When I say “regular expression” here I obviously mean it in the
> programmer sense, not the formal language theory sense.)

Once I read an article where the autor recomended using "regular expression"
meaning the formal theory and "regex" to point out the (more powerfull)
implementation. I think it's a good way to provide a common vocabulary.

------
bwindels
The main point against using regular expression for me is that it takes
developers that are new to the code (or yourself in a few months) so long to
figure out what is happening there, not that they are not powerful. The fact
that there is no way to debug them in most environments and no comments does
not help.

~~~
logn
I've been on several projects maintaining code by people who were ignorant to
regex or otherwise refused to use them. Their code is littered with indexOf
operations and charAt and all sorts of things like that. And they build those
calls up one after the other creating all sorts of intermediate variables.
It's a gigantic mess and I can usually remove 50-80% of that code (my guess)
with regex. I think regex is far more maintainable and easier to read.

As far as no comments or not being able to debug, you can place each regex in
a clearly named variable. And for debugging, there are a wealth of external
tools out there, and in Java you certainly can step through them in a debugger
(if you like pain). Like code, there are ways to write regex that are more
readable and maintainable than others. We should make our regexes clear and
used in conjunction with well thought out code. Then the software is a joy to
work with, after spending 2 days studying regular-expressions.info.

------
lightknight
So...on the off-chance that the default SDKs for Java / .Net do not implement
this natively, anyone have a link to some (preferably free) libraries that do
make use of this implementation?

~~~
stefantalpalaru
<http://www.pcre.org/>

~~~
gizzlon
How Perl-compatible is pcre nowadays? Does it implement all the regular
expression features in Perl?

~~~
stefantalpalaru
The Perl 5 compatibility is good. Here's a list of the differences:
<http://www.gsp.com/cgi-bin/man.cgi?topic=pcrecompat>

------
sjs1234
While these 'regular expressions' are unreadable, on the plus side they don't
ensure O(n) runtime. There should be an anti-Turing award for their creation.

