

Recursive Regular Expressions - pkrumins
http://www.catonmat.net/blog/recursive-regular-expressions/

======
madars
Shouldn't

    
    
        my $regex = qr/0(??{$regex})*1/; # (1)
    

rather be

    
    
        my $regex = qr/0(??{$regex})?1/; # (2)
    

? Otherwise it matches many other strings (e.g. 001011) in addition to 0^n
1^n; you actually want it to match zero or one times instead of zero or more.
Unfortunately the explaination why (1) works is a bit faulty.

I didn't know recursive regexes before, thanks for the article.

~~~
pkrumins
You're right! I just fixed this serious error in the article and updated the
explanation. Thanks!

I was thinking about ? but wrote it down as * without realizing that * may
insert several $regex$regex... thus breaking the regex.

------
baddox
To sum it all up: most "regular expression" implementations are actually
powerful enough to describe context-free grammars, which are essentially
regular expressions mixed with balanced parentheses. The fact that the author
manages to never use the term "context-free grammar" leads me to believe he's
unduly surprised at his discovery.

~~~
pkrumins
Well said. I will include this in the article. Couldn't figure out how to say
it well before because they are still not equivalent to context-free
languages, they are just sometimes powerful enough to describe some of the
context-free languages. Therefore I didn't include it. I think only the regex
implementations with recursion are context-free. Others are not, therefore I
wouldn't say "most" of them are.

You're right that I was surprised because I hadn't heard of recursive regular
expressions in Perl before.

Update: I just remembered that extended regular expressions can even recognize
recursive languages. Here is a regex that decides if the number is a prime:

    
    
         perl -lne '(1x$_) !~ /^1?$|^(11+?)\1+$/ && print "$_ is prime"'
    

This goes beyond context-free, therefore I couldn't just say "context-free"
carelessly in the article.

~~~
baddox
Technically, a CFG is the intersection of an n-balanced parentheses language
and a regular language. Therefore, any regular language can be described by a
CFG (but obviously not vice-versa).

I am completely unfamiliar with Perl and its regular expressions, but from
your article and this comment I get the impression that they are Turing-
complete, which would mean they're capable of describing any formal language
as well as doing any "doable" computation. Is that correct?

~~~
pkrumins
Just found this comment.

I am not exactly sure if they are Turing-complete. The code in (??{code}) can
be anything, so I guess, they are.

------
andreyf
Allison Randal, the chief architect of the Parrot virtual machine (that Perl 6
runs on) has one of the most under-rated views regarding this --

 _There's an odd misconception in the computing world that writing compilers
is hard. This view is fueled by the fact that people don't write compilers
very often. People used to think writing CGI code was hard. Well, it is hard,
if you do it in C without any tools. I don't know anyone who writes CGI code
in C anymore. If we wrote compilers as often as we write shopping carts, or
web forums, or wikis, there would be just as many tools available to make the
job easy. (And just like web tools, only 10% of them would be worth using.
That's evolution in action.)

—Allison Randal, "Parrot Compiler Tools"_

Source: <http://www.lohutok.net/>

~~~
pkrumins
Completely true. In fact nothing is hard in the computing world. Everything is
doable. And if someone says it's not doable, they are not trying hard enough.

~~~
scott_s
Write a program that determines whether or not any arbitrary program goes into
an infinite loop.

~~~
pkrumins
You can still try hard and cover many cases where it would go into an infinite
loop. And then try even harder and cover even more cases.

~~~
scott_s
And you still won't have written a program that determines whether any
_arbitrary_ program goes into an infinite loop.

~~~
warfangle
If we can write a program that translates said arbitrary program into the
rules for an n-state, m-symbol turing machine, where n = (2|3|4|5) and m = 2,
then yes - we can determine if the program terminates or not.

We can make some fuzzy guesses whether or not the program terminates or not in
other state/symbol combinations.

<http://en.wikipedia.org/wiki/Busy_beaver>

If we ever did write said program that could determine whether any /arbitrary/
program is infinite, then we have solved the halting problem. And that would
be quite something.

------
compay
It's a neat trick but as the author says, it's probably too cryptic to use for
"real" code.

Lua's LPeg lets you create simple recursive grammars to solve problems that
are often hard with regular expressions (e.g., matching balanced parentheses).
It's the one library from Lua I find myself seriously missing when programming
in other languages.

<http://www.inf.puc-rio.br/~roberto/lpeg/lpeg.html#grammar>

------
scotty79
It's also supported in PHP preg_ functions, but instead (??{$regex}) you write
(?R)

------
moeffju
(OT) I would have to upvote this even if it were only the regeXzibit pic.

------
vlisivka
Perl6 already contains grammars.

Perl5 backported "named captures" from perl6:

    
    
      #!/usr/bin/perl
      use warnings;
      use strict;
      use feature qw(switch say);
      #use locale;
      #use re 'debug';
    
      my $expression = qr{
          (?(DEFINE)
                (?<expr>       (?&space)* (?&term) (?&space)* (?&opterm)? (?&space)* ; (?&space)* )
                (?<term>       (?&identifier)
                               | (?&number)
                               | (?&string)
                )
                (?<opterm>     (?&space)* (?&operator) (?&space)* (?&term) (?&space)* (?&opterm)? (?&space)* )
                (?<operator>   [=+*/-]                                          )
                (?<identifier> [A-Za-z_][A-Za-z0-9_]+            )
                (?<number>     [0-9]+                                           )
                (?<string>     (?&quote) ( (?&char) | (?&bsquote) )*? (?&quote) )
                (?<quote>      \"                                               )
                (?<char>       [^\\\"]                                          )
                (?<bsquote>    \\.                                              )
                (?<space>      [ \t]                                            )
          )
          (?&expr)
        }x;
    
      while (<>) {
      chomp;
      if ( /^ $expression $/x ) {
        say "Correct expression: $_";
      }else {
        say "Incorrect expression: $_";
      }
      }

~~~
pkrumins
This is an awesome backport. Thanks for reporting.

------
uriel
See also Structural Regular Expressions by Rob Pike:

<http://doc.cat-v.org/bell_labs/structural_regexps/>

