Recursive Regular Expressions

madars · on Dec 14, 2009

Shouldn't

    my $regex = qr/0(??{$regex})*1/; # (1)

rather be

    my $regex = qr/0(??{$regex})?1/; # (2)

? Otherwise it matches many other strings (e.g. 001011) in addition to 0^n 1^n; you actually want it to match zero or one times instead of zero or more. Unfortunately the explaination why (1) works is a bit faulty.

I didn't know recursive regexes before, thanks for the article.

pkrumins · on Dec 14, 2009

You're right! I just fixed this serious error in the article and updated the explanation. Thanks!

I was thinking about ? but wrote it down as * without realizing that * may insert several $regex$regex... thus breaking the regex.

tshaddox · on Dec 14, 2009

To sum it all up: most "regular expression" implementations are actually powerful enough to describe context-free grammars, which are essentially regular expressions mixed with balanced parentheses. The fact that the author manages to never use the term "context-free grammar" leads me to believe he's unduly surprised at his discovery.

pkrumins · on Dec 14, 2009

Well said. I will include this in the article. Couldn't figure out how to say it well before because they are still not equivalent to context-free languages, they are just sometimes powerful enough to describe some of the context-free languages. Therefore I didn't include it. I think only the regex implementations with recursion are context-free. Others are not, therefore I wouldn't say "most" of them are.

You're right that I was surprised because I hadn't heard of recursive regular expressions in Perl before.

Update: I just remembered that extended regular expressions can even recognize recursive languages. Here is a regex that decides if the number is a prime:

     perl -lne '(1x$_) !~ /^1?$|^(11+?)\1+$/ && print "$_ is prime"'

This goes beyond context-free, therefore I couldn't just say "context-free" carelessly in the article.

tshaddox · on Dec 15, 2009

Technically, a CFG is the intersection of an n-balanced parentheses language and a regular language. Therefore, any regular language can be described by a CFG (but obviously not vice-versa).

I am completely unfamiliar with Perl and its regular expressions, but from your article and this comment I get the impression that they are Turing-complete, which would mean they're capable of describing any formal language as well as doing any "doable" computation. Is that correct?

pkrumins · on Dec 17, 2009

Just found this comment.

I am not exactly sure if they are Turing-complete. The code in (??{code}) can be anything, so I guess, they are.

haberman · on Dec 14, 2009

This is incorrect. What is being described is closer to PEGs. A parser for context-free grammars will correctly handle ambiguity between alternatives, whereas a PEG (and modern regex engines) will just try all of the alternatives and choose the first one that matches.

andreyf · on Dec 14, 2009

Allison Randal, the chief architect of the Parrot virtual machine (that Perl 6 runs on) has one of the most under-rated views regarding this --

There's an odd misconception in the computing world that writing compilers is hard. This view is fueled by the fact that people don't write compilers very often. People used to think writing CGI code was hard. Well, it is hard, if you do it in C without any tools. I don't know anyone who writes CGI code in C anymore. If we wrote compilers as often as we write shopping carts, or web forums, or wikis, there would be just as many tools available to make the job easy. (And just like web tools, only 10% of them would be worth using. That's evolution in action.)

—Allison Randal, "Parrot Compiler Tools"

Source: http://www.lohutok.net/

pkrumins · on Dec 14, 2009

Completely true. In fact nothing is hard in the computing world. Everything is doable. And if someone says it's not doable, they are not trying hard enough.

scott_s · on Dec 14, 2009

Write a program that determines whether or not any arbitrary program goes into an infinite loop.

pkrumins · on Dec 14, 2009

You can still try hard and cover many cases where it would go into an infinite loop. And then try even harder and cover even more cases.

scott_s · on Dec 14, 2009

And you still won't have written a program that determines whether any arbitrary program goes into an infinite loop.

warfangle · on Dec 14, 2009

If we can write a program that translates said arbitrary program into the rules for an n-state, m-symbol turing machine, where n = (2|3|4|5) and m = 2, then yes - we can determine if the program terminates or not.

We can make some fuzzy guesses whether or not the program terminates or not in other state/symbol combinations.

http://en.wikipedia.org/wiki/Busy_beaver

If we ever did write said program that could determine whether any /arbitrary/ program is infinite, then we have solved the halting problem. And that would be quite something.

pkrumins · on Dec 14, 2009

You're right. I wouldn't have.

chromatic · on Dec 14, 2009

> In fact nothing is hard in the computing world.

Allison's point is instead that the combination of the perception that compilers are difficult and the relative paucity of tools to make writing compilers easier is a vicious cycle which writing better tools might fix.

compay · on Dec 14, 2009

It's a neat trick but as the author says, it's probably too cryptic to use for "real" code.

Lua's LPeg lets you create simple recursive grammars to solve problems that are often hard with regular expressions (e.g., matching balanced parentheses). It's the one library from Lua I find myself seriously missing when programming in other languages.

http://www.inf.puc-rio.br/~roberto/lpeg/lpeg.html#grammar

scotty79 · on Dec 14, 2009

It's also supported in PHP preg_ functions, but instead (??{$regex}) you write (?R)

moeffju · on Dec 14, 2009

(OT) I would have to upvote this even if it were only the regeXzibit pic.

vlisivka · on Dec 14, 2009

Perl6 already contains grammars.

Perl5 backported "named captures" from perl6:

  #!/usr/bin/perl
  use warnings;
  use strict;
  use feature qw(switch say);
  #use locale;
  #use re 'debug';

  my $expression = qr{
      (?(DEFINE)
            (?<expr>       (?&space)* (?&term) (?&space)* (?&opterm)? (?&space)* ; (?&space)* )
            (?<term>       (?&identifier)
                           | (?&number)
                           | (?&string)
            )
            (?<opterm>     (?&space)* (?&operator) (?&space)* (?&term) (?&space)* (?&opterm)? (?&space)* )
            (?<operator>   [=+*/-]                                          )
            (?<identifier> [A-Za-z_][A-Za-z0-9_]+            )
            (?<number>     [0-9]+                                           )
            (?<string>     (?&quote) ( (?&char) | (?&bsquote) )*? (?&quote) )
            (?<quote>      \"                                               )
            (?<char>       [^\\\"]                                          )
            (?<bsquote>    \\.                                              )
            (?<space>      [ \t]                                            )
      )
      (?&expr)
    }x;

  while (<>) {
  chomp;
  if ( /^ $expression $/x ) {
    say "Correct expression: $_";
  }else {
    say "Incorrect expression: $_";
  }
  }

pkrumins · on Dec 14, 2009

This is an awesome backport. Thanks for reporting.

uriel · on Dec 15, 2009

See also Structural Regular Expressions by Rob Pike:

http://doc.cat-v.org/bell_labs/structural_regexps/