Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Recursive Regular Expressions (catonmat.net)
44 points by pkrumins on Dec 14, 2009 | hide | past | favorite | 21 comments


Shouldn't

    my $regex = qr/0(??{$regex})*1/; # (1)
rather be

    my $regex = qr/0(??{$regex})?1/; # (2)
? Otherwise it matches many other strings (e.g. 001011) in addition to 0^n 1^n; you actually want it to match zero or one times instead of zero or more. Unfortunately the explaination why (1) works is a bit faulty.

I didn't know recursive regexes before, thanks for the article.


You're right! I just fixed this serious error in the article and updated the explanation. Thanks!

I was thinking about ? but wrote it down as * without realizing that * may insert several $regex$regex... thus breaking the regex.


To sum it all up: most "regular expression" implementations are actually powerful enough to describe context-free grammars, which are essentially regular expressions mixed with balanced parentheses. The fact that the author manages to never use the term "context-free grammar" leads me to believe he's unduly surprised at his discovery.


Well said. I will include this in the article. Couldn't figure out how to say it well before because they are still not equivalent to context-free languages, they are just sometimes powerful enough to describe some of the context-free languages. Therefore I didn't include it. I think only the regex implementations with recursion are context-free. Others are not, therefore I wouldn't say "most" of them are.

You're right that I was surprised because I hadn't heard of recursive regular expressions in Perl before.

Update: I just remembered that extended regular expressions can even recognize recursive languages. Here is a regex that decides if the number is a prime:

     perl -lne '(1x$_) !~ /^1?$|^(11+?)\1+$/ && print "$_ is prime"'
This goes beyond context-free, therefore I couldn't just say "context-free" carelessly in the article.


Technically, a CFG is the intersection of an n-balanced parentheses language and a regular language. Therefore, any regular language can be described by a CFG (but obviously not vice-versa).

I am completely unfamiliar with Perl and its regular expressions, but from your article and this comment I get the impression that they are Turing-complete, which would mean they're capable of describing any formal language as well as doing any "doable" computation. Is that correct?


Just found this comment.

I am not exactly sure if they are Turing-complete. The code in (??{code}) can be anything, so I guess, they are.


This is incorrect. What is being described is closer to PEGs. A parser for context-free grammars will correctly handle ambiguity between alternatives, whereas a PEG (and modern regex engines) will just try all of the alternatives and choose the first one that matches.


Allison Randal, the chief architect of the Parrot virtual machine (that Perl 6 runs on) has one of the most under-rated views regarding this --

There's an odd misconception in the computing world that writing compilers is hard. This view is fueled by the fact that people don't write compilers very often. People used to think writing CGI code was hard. Well, it is hard, if you do it in C without any tools. I don't know anyone who writes CGI code in C anymore. If we wrote compilers as often as we write shopping carts, or web forums, or wikis, there would be just as many tools available to make the job easy. (And just like web tools, only 10% of them would be worth using. That's evolution in action.)

—Allison Randal, "Parrot Compiler Tools"

Source: http://www.lohutok.net/


Completely true. In fact nothing is hard in the computing world. Everything is doable. And if someone says it's not doable, they are not trying hard enough.


Write a program that determines whether or not any arbitrary program goes into an infinite loop.


You can still try hard and cover many cases where it would go into an infinite loop. And then try even harder and cover even more cases.


And you still won't have written a program that determines whether any arbitrary program goes into an infinite loop.


If we can write a program that translates said arbitrary program into the rules for an n-state, m-symbol turing machine, where n = (2|3|4|5) and m = 2, then yes - we can determine if the program terminates or not.

We can make some fuzzy guesses whether or not the program terminates or not in other state/symbol combinations.

http://en.wikipedia.org/wiki/Busy_beaver

If we ever did write said program that could determine whether any /arbitrary/ program is infinite, then we have solved the halting problem. And that would be quite something.


You're right. I wouldn't have.


> In fact nothing is hard in the computing world.

Allison's point is instead that the combination of the perception that compilers are difficult and the relative paucity of tools to make writing compilers easier is a vicious cycle which writing better tools might fix.


It's a neat trick but as the author says, it's probably too cryptic to use for "real" code.

Lua's LPeg lets you create simple recursive grammars to solve problems that are often hard with regular expressions (e.g., matching balanced parentheses). It's the one library from Lua I find myself seriously missing when programming in other languages.

http://www.inf.puc-rio.br/~roberto/lpeg/lpeg.html#grammar


It's also supported in PHP preg_ functions, but instead (??{$regex}) you write (?R)


(OT) I would have to upvote this even if it were only the regeXzibit pic.


Perl6 already contains grammars.

Perl5 backported "named captures" from perl6:

  #!/usr/bin/perl
  use warnings;
  use strict;
  use feature qw(switch say);
  #use locale;
  #use re 'debug';

  my $expression = qr{
      (?(DEFINE)
            (?<expr>       (?&space)* (?&term) (?&space)* (?&opterm)? (?&space)* ; (?&space)* )
            (?<term>       (?&identifier)
                           | (?&number)
                           | (?&string)
            )
            (?<opterm>     (?&space)* (?&operator) (?&space)* (?&term) (?&space)* (?&opterm)? (?&space)* )
            (?<operator>   [=+*/-]                                          )
            (?<identifier> [A-Za-z_][A-Za-z0-9_]+            )
            (?<number>     [0-9]+                                           )
            (?<string>     (?&quote) ( (?&char) | (?&bsquote) )*? (?&quote) )
            (?<quote>      \"                                               )
            (?<char>       [^\\\"]                                          )
            (?<bsquote>    \\.                                              )
            (?<space>      [ \t]                                            )
      )
      (?&expr)
    }x;

  while (<>) {
  chomp;
  if ( /^ $expression $/x ) {
    say "Correct expression: $_";
  }else {
    say "Incorrect expression: $_";
  }
  }


This is an awesome backport. Thanks for reporting.


See also Structural Regular Expressions by Rob Pike:

http://doc.cat-v.org/bell_labs/structural_regexps/




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: