
Regular Expression Matching in the Wild - renata
http://swtch.com/~rsc/regexp/regexp3.html
======
avar
After reading Russ's original article in 2007 I did some work on bringing the
nascent regex API in perl's core up to par so one could lexically replace
perl's regex engine with the plan9 engine, PCRE and others.

One interesting product of this work was the ability to compare other regex
engines to Perl's in perl's own test suite. See this journal entry for some
info about that: <http://use.perl.org/~avar/journal/33585>

Maybe I'll resurrect some of that work once Perl 5.12 comes out and try
dropping in RE2 and see how it compares to PCRE and Perl on the various edge
cases involved.

This would be somewhat easier if RE2 supported Perl's syntax for named
captures.

~~~
rsc
It's easy enough to tweak the parser (re2/parse.cc) to add it. Please send me
an email (rsc@swtch.com) if you do run the tests. I'd be interested to see the
results.

Thanks.

~~~
avar
I'll make sure to do that, no promises that I'll actually get around to this
though.

Perl and others could definitely use the test work that was done on RE2
though, the way the tests are being automatically generated is really neat.

------
crux_
Interesting fact: If you restrict yourself in certain ways (e.g. no
backreferences), it's possible to write an automatic "subtype" checking system
for regular expressions.

(By this I mean you can prove, for two regexes, that if the first matches
something then the second will always also match that same text. You can also
test for exact equivalence by making sure the relation holds in both
directions.)

I've used this before in code that dealt with XML schema (which are roughly
regular expressions over trees), but it strikes me that it could be highly
useful to have for the analysis & optimization described...

Examples might be an optimization database (if you see a part of a regex
that's equivalent to X, substitute Y instead), or automatically generating
faster-but-coarser regexes.

~~~
bdr
Do you know whether the restrictions you're talking about limit regexps to the
set of actual regular languages?

~~~
crux_
The algorithm I used should work with anything that's a "Regular Hedge
Grammar".

An off-the-cuff understanding is that means a regular hedge grammer is a CFG
written such that any recursion in the grammar is tail-recursion. (Hopefully
that makes sense, although I'm mixing terms from different domains there.)

So, yes.

(n.b.: I haven't worked much with the theory; just implemented a couple of
papers. :) )

~~~
baddox
I'm not understanding what you're saying. Whether a grammar on one level of
the Chomsky hierarchy is also a grammar on a lower level is an undecidable
problem.

------
ximeng
A couple of the image links are broken. Try

<http://pdos.csail.mit.edu/~rsc/regexp-img/script_Greek.png>
<http://pdos.csail.mit.edu/~rsc/regexp-img/cat_Lu.png>

for the trees used to match greek scripts and lower case letters respectively.

Another interesting article from the guy who wrote Google Code Search if I'm
not mistaken.

~~~
rsc
Image links fixed.

~~~
ableal
Thanks for great article, but I'd also suggest s/cacophany/cacophony/

(If theophany is "an appearance of a deity to man", 'cacophany' sounds like
"shit happens" ;-)

~~~
rsc
That's probably an even more appropriate term for modern regular expression
syntax, but you're right, it isn't the sense I was going for. Fixed, thanks.

------
andrewcooke
At last - been waiting for this. Will rewrite LEPL's regexp engine as soon as
I can get the current release out...

If you haven't read the previous articles, they are classics and really worth
the effort (linked at start of this one).

------
mkinsella
Great article. I've been interested in compilers since my college class and
this scratches the itch.

------
chasingsparks
This should be included by professors teaching, or programmers working
through, the Dragon Book. It really does a great job reinforcing some ideas.

------
andrewcooke
See also <http://news.ycombinator.com/item?id=1184971> which links to
[http://google-
opensource.blogspot.com/2010/03/re2-principled...](http://google-
opensource.blogspot.com/2010/03/re2-principled-approach-to-regular.html)
(source for this).

