A new Regular Expression Matching paper ‎(Philip Bille)‎

crux_ · on July 20, 2009

To save time for those after the paper itself (actually an extended abstract, but that's good enough for me for now):

http://www.philipbille.org/files/c10.pdf?attredirects=0

Executive summary (from me, a hasty reader and a sloppy thinker): It matches multiple characters at a time by taking advantage the fact that, typically, more than one character its in a machine word. This quickly becomes hairy, leading to over 50% of the remaining text written in the LaTeX math environment.

Less jokingly, in order to understand what's going on, I think you need to have read+understood Myers' regex algorithm first, which speeds up the straightforward Thompson NFA O(nm) matching by using a table scheme, which they then extend to support performing transitions for multiple characters at once.

Finally ... I'm not sure how much practical use this improvement really is, given:

> "Today, regular expressions have also become a shining example of how ignoring good theory leads to bad programs. The regular expression implementations used by today's popular tools are significantly slower than the ones used in many of those thirty-year-old Unix tools."

From http://swtch.com/~rsc/regexp/regexp1.html

carterschonwald · on July 20, 2009

This is about as awesome hackery piece of theoretical computer science as you can get, the first asymptotic improvement to regular expression matching in at least 17 years!

carterschonwald · on July 20, 2009

to quote from the paper's abstract, the improvement is We show how to solve this ubiquitous task in linear space and O(nm(log log n)/(log n)3/2 +n+m) time where m is the length of the expression and n the length of the string. This is the ﬁrst improvement for the dominant O(nm/ log n) term in Myers’ O(nm/ log n + (n + m) log n) bound [JACM 1992]. We also get improved bounds for external memory.

yannis · on July 20, 2009

Mind boggling work!