Hacker News new | past | comments | ask | show | jobs | submit login
RE2/J: Linear time regular expression matching in Java (github.com/google)
42 points by fmela on Feb 18, 2015 | hide | past | favorite | 8 comments



Lucene has had this capability for a while now. A regexp query (as defined by the RegExp class) is compiled into an automaton for quick comparison with the inverted index, also represented as an automaton. Automatons and FSTs are used extensively by Lucene these days. Out of curiosity, I did some playing around with Lucene's RegExp class when it first came out (or rather, when I first learned about it). It provides a really interesting way to build regular expressions: https://gist.github.com/mumrah/6104234

Javadoc: https://lucene.apache.org/core/4_0_0/core/org/apache/lucene/...

Source: https://github.com/apache/lucene-solr/blob/trunk/lucene/core...


Related: the dk.brics.automaton library (http://www.brics.dk/automaton/) is excellent.

It's got a much different API than java.util.RegEx, but it let's you work with Automatons as first class things instead of operating just with regexes. Being able to compute intersections, unions, shortest match examples, from multiple automata etc.. can be really useful.


Lucene's regexp querying is based on http://www.brics.dk/automaton/

It's fast, but limited too: no word boundary matching for example.


In addition to writing RE2, Russ Cox has a set of very readable articles on implementing efficient regular expressions here: http://swtch.com/~rsc/regexp/


Obligatory comment about how if you can't match it in linear time then it's not a regular expression.


So Perl and Python don't have regex then?


No, they are not regular.

See http://en.wikipedia.org/wiki/Regular_language for more.


Something missing from all such libraries that I find a little disappointing, is the ability to compile multiple regular expressions into a single evaluation with a paired action on the result, so that the regexp becomes a general purpose map with string domain. At its most basic, each regexp could then be paired with a different replacement operation. Or an arbitrary more complex action.

With a DFA representation (linear time evaluation), this is pretty powerful. I've a regexp library that supports this, that I've been meaning to spruce up for wider release, but it's been several years and I've yet to do it.

https://code.google.com/p/jjoost/source/browse/#hg%2Fjjoost-...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: