
RE2: robust regexp library by Google - vr
http://google-opensource.blogspot.com/2010/03/re2-principled-approach-to-regular.html
======
dasht
Mr. Cox generally speaking has his act together on regular expressions - he's
more or less joined the very small number of real experts on the topic. I
especially admire his courage in promoting a regular expression engine that
disregards popular features not found in true regular languages, such as
backreferences.

That sed, some pedantic gripes:

1\. It is not true (or is at least confusingly stated) that searches are
linear (per the project page) in space and time in the lengths of the regular
expression and the length of the target string. The simpler and more accurate
statement is that searches are (worst case) linear time-wise in the _product_
of the length of regexp and the search string, and O(1) for space.

2\. Credit where credit is due: the caching-DFA technique was first widely
popularized in old editions of the so-called Dragon Book (Aho, Sethi, and
Ullman's "Compilers: Principles, Techniques, and Tools." My copy was published
in 1985. In personal correspondence with some old Bell Labs guys, I'm informed
that someone (Thompson, as I recall) puzzled out the technique in the 1970s or
perhaps the very early 1980s - but set it aside as being too hairy for their
needs at the time. Cox should ask Pike to correct my memory on that. The
technique has been practiced in a number of matchers since that time. In fact,
an O(1) memory DFA caching implementation has been available for more than a
decade as free software although its author (uh, me) hastens to add that he
has little doubt that Mr. Cox has an implementation with novel charms that was
well worth doing. My implementation got too hairy in some ways.

3\. Perhaps it will change over time but in this release, at least per the
documentation, when the cache of DFA states is filled, the entire cache is
discarded. This nearly needlessly limits the range of applicability of the
matcher. An incremental approach to cache clearing using a (perhaps weighted)
LRU strategy would expand the reach of this implementation at relatively
little cost. This is _not_ to say that Mr. Cox made a bad decision taking the
simpler approach first (mumble mumble premature optimization mumble evil
mumble mumble just constant factors something something something the dark
side).

4\. His inner loop looks like it uses _way_ more instructions than are
reasonably called for although, to be sure, to be certain of that I'd have to
know what requirements he's constrained to satisfy. When running out of the
DFA cache, my implementation could (at least when last measured) get by on
something like 12 or 20 instructions per character in the string being
searched. Perhaps I'm mis-reading the code but this thing looks like it has
much higher "constant factors".

To be clear: I'm a fan of Mr. Cox's series of articles (written over several
years!) about regexps. I like his general approach so much that I _would_ wish
I'd tried it first (except that I did, just wrong place, wrong time). I think
the topic is sufficiently cool that I mention the above pedantic points mainly
for the benefit of third parties, to give them some additional entry points to
learn about it. Not trying to damn Mr. Cox at all, yadda yadda. On the
contrary, he seems to have arrived in the weird little regular expression
engine zone. Nice calling card.

~~~
rsc
Thanks for the comments. I can't tell from your user name who you are and what
free software package you're referring to: can you post a link? I'm always
interested to read a regular expression implementation.

1\. Fair enough, changed linear in X and Y to linear in X and linear in Y.

2\. I wasn't trying to highlight the caching so much as the flushing. The
Dragon book algorithm does not cache. It generates the entire DFA ahead of
time. This may be appropriate in a lexer but not in a grep. Ken Thompson's
Plan 9 grep, which I do link to, does the same caching, as does the tiny DFA
in regexp1.html, but neither can flush. Being able to flush the cache is the
big deal, because it lets you bound the memory usage. I changed the heading
from "Use the DFA as a cache" to "Be able to flush the DFA cache". More
generally, I try pretty hard to give credit where credit is due, both in this
article and in the previous two. This is well trodden ground, and I don't
think any algorithm in RE2 is original, though many are unfairly ignored. If
you know of other specific citations that are missing, please let me know. I'm
especially not trying to steal any thunder from Ken.

3\. The entire cache is flushed and then rebuilt as needed. I'd be surprised
to see an fast implementation that can flush individual cache entries. The
cache is storing a very cross-linked graph. If you wanted to flush an
individual state, you'd have to keep all the backpointers for the arrows in
the graph so that you could invalidate them when flushing that state. It is
far easier (and I bet simpler and more efficient) to flush the whole graph and
start again. But hey, it's open source. Try it and see.

4\. The inner loop is many many lines of code but the common path through it
is a small number of instructions, due to a bunch of big if statements that
rarely run. I think you'll find that it's not too far away from 12 or 20
instructions per byte (not character).

Thanks again for the comments.

~~~
dasht
Hi, rsc. I'm Thomas Lord. Also, I got something wrong in my pedantry there,
which is the absolute worst place to get this kind of thing wrong: it's not
O(1) space - it's O(length-of-regexp), no?

Going through the points:

1\. Ok. I think you can also say, overall, linear in X times Y.

2\. The 1985 Dragon edition has, in the second to last paragraph of section
3.7, these remarks: "A third approach is to use a DFA, but avoid constructing
all of the transition table by using a technique called 'lazy transition
evaluation'. Here, transitions are computed at run time but a transition from
a given state on a given character is not determined until it is actually
needed. The computed transitions are stored in a cache. Each time a transition
is about to be made the cache is consulted. If the transition is not there, it
is computed and stored in the cache. If the cache becomes full, we can erase
some previously computed transition to make room for the new transition."

There's nothin' new under the sun, kid ;-)

(And that paragraph cost me a few years of part time hacking :-)

Oh, and, I don't think you're "stealing" anything from anyone, let alone
thunder. Just that it's well known.

It's not currently separately distributed and, like I said, the code is a bit
embarrassingly hairy and messed up -- but if you grab the source of GNU Arch
and look deep in the tree under the "hackerlab" library you'll find Rx which
has a (fairly old, prbly now slightly bitrotted) implementation of a caching
DFA. Atop that -- oh you asked for references -- is Henry Spencer's algorithm
for full Posix matching (backreferences, etc.) which I recall that he dubbed
"recursive decomposition". The implementation in Rx is haired up because I
wanted both tightly controllable DFA caching _and_ the ability to match
against strings that weren't contiguous in memory -- and then it's haired up
more after than when I bolted on some Unicode support for UTF-8 and UTF-16;
I've no cite for you for the Spencer thing -- personal correspondence.

I've also no cite for you for the other interesting one to look at which (last
I knew) is the current regexp implementation in GNU libc from damnit-i'm-lame-
but-the-guy's-name-escapes-me. He did an interesting twist on how to handle
the non-regular Posix regexp features and made the (to my knowledge) first new
thing since the naive backtracking versions and since Spencer's recursive
decomposition.

With the addition of you, btw, I think there are maybe about 5 or 10 of us in
the world who know or care about any of this stuff at the nitty-gritty
implementation level, btw.

3\. In Rx, looking at the code (haven't touched for several years), it appears
that when flushing a DFA state I did it in two stages. First, I just brute-
force away the incoming transitions and mark them as "warning, that state is
going to go away." If one of those transitions is then taken before the state
_actually_ goes away it gets points to keep that state alive in the cache.
Otherwise, the state eventually gets flushed. Off the top of my head I'm not
able to prove that this wins over the naive full flush you've got. I am sure
the naive full flush is simpler to code in a confident way - the Rx code _is_
more hairy than I'd like. I'm "pretty sure" you can get better performance on
a wider range of cases with something more like the approach I took in Rx.
It's been a long time since I've handled the code and benchmarked stuff,
though so, please, accept this fine grain of salt along with those opinions.

4\. I'll take your word for that. Back before unicode support when I was most
intensely optimizing Rx I spent a decent amount of time looking at the
assembly code generated by the compiler and shaving instructions here and
there. Incredibly f'ed up hobby, that :-)

\-------------

Those points aside, more shop talk:

Back in yonder day when I was active on that project one of my fantasies
(hence all the stuff about tunable memory usage, non-contiguous strings, etc.)
was to make a dfa engine (not the posix regexp engine, just the dfa bit) that
would rock Cisco's world so much that they'd just have to throw money at me.
Heh.

The other thing is more interesting and useful, though: More recently I did
some crazy-strange work in a backwater of the "bioinformatics" world. The
problem took weeks or months to extract from the biologists but boiled down
to: "Here are a few million regexps, all of which are of the form [big hairy
regular expression]. Here is the reference sequence for the human genome (all
3bln base pairs or their approximations, a bit over 2 bits per base pair). Go
find a way to make a list of all the matches, indexed by which regexp matches
what part, as quickly and cheaply as possible."

Rx was useless for this, of course.

The stuff I learned building Rx was the opposite of useless. I'll bet (low
stakes) that sometime down the road you'll have a similar experience.

