
Doug McIlroy's C++ regular expression matching library - fanf2
https://github.com/arnoldrobbins/mcilroy-regex
======
mehrdadn
I still haven't found a single regex library that can do a few tasks that I
find to be very critical: (1) accept the input string in pieces (i.e.,
lazily), (2) let you make copies of the recognition automaton in the middle of
recognition so that you can run them on different suffixes, (3) tell you what
characters/strings are valid suffixes of the current automaton. (Well, I would
also like (4) lazily creates a DFA whenever the pattern allows it, and (5)
works on non-string data, but I've given up on these since it seems to me
nobody even cares about linear-time matching or non-string data.)

Requirements 1-3 are extremely important when you have a trie-like data
structure that is expensive to traverse (like, say, file paths on a network
folder, or even a local disk sometimes) -- you don't want to expand nodes or
traverse edges needlessly. However, no library that I've seen lets you do
this. Has anyone else seen any?

~~~
burntsushi
The reason why that doesn't exist is because it's hard, or at least, hard
without sacrificing something else that is typically considered more valuable.
For example, I wouldn't describe your desired operations to be _generally_
"critical." The things that are generally critical to a regular expression
library are things like "accept a string and report a match," "tell where the
match occurred," "extract submatch locations." A regex library may provide
other niceties such as iterators over matches or routines for replacing some
text with another piece of a text, but even those aren't typically _critical_
and can be implemented in user code just as effectively.

With that said, yes, there are definitely specific use cases in which the
operations you describe are indeed critical. A common one I've seen is in the
development of a text editor, which probably does not store the file it's
editing in a single contiguous block of memory, but still wants a way to
search it with a regex.

Supporting the streaming use case is something I'd like to work on. I don't
know if I'll succeed, but I've documented my thoughts here:
[https://github.com/rust-lang/regex/issues/425](https://github.com/rust-
lang/regex/issues/425) \--- If you'd like to elaborate and go more in depth
about the specific use cases you have, I'd love to have that data, since it
would be quite useful! (I am particularly interested in supporting streaming
matching. Suffix extraction and more flexible automaton construction are also
interesting to me. Working on non-string data is probably outside my scope.
That's too hard to bake into a general purpose regex engine.)

> Requirements 1-3 are extremely important when you have a trie-like data
> structure that is expensive to traverse (like, say, file paths on a network
> folder, or even a local disk sometimes) -- you don't want to expand nodes or
> traverse edges needlessly.

I'm trying to unpack this... Is the trie data structure containing all of the
file paths in memory? If so, it should be possible to build a finite automaton
from a regex and then "simply" intersect it with your trie (which is, of
course, itself also a finite state automaton). This is, for example, what my
fst library does: [https://docs.rs/fst/0.3.0/fst/#example-case-insensitive-
sear...](https://docs.rs/fst/0.3.0/fst/#example-case-insensitive-search)

~~~
mehrdadn
Yeah, the fact that it's hard is partly why I haven't implemented a working
one one myself. :-)

Regarding streaming: I don't remember exactly when I ran into this, but a text
or hex editor which you also mentioned for the other use case is a clear
example (e.g. if I want to search for a pattern on my disk). I would have to
think back to what the actual use case I ran into myself was; it may have been
this or something else... it was quite a while ago.

> Is the trie data structure containing all of the file paths in memory?

Well obviously not in network example (the network aspect would be moot) but
there are also times when this is the case. Thanks for linking to your
library! I haven't learned Rust yet but I'll take a look at it.

------
saagarjha
I guess this is supposed to be a C++ library, but the code still looks very
"C-like"–that is, there are almost no C++ features. Of course, it might just
be that Cfront didn't really support much, but either way, the current
maintainer has a lot of work to do if they want to "review the code and try to
improve the use of C++".

~~~
aap_
Quote from doug:

> And finally, having followed the development of C++ from its infancy, I
> wanted to try out its new template facility, so there's a bit of that in the
> package, too. Arnold has discovered that not only has C++ evolved, but also
> that without the discipline of -Wall to force clean code, I was rather
> cavalier about casting, both explicitly and implicitly.

------
aap_
This is a very nice specimen of the clear and easy to read code that is so
typical of room 1127. A great example to follow.

------
glangdale
Is there a concise description of how this library works? Apparently I need to
read source code to find out whether or not it's a backtracker, etc.

~~~
burntsushi
Definitely does not look like a backtracker to me.

One interesting thing I found is that it appears to support intersection and
negation, as documented here: [https://github.com/arnoldrobbins/mcilroy-
regex/blob/master/R...](https://github.com/arnoldrobbins/mcilroy-
regex/blob/master/README)

Running `make` will build its `grep` utility, and using `grep -A` enables the
intersection and negation features.

~~~
glangdale
Yes, it looks kinda interesting. Am still scratching my head at the idea that
summarizing how it works isn't anywhere outside the source base.

------
beagle3
It's not every day that one sees a 26-year old commit on github....

Anything by Doug McIlroy is worth looking at.

~~~
chrstphrknwtn
If git is ~15 years old, what is the original VCS likely to be for this?

~~~
zaphar
Probably, nothing or maybe RCS. It looks like the history was "reconstructed"
from time stamps on the file system so no VCS is a strong possibility.

~~~
jhayward
Bell Labs was using source control in the 1970's. SCCS was common in System V
unix.

------
rgovostes
Hmm.

    
    
        static exef *excom[128] = {
        	vv,vv,vv,vv,vv,vv,vv,vv,vv,vv,Ie,vv,vv,vv,vv,vv,
        	vv,vv,vv,vv,vv,vv,vv,vv,vv,vv,vv,vv,vv,vv,vv,vv,
        	vv,vv,vv,Ie,vv,vv,vv,vv,vv,vv,vv,vv,vv,vv,vv,vv, /* # */
        	vv,vv,vv,vv,vv,vv,vv,vv,vv,vv,Ce,Se,vv,Ee,vv,vv, /* :;= */
        	vv,vv,vv,vv,De,vv,vv,Ge,He,vv,vv,vv,vv,vv,Ne,vv, /* DGHN */
        	Pe,vv,vv,vv,vv,vv,vv,vv,vv,vv,vv,vv,vv,vv,vv,vv, /* P */
        	vv,ae,be,ce,de,vv,vv,ge,he,ie,vv,vv,le,vv,ne,vv, /* a-n */
        	pe,qe,re,se,te,vv,vv,we,xe,ye,vv,Le,vv,Re,vv,vv  /* p-y{} */
        };
    

(I think it's a lookup table of sed commands.)

~~~
pouta
Is this in the code?

~~~
agumonkey
that's ed whole source

~~~
pouta
Really? How? I can't find any references online

~~~
agumonkey
I was joking

