Doug McIlroy's C++ regular expression matching library

dataflow · on July 31, 2018

I still haven't found a single regex library that can do a few tasks that I find to be very critical: (1) accept the input string in pieces (i.e., lazily), (2) let you make copies of the recognition automaton in the middle of recognition so that you can run them on different suffixes, (3) tell you what characters/strings are valid suffixes of the current automaton. (Well, I would also like (4) lazily creates a DFA whenever the pattern allows it, and (5) works on non-string data, but I've given up on these since it seems to me nobody even cares about linear-time matching or non-string data.)

Requirements 1-3 are extremely important when you have a trie-like data structure that is expensive to traverse (like, say, file paths on a network folder, or even a local disk sometimes) -- you don't want to expand nodes or traverse edges needlessly. However, no library that I've seen lets you do this. Has anyone else seen any?

burntsushi · on July 31, 2018

The reason why that doesn't exist is because it's hard, or at least, hard without sacrificing something else that is typically considered more valuable. For example, I wouldn't describe your desired operations to be generally "critical." The things that are generally critical to a regular expression library are things like "accept a string and report a match," "tell where the match occurred," "extract submatch locations." A regex library may provide other niceties such as iterators over matches or routines for replacing some text with another piece of a text, but even those aren't typically critical and can be implemented in user code just as effectively.

With that said, yes, there are definitely specific use cases in which the operations you describe are indeed critical. A common one I've seen is in the development of a text editor, which probably does not store the file it's editing in a single contiguous block of memory, but still wants a way to search it with a regex.

Supporting the streaming use case is something I'd like to work on. I don't know if I'll succeed, but I've documented my thoughts here: https://github.com/rust-lang/regex/issues/425 --- If you'd like to elaborate and go more in depth about the specific use cases you have, I'd love to have that data, since it would be quite useful! (I am particularly interested in supporting streaming matching. Suffix extraction and more flexible automaton construction are also interesting to me. Working on non-string data is probably outside my scope. That's too hard to bake into a general purpose regex engine.)

> Requirements 1-3 are extremely important when you have a trie-like data structure that is expensive to traverse (like, say, file paths on a network folder, or even a local disk sometimes) -- you don't want to expand nodes or traverse edges needlessly.

I'm trying to unpack this... Is the trie data structure containing all of the file paths in memory? If so, it should be possible to build a finite automaton from a regex and then "simply" intersect it with your trie (which is, of course, itself also a finite state automaton). This is, for example, what my fst library does: https://docs.rs/fst/0.3.0/fst/#example-case-insensitive-sear...

dataflow · on July 31, 2018

Yeah, the fact that it's hard is partly why I haven't implemented a working one one myself. :-)

Regarding streaming: I don't remember exactly when I ran into this, but a text or hex editor which you also mentioned for the other use case is a clear example (e.g. if I want to search for a pattern on my disk). I would have to think back to what the actual use case I ran into myself was; it may have been this or something else... it was quite a while ago.

> Is the trie data structure containing all of the file paths in memory?

Well obviously not in network example (the network aspect would be moot) but there are also times when this is the case. Thanks for linking to your library! I haven't learned Rust yet but I'll take a look at it.

ur-whale · on July 31, 2018

> (1) accept the input string in pieces (i.e., lazily)

I think re2c may have at least parts of what you need:

From https://github.com/skvadrik/re2c/blob/master/re2c/doc/manual...

-f --storable-state Generate a lexer which can store its inner state. This is useful in push-model lexers which are stopped by an outer program when there is not enough input, and then resumed when more input becomes available. In this mode users should additionally define YYGETSTATE () and YYSETSTATE (state) macros and variables yych, yyaccept and the state as part of the lexer state.

[EDIT] : A slightly better explanation is found here:

http://re2c.org/manual/features/state/state.html

Code example here: http://re2c.org/examples/example_18.html

dataflow · on July 31, 2018

Ooh interesting, thank you!

throwawaymath · on July 31, 2018

Why don't you write it? I don't mean this in a snarky way - I don't share this particular problem, but other people writing C++ code probably do. It sounds like you're 1) experienced in writing C++, 2) reasonably knowledgeable about the landscape of regular expression libraries in C++, and 3) aware of a clear gap in the functionality of available libraries.

Whatever you develop might not make it into the STL, but I bet a bunch of other people would find it useful. If it's a matter of personak time and this is a professional need, maybe you can write it for your employer.

dataflow · on July 31, 2018

Haha, great question, and in fact, I started to do exactly this, back in... 2015? is what Git tells me. I forgot about it and it sat gathering dust for a few years and I think it was this year when I got around to finishing implementing the actual parsing the regex pattern (which, as a side effect I now have a library to reverse a regex! which might be operation #6 they should all support!), but while my 'basic_regex' has been written, my 'basic_nfa' class has yet to have its guts filled in, and a 'basic_dfa' has yet to be started.

First reason why I haven't implemented this already is obviously time... it's not exactly a trivial undertaking, and no, it's by no means a professional need; it's something I've needed in my side projects. Second reason is that while I know enough about the theory to be (probably) able to do it given sufficient time, I don't know enough to do it in what I feel is a comfortable amount of time -- it's my first time implementing a regex engine, and I would still need to learn how to do capturing groups and DFA minimization efficiently, for example (and probably other things I can't think of right now). And lastly, I have this nasty habit of trying to make code ultra-optimal and -extensible when I hope to see widespread usage (e.g. avoiding unnecessary heap allocations like crazy, making it fast, and making it so that it can be elegantly generalized to or used within a CFG parser). And if I'm doing it, it better be perfect for all of my own uses -- I'm not going to write a second one because this one is inadequate!! This also doesn't make the job any easier. But the main obstacles are the first two, not this last one.

wahern · on July 31, 2018

Ragel was designed to do #1 and #2 to make network scanners easy to implement. It technically doesn't do #4 because it only generates DFAs[1] And it does #5 because it doesn't have any concept of strings, per se, just sequences of integers. (The pre-defined named patterns assume ASCII encoding but you don't need to use those.)

Saving state (#2) is trivial because the state of the DFA is a single integer, `cs`. Ragel scans input through a very simple interface--in C it uses just three pointers, `p`, `pe`, and `eof`, that are expected to be in scope and valid when its generated code block is resumed. (It pauses when p == pe but p != eof, allowing you to yield and resume when you have more data, and only halts when p == eof. Can you also explicitly exit at any point while keeping state consistent.) You can save and restore these however you want, and you can even change their names or how the generated code dereferences them (e.g. so they're dereferenced directly through your state structure).

[1] It only does DFAs but you can support more complex matchers by employing specialized constructs for tracking state transitions (allowing you to capture match positions), conditional matching (using user code to dynamically test if a character matches), direct branching to another DFA (fgoto), and recursion (fcall, fret). The latter (recursion) also requires that the user save and restore a stack of `cs` values.

dataflow · on Aug 1, 2018

Cool! Yeah, someone else mentioned Ragel too. I'll have to check it out and see if it's also lazy (to prevent exponential blow-up for the DFA).

wahern · on Aug 1, 2018

Unfortunately[1] it's not lazy. The DFAs are static. And you can get blow ups. I've used Ragel for many different projects. The only time that became an issue was for a project that involved transforming ~10,000 PCREs to Ragelable syntax and into a giant union. A handful[2], when placed into a union, would trigger (directly or indirectly) DFAs that were either too large or simply took too long to compile.[3] We developed heuristics to split them up, so we had one giant union and a few dozen smaller ones. Later the Ragel author implemented a cost function that allowed us to do that more consistently and efficiently.

Normally if you encounter undesirable blow ups you can use Ragel constructs to split your machines and dynamically jump between them.

I've also worked with RE2 and consulted on a project using re2c. Those projects run into trouble with DFA complexity much, much, much earlier than Ragel. For example, re2c took days to compile a union that Ragel compiled in seconds, and while the generated code was large it wasn't even noticeable in terms of performance. (And by not noticeable I mean that the run time wasn't noticeably different from other Ragel-using projects. In terms of performance there's simply no comparison between Ragel and re2c, or Ragel and RE2 for that matter.)

[1] On the plus side, if you compile the machine into a dynamically linked library the generated code and data will be shared across processes and with the file system buffer cache. Even multi-gigabyte machines don't necessarily result in as much memory pressure as you'd think because most of the states won't be visited and so won't need to be resident. Sometimes you make a little change to a machine and it doubles or triples in size; I had to learn to leave it be and move on because ultimately it was usually inconsequential and my time was better spent elsewhere rather than obsessing over what intuitively seemed inefficient and costly.

[2] EDIT: This originally said 5%, but now I remember that 5% was the number of PCREs that contained problematic constructs that, in order to emulate, required special treatment. (In other words, 95% readily transformed to equivalent DFAs) The ones that truly blew up the DFA into something unusable were quite rare. And none were so large that they couldn't be compiled to their own, single machine.

[3] Part of that was our use of `when` conditionals, which we used heavily to translate zero-width assertions (e.g. "\B"). Because of how they work they can easily inflate the machine size when part of a union.

dataflow · on Aug 1, 2018

Interesting, thanks for the info! :-)

philix001 · on July 31, 2018

Check Regel. It's very flexible and allows you to generate code for different programming languages. It can also work with binary data. Its main application is generation of optimized network protocol scanners.

dataflow · on July 31, 2018

I can't find it on Google for some reason.

abenedic · on July 31, 2018

He means Ragel. https://www.colm.net/open-source/ragel/

dataflow · on July 31, 2018

Ooh, okay thanks!

zzzcpan · on July 31, 2018

> (1) accept the input string in pieces

For that it must allow you to specify how to get the next character and save/restore parsing context. Also solves (2). Ragel allows that I believe.

> (3) tell you what characters/strings are valid suffixes of the current automaton

What do you mean by that?

dataflow · on July 31, 2018

> What do you mean by that?

Oh, I just mean you should be able to ask a regex automaton what character(s) it will accept next. That can help you narrow down the set of strings to try feeding in.

phaedrus · on July 31, 2018

Regarding (5), this reminds me of my long quest to figure out how to make a logic / relational embedded language for C++, which can work with any type.

Sean1708 · on July 31, 2018

> (5) works on non-string data

Would you mind giving an example of when this would be useful?

dataflow · on July 31, 2018

Well, by strings I meant 8-bit/16-bit chars (std::string/wstring). One "problem" with these string patterns is that they let the engine do more than mere comparison -- they let the engine actually enumerate all the characters in between. If I remember correctly, I've seen linear-ish-time regex engines (i.e.: I'm not talking about the backtracking ones you see in most languages, although they might sometimes do this too) take advantage of this by making lookup tables for the characters in an interval, and this clearly doesn't scale when you go to 32-bit or 64-bit "characters". They really have to implement efficient set intersection for large sets, which is possible, but I'm not I've seen them do this.

If you're looking for specific examples of "large alphabets", there's a very easy one in my mind, which is UTF-32. If you're looking for data that isn't a plain sequence of integers in any shape or form... I have to think back, as I don't remember what the use case I ran into was, sadly. I'll post here if I remember.

burntsushi · on July 31, 2018

You'll also find large alphabets quite difficult to deal with in an efficient DFA implementation. Naively, a large alphabet will either cause the size of each DFA node to be very large (a dense representation) or make accesses into the DFA transitions slower (a sparse representation).

For most DFA engines I've seen, they take advantage of "strings" themselves and define the alphabet to be the set of bytes. If you're searching UTF-8, then this also implies that your DFA must include UTF-8 decoding (which is generally the right choice for performance reasons anyway).

Regex engines with modern Unicode support do generally need to support efficient set operations on large sets though, in order to support the various character class syntaxes (which include negation, intersection, symmetric difference and, of course, union). Usually you want some kind of interval set structure, and its implementation complexity, in my experience, generally correlates with the amount of additional memory you're willing to use. :-)

dataflow · on July 31, 2018

...yeah, exactly that, thanks for mentioning the character class issues. :-) I think it was partly for this reason (but also others -- like my idea that it would help SAT or SMT solving) I've also had this idea of making a "universal" int_set class that adaptively adjusts its internal data structure (bitvectors, intervals, "shift-intervals", sparse sets, BDDs, etc.) based on its contents' sparsity patterns... which, for n different data structures, would require O(n^2) or O(n^3) methods to handle set and perhaps arithmetic operations between them (n^2 if the output type matches the input type, n^3 if we let the output data structure be different than the inputs'), which is kind of painful but I think doable and likely to be useful. That's yet another ambitious endeavor I may never get around to, or at least not before someone else does...

andreasvc · on Aug 1, 2018

Regarding that idea for a universal integer set, roaring bitmaps implement that.

HelloNurse · on July 31, 2018

Given a byte based encoding like UTF-8 or extended 8-bit Ascii variants or Shift-JIS, a regular expression over Unicode characters can be translated to a more complicated, but not necessarily slow to match, regular expression over bytes.

burntsushi · on July 31, 2018

Indeed. That's what I meant with: "For most DFA engines I've seen, they take advantage of "strings" themselves and define the alphabet to be the set of bytes. If you're searching UTF-8, then this also implies that your DFA must include UTF-8 decoding (which is generally the right choice for performance reasons anyway)." :-)

HelloNurse · on July 31, 2018

UTF-8 decoding usually means translating UTF-8 text to a more convenient character representation like UTF-32 or if you are naughty UTF-16 before processing, not adapting the application to work with UTF-8 data.

burntsushi · on July 31, 2018

I think the principle of charity applies here. What other possible interpretation of UTF-8 decoding is there for what I wrote? I mean, the automaton literally contains a UTF-8 automaton right in it, and semantically, it is totally matching in terms of codepoints.

saagarjha · on July 31, 2018

I guess this is supposed to be a C++ library, but the code still looks very "C-like"–that is, there are almost no C++ features. Of course, it might just be that Cfront didn't really support much, but either way, the current maintainer has a lot of work to do if they want to "review the code and try to improve the use of C++".

aap_ · on July 31, 2018

Quote from doug:

> And finally, having followed the development of C++ from its infancy, I wanted to try out its new template facility, so there's a bit of that in the package, too. Arnold has discovered that not only has C++ evolved, but also that without the discipline of -Wall to force clean code, I was rather cavalier about casting, both explicitly and implicitly.

Sharlin · on July 31, 2018

Modern C++ is an almost entirely different language from the C++ back then.

jimjimjim · on July 31, 2018

why make it worse with c++? :)

aap_ · on July 31, 2018

This is a very nice specimen of the clear and easy to read code that is so typical of room 1127. A great example to follow.

glangdale · on July 31, 2018

Is there a concise description of how this library works? Apparently I need to read source code to find out whether or not it's a backtracker, etc.

burntsushi · on July 31, 2018

Definitely does not look like a backtracker to me.

One interesting thing I found is that it appears to support intersection and negation, as documented here: https://github.com/arnoldrobbins/mcilroy-regex/blob/master/R...

Running `make` will build its `grep` utility, and using `grep -A` enables the intersection and negation features.

glangdale · on July 31, 2018

Yes, it looks kinda interesting. Am still scratching my head at the idea that summarizing how it works isn't anywhere outside the source base.

beagle3 · on July 31, 2018

It's not every day that one sees a 26-year old commit on github....

Anything by Doug McIlroy is worth looking at.

colonelspace · on July 31, 2018

If git is ~15 years old, what is the original VCS likely to be for this?

zaphar · on July 31, 2018

Probably, nothing or maybe RCS. It looks like the history was "reconstructed" from time stamps on the file system so no VCS is a strong possibility.

jhayward · on Aug 1, 2018

Bell Labs was using source control in the 1970's. SCCS was common in System V unix.

jhayward · on Aug 1, 2018

If it was during the Bell Labs days it would likely have been SCCS.

rgovostes · on July 31, 2018

Hmm.

    static exef *excom[128] = {
    	vv,vv,vv,vv,vv,vv,vv,vv,vv,vv,Ie,vv,vv,vv,vv,vv,
    	vv,vv,vv,vv,vv,vv,vv,vv,vv,vv,vv,vv,vv,vv,vv,vv,
    	vv,vv,vv,Ie,vv,vv,vv,vv,vv,vv,vv,vv,vv,vv,vv,vv, /* # */
    	vv,vv,vv,vv,vv,vv,vv,vv,vv,vv,Ce,Se,vv,Ee,vv,vv, /* :;= */
    	vv,vv,vv,vv,De,vv,vv,Ge,He,vv,vv,vv,vv,vv,Ne,vv, /* DGHN */
    	Pe,vv,vv,vv,vv,vv,vv,vv,vv,vv,vv,vv,vv,vv,vv,vv, /* P */
    	vv,ae,be,ce,de,vv,vv,ge,he,ie,vv,vv,le,vv,ne,vv, /* a-n */
    	pe,qe,re,se,te,vv,vv,we,xe,ye,vv,Le,vv,Re,vv,vv  /* p-y{} */
    };

(I think it's a lookup table of sed commands.)

pouta · on July 31, 2018

Is this in the code?

Sean1708 · on July 31, 2018

https://github.com/arnoldrobbins/mcilroy-regex/blob/3cc2cb9c...

agumonkey · on July 31, 2018

that's ed whole source

pouta · on July 31, 2018

Really? How? I can't find any references online

agumonkey · on July 31, 2018

I was joking