I still haven't found a single regex library that can do a few tasks that I find to be very critical: (1) accept the input string in pieces (i.e., lazily), (2) let you make copies of the recognition automaton in the middle of recognition so that you can run them on different suffixes, (3) tell you what characters/strings are valid suffixes of the current automaton. (Well, I would also like (4) lazily creates a DFA whenever the pattern allows it, and (5) works on non-string data, but I've given up on these since it seems to me nobody even cares about linear-time matching or non-string data.)
Requirements 1-3 are extremely important when you have a trie-like data structure that is expensive to traverse (like, say, file paths on a network folder, or even a local disk sometimes) -- you don't want to expand nodes or traverse edges needlessly. However, no library that I've seen lets you do this. Has anyone else seen any?
The reason why that doesn't exist is because it's hard, or at least, hard without sacrificing something else that is typically considered more valuable. For example, I wouldn't describe your desired operations to be generally "critical." The things that are generally critical to a regular expression library are things like "accept a string and report a match," "tell where the match occurred," "extract submatch locations." A regex library may provide other niceties such as iterators over matches or routines for replacing some text with another piece of a text, but even those aren't typically critical and can be implemented in user code just as effectively.
With that said, yes, there are definitely specific use cases in which the operations you describe are indeed critical. A common one I've seen is in the development of a text editor, which probably does not store the file it's editing in a single contiguous block of memory, but still wants a way to search it with a regex.
Supporting the streaming use case is something I'd like to work on. I don't know if I'll succeed, but I've documented my thoughts here: https://github.com/rust-lang/regex/issues/425 --- If you'd like to elaborate and go more in depth about the specific use cases you have, I'd love to have that data, since it would be quite useful! (I am particularly interested in supporting streaming matching. Suffix extraction and more flexible automaton construction are also interesting to me. Working on non-string data is probably outside my scope. That's too hard to bake into a general purpose regex engine.)
> Requirements 1-3 are extremely important when you have a trie-like data structure that is expensive to traverse (like, say, file paths on a network folder, or even a local disk sometimes) -- you don't want to expand nodes or traverse edges needlessly.
I'm trying to unpack this... Is the trie data structure containing all of the file paths in memory? If so, it should be possible to build a finite automaton from a regex and then "simply" intersect it with your trie (which is, of course, itself also a finite state automaton). This is, for example, what my fst library does: https://docs.rs/fst/0.3.0/fst/#example-case-insensitive-sear...
Yeah, the fact that it's hard is partly why I haven't implemented a working one one myself. :-)
Regarding streaming: I don't remember exactly when I ran into this, but a text or hex editor which you also mentioned for the other use case is a clear example (e.g. if I want to search for a pattern on my disk). I would have to think back to what the actual use case I ran into myself was; it may have been this or something else... it was quite a while ago.
> Is the trie data structure containing all of the file paths in memory?
Well obviously not in network example (the network aspect would be moot) but there are also times when this is the case. Thanks for linking to your library! I haven't learned Rust yet but I'll take a look at it.
-f --storable-state
Generate a lexer which can store its inner state. This is useful in push-model lexers which are stopped by an outer program when there is not enough input, and then resumed when more input becomes available. In this mode users should additionally define YYGETSTATE () and YYSETSTATE (state) macros and variables yych, yyaccept and the state as part of the lexer state.
[EDIT] : A slightly better explanation is found here:
Why don't you write it? I don't mean this in a snarky way - I don't share this particular problem, but other people writing C++ code probably do. It sounds like you're 1) experienced in writing C++, 2) reasonably knowledgeable about the landscape of regular expression libraries in C++, and 3) aware of a clear gap in the functionality of available libraries.
Whatever you develop might not make it into the STL, but I bet a bunch of other people would find it useful. If it's a matter of personak time and this is a professional need, maybe you can write it for your employer.
Haha, great question, and in fact, I started to do exactly this, back in... 2015? is what Git tells me. I forgot about it and it sat gathering dust for a few years and I think it was this year when I got around to finishing implementing the actual parsing the regex pattern (which, as a side effect I now have a library to reverse a regex! which might be operation #6 they should all support!), but while my 'basic_regex' has been written, my 'basic_nfa' class has yet to have its guts filled in, and a 'basic_dfa' has yet to be started.
First reason why I haven't implemented this already is obviously time... it's not exactly a trivial undertaking, and no, it's by no means a professional need; it's something I've needed in my side projects. Second reason is that while I know enough about the theory to be (probably) able to do it given sufficient time, I don't know enough to do it in what I feel is a comfortable amount of time -- it's my first time implementing a regex engine, and I would still need to learn how to do capturing groups and DFA minimization efficiently, for example (and probably other things I can't think of right now). And lastly, I have this nasty habit of trying to make code ultra-optimal and -extensible when I hope to see widespread usage (e.g. avoiding unnecessary heap allocations like crazy, making it fast, and making it so that it can be elegantly generalized to or used within a CFG parser). And if I'm doing it, it better be perfect for all of my own uses -- I'm not going to write a second one because this one is inadequate!! This also doesn't make the job any easier. But the main obstacles are the first two, not this last one.
Ragel was designed to do #1 and #2 to make network scanners easy to implement. It technically doesn't do #4 because it only generates DFAs[1] And it does #5 because it doesn't have any concept of strings, per se, just sequences of integers. (The pre-defined named patterns assume ASCII encoding but you don't need to use those.)
Saving state (#2) is trivial because the state of the DFA is a single integer, `cs`. Ragel scans input through a very simple interface--in C it uses just three pointers, `p`, `pe`, and `eof`, that are expected to be in scope and valid when its generated code block is resumed. (It pauses when p == pe but p != eof, allowing you to yield and resume when you have more data, and only halts when p == eof. Can you also explicitly exit at any point while keeping state consistent.) You can save and restore these however you want, and you can even change their names or how the generated code dereferences them (e.g. so they're dereferenced directly through your state structure).
[1] It only does DFAs but you can support more complex matchers by employing specialized constructs for tracking state transitions (allowing you to capture match positions), conditional matching (using user code to dynamically test if a character matches), direct branching to another DFA (fgoto), and recursion (fcall, fret). The latter (recursion) also requires that the user save and restore a stack of `cs` values.
Unfortunately[1] it's not lazy. The DFAs are static. And you can get blow ups. I've used Ragel for many different projects. The only time that became an issue was for a project that involved transforming ~10,000 PCREs to Ragelable syntax and into a giant union. A handful[2], when placed into a union, would trigger (directly or indirectly) DFAs that were either too large or simply took too long to compile.[3] We developed heuristics to split them up, so we had one giant union and a few dozen smaller ones. Later the Ragel author implemented a cost function that allowed us to do that more consistently and efficiently.
Normally if you encounter undesirable blow ups you can use Ragel constructs to split your machines and dynamically jump between them.
I've also worked with RE2 and consulted on a project using re2c. Those projects run into trouble with DFA complexity much, much, much earlier than Ragel. For example, re2c took days to compile a union that Ragel compiled in seconds, and while the generated code was large it wasn't even noticeable in terms of performance. (And by not noticeable I mean that the run time wasn't noticeably different from other Ragel-using projects. In terms of performance there's simply no comparison between Ragel and re2c, or Ragel and RE2 for that matter.)
[1] On the plus side, if you compile the machine into a dynamically linked library the generated code and data will be shared across processes and with the file system buffer cache. Even multi-gigabyte machines don't necessarily result in as much memory pressure as you'd think because most of the states won't be visited and so won't need to be resident. Sometimes you make a little change to a machine and it doubles or triples in size; I had to learn to leave it be and move on because ultimately it was usually inconsequential and my time was better spent elsewhere rather than obsessing over what intuitively seemed inefficient and costly.
[2] EDIT: This originally said 5%, but now I remember that 5% was the number of PCREs that contained problematic constructs that, in order to emulate, required special treatment. (In other words, 95% readily transformed to equivalent DFAs) The ones that truly blew up the DFA into something unusable were quite rare. And none were so large that they couldn't be compiled to their own, single machine.
[3] Part of that was our use of `when` conditionals, which we used heavily to translate zero-width assertions (e.g. "\B"). Because of how they work they can easily inflate the machine size when part of a union.
Check Regel. It's very flexible and allows you to generate code for different programming languages. It can also work with binary data. Its main application is generation of optimized network protocol scanners.
Oh, I just mean you should be able to ask a regex automaton what character(s) it will accept next. That can help you narrow down the set of strings to try feeding in.
Regarding (5), this reminds me of my long quest to figure out how to make a logic / relational embedded language for C++, which can work with any type.
Well, by strings I meant 8-bit/16-bit chars (std::string/wstring). One "problem" with these string patterns is that they let the engine do more than mere comparison -- they let the engine actually enumerate all the characters in between. If I remember correctly, I've seen linear-ish-time regex engines (i.e.: I'm not talking about the backtracking ones you see in most languages, although they might sometimes do this too) take advantage of this by making lookup tables for the characters in an interval, and this clearly doesn't scale when you go to 32-bit or 64-bit "characters". They really have to implement efficient set intersection for large sets, which is possible, but I'm not I've seen them do this.
If you're looking for specific examples of "large alphabets", there's a very easy one in my mind, which is UTF-32. If you're looking for data that isn't a plain sequence of integers in any shape or form... I have to think back, as I don't remember what the use case I ran into was, sadly. I'll post here if I remember.
You'll also find large alphabets quite difficult to deal with in an efficient DFA implementation. Naively, a large alphabet will either cause the size of each DFA node to be very large (a dense representation) or make accesses into the DFA transitions slower (a sparse representation).
For most DFA engines I've seen, they take advantage of "strings" themselves and define the alphabet to be the set of bytes. If you're searching UTF-8, then this also implies that your DFA must include UTF-8 decoding (which is generally the right choice for performance reasons anyway).
Regex engines with modern Unicode support do generally need to support efficient set operations on large sets though, in order to support the various character class syntaxes (which include negation, intersection, symmetric difference and, of course, union). Usually you want some kind of interval set structure, and its implementation complexity, in my experience, generally correlates with the amount of additional memory you're willing to use. :-)
...yeah, exactly that, thanks for mentioning the character class issues. :-) I think it was partly for this reason (but also others -- like my idea that it would help SAT or SMT solving) I've also had this idea of making a "universal" int_set class that adaptively adjusts its internal data structure (bitvectors, intervals, "shift-intervals", sparse sets, BDDs, etc.) based on its contents' sparsity patterns... which, for n different data structures, would require O(n^2) or O(n^3) methods to handle set and perhaps arithmetic operations between them (n^2 if the output type matches the input type, n^3 if we let the output data structure be different than the inputs'), which is kind of painful but I think doable and likely to be useful. That's yet another ambitious endeavor I may never get around to, or at least not before someone else does...
Given a byte based encoding like UTF-8 or extended 8-bit Ascii variants or Shift-JIS, a regular expression over Unicode characters can be translated to a more complicated, but not necessarily slow to match, regular expression over bytes.
Indeed. That's what I meant with: "For most DFA engines I've seen, they take advantage of "strings" themselves and define the alphabet to be the set of bytes. If you're searching UTF-8, then this also implies that your DFA must include UTF-8 decoding (which is generally the right choice for performance reasons anyway)." :-)
UTF-8 decoding usually means translating UTF-8 text to a more convenient character representation like UTF-32 or if you are naughty UTF-16 before processing, not adapting the application to work with UTF-8 data.
I think the principle of charity applies here. What other possible interpretation of UTF-8 decoding is there for what I wrote? I mean, the automaton literally contains a UTF-8 automaton right in it, and semantically, it is totally matching in terms of codepoints.
I guess this is supposed to be a C++ library, but the code still looks very "C-like"–that is, there are almost no C++ features. Of course, it might just be that Cfront didn't really support much, but either way, the current maintainer has a lot of work to do if they want to "review the code and try to improve the use of C++".
> And finally, having followed the development of C++ from its infancy, I wanted to try out its new template facility, so there's a bit of that in the package, too. Arnold has discovered that not only has C++ evolved, but also that without the discipline of -Wall to force clean code, I was rather cavalier about casting, both explicitly and implicitly.
Requirements 1-3 are extremely important when you have a trie-like data structure that is expensive to traverse (like, say, file paths on a network folder, or even a local disk sometimes) -- you don't want to expand nodes or traverse edges needlessly. However, no library that I've seen lets you do this. Has anyone else seen any?