One of the most interesting ways to handle regex and general parsing IMO is by using derivitives. I'm not sure if this uses any of the ideas from that area, but with derivitive based regex parsing instead of constructing some specific program to parse some language, you parse by instead transforming the original language to be the language of the union of the old language as well as the character just seen.
In addition to being incredibly simple to implement compared to traditional regex and parser generators, I think it is also interesting in that while it is really simple and looks like you are interpreting a language, it is actually powerful enough to parse arbitrary CFGs in O(n^3) with some help from memoization and fixpoints. See http://matt.might.net/articles/parsing-with-derivatives/ for more details.
I think the interesting aspect in terms of performance with something like this is that the derivitive based parsing system is explicitly streaming since it only stores the remaining input string and a regex string. Also, I think you can get around some of the performance issues with constructing these strings by replacing certain parts of the regex with more traditional NFAs/DFAs. In particular, I think this method provides an easy way to use simple equality matching for constant strings, DFAs for Kleene star, NFAs for the rest of the typical NFA features like the or operator and negation, then finally using the derivitive for very complicated features like capturing groups and backreferences.
Derivatives are simple to explain, but not entirely straightforward to implement efficiently. I personally prefer the Glushkov construction.
The Glushkov construction (which Hyperscan uses among other things) is a similarly straightforward translation from a regular expression to an epsilon-free NFA with nice properties. There's a very neat paper explaining the construction in the form of a play: https://sebfisch.github.io/haskell-regexp/regexp-play.pdf
On the CFG side, I like to think of Early parsing as analogous to the Glushkov construction. At the very least they have similar properties.
I'm aware of derivatives. My concern with derivatives (which I think are awesome) is that the combinatorial blow-up that we associate with DFAs is always around the corner.
I've never been a huge fan of putting all my eggs in the the 'I hope it won't explode' basket. It's been a major point of difference with RE2 all this time - IMO you need a high performance pass for things that don't determinize cleanly (even when you are doing lazy determinization).
I would need to see a lot more detail on using derivatives for capturing groups and back-references.
Sorry if I sound closed-minded, but derivatives look like they solve a lot of stuff I already know how to do really well.
Let me seize the opportunity. I have a problem where I need to match multiple person names (hundreds of thousands) in huge texts. Aho-Corasick works good for exact matches. Could HyperScan works for approximated matches?
We have both Levenstein and Hamming distance parameters. Do note that putting large distance numbers is not expected to perform well (i.e. matching "Smith" at Levenstein distance = 3 won't be a happy experience).
I understand this possibility, but it's not the way I would do an ARM port in an ideal world.
Being the original designer of Hyperscan isn't much help; I don't work at Intel any more and wouldn't have all that much clout in trying to get an ARM port into the codebase.
I'm thinking about building a followup regex matcher - considerably smaller scale, and starting with a Hyperscan code base for the tools/infrastructure/boring bits - ("ure3"). This would definitely have ARM as a first-class architecture. Anyone 64-bit and little-endian can play, I would think (legacy stuff like 32-bit, etc can go somewhere else).
I just with the ARM guys would get their shit together and support their own much ballyhooed SVE. The idea of announcing something in 2016 and not supporting it in a server chip that's due for mid-2020 is... novel.
The reference to "Australian-free" has piqued my curiosity. Do you care to elaborate? In my imagination you had an Australian collaborator that you fell out with and have been busy expunging all "Australian" code from the project!
This appears to be the same algorithm used by ripgrep (https://github.com/BurntSushi/ripgrep) for searching when SIMD/AVX is enabled. Specifically, it uses the algorithm Teddy from the library that’s derived from this paper.
ripgrep author here. Yes, the Teddy algorithm was originally extracted from Hyperscan. But this is a teeny tiny piece of Hyperscan. And AFAIK, ripgrep's implementation (which is actually in the underlying regex library) doesn't carry over the full Teddy algorithm. Or so I've been told. :-)
Both libraries also rely on a number of techniques from existing published research.
Hyperscan is ruthlessly optimized for x86 platforms, but has dropped support for non-x86 platforms. Rust's regex crate is designed more as a general purpose, portable regular expression implementation, suitable for wide scale use.
I like 'ruthlessly'. It sounds very butch, especially applying to an activity largely carried out by seated, balding men hunched over terminals in an office.
Rust regex is suited to wide scale use. By comparison to Hyperscan, it is not suited to the kind of large scale cases that Hyperscan handles, but it's well done.
Many of Hyperscan techniques are sui generis, not published. Hopefully some of that will be rectified over time.
Well, it kinda is. AFAIK, that’s exactly what this was made for, and Intel/McAfee was going to create a DPI device using this tech. That never happened, and now Snort uses/will use hyperscan for its pattern matcher
I'm drawn to hairy parsers for implementing DSL-rich syntactically-extensible programming languages. Hairy, as in, extensible and scoped; full-ish Perl-compatible RE, including parse-influencing code execution; n-ary multifix operator precedence; backtrack/operator-precedence/regex sandwiches. Subengines can be used to variously strength reduce parts of the grammar, and to dynamically prune search. It's been some years, so I'm out of date...
Does anyone have experience with MNCaRT? [1] I've long wished for a toolkit for doing grammar analysis and building assemblages of specialized high-performance engines into relatively performant parsers.
Has there been work on using multiple-pattern RE engines (like RE2::Set or Hyperscan) in some context similar to language parsing? Or as subengines of a larger parser?
I like the UVa guys, but MNCaRT isn't very mature and was mostly there to support the Automata Processor. Since Micron kicked that product out the door (into a startup that might be charitably described as 'moribund', Natural Intelligence Semiconductor) there's not much point in starting with that codebase.
Some customers used Hyperscan as a primary matching engine with a secondary state machine / parser behind, but I'm not at liberty to discuss specifics. In any case these guys were interested in network threat detection, not generalized parsing.
I wouldn't use RE2::Set for language parsing, as it can only tell you that certain patterns occurred, but won't give you offsets, just a single bit.
The problem with the charmingly described 'hairy parser' is that it will just be 'one damn thing after another' - a lot of code, a lot of weird semantic corners, etc. Why not just use a parser that's intended for generality - something like ANTLR? What will taping all that stuff together accomplish?
We didn't set out to create a huge codebase with weird corners with Hyperscan either, and we still got one. If you go out with the intent of building something hairy from the start.... :-/
Let's see... I'm coming at this with objectives around programming experience, rather than around parser implementation. I want to be able to do grammar design that is tightly tied to problem domain expression, rather than to parser tech. The usual dance, of adapting a pretty problem-domain grammar to available parsers, by grammar uglification, transformation, and kludgery... I'd like that to be optional - a thing of optimization, not of minimum viable effort. I'd ideally like parser design choices to escape the parser only as performance variance.
So yes, I'd love a general parser generator, that accepts any grammar, and ideally does some reasonable best-effort transformation and compilation to subengines given the mess you've handed it. I've not seen that. But since I'd not seen RE2::Set, it's perhaps been a decade since I seriously looked around, so maybe there's new niftyness?
I've tried ANTLR a couple of times over the years, at least for toys, maybe a no-templates C++ parser. (It was weird - I never figured it out, and it's not a happened with anything else, but I just viscerally disliked working with it.) Others... None were without sacrifices that required them being wrapped in hair.
So I dreamed of someone creating a parser generator toolkit, a library of engines, reusable rather than inextricably tangled in yet another parser silo. I saw Hyperscan and thought, hmm... might a toolkit come with a regexp api? :)
A rich parser doesn't have to be complex, if you sacrifice speed. A Perl 5 compatible (some old version) regexp engine can be done in a page or two of prolog. So I used to focus on expressivity, and then scramble to get back to minimally tolerable performance.
Minimum-viable compiler performance is arguably dropping dramatically now. With cloud-parallel deterministic compilation, and community-scale caching. So maybe something simple could now have viable pragmatics.
I saw Hyperscan, and was hit by old dreams of speed. Multiple wizzy subengines woven together. I'd woven in GLRs before, but multiple patterns... oooh, what leverage might might be found there?! :)
> What will taping all that stuff together accomplish?
Extensible and scoped parsing... the immensely expensive Python 2 to 3 transition was in part a design choice to avoid scoped method dispatch and file-scoped parsing of both languages. I suggest it was the wrong call.
PCRE with parse-affecting code... say rather, each time you lose a feature, some set of problems gets harder. I'd prefer that to be the harder-slower of falling back to a less-specialized engine (which sometimes isn't a problem), to the harder-go-back-and-rewrite of nonimplementation (which always is). Grammar restriction as premature optimization.
N-ary multifix operators... lets you easily parse expressions of rich operators, from Smalltalk to math. Modern IDEs seem sufficient to address puzzlement over "how and why did this (not)parse".
A "sandwich" of regexp for tokens, operator precedence parser for extensible expressions, and something backtracking for an extensible list of statements, is one way to get a somewhat traditional language parser that's more nicely extensible and expressive.
> a lot of weird semantic corners
Yes, but... I'd like the choice to avoid semantic weirdness, or not, to happen at the application level, not at the parser api level. Because it's a tradeoff. I accept the PEG argument that often simplicity and composability is the right design choice, is worth the expressive cost. But not the argument that anything PEGs can't handle is a "legacy language" which people shouldn't be using anyway (fun conversation with a PEG person).
So yes, I'd like to see programming language parsing become far richer than currently, and thus somewhat hairier. Because avoiding that is, I suggest, inflicting far greater costs elsewhere.
"Hyperscan Things That Didn’t Make It Into Open Source Hyperscan ... Ports to non-x86 platforms."
This is a big shame, and undercuts the title. Is there any way to release the older stuff so that interested folks can work on it to bring the other arches up to parity? I'm a Power ISA bigot in particular but not having an ARM port seems like a big gap.
I suppose "Some Modern CPUs" was too long-winded a title?
As I said, it doesn't take a genius to understand Intel's motivations. They bought the project, after all.
Not being @ Intel anymore, I don't have access to the older stuff, and even if I did, the codebase has diverged significantly since.
Without going on too much of a tirade - the experience of developing for all those platforms really sucked. Almost all the non-x86 platforms had significant bugs in their toolchains. One of the MIPS variants (particular architecture elided to spare the guilty) had bugs in their gcc intrinsics in a way that suggested that no-one had ever done any significant third-party dev on the platform.
Big-endian was also a huge PITA.
It was a ton of work to keep all those systems alive, and our machine rack looked like a zoo of dev boards and weirdo devices.
In the "ure3" system I mention, I would make retargetability/portability to other systems a first-class goal.
One way of achieving this is not having such a huge profusion of methods and complexity. Hyperscan is over-engineered for many use cases if you aren't a network company looking to scan 5,000 complex regexes in streaming mode at hopefully maximal performance.
glangdale can probably say more, but I think the high level answer is pretty easy: re2c is intensely focused on building state machines, in their entirety, ahead of time. There's also some cool stuff that lets you integrate the state machine with the rest of your code. AFAIK, re2c sticks fairly strictly to state machines and doesn't do sophisticated optimizations like Hyperscan, particularly with respect to literals. To me, they are fundamentally solving different problems. re2c is only viable if you can actually deal with the full size of the DFAs you build.
Yes, this is an accurate summary. re2c works when it works, and there's clearly a good niche for "pattern matching stuff that determinizes cleanly". However, in the general case, DFAs catch fire (combinatorial explosion in # of states), especially when handling multiple regular expressions.
IIRC SpamAssassin has lots of quite hard patterns and re2c can only handle a subset. I forget what the fallback position is (libpcre?).
In addition to being incredibly simple to implement compared to traditional regex and parser generators, I think it is also interesting in that while it is really simple and looks like you are interpreting a language, it is actually powerful enough to parse arbitrary CFGs in O(n^3) with some help from memoization and fixpoints. See http://matt.might.net/articles/parsing-with-derivatives/ for more details.
I think the interesting aspect in terms of performance with something like this is that the derivitive based parsing system is explicitly streaming since it only stores the remaining input string and a regex string. Also, I think you can get around some of the performance issues with constructing these strings by replacing certain parts of the regex with more traditional NFAs/DFAs. In particular, I think this method provides an easy way to use simple equality matching for constant strings, DFAs for Kleene star, NFAs for the rest of the typical NFA features like the or operator and negation, then finally using the derivitive for very complicated features like capturing groups and backreferences.