Hyperscan: A Fast Multi-Pattern Regex Matcher for Modern CPUs

slaymaker1907 · on Feb 28, 2019

One of the most interesting ways to handle regex and general parsing IMO is by using derivitives. I'm not sure if this uses any of the ideas from that area, but with derivitive based regex parsing instead of constructing some specific program to parse some language, you parse by instead transforming the original language to be the language of the union of the old language as well as the character just seen.

In addition to being incredibly simple to implement compared to traditional regex and parser generators, I think it is also interesting in that while it is really simple and looks like you are interpreting a language, it is actually powerful enough to parse arbitrary CFGs in O(n^3) with some help from memoization and fixpoints. See http://matt.might.net/articles/parsing-with-derivatives/ for more details.

I think the interesting aspect in terms of performance with something like this is that the derivitive based parsing system is explicitly streaming since it only stores the remaining input string and a regex string. Also, I think you can get around some of the performance issues with constructing these strings by replacing certain parts of the regex with more traditional NFAs/DFAs. In particular, I think this method provides an easy way to use simple equality matching for constant strings, DFAs for Kleene star, NFAs for the rest of the typical NFA features like the or operator and negation, then finally using the derivitive for very complicated features like capturing groups and backreferences.

fmap · on Feb 28, 2019

Derivatives are simple to explain, but not entirely straightforward to implement efficiently. I personally prefer the Glushkov construction.

The Glushkov construction (which Hyperscan uses among other things) is a similarly straightforward translation from a regular expression to an epsilon-free NFA with nice properties. There's a very neat paper explaining the construction in the form of a play: https://sebfisch.github.io/haskell-regexp/regexp-play.pdf

On the CFG side, I like to think of Early parsing as analogous to the Glushkov construction. At the very least they have similar properties.

glangdale · on Feb 28, 2019

I'm aware of derivatives. My concern with derivatives (which I think are awesome) is that the combinatorial blow-up that we associate with DFAs is always around the corner.

I've never been a huge fan of putting all my eggs in the the 'I hope it won't explode' basket. It's been a major point of difference with RE2 all this time - IMO you need a high performance pass for things that don't determinize cleanly (even when you are doing lazy determinization).

I would need to see a lot more detail on using derivatives for capturing groups and back-references.

Sorry if I sound closed-minded, but derivatives look like they solve a lot of stuff I already know how to do really well.

glangdale · on Feb 28, 2019

Author here, in case anyone wants to question, abuse, argue, etc.

bratao · on Feb 28, 2019

Thank you! I' m very excited to test in my projects ( I found a working Python extension here https://github.com/shenfe/python-hyperscan)

Let me seize the opportunity. I have a problem where I need to match multiple person names (hundreds of thousands) in huge texts. Aho-Corasick works good for exact matches. Could HyperScan works for approximated matches?

glangdale · on Feb 28, 2019

We have both Levenstein and Hamming distance parameters. Do note that putting large distance numbers is not expected to perform well (i.e. matching "Smith" at Levenstein distance = 3 won't be a happy experience).

taleinat · on Feb 28, 2019

Have you tried some existing Python libraries which support fuzzy searching, such as regex and fuzzysearch?

bratao · on Feb 28, 2019

Yes. Both grow the complexity linearly/exponentially based on the number of patterns to be searched.

taleinat · on Feb 28, 2019

Ah. You should likely use a different algorithmic approach.

If you add contact details to your user info I'll be happy to get in touch and help.

acuozzo · on Feb 28, 2019

I ported Hyperscan to ARM several months ago with help from the following project: https://github.com/nemequ/simde

Would you be interested in working together to get an ARM port that's a bit cleaner and author-approved?

glangdale · on Feb 28, 2019

I understand this possibility, but it's not the way I would do an ARM port in an ideal world.

Being the original designer of Hyperscan isn't much help; I don't work at Intel any more and wouldn't have all that much clout in trying to get an ARM port into the codebase.

I'm thinking about building a followup regex matcher - considerably smaller scale, and starting with a Hyperscan code base for the tools/infrastructure/boring bits - ("ure3"). This would definitely have ARM as a first-class architecture. Anyone 64-bit and little-endian can play, I would think (legacy stuff like 32-bit, etc can go somewhere else).

I just with the ARM guys would get their shit together and support their own much ballyhooed SVE. The idea of announcing something in 2016 and not supporting it in a server chip that's due for mid-2020 is... novel.

leon_young · on March 4, 2019

Does the ARM version hyperscan work well? Can you describe the detailed steps? Thanks

femto · on March 1, 2019

The reference to "Australian-free" has piqued my curiosity. Do you care to elaborate? In my imagination you had an Australian collaborator that you fell out with and have been busy expunging all "Australian" code from the project!

glangdale · on March 1, 2019

I am Australian, as is the original technical team for Hyperscan. As of March 2018, no member of that team is still at Intel.

Expunging Australian code from Hyperscan would be a non-trivial task.

hlieberman · on Feb 28, 2019

This appears to be the same algorithm used by ripgrep (https://github.com/BurntSushi/ripgrep) for searching when SIMD/AVX is enabled. Specifically, it uses the algorithm Teddy from the library that’s derived from this paper.

burntsushi · on Feb 28, 2019

ripgrep author here. Yes, the Teddy algorithm was originally extracted from Hyperscan. But this is a teeny tiny piece of Hyperscan. And AFAIK, ripgrep's implementation (which is actually in the underlying regex library) doesn't carry over the full Teddy algorithm. Or so I've been told. :-)

glangdale · on Feb 28, 2019

Imma keep telling you until you grab my merge logic. :-)

lambda · on Feb 28, 2019

Yes, ripgrep (or specifically, the Rust regex crate) takes inspiration from Hyperscan for its SIMD matching.

https://github.com/rust-lang/regex/commit/203c509df9e1dcba61...

Both libraries also rely on a number of techniques from existing published research.

Hyperscan is ruthlessly optimized for x86 platforms, but has dropped support for non-x86 platforms. Rust's regex crate is designed more as a general purpose, portable regular expression implementation, suitable for wide scale use.

glangdale · on March 1, 2019

I like 'ruthlessly'. It sounds very butch, especially applying to an activity largely carried out by seated, balding men hunched over terminals in an office.

Rust regex is suited to wide scale use. By comparison to Hyperscan, it is not suited to the kind of large scale cases that Hyperscan handles, but it's well done.

Many of Hyperscan techniques are sui generis, not published. Hopefully some of that will be rectified over time.

Bohp3oow · on Feb 28, 2019

Teddy seems to be just one small specialized chunk of hyperscan.

teddyh · on Feb 28, 2019

This seems tailor-made for deep packet inspection.

j0hnml · on Feb 28, 2019

Well, it kinda is. AFAIK, that’s exactly what this was made for, and Intel/McAfee was going to create a DPI device using this tech. That never happened, and now Snort uses/will use hyperscan for its pattern matcher

chd_ · on Feb 28, 2019

As far as I am aware Arbor is also using hyperscan together with DPDK for DPI/DDoS mitigation

ianhowson · on Feb 28, 2019

Hello, fellow Sensory employee!

glangdale · on Feb 28, 2019

We are everywhere...

sanxiyn · on Feb 28, 2019

There is no "seems" about it. The linked paper mentions DPI over and over.

glangdale · on Feb 28, 2019

Correct. There are other uses, but the main users are security companies scanning lots of data for threats.

pr0tocol_7 · on Feb 28, 2019

Hyperscan is also used for scanning repos at Github: https://github.blog/2018-10-17-behind-the-scenes-of-github-t.... I have a similar project for scanning repos, https://github.com/zricethezav/gitleaks and plan on adding hyperscan functionality in a future release. Really speeds up scans when trying to match many regexes.

mncharity · on Feb 28, 2019

I'm drawn to hairy parsers for implementing DSL-rich syntactically-extensible programming languages. Hairy, as in, extensible and scoped; full-ish Perl-compatible RE, including parse-influencing code execution; n-ary multifix operator precedence; backtrack/operator-precedence/regex sandwiches. Subengines can be used to variously strength reduce parts of the grammar, and to dynamically prune search. It's been some years, so I'm out of date...

Does anyone have experience with MNCaRT? [1] I've long wished for a toolkit for doing grammar analysis and building assemblages of specialized high-performance engines into relatively performant parsers.

Has there been work on using multiple-pattern RE engines (like RE2::Set or Hyperscan) in some context similar to language parsing? Or as subengines of a larger parser?

Thanks.

[1] https://web.eecs.umich.edu/~weimerw/p/weimer-mncart.pdf https://github.com/kevinaangstadt/MNCaRT

glangdale · on March 1, 2019

I like the UVa guys, but MNCaRT isn't very mature and was mostly there to support the Automata Processor. Since Micron kicked that product out the door (into a startup that might be charitably described as 'moribund', Natural Intelligence Semiconductor) there's not much point in starting with that codebase.

Some customers used Hyperscan as a primary matching engine with a secondary state machine / parser behind, but I'm not at liberty to discuss specifics. In any case these guys were interested in network threat detection, not generalized parsing.

I wouldn't use RE2::Set for language parsing, as it can only tell you that certain patterns occurred, but won't give you offsets, just a single bit.

The problem with the charmingly described 'hairy parser' is that it will just be 'one damn thing after another' - a lot of code, a lot of weird semantic corners, etc. Why not just use a parser that's intended for generality - something like ANTLR? What will taping all that stuff together accomplish?

We didn't set out to create a huge codebase with weird corners with Hyperscan either, and we still got one. If you go out with the intent of building something hairy from the start.... :-/

mncharity · on March 1, 2019

Ah, ok, thanks.

Let's see... I'm coming at this with objectives around programming experience, rather than around parser implementation. I want to be able to do grammar design that is tightly tied to problem domain expression, rather than to parser tech. The usual dance, of adapting a pretty problem-domain grammar to available parsers, by grammar uglification, transformation, and kludgery... I'd like that to be optional - a thing of optimization, not of minimum viable effort. I'd ideally like parser design choices to escape the parser only as performance variance.

So yes, I'd love a general parser generator, that accepts any grammar, and ideally does some reasonable best-effort transformation and compilation to subengines given the mess you've handed it. I've not seen that. But since I'd not seen RE2::Set, it's perhaps been a decade since I seriously looked around, so maybe there's new niftyness?

I've tried ANTLR a couple of times over the years, at least for toys, maybe a no-templates C++ parser. (It was weird - I never figured it out, and it's not a happened with anything else, but I just viscerally disliked working with it.) Others... None were without sacrifices that required them being wrapped in hair.

So I dreamed of someone creating a parser generator toolkit, a library of engines, reusable rather than inextricably tangled in yet another parser silo. I saw Hyperscan and thought, hmm... might a toolkit come with a regexp api? :)

A rich parser doesn't have to be complex, if you sacrifice speed. A Perl 5 compatible (some old version) regexp engine can be done in a page or two of prolog. So I used to focus on expressivity, and then scramble to get back to minimally tolerable performance.

Minimum-viable compiler performance is arguably dropping dramatically now. With cloud-parallel deterministic compilation, and community-scale caching. So maybe something simple could now have viable pragmatics.

I saw Hyperscan, and was hit by old dreams of speed. Multiple wizzy subengines woven together. I'd woven in GLRs before, but multiple patterns... oooh, what leverage might might be found there?! :)

> What will taping all that stuff together accomplish?

Extensible and scoped parsing... the immensely expensive Python 2 to 3 transition was in part a design choice to avoid scoped method dispatch and file-scoped parsing of both languages. I suggest it was the wrong call.

PCRE with parse-affecting code... say rather, each time you lose a feature, some set of problems gets harder. I'd prefer that to be the harder-slower of falling back to a less-specialized engine (which sometimes isn't a problem), to the harder-go-back-and-rewrite of nonimplementation (which always is). Grammar restriction as premature optimization.

N-ary multifix operators... lets you easily parse expressions of rich operators, from Smalltalk to math. Modern IDEs seem sufficient to address puzzlement over "how and why did this (not)parse".

A "sandwich" of regexp for tokens, operator precedence parser for extensible expressions, and something backtracking for an extensible list of statements, is one way to get a somewhat traditional language parser that's more nicely extensible and expressive.

> a lot of weird semantic corners

Yes, but... I'd like the choice to avoid semantic weirdness, or not, to happen at the application level, not at the parser api level. Because it's a tradeoff. I accept the PEG argument that often simplicity and composability is the right design choice, is worth the expressive cost. But not the argument that anything PEGs can't handle is a "legacy language" which people shouldn't be using anyway (fun conversation with a PEG person).

So yes, I'd like to see programming language parsing become far richer than currently, and thus somewhat hairier. Because avoiding that is, I suggest, inflicting far greater costs elsewhere.

classichasclass · on Feb 28, 2019

"Hyperscan Things That Didn’t Make It Into Open Source Hyperscan ... Ports to non-x86 platforms."

This is a big shame, and undercuts the title. Is there any way to release the older stuff so that interested folks can work on it to bring the other arches up to parity? I'm a Power ISA bigot in particular but not having an ARM port seems like a big gap.

glangdale · on Feb 28, 2019

I suppose "Some Modern CPUs" was too long-winded a title?

As I said, it doesn't take a genius to understand Intel's motivations. They bought the project, after all.

Not being @ Intel anymore, I don't have access to the older stuff, and even if I did, the codebase has diverged significantly since.

Without going on too much of a tirade - the experience of developing for all those platforms really sucked. Almost all the non-x86 platforms had significant bugs in their toolchains. One of the MIPS variants (particular architecture elided to spare the guilty) had bugs in their gcc intrinsics in a way that suggested that no-one had ever done any significant third-party dev on the platform.

Big-endian was also a huge PITA.

It was a ton of work to keep all those systems alive, and our machine rack looked like a zoo of dev boards and weirdo devices.

In the "ure3" system I mention, I would make retargetability/portability to other systems a first-class goal.

One way of achieving this is not having such a huge profusion of methods and complexity. Hyperscan is over-engineered for many use cases if you aren't a network company looking to scan 5,000 complex regexes in streaming mode at hopefully maximal performance.

fanf2 · on March 1, 2019

I’m curious how Hyperscan compares to the venerable re2c, which is used for compiling SpamAssassin’s rules.

burntsushi · on March 1, 2019

glangdale can probably say more, but I think the high level answer is pretty easy: re2c is intensely focused on building state machines, in their entirety, ahead of time. There's also some cool stuff that lets you integrate the state machine with the rest of your code. AFAIK, re2c sticks fairly strictly to state machines and doesn't do sophisticated optimizations like Hyperscan, particularly with respect to literals. To me, they are fundamentally solving different problems. re2c is only viable if you can actually deal with the full size of the DFAs you build.

re2c is also notable for, AIUI, having a very principled solution to the problem of submatch extraction using tagged DFAs. They wrote a paper about it: http://re2c.org/2017_trofimovich_tagged_deterministic_finite...

glangdale · on March 3, 2019

For what it's worth at this late stage:

Yes, this is an accurate summary. re2c works when it works, and there's clearly a good niche for "pattern matching stuff that determinizes cleanly". However, in the general case, DFAs catch fire (combinatorial explosion in # of states), especially when handling multiple regular expressions.

IIRC SpamAssassin has lots of quite hard patterns and re2c can only handle a subset. I forget what the fallback position is (libpcre?).