
Hyperscan: A Fast Multi-Pattern Regex Matcher for Modern CPUs - glangdale
https://branchfree.org/2019/02/28/paper-hyperscan-a-fast-multi-pattern-regex-matcher-for-modern-cpus/
======
slaymaker1907
One of the most interesting ways to handle regex and general parsing IMO is by
using derivitives. I'm not sure if this uses any of the ideas from that area,
but with derivitive based regex parsing instead of constructing some specific
program to parse some language, you parse by instead transforming the original
language to be the language of the union of the old language as well as the
character just seen.

In addition to being incredibly simple to implement compared to traditional
regex and parser generators, I think it is also interesting in that while it
is really simple and looks like you are interpreting a language, it is
actually powerful enough to parse arbitrary CFGs in O(n^3) with some help from
memoization and fixpoints. See [http://matt.might.net/articles/parsing-with-
derivatives/](http://matt.might.net/articles/parsing-with-derivatives/) for
more details.

I think the interesting aspect in terms of performance with something like
this is that the derivitive based parsing system is explicitly streaming since
it only stores the remaining input string and a regex string. Also, I think
you can get around some of the performance issues with constructing these
strings by replacing certain parts of the regex with more traditional
NFAs/DFAs. In particular, I think this method provides an easy way to use
simple equality matching for constant strings, DFAs for Kleene star, NFAs for
the rest of the typical NFA features like the or operator and negation, then
finally using the derivitive for very complicated features like capturing
groups and backreferences.

~~~
fmap
Derivatives are simple to explain, but not entirely straightforward to
implement efficiently. I personally prefer the Glushkov construction.

The Glushkov construction (which Hyperscan uses among other things) is a
similarly straightforward translation from a regular expression to an epsilon-
free NFA with nice properties. There's a very neat paper explaining the
construction in the form of a play: [https://sebfisch.github.io/haskell-
regexp/regexp-play.pdf](https://sebfisch.github.io/haskell-regexp/regexp-
play.pdf)

On the CFG side, I like to think of Early parsing as analogous to the Glushkov
construction. At the very least they have similar properties.

------
glangdale
Author here, in case anyone wants to question, abuse, argue, etc.

~~~
bratao
Thank you! I' m very excited to test in my projects ( I found a working Python
extension here [https://github.com/shenfe/python-
hyperscan](https://github.com/shenfe/python-hyperscan))

Let me seize the opportunity. I have a problem where I need to match multiple
person names (hundreds of thousands) in huge texts. Aho-Corasick works good
for exact matches. Could HyperScan works for approximated matches?

~~~
taleinat
Have you tried some existing Python libraries which support fuzzy searching,
such as regex and fuzzysearch?

~~~
bratao
Yes. Both grow the complexity linearly/exponentially based on the number of
patterns to be searched.

~~~
taleinat
Ah. You should likely use a different algorithmic approach.

If you add contact details to your user info I'll be happy to get in touch and
help.

------
hlieberman
This appears to be the same algorithm used by ripgrep
([https://github.com/BurntSushi/ripgrep](https://github.com/BurntSushi/ripgrep))
for searching when SIMD/AVX is enabled. Specifically, it uses the algorithm
Teddy from the library that’s derived from this paper.

~~~
burntsushi
ripgrep author here. Yes, the Teddy algorithm was originally extracted from
Hyperscan. But this is a teeny tiny piece of Hyperscan. And AFAIK, ripgrep's
implementation (which is actually in the underlying regex library) doesn't
carry over the full Teddy algorithm. Or so I've been told. :-)

~~~
glangdale
Imma keep telling you until you grab my merge logic. :-)

------
teddyh
This seems tailor-made for deep packet inspection.

~~~
j0hnml
Well, it kinda is. AFAIK, that’s exactly what this was made for, and
Intel/McAfee was going to create a DPI device using this tech. That never
happened, and now Snort uses/will use hyperscan for its pattern matcher

~~~
ianhowson
Hello, fellow Sensory employee!

~~~
glangdale
We are everywhere...

------
pr0tocol_7
Hyperscan is also used for scanning repos at Github:
[https://github.blog/2018-10-17-behind-the-scenes-of-
github-t...](https://github.blog/2018-10-17-behind-the-scenes-of-github-token-
scanning/). I have a similar project for scanning repos,
[https://github.com/zricethezav/gitleaks](https://github.com/zricethezav/gitleaks)
and plan on adding hyperscan functionality in a future release. Really speeds
up scans when trying to match many regexes.

------
mncharity
I'm drawn to hairy parsers for implementing DSL-rich syntactically-extensible
programming languages. Hairy, as in, extensible and scoped; full-ish Perl-
compatible RE, including parse-influencing code execution; n-ary multifix
operator precedence; backtrack/operator-precedence/regex sandwiches.
Subengines can be used to variously strength reduce parts of the grammar, and
to dynamically prune search. It's been some years, so I'm out of date...

Does anyone have experience with MNCaRT? [1] I've long wished for a toolkit
for doing grammar analysis and building assemblages of specialized high-
performance engines into relatively performant parsers.

Has there been work on using multiple-pattern RE engines (like RE2::Set or
Hyperscan) in some context similar to language parsing? Or as subengines of a
larger parser?

Thanks.

[1] [https://web.eecs.umich.edu/~weimerw/p/weimer-
mncart.pdf](https://web.eecs.umich.edu/~weimerw/p/weimer-mncart.pdf)
[https://github.com/kevinaangstadt/MNCaRT](https://github.com/kevinaangstadt/MNCaRT)

~~~
glangdale
I like the UVa guys, but MNCaRT isn't very mature and was mostly there to
support the Automata Processor. Since Micron kicked that product out the door
(into a startup that might be charitably described as 'moribund', Natural
Intelligence Semiconductor) there's not much point in starting with that
codebase.

Some customers used Hyperscan as a primary matching engine with a secondary
state machine / parser behind, but I'm not at liberty to discuss specifics. In
any case these guys were interested in network threat detection, not
generalized parsing.

I wouldn't use RE2::Set for language parsing, as it can only tell you that
certain patterns occurred, but won't give you offsets, just a single bit.

The problem with the charmingly described 'hairy parser' is that it will just
be 'one damn thing after another' \- a lot of code, a lot of weird semantic
corners, etc. Why not just use a parser that's intended for generality -
something like ANTLR? What will taping all that stuff together accomplish?

We didn't set out to create a huge codebase with weird corners with Hyperscan
_either_ , and we still got one. If you go out with the intent of building
something hairy from the start.... :-/

~~~
mncharity
Ah, ok, thanks.

Let's see... I'm coming at this with objectives around programming experience,
rather than around parser implementation. I want to be able to do grammar
design that is tightly tied to problem domain expression, rather than to
parser tech. The usual dance, of adapting a pretty problem-domain grammar to
available parsers, by grammar uglification, transformation, and kludgery...
I'd like that to be optional - a thing of optimization, not of minimum viable
effort. I'd ideally like parser design choices to escape the parser only as
performance variance.

So yes, I'd love a general parser generator, that accepts any grammar, and
ideally does some reasonable best-effort transformation and compilation to
subengines given the mess you've handed it. I've not seen that. But since I'd
not seen RE2::Set, it's perhaps been a decade since I seriously looked around,
so maybe there's new niftyness?

I've tried ANTLR a couple of times over the years, at least for toys, maybe a
no-templates C++ parser. (It was weird - I never figured it out, and it's not
a happened with anything else, but I just viscerally disliked working with
it.) Others... None were without sacrifices that required them being wrapped
in hair.

So I dreamed of someone creating a parser generator toolkit, a library of
engines, reusable rather than inextricably tangled in yet another parser silo.
I saw Hyperscan and thought, hmm... might a toolkit come with a regexp api? :)

A rich parser doesn't have to be complex, if you sacrifice speed. A Perl 5
compatible (some old version) regexp engine can be done in a page or two of
prolog. So I used to focus on expressivity, and then scramble to get back to
minimally tolerable performance.

Minimum-viable compiler performance is arguably dropping dramatically now.
With cloud-parallel deterministic compilation, and community-scale caching. So
maybe something simple could now have viable pragmatics.

I saw Hyperscan, and was hit by old dreams of speed. Multiple wizzy subengines
woven together. I'd woven in GLRs before, but multiple patterns... oooh, what
leverage might might be found there?! :)

> What will taping all that stuff together accomplish?

Extensible and scoped parsing... the _immensely_ expensive Python 2 to 3
transition was in part a design choice to avoid scoped method dispatch and
file-scoped parsing of both languages. I suggest it was the wrong call.

PCRE with parse-affecting code... say rather, each time you lose a feature,
some set of problems gets harder. I'd prefer that to be the harder-slower of
falling back to a less-specialized engine (which sometimes isn't a problem),
to the harder-go-back-and-rewrite of nonimplementation (which always is).
Grammar restriction as premature optimization.

N-ary multifix operators... lets you easily parse expressions of rich
operators, from Smalltalk to math. Modern IDEs seem sufficient to address
puzzlement over "how and why did this (not)parse".

A "sandwich" of regexp for tokens, operator precedence parser for extensible
expressions, and something backtracking for an extensible list of statements,
is one way to get a somewhat traditional language parser that's more nicely
extensible and expressive.

> a lot of weird semantic corners

Yes, but... I'd like the choice to avoid semantic weirdness, or not, to happen
at the application level, not at the parser api level. Because it's a
tradeoff. I accept the PEG argument that often simplicity and composability is
the right design choice, is worth the expressive cost. But not the argument
that anything PEGs can't handle is a "legacy language" which people shouldn't
be using anyway (fun conversation with a PEG person).

So yes, I'd like to see programming language parsing become far richer than
currently, and thus somewhat hairier. Because avoiding that is, I suggest,
inflicting far greater costs elsewhere.

------
classichasclass
"Hyperscan Things That Didn’t Make It Into Open Source Hyperscan ... Ports to
non-x86 platforms."

This is a big shame, and undercuts the title. Is there any way to release the
older stuff so that interested folks can work on it to bring the other arches
up to parity? I'm a Power ISA bigot in particular but not having an ARM port
seems like a big gap.

~~~
glangdale
I suppose "Some Modern CPUs" was too long-winded a title?

As I said, it doesn't take a genius to understand Intel's motivations. They
_bought_ the project, after all.

Not being @ Intel anymore, I don't have access to the older stuff, and even if
I did, the codebase has diverged significantly since.

Without going on too much of a tirade - the experience of developing for all
those platforms _really sucked_. Almost all the non-x86 platforms had
significant bugs in their toolchains. One of the MIPS variants (particular
architecture elided to spare the guilty) had bugs in their gcc intrinsics in a
way that suggested that no-one had ever done any significant third-party dev
on the platform.

Big-endian was also a huge PITA.

It was a ton of work to keep all those systems alive, and our machine rack
looked like a zoo of dev boards and weirdo devices.

In the "ure3" system I mention, I would make retargetability/portability to
other systems a first-class goal.

One way of achieving this is not having such a huge profusion of methods and
complexity. Hyperscan is over-engineered for many use cases if you aren't a
network company looking to scan 5,000 complex regexes in streaming mode at
hopefully maximal performance.

------
fanf2
I’m curious how Hyperscan compares to the venerable re2c, which is used for
compiling SpamAssassin’s rules.

~~~
burntsushi
glangdale can probably say more, but I think the high level answer is pretty
easy: re2c is intensely focused on building state machines, in their entirety,
ahead of time. There's also some cool stuff that lets you integrate the state
machine with the rest of your code. AFAIK, re2c sticks fairly strictly to
state machines and doesn't do sophisticated optimizations like Hyperscan,
particularly with respect to literals. To me, they are fundamentally solving
different problems. re2c is only viable if you can actually deal with the full
size of the DFAs you build.

re2c is also notable for, AIUI, having a very principled solution to the
problem of submatch extraction using tagged DFAs. They wrote a paper about it:
[http://re2c.org/2017_trofimovich_tagged_deterministic_finite...](http://re2c.org/2017_trofimovich_tagged_deterministic_finite_automata_with_lookahead.pdf)

~~~
glangdale
For what it's worth at this late stage:

Yes, this is an accurate summary. re2c works when it works, and there's
clearly a good niche for "pattern matching stuff that determinizes cleanly".
However, in the general case, DFAs catch fire (combinatorial explosion in # of
states), especially when handling multiple regular expressions.

IIRC SpamAssassin has lots of quite hard patterns and re2c can only handle a
subset. I forget what the fallback position is (libpcre?).

