
Hyperscan, a high-performance multiple regex matching library - jsnell
https://01.org/hyperscan
======
jonstewart
Neat! There's also Lightgrep.

[https://github.com/LightboxTech/liblightgrep](https://github.com/LightboxTech/liblightgrep)

Lightgrep also supports a subset of PCRE and scales pretty well with the
number of patterns (has done millions of fixed strings). Lightgrep is designed
for digital forensics, so it searches data as a binary stream. Lightgrep also
supports searching for patterns in multiple encodings, with no requirements on
the underlying data. Lightgrep is GPL.

(I'm one of the authors of Lightgrep.)

~~~
mebrahim
Does lightgrep return which one(s) of the millions of patterns has matched the
input?

------
trishume
One application this could work really well for is syntax highlighting.
Textmate/Sublime syntax files are basically lists of hundreds of regexes that
have to be matched very quickly over the same content.

With this library it might be possible for an open source editor beat Sublime
text at highlighting using the same grammars. Which is saying something since
Sublime is crazy fast and the highlighter is even parallelized.

~~~
glangdale
Hyperscan is likely a bit heavyweight for this application. Anything that only
needs to keep up with user input for a small number of patterns would probably
be better off with a lightweight solution with wider pattern support. Although
we would be interested in seeing the integration...

------
rch
The docs state that it's comparable to RE2, and others, but I don't see
benchmarks yet. Is it fair to assume it's faster, or is there a tradeoff to
emphasize the streaming features?

~~~
jonstewart
RE2 isn't a multipattern engine. That's the big difference. You can join a
bunch of patterns together with | in RE2, but there's a precedence so it will
only tell you about the first one to hit and won't tell you which one. Also,
RE2 hits the wall with joining patterns together like this (depends on the
patterns, of course). Comparing multipattern and single pattern engines is a
bit apples and oranges.

I haven't run HyperScan but it looks like it should be very fast. It greatly
depends on the patterns selected. It also looks like HyperScan has
pragmatically sacrificed some aspects of PCRE compatibility/correctness wrt/
matching in order to achieve performance.

~~~
glangdale
We sacrifice features and compatibility. We have sacrificed zero correctness
as far as we are aware. We maintain wart-for-wart compatibility with libpcre
in order to have a good basis for comparison for fuzzing and we believe we
100% recognize libpcre constructs even when ww don't support them.

Our semantics differ, but roughly speaking, we provide the set of matches that
libpcre would if you hook up a callout to libpcre that rejects each match and
sends libpcre back to look for another match (our ordering will be different,
but is also well defined).

~~~
jonstewart
If a pattern uses quantification and the match extends successfully, won't two
matches be reported (instead of one)? That's what I mean--not that it's buggy.
As I'm sure you know quite well, deciding when matches can be reported is
quite tricky, especially with Perl's ordered alternation rules.

A few questions: \- What kind of overhead is actually implied by using start
of matches in streaming mode?

\- Is the backend a machine code JIT of the automaton, an interpreter, a
table-driven search, or...?

\- Why not lazy quantifiers? They're almost always preferable to greedy
quantification in a streaming scenario.

cheers, Jon

~~~
glangdale
Sorry - might have sounded a bit defensive there. We've seen a few systems
that have a tendency to sweep correctness under the carpet with shortcuts,
especially when you hit nasties like "large bounded repeats in streaming
mode".

Start of Match (SoM) in streaming mode is a real bear for some patterns. It
really depends on the pattern - some are free, others (that we would otherwise
support) are very expensive or impossible. Our SoM is fully general (i.e. it
doesn't time out or give up after N bytes or K stream writes) so the worst
case means tracking lots of 64-bit offsets through our automata.

The backend is a big pile of code that coordinates the efforts of many
different automata and subsystems, most of which are a combination of a C
implementation plus a big table of some sort to implement NFA/DFA/literal
matcher/whatever. This isn't something of which we are particularly proud. We
would like to make this a lot simpler and more interpreter-based.

A JIT would be really cool but we had some reluctance in our user base to run
arbitrary code. I would be fascinated to support a JIT project even if it's
not on our mainline.

I respectfully differ on lazy vs greedy. In streaming mode, both require
information that's expensive to maintain and imply that you can 'predict the
future' (especially once you put a complex pattern _after_ the lazy
quantifier). We've been using "automata semantics" for a long time and this is
simple and easy to keep track of in streaming mode.

Capturing semantics and lazy/greedy simulation are a work in progress for us.
We used to have more work in this area but we did't have a strong demand for
it. We do have some IP in this that isn't in Hyperscan - I may blog about this
soon. We did have a subsystem that got 100% compatibility for some patterns
("100% right, 50% of the time"?) in block mode for libpcre that covered
quantifer semantics and capturing subexpressions.

~~~
jonstewart
\- With a PCRE/RE2-like NFA implementation, where you have green threads
representing the states in the automata, you can just store the start of match
in the thread. You do have to be careful then about not killing some threads
off. If HyperScan is a DFA, though, then SoM is indeed hard.

\- Redgrep is a cool project by Paul Wankadia that uses LLVM to JIT regexp
automata into machine code.
[https://github.com/google/redgrep](https://github.com/google/redgrep). The
two things that are tough about JITting for very large automata, though, are
breaking the code up into functions (a compiler can't deal with one gigantor
block), and all that entails, and the fact that a machine code representation
of a large automaton might be many times bigger than a more traditional
representation, so what you gain in decode time you lose on L3 cache hits.

\- I don't grok what you mean about predicting the future wrt/ lazy/greedy
quantification. With lazy quantification, you can stop producing matches once
you've gotten the first match (or once you've moved past the quantification
iff the subsequent subpattern and the quantification subpattern are disjoint).
Looking for "<html>.+</html>" in a large stream is very bad, whereas looking
for "<html>.+?</html>" is almost certainly what's desired.

\- Capturing in a streaming mode just doesn't seem worth it.

\- Large spans of bounded repetition are indeed the devil. It should be
possible to replace states with a counter, but I think it requires dynamic
mallocs (though I haven't worked on it yet).

~~~
jonstewart
re: predicting the future/streaming/"stitching", how do you need to remember
more state for quantification beyond the state that needs to be maintained to
support streaming at all? Why do you need to make a decision when you've
reached the current block--can't that decision wait until you've seen more
data or been told there is no more data?

\--

...I can see that if you're using bit vectors to represent active states,
that's how you can make use of SSE4/string instructions and POPCNT... I've
been thinking about how to do that but it just doesn't seem to work with a VM
implementation. Maybe for a hybrid filtration scheme, though...

\--

Have you implemented Watson's MultiRE with input skipping, or Wu-Manber? I've
tried some hacky approaches to skipping before and the problem seems to be
that a sufficiently large pattern set will always include a short pattern
(e.g., "PE") and now you're stuck with a bad l-min.

\--

Ordering is horrible, both between patterns and within patterns (Perl's
ordered alternation rules). It took us a couple years to get full
compatibility with PCRE (for forensics, correctness is more important than
line speed searching; and most of our input comes from spinning rust, not
ephemeral packets). We had to sacrifice the strict guarantee on O(M)/static
memory usage during searching in order to gain full compatibility.
Fortunately, branch prediction and reserving memory remove pretty much remove
that cost, so that with normal pattern sets we don't see mallocs.

Streaming matching is criminally understudied from a theoretical perspective.
You cannot match some patterns in O(M) memory in a streaming model as an NFA
algorithm should. The one we know about is any time a lower priority path is a
prefix for a higher priority prefix; you have to buffer the lower priority
match until the high priority match resolves. E.g., "a+b|a" on a string of a's
is pathological. These kinds of patterns can be detected; what I don't know is
whether there are other classes (I don't think so, _but_...).

\--

I've always found the DPI use case fascinating, as it is a completely
different input model than what I've worked on, which has its own rules, and
yet has many of the same demanding theoretical requirements in its treatment
of automata.

~~~
glangdale
I think we'll need to take this discussion offline, and we _will_ be posting
more details about how Hyperscan is implemented to our blog over the next few
weeks.

Our different implementations clearly start with different emphasis; some of
design decisions will probably be clearer as we post about our implementation
and the hard constraints in DPI.

