
Multiple regex performance shootout: RE2 vs. Intel's Hyperscan - glangdale
https://01.org/hyperscan/blogs/jpviiret/2017/regex-set-scanning-hyperscan-and-re2set
======
glangdale
This post is by no means the definitive statement on RE2 and Hyperscan
performance. We went very deep on subsets of one measurement case, but
different patterns would yield a different set of behaviors for both engines.

Single-pattern performance has been covered before: e.g. [https://rust-
leipzig.github.io/regex/2017/03/28/comparison-o...](https://rust-
leipzig.github.io/regex/2017/03/28/comparison-of-regex-engines/)

~~~
Aissen
What's fascinating is that it shows here that Hyperscan is not matching the
same (!) as the other engines. A piece of information conveniently forgotten
in the Intel piece.

Also note that the rust regex crate now uses the same SIMD algorithm(Teddy)
invented as part of Hyperscan, see:
[http://blog.burntsushi.net/ripgrep/](http://blog.burntsushi.net/ripgrep/)
(also ripgrep is awesome, and you should use it.)

~~~
glangdale
This is a surprisingly negative-sounding comment. We've been quite clear in
our documentation and public language in the past about how 'all-matches'
semantics differ from 'best match' semantics implemented by back-tracking
engines, so we generally find a few more matches.

Given the comparison with RE2 here is under circumstances where RE2::Set is
not returning _any offset at all_ , I'm not sure what your point is about the
"Intel piece".

Ripgrep is also awesome, and thanks (I guess) for mentioning Teddy, which
nicely illustrates my favorite instruction (PSHUFB). There are still some bits
of the Teddy algorithm that should be incorporated into ripgrep (the merging
of literals when you have >8 literals is naive).

~~~
Aissen
It is negative, I should probably have toned it down even more. My mistake,
sorry.

------
netheril96
It is a pity that regex engines based on regex theory have just come into
widespread use when the theory has been in place for decades. Why many
language standard libraries use exponential algorithm is beyond my
comprehension.

~~~
glangdale
You can do things with these backtracking algorithms that are difficult
(capturing subexpressions, start of match) or impossible (backreferences,
arbitrary lookarounds) to do efficiently in an automata based algorithm.

Sometimes users want these features more than they want theoretical purity.
Who are we to tell them they are wrong?

That said, we do have some projects in mind to try to bridge these two worlds
(feature-rich backtracking world vs speedy/streamable automata-based). Anyone
interested should contact the Hyperscan team. Make a nice project at
undergraduate or even postgraduate level depending on scale.

~~~
vitus
> impossible (backreferences, arbitrary lookarounds) to do efficiently in an
> automata based algorithm.

But... backreferences are impossible [0] to do efficiently [1] in any
algorithm!

[0] Well, assuming P != NP. [1] Polynomial-time.

See: [http://perl.plover.com/NPC/](http://perl.plover.com/NPC/)

~~~
glangdale
I don't really like this result. It's a bit like saying finding the maximum
integer in a sorted list is actually order N _N because... what if the N ints
are all BIGNUMS WITH N bits, huh?

The fact that you have to grow the _regex* to get this result has a similar
vibe to it. All the regex matching we see - and most conventional regex
matching - assumes the regular expressions are fixed. Every now any then
you'll see an algorithm where the size of the regex contributes to the big-O
notation but then it's usually broken out as a parameter.

I don't have a good method to actually _do_ a fixed-number-of-backreferences
in less than exponential time but it seems fairly clear that if you are
willing to burn polynomial time gratuitously you should be able to handle a
back-reference (i.e. for input of length N and M back-references, there are
only O(N^(2M)) possible final places where those back-references could match
for a single match... huge but if M fixed, not polynomial). So if you had an
otherwise poly-time regex algorithm you could do it over and over again trying
each possible back-ref location and still be poly-time.

~~~
vitus
> It's a bit like saying finding the maximum integer in a sorted list is
> actually order NN because... what if the N ints are all BIGNUMS WITH N bits,
> huh?

I'm not sure I follow. Why would finding the max integer in a sorted list be
anything other than the time to copy the last (or first, depending on sort
order) element of the list?

Certainly a valid point, though, that this isn't a common use case, so that's
just not how experts in this field think about this problem.

> there are only O(N^(2M)) possible final places where those back-references
> could match for a single match... huge but if M fixed, not polynomial).

Sure, that buys you a pseudo-polynomial algorithm. I just don't like the idea
of calling something like that polynomial -- if we did that across the board,
then we'd have a poly solution for knapsack (and consequently demonstrate P =
NP).

~~~
barrkel
Because comparing N bits takes time proportional to N.

------
rurban
The summary is very easy:

* Hyperscan is by far the fastest, but has only limited platform and regex support.

* Next is pcre2 (with the jit), which is a bit faster than the new rust regex. These support all options, everywhere.

* Then RE2 and all the others. RE2 has very limited regex support.

Why they compare the fastest with one of the slower and limited ones is beyond
my understanding. The best overview is still [https://rust-
leipzig.github.io/regex/2017/03/28/comparison-o...](https://rust-
leipzig.github.io/regex/2017/03/28/comparison-of-regex-engines/)

~~~
glangdale
This summary was "very easy", because it is wrong. This assessment is unfair
to RE2, and we're comparing to RE2 because it's the only one of these that can
do multiple regex. Both Hyperscan and RE2 throw a good deal of regex under the
bus in order to get multiple regex support, performance and/or streaming
capability.

~~~
burntsushi
Rust's regex library has had support for regex sets for quite some time, and
it's based on the same "tricks" that RE2 uses:
[https://docs.rs/regex/0.2.2/regex/struct.RegexSet.html](https://docs.rs/regex/0.2.2/regex/struct.RegexSet.html)

~~~
glangdale
Oops. We're out of date in our thinking then. If the implementation is similar
to re2, you'll probably see similar performance, although I hear good things
about your small group literal matcher. :-)

~~~
burntsushi
:P Yeah, I would expect performance characteristics to be very similar to RE2!
I actually think RegexSet disables literal optimizations in more cases than
the standard regex, but it's been a while since I've touched that.

------
strictfp
Limiting the benchmark to engines supporting matching sets of regexes seems
unnecessarily narrow to me. Is matching a set of regexes not equivalent to
matching a single combined regex? Something like "(?:a)|(?:b)|(?:c)"? If so,
why not run the test with that regex instead, and include other engines?

~~~
glangdale
For engines like Hyperscan and RE2's underlying algorithms, alternation works
the way you describe. However, for backtracking engines, alternation does
_not_ do a search in parallel for all the patterns. Instead, it winds up first
looking for the first, then the second, etc., finds the first match, and
declares victory. There are particular patterns and inputs where this behavior
will be similar to parallel matchers like RE2::Set and Hyperscan, but the
general case just won't work.

------
VHRanger
How do both of those compare to the C++11 std::regex or Boost::regex? At what
point is it worth the switch?

~~~
glangdale
If you really need the backtracking features in those other systems, the point
is "never".

Similarly, if you have short-lived regexes that are compiled, used for a small
amount of data, and discarded, you might never see a performance benefit.

Multiple regexes, scanning a substantial amount of data and/or having a
requirement to 'stream' (i.e. process successive writes of input data when you
can't hold on to old data) are the sort of things that make Hyperscan use more
sensible. We see a lot of use in network security where these assumptions all
generally hold.

~~~
Blackthorn
Also, if you're taking in user input into a regex you want to use something
like RE2 or Hyperscan. To do otherwise is to expose yourself to DoS.

~~~
anentropic
is that just because they don't support the features which can be used to make
pathological regexes in other engines?

or they have additional sanitisation or security features?

~~~
glangdale
Neither. There are regexes that can be written in the common subset supported
by (say) libpcre, RE2 and Hyperscan that will induce exponential backtracking
with libpcre but not with the other libraries.

I'm not aware of any difference in terms of sanitisation or security.

------
bterlson
Wow, oddly relevant - I've been working on a RegExp implementation in
JavaScript using similar techniques (minus vectorization of course, RIP
SIMD.js). There is no efficient way in JavaScript to replace many patterns
(e.g. a list of sentence fragments with optional whitespace, capitalization,
pluralization, etc.) against a large document where the replacement requires
metadata about the subpattern that was matched (e.g. to replace a term with a
link to the term's definition).

I wonder if I should switch gears and try to write a native Node module for
this instead. This looks really great.

~~~
glangdale
Thanks for the kind words. You'll have to roll your own 'replace', and the
all-matches semantics of Hyperscan may give you some headaches - you'll need a
careful reading of our Start of Match semantics and to craft patterns that
yield "less surprising results".

If you can share your workload, or want assistance in figuring out whether to
do this and/or how to do it, please contact us on the mailing list or on the
email link for Hyperscan on 01.org. We're always interested in sample
patterns, especially ones that fall outside our normal wheelhouse of 'network
security and then some more network security'.

------
phunge
If you're interested in getting parse trees from regexes (ie all the submatch
locations), check out Kleenex, it looks really interesting and is based on
some new research. Full parse trees are a harder problem than matching, and
one that is not a primary focus in RE2 performance AFAICS.

~~~
nly
Isn't it true that once you have backtracking in your regex engine that
there's no longer a guarantee that there's only a single parse tree?

I've found the hardest part of using e.g. a recursive descent parser is having
to correct captured state while backtracking.

------
jmnicolas
Maybe I'm not normal but I'd rather write 30 lines of readable and debugable
code than use a regex.

Would you embed Brainfuck in your code ? Then why do you use regexes ?

~~~
Xophmeister
How do you match arbitrarily repeating patterns, which may/probably have a
combinatoric expansion, without effectively writing your own parser? If you're
using a language where writing a parser is relatively easy (e.g., Lisp,
Haskell, etc.) or your patterns are trivial, then fair enough. Otherwise, good
luck with that.

For example, in Python, if I want to match "foo" followed by any number of
"bar"s and/or "quux"s at the end of the string, I could write this:

    
    
        if re.match(r"foo(bar|quux)*$", my_string):
            ...
    

...without going to the trouble writing a recursive-descent parser or state
machine, your way would be something like this:

    
    
        def ends_with_foo_etc(s):
            if s.endswith("foo"):
                return True
    
            after_last = len(s) - s[::-1].find("oof")
            if after_last > len(s):
                return False
    
            good_tail = True
            while good_tail and after_last < len(s):
                if s[after_last:after_last + 3] == "bar":
                    after_last += 3
                elif s[after_last:after_last + 4] == "quux":
                    after_last += 4
                else:
                    good_tail = False
    
            return good_tail
    
        if ends_with_foo_etc(my_string):
            ...
    

That's both excessively terse -- for a really simple pattern -- almost
certainly slower and, most importantly, brittle in that it couples your
pattern to your implementation in a non-trivial way. If your pattern changes,
you may have to completely rewrite your matching function.

It's also worth noting that a typical regex API will do more than just match
patterns; e.g., extracting subpatterns/groups, splitting, etc.

~~~
kazinator
Your implementation seems complicated. The regular expression "foo(bar|quux)*"
denotes a set of strings: { "foo", "foobar", "fooquux", "foobarbar",
"foobarquux", "fooquuxbar", "fooquuxquux", ... }

Since this is anchored to the right, we can look at this from right to left:
it's any mixture, including empty, of "quux" or "bar" elements, preceded by
"foo". From this, a very simple algorithm pops out:

    
    
       while (string does not end with "foo") {
         if (string ends with "quux")
           chop off "quux" from end of string
         else if (string ends with "bar")
           chop off "bar" from end of string
         else
           return "no match"
       }
    
       return "match"

~~~
Xophmeister
My implementation was the first thing that popped into my head. Yours works
too (although note that yours will probably reallocate memory every time it
chops the end off the string, so it'll be slower). A RD parser or finite state
machine would also work. My point was rather that regular expressions are a
much better tool for this type of task (see the reasons given at the end of my
previous comment).

Writing fewer lines of code to do the same job is usually a good idea,
providing it remains understandable. Personally, I find the syntax of regular
expressions pretty straightforward and a lot of languages even allow you write
comments within them to make them even easier to follow.

