Hacker News new | past | comments | ask | show | jobs | submit login
Multiple regex performance shootout: RE2 vs. Intel's Hyperscan (01.org)
124 points by glangdale on June 21, 2017 | hide | past | favorite | 55 comments



This post is by no means the definitive statement on RE2 and Hyperscan performance. We went very deep on subsets of one measurement case, but different patterns would yield a different set of behaviors for both engines.

Single-pattern performance has been covered before: e.g. https://rust-leipzig.github.io/regex/2017/03/28/comparison-o...


What's fascinating is that it shows here that Hyperscan is not matching the same (!) as the other engines. A piece of information conveniently forgotten in the Intel piece.

Also note that the rust regex crate now uses the same SIMD algorithm(Teddy) invented as part of Hyperscan, see: http://blog.burntsushi.net/ripgrep/ (also ripgrep is awesome, and you should use it.)


This is a surprisingly negative-sounding comment. We've been quite clear in our documentation and public language in the past about how 'all-matches' semantics differ from 'best match' semantics implemented by back-tracking engines, so we generally find a few more matches.

Given the comparison with RE2 here is under circumstances where RE2::Set is not returning any offset at all, I'm not sure what your point is about the "Intel piece".

Ripgrep is also awesome, and thanks (I guess) for mentioning Teddy, which nicely illustrates my favorite instruction (PSHUFB). There are still some bits of the Teddy algorithm that should be incorporated into ripgrep (the merging of literals when you have >8 literals is naive).


It is negative, I should probably have toned it down even more. My mistake, sorry.


It is a pity that regex engines based on regex theory have just come into widespread use when the theory has been in place for decades. Why many language standard libraries use exponential algorithm is beyond my comprehension.


Because the exponential algorithm (1) easily supports backtracking and (2) can often be implemented with less constant factor overhead than the Thompson NFA. Building the NFA and doing NFA to DFA conversion takes time. The simplicity and resulting speed in non-pathological cases is why JS regex engines all use the exponential algorithm.

Think about quicksort. Quicksort is O(n^2) worst case, so it must be awful, right? Well, no, not really, because with a sensible choice of pivot you rarely hit that case, and the constant overhead is significantly better than the guaranteed O(n log n) mergesort for common input. Regex algorithms have a similar dynamic.

My understanding is that re2 and rust-regex manage parity with the backtracking algorithm for non-pathological regexes, much as the complicated mergesort-based Timsort can achieve parity with quicksort, but they have to do a lot of work to get there.


This sounds reasonable, but is simply not correct. There's nothing inherently slow about either algorithm in the average case and there's absolutely no reason that Thompson (or the One True NFA, Glushkov) should be slower than backtracking algorithms.

The exponential algorithms are commonly used because (a) they rarely hit their exponential behavior in practice, (b) they offer huge numbers of features that automata based algorithms cannot easily supply (capturing subexpressions and backreferences are the most obvious - yes, you can do capturing in automata but it's pretty hard) and (c) they are simpler than building something like Hyperscan.

Our codebase is enormous compared to a relatively feature-rich back-tracker. Partly this is because we've been at this a while and under constant pressure to do well on lots of cases and metrics (a cleaner, 'toy' system would look nicer but maybe we wouldn't have survived as a startup if we hadn't optimized 'that one case'). But if you want to build out a Glushkov NFA graph and do a ton of optimizations, you're already more complex than a backtracking engine and you're just getting started.


> The exponential algorithms are commonly used because (a) they rarely hit their exponential behavior in practice, (b) they offer huge numbers of features that automata based algorithms cannot easily supply (capturing subexpressions and backreferences are the most obvious - yes, you can do capturing in automata but it's pretty hard) and (c) they are simpler than building something like Hyperscan.

I'm not sure how that is saying anything substantatively different from what I said... Point (a) is what I was alluding to with the quicksort analogy: you rarely hit the O(n^2) behavior with quicksort. Point (b) is what I was talking about with backtracking. Point (c) is what my simplicity remark was describing.

I didn't mean to imply that you can't go as fast as a good backtracking JIT, rather that it's an awful lot of work to.

That said, I do think that building the NFA or converting to DFA takes up-front time, right? Some regex benchmarks create regexes dynamically at runtime over and over and this can add up.


Yes, you're right. I was fixating on the wrong interpretation of your post. We do have exactly the problem you describe (startup time) and the cross-over analysis with RE2 ("How much data do I have to scan to catch a simpler compiler") would be interesting against libpcre or another backtracker.

Generally - if you can amortize the startup - we find that it's not that hard to outperform back-trackers. Some of this is chucking effort at the problem (writing SIMD implementations of some parts of the search problem) and could be replicated on the backtracking side of the fence. I do think it's simpler to make an NFA or DFA go fast, however, as the problem is very regular and predictable. The analysis of how you would do more than trivial SIMD optimizations on a true back-tracker break my brain. :-)


> yes, you can do capturing in automata but it's pretty hard

Huh, I thought it was fairly easy -- you just need a non-consuming state that pushes a mark onto a stack or pops one off of it. At least, I don't remember much of a problem implementing it when I implemented regex for Myrddin.

The trickiest part of getting matches to work right was getting the order of operations right so that when the VM forked, the state machine would pick the right order of operations to get greedy or non-greedy matches to end at the right place.

Backtracking is a far bigger problem, of course.


There are many types of automata, but if your description of "automata" includes phrases like "when the VM forked" I'm guessing that you're doing something pretty different to what we're doing. :-)

In our normal execution model, we also can't have a stack. Hyperscan is not allowed to go get more memory at run-time nor is it allowed to grab more storage proportional to the size of the input.

When we had a capturing prototype, we broke a lot of these rules (you needed either O(N) or O(lg(N)) side storage). We never had something that entirely fit our normal strictures and still did capturing; we couldn't do it in streaming mode either. We had something that could exactly simulate what you'd get with libpcre in terms of capturing and quantifiers as long as you could live with block mode and some limits on scan size dictated by our side storage use. The main problem? It was complex, and we didn't have a customer for it (this was back when we were a startup).


Fair -- I think the tagged DFA approach (https://laurikari.net/tre/) is the usual approach for doing captures without memory at runtime, but I haven't had a chance to look deeper into it.


You can do things with these backtracking algorithms that are difficult (capturing subexpressions, start of match) or impossible (backreferences, arbitrary lookarounds) to do efficiently in an automata based algorithm.

Sometimes users want these features more than they want theoretical purity. Who are we to tell them they are wrong?

That said, we do have some projects in mind to try to bridge these two worlds (feature-rich backtracking world vs speedy/streamable automata-based). Anyone interested should contact the Hyperscan team. Make a nice project at undergraduate or even postgraduate level depending on scale.


> impossible (backreferences, arbitrary lookarounds) to do efficiently in an automata based algorithm.

But... backreferences are impossible [0] to do efficiently [1] in any algorithm!

[0] Well, assuming P != NP. [1] Polynomial-time.

See: http://perl.plover.com/NPC/


I don't really like this result. It's a bit like saying finding the maximum integer in a sorted list is actually order NN because... what if the N ints are all BIGNUMS WITH N bits, huh?

The fact that you have to grow the regex* to get this result has a similar vibe to it. All the regex matching we see - and most conventional regex matching - assumes the regular expressions are fixed. Every now any then you'll see an algorithm where the size of the regex contributes to the big-O notation but then it's usually broken out as a parameter.

I don't have a good method to actually do a fixed-number-of-backreferences in less than exponential time but it seems fairly clear that if you are willing to burn polynomial time gratuitously you should be able to handle a back-reference (i.e. for input of length N and M back-references, there are only O(N^(2M)) possible final places where those back-references could match for a single match... huge but if M fixed, not polynomial). So if you had an otherwise poly-time regex algorithm you could do it over and over again trying each possible back-ref location and still be poly-time.


oops should have been "unsorted list". Or order(NlgN) in a sorted list.

The point is that number of back-references are not typically something that varies as part of a regex workload, so this proof derives something that is correct in itself, but not applicable to the way most people view regex. This has led to a meme that "all regex with backtracking is NP-hard".

My rather labored analogy is that people view comparison of two numbers as O(1) and would be nonplussed if someone demanded to know if they'd considered O(N) sized bignums... it isn't what most people are talking about when they talk about the cost of doing arithmetic operations. Similarly, arbitrarily expanding regexes is not what most people are talking about when they talk about the cost of doing regex.


> It's a bit like saying finding the maximum integer in a sorted list is actually order NN because... what if the N ints are all BIGNUMS WITH N bits, huh?

I'm not sure I follow. Why would finding the max integer in a sorted list be anything other than the time to copy the last (or first, depending on sort order) element of the list?

Certainly a valid point, though, that this isn't a common use case, so that's just not how experts in this field think about this problem.

> there are only O(N^(2M)) possible final places where those back-references could match for a single match... huge but if M fixed, not polynomial).

Sure, that buys you a pseudo-polynomial algorithm. I just don't like the idea of calling something like that polynomial -- if we did that across the board, then we'd have a poly solution for knapsack (and consequently demonstrate P = NP).


Because comparing N bits takes time proportional to N.


Just wanted to draw your attention to a proof of concept regex engine that I wrote a few years ago that does handle backreferences and arbitrary look around using a Thomson NFA and with no backtracking.

The point of this engine was to show that, while it seems to be commonly believed that an automata-based regex engine cannot handle backreferences and arbitrary look around, that is not entirely true. It does use a stack when it encounters backreferences though.

The source code for the regex engine is at https://github.com/karthikj1/JRegexPlus and a brief explanation of how it works is at http://karthikj1.github.io/JRegexPlus/

I would be interested to hear any thoughts on this.


I think the idea is interesting. I don't think most reasonable people believe that you can't handle backreferences and arbitrary lookaround in something that's largely based on automata; rather that if you're using automata you probably made the decision to use them to avoid having to allocate arbitrary amounts of memory, to "stream" or for multiple pattern support and/or performance reasons.

If those criteria frame why you are using an automata to begin with, then you either (a) can't use a stack either or (b) you at least need to make a performance argument that using NFA+stack is better than a back-tracker.

Arbitrary lookaround seems straightforward except in streaming mode, although efficient support for forward asserts for arbitrary regexes without "grows with the input size" side storage is IMO not trivial (it also breaks our ordering model). Backward lookaround seems easy if you are OK with adding logical AND to your NFA.


>Why many language standard libraries use exponential algorithm is beyond my comprehension.

"Because it enables very useful features like backtracking" doesn't really sound beyond comprehension.


The summary is very easy:

* Hyperscan is by far the fastest, but has only limited platform and regex support.

* Next is pcre2 (with the jit), which is a bit faster than the new rust regex. These support all options, everywhere.

* Then RE2 and all the others. RE2 has very limited regex support.

Why they compare the fastest with one of the slower and limited ones is beyond my understanding. The best overview is still https://rust-leipzig.github.io/regex/2017/03/28/comparison-o...


There are some inaccuracies in this comment.

Firstly, that link shows PCRE2 and Rust's regex engine are neck-and-neck. And that's including the fact that the benchmark contains a regex that is hard for DFA engines but contains no regexes that are hard for backtrackers. (I lodged this criticism against the author of the benchmark before they published it.)

Secondly, Rust's regex crate is heavily inspired by RE2. They support roughly the same features. So if RE2 is limited, then so is Rust's regex library.

Rust's regex library tends to do a little better than RE2 in a wide variety of common use cases because of aggressive literal optimizations. But if you mask those out, performance will be comparable. (Also, if you benchmarked captures, I bet RE2 would win handedly. Rust's regex library needs some work in that area.)


I tested it also for inclusion into cperl to replace the horrible old spencer regex code with longjmp for logical control and double parsing, and came to my conclusions. Rust's will get better eventually, but PCRE2 also gets better every week. That time PCRE2 was a bit faster, and supported more features. And it has a JIT. Only Hyperscan, PCRE2 and sregex have a JIT.

I haven't checked Rust's feature list, sorry. I ruled it out for serious usage very early, and it's still evolving.

Captures are very important for us. Same as UTF8. There was once the idea to switch to a limited engine like RE2 after the first scan of the regex when it doesn't do backtracking. But then I would switch to Hyperscan instead, which has the best compiler, besides having a very good SIMD optimized runtime.


Sorry, but there are still inaccuracies in your comments.

Hyperscan doesn't have a JIT. PCRE2 and sregex are not the only ones with a JIT. For example, PyPy's regex engine is JITed and so is v8's. The existence of Hyperscan (and Rust's regex engine) should show that a JIT is not necessary for high performance.

(Disclaimer: I am the author of Rust's regex library.)


I would qualify the Hyperscan engine as "JIT". Not a traditional, but practically. It's dynamic architecture-optimized superfast bytecode, but much more native than a normal platform-independent bytecode. Look e.g. at the nfa limex interpreter/runtime. It uses intel opcodes dynamically for various SIMD features, simd128,256,384,512.

Yes, pypy and v8 also have a jit, but not usable as library. ruby extracted their regex library, so I could use it also. pypy and v8 did not.


This summary was "very easy", because it is wrong. This assessment is unfair to RE2, and we're comparing to RE2 because it's the only one of these that can do multiple regex. Both Hyperscan and RE2 throw a good deal of regex under the bus in order to get multiple regex support, performance and/or streaming capability.


Rust's regex library has had support for regex sets for quite some time, and it's based on the same "tricks" that RE2 uses: https://docs.rs/regex/0.2.2/regex/struct.RegexSet.html


Oops. We're out of date in our thinking then. If the implementation is similar to re2, you'll probably see similar performance, although I hear good things about your small group literal matcher. :-)


:P Yeah, I would expect performance characteristics to be very similar to RE2! I actually think RegexSet disables literal optimizations in more cases than the standard regex, but it's been a while since I've touched that.


My summary is from the language implementor point of view, not from the POV of the 1% users which do need regex sets (batch regex searching), no backtracking and fast streaming. For fast streaming there would be sregex as first contender of Hyperscan, not RE2.

RE2 is fine, but nobody uses it because... What do you think why? Because of no backtracking support. If I need a limited fast regex on modern intel-CPU's with SIMD, I'll choose Hyperscan over everything else. But I need to check the cpu capabilities before loading the shared lib. If I need a fast regex everywhere else, I choose PCRE2 over everything else. RE2 had its time for the last decade, but this time is over now.


Limiting the benchmark to engines supporting matching sets of regexes seems unnecessarily narrow to me. Is matching a set of regexes not equivalent to matching a single combined regex? Something like "(?:a)|(?:b)|(?:c)"? If so, why not run the test with that regex instead, and include other engines?


For engines like Hyperscan and RE2's underlying algorithms, alternation works the way you describe. However, for backtracking engines, alternation does not do a search in parallel for all the patterns. Instead, it winds up first looking for the first, then the second, etc., finds the first match, and declares victory. There are particular patterns and inputs where this behavior will be similar to parallel matchers like RE2::Set and Hyperscan, but the general case just won't work.


The complication occurs when you need to identify which sub-expression matched.


How do both of those compare to the C++11 std::regex or Boost::regex? At what point is it worth the switch?


If you really need the backtracking features in those other systems, the point is "never".

Similarly, if you have short-lived regexes that are compiled, used for a small amount of data, and discarded, you might never see a performance benefit.

Multiple regexes, scanning a substantial amount of data and/or having a requirement to 'stream' (i.e. process successive writes of input data when you can't hold on to old data) are the sort of things that make Hyperscan use more sensible. We see a lot of use in network security where these assumptions all generally hold.


Also, if you're taking in user input into a regex you want to use something like RE2 or Hyperscan. To do otherwise is to expose yourself to DoS.


is that just because they don't support the features which can be used to make pathological regexes in other engines?

or they have additional sanitisation or security features?


Neither. There are regexes that can be written in the common subset supported by (say) libpcre, RE2 and Hyperscan that will induce exponential backtracking with libpcre but not with the other libraries.

I'm not aware of any difference in terms of sanitisation or security.


C++ says what the regex must do, but not how it should do it or what the performance should be.

To be honest, I have no idea if the "what the regex must do" description rules out particular algorithms. I suspect that, in keeping with Committee tradition, it's basically a new interface over POSIX ( http://pubs.opengroup.org/onlinepubs/7908799/xbd/re.html ). By "Committee tradition," I mean "compare 'catgets' with C++ message catalogs."


std::regex does not work in half of distros, boost is disaster, qt is ok. never.


Wow, oddly relevant - I've been working on a RegExp implementation in JavaScript using similar techniques (minus vectorization of course, RIP SIMD.js). There is no efficient way in JavaScript to replace many patterns (e.g. a list of sentence fragments with optional whitespace, capitalization, pluralization, etc.) against a large document where the replacement requires metadata about the subpattern that was matched (e.g. to replace a term with a link to the term's definition).

I wonder if I should switch gears and try to write a native Node module for this instead. This looks really great.


Thanks for the kind words. You'll have to roll your own 'replace', and the all-matches semantics of Hyperscan may give you some headaches - you'll need a careful reading of our Start of Match semantics and to craft patterns that yield "less surprising results".

If you can share your workload, or want assistance in figuring out whether to do this and/or how to do it, please contact us on the mailing list or on the email link for Hyperscan on 01.org. We're always interested in sample patterns, especially ones that fall outside our normal wheelhouse of 'network security and then some more network security'.


Well depends on how big and how fast.

This worked for me https://github.com/time4tea/glossary/blob/master/README.md

Its a very simple Aho-Corasick text search with replacement.

A similar thing in Java using Ropes rather than Strings was also ok, for some value of ok.


If you're interested in getting parse trees from regexes (ie all the submatch locations), check out Kleenex, it looks really interesting and is based on some new research. Full parse trees are a harder problem than matching, and one that is not a primary focus in RE2 performance AFAICS.


Isn't it true that once you have backtracking in your regex engine that there's no longer a guarantee that there's only a single parse tree?

I've found the hardest part of using e.g. a recursive descent parser is having to correct captured state while backtracking.


Maybe I'm not normal but I'd rather write 30 lines of readable and debugable code than use a regex.

Would you embed Brainfuck in your code ? Then why do you use regexes ?


How do you match arbitrarily repeating patterns, which may/probably have a combinatoric expansion, without effectively writing your own parser? If you're using a language where writing a parser is relatively easy (e.g., Lisp, Haskell, etc.) or your patterns are trivial, then fair enough. Otherwise, good luck with that.

For example, in Python, if I want to match "foo" followed by any number of "bar"s and/or "quux"s at the end of the string, I could write this:

    if re.match(r"foo(bar|quux)*$", my_string):
        ...
...without going to the trouble writing a recursive-descent parser or state machine, your way would be something like this:

    def ends_with_foo_etc(s):
        if s.endswith("foo"):
            return True

        after_last = len(s) - s[::-1].find("oof")
        if after_last > len(s):
            return False

        good_tail = True
        while good_tail and after_last < len(s):
            if s[after_last:after_last + 3] == "bar":
                after_last += 3
            elif s[after_last:after_last + 4] == "quux":
                after_last += 4
            else:
                good_tail = False

        return good_tail

    if ends_with_foo_etc(my_string):
        ...
That's both excessively terse -- for a really simple pattern -- almost certainly slower and, most importantly, brittle in that it couples your pattern to your implementation in a non-trivial way. If your pattern changes, you may have to completely rewrite your matching function.

It's also worth noting that a typical regex API will do more than just match patterns; e.g., extracting subpatterns/groups, splitting, etc.


Your implementation seems complicated. The regular expression "foo(bar|quux)*" denotes a set of strings: { "foo", "foobar", "fooquux", "foobarbar", "foobarquux", "fooquuxbar", "fooquuxquux", ... }

Since this is anchored to the right, we can look at this from right to left: it's any mixture, including empty, of "quux" or "bar" elements, preceded by "foo". From this, a very simple algorithm pops out:

   while (string does not end with "foo") {
     if (string ends with "quux")
       chop off "quux" from end of string
     else if (string ends with "bar")
       chop off "bar" from end of string
     else
       return "no match"
   }

   return "match"


My implementation was the first thing that popped into my head. Yours works too (although note that yours will probably reallocate memory every time it chops the end off the string, so it'll be slower). A RD parser or finite state machine would also work. My point was rather that regular expressions are a much better tool for this type of task (see the reasons given at the end of my previous comment).

Writing fewer lines of code to do the same job is usually a good idea, providing it remains understandable. Personally, I find the syntax of regular expressions pretty straightforward and a lot of languages even allow you write comments within them to make them even easier to follow.


To be frank I don't have very complex use cases. But in your example even while not knowing Python I can understand your code.

To be sure to truly understand your regex I would need to do a lot of Googling and a lot of thinking.

Regexes are not KISS IMO.


Regex syntax is fairly gross, I agree.

That being said I'd be surprised to your open-coded version of simple regexes like /hatstand.*teakettle/ or /badgerbrush\s{12}/ was clearer than the regex, and it would almost certainly be dismally slow relative to a tuned regex implementation - unless you wrote a lot of one-off SIMD code.

Further, if given a bunch of different regexes to match simultaneously, your code would have to run them one by one.

Finally, quite frequently our customers/users are in a situation where they don't know what the regex is going to be, and they aren't going to let people inject arbitrary code into their applications.


A lot of the grossness is syntactic sugar, if one were so inclined they could get by with a handful of operators, and end up with something fairly clear as you demonstrated.

So if anything I blame lack of good taste for regular expressions bad rep.


Your comment is like "I don't need nails because I have screws".


Exactly : why would I risk smashing my hand with a hammer since I have a nice electric screwdriver that is much more user friendly ?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: