Hacker News new | past | comments | ask | show | jobs | submit login
Hyperscan: High-performance multiple regex matching library from Intel (hyperscan.io)
212 points by goranmoomin on Dec 24, 2019 | hide | past | favorite | 53 comments

Based on the documentation[1], and not surprisingly, this library has been optimized for Intel processors. Particularly, you can enable AVX2 and AVX512 instruction sets in the regex. In addition to that, you can optimize the matching database to a specific Intel processor family (like Haswell or Sandy Bridge).

BSD-license, and they accept contributions. I wonder if they accept any flags that would allow optimization on AMD processors.

[1]: http://intel.github.io/hyperscan/dev-reference/api_constants...

I wonder if they support allowing AVX2 optimizations on non Intel processors; Intel MKL refuses to. I think for that reason alone I’d never use an upstream Intel library directly...

I built Hyperscan with my team in Australia (we've since moved on, or been moved on, from Intel). There are no shenanigans to prevent Hyperscan from using AVX2 on non-IA processors, although we didn't have any high performance examples of same when we were last at Intel (2017).

The mechanism is trivial to control IIRC and would not be difficult to patch. It is, after all and unlike MKL, open source.

Thanks for responding. Please excuse my weariness, then. (I think it’s fair to say it’s pretty well earned at this point, but its certainly not you or your team’s fault.)

Kudos for making one of the most impressive pieces of software ever. The level of optimization you achieved is truly rare these days.

I don't know anything about Hyperscan, but it surprises me that it has a truly rare level of optimization. How does the optimization compare with typical tuned linear algebra, for instance?

I don't know why you're being downvoted either, for what seems like a legitimate (if slightly oddly framed) question.

Linear algebra libraries have a ridiculously simple task (by comparison) that can be optimized incredibly well - there seems to be no amount of intellectual effort that isn't worth doing to get 1% more on BLAS given the massive usage of it.

By contrast, Hyperscan has many different subsystems (string search of various kinds, NFA, DFA, various quick SIMD 'lookarounds', different acceleration schemes for NFA and DFA etc. etc). It's enormously complicated already (and a bit unpredictable in terms of the different tradeoffs that can happen) and the optimizations have to be best-effort. We couldn't put in nearly as much optimization into any given hot loop as the linear algebra guys can.

That being said, we've done quite a lot, and many people have picked up useful techniques from our codebase. For example, the string search in Hacker News user burntsushi's regex scanner is an entirely legit (i.e. he freely credits where it comes from) clone of our "Teddy" literal matcher, for example.

Rather than just downvoting, could someone answer the genuine question about how the optimization that's been done compares with the work on linear algebra which underpins so much? As a specific example, there's libxsmm for small matrices.

There's no aspersions cast on Hyperscan at all, just a query about what makes it "truly rare" for the benefit of people who don't have time to study it. It would also be interesting to know how it compares with hardware regex implementations, of which I haven't heard anything recently in connexion with bioinformatics where the interest was.

I might be the only other person to solve multipattern regex searching. It’s easy conceptually, and fiendishly tricksy when you try to get the tests to pass. You want to take advantage of commonality between patterns but the matches must still be independent. The engine I wrote uses a classical interpreter backend, optimized well enough to beat TRE and the old PCRE handily. I’d read about the bit parallel algorithms and play with them a bit, but the pattern sets we’d throw at any given BP algorithm would break it, as any given BP algorithm comes with several caveats.

The genius of the HyperScan team is that they threw damn near every algorithm at their pattern sets, including some they invented, and divvied up the patterns into subsets that fit the particular algorithms well. Usually getting the tests to pass with one backend is the last act of any regex author—once it ain’t broke, you’re too exhausted to fix it. Contemplating the complexity that comes with a multi-algorithm backend makes me want to cry. But HyperScan did it.

So, to put it in perspective with linear algebra, imagine a library that first spent time analyzing given matrices, and then split them into submatrices and operated on them with different algorithms according to the particular patterns detected therein. That’s kind of insane to contemplate in a linear algebra—it’s really not a domain that compares well at all—but it’s how HyperScan works... and that ignores all the crazy ISA tricks it used.

The vast, vast majority of code is not optimized to the level of some linear algebra code. Pointing at MKL or comparable linear algebra packages as an example of why that level of optimization is not "rare" is silly.

I keep seeing complaints about MKL not supporting other architectures. Why do people care? For instance, for linear algebra, use BLIS (like AMD do), or OpenBLAS. Intel actually support BLIS (at least to some extent), and I'm not aware of libxsmm being Intel-only, though I haven't checked. MKL-DNN (whatever it's called now) also appeared to be fairly generic when I looked.

My complaint with MKL is that Intel is intentionally harming performance by only allowing AVX2 code to run on Intel’s own processors. It does not matter if there are alternatives because having the ability to avoid something doesn’t excuse bad behavior. That said, if you’re relying on a dependency that uses MKL i guess you are probably SOL without hacks.

I don't know how MKL works, but it's been reported that you can make it run on AMD processors for no good reason. I do know that you're harming performance if you use an implementation for a different micro-architecture rather than one that's correctly tuned. Assuming we're talking GEMM, you don't get the performance just with generic SIMD code. You need to worry about details of caches and instruction latencies, in particular. Consider that you should even avoid AVX512 on some processors.

I'm no apologist for Intel, but I suppose people running MKL on AMD would complain that it was slow because it was wrongly tuned. Note that BLIS gets quite close to MKL on Intel anyhow, and the timing variance of a typical HPC application is likely to be greater than the contribution of ~10% from GEMM.

I don't know what people do that's MKL-specific, but its major functionality is available in free implementations that will be infinitely faster on POWER and ARM apart from AMD.

Now I can’t claim to be an expert by any means but I feel like you are over complicating this. For one, we know in practice the performance is much better when you force enable AVX2 on AMD with MKL; it was benchmarked. For another, that’s already obvious, because even without micro optimizing we’re still talking about codepaths that are able to better fill the processor and support 256 bit integers. Certainly not everything will always be straightforward but MKL seems like one of the single most obvious libraries for this.

Further, Intel has a rich history here. They have been sued and lost over Intel C++ Compiler’s “Cripple AMD” bit. Funny thing is, it's still there. Although now Intel has warnings throughout their pages. Guess removing a strcmp cpuid, “GenuineIntel” was too much work.

Let’s be realistic. Intel wants to win the benchmarks. There’s no need to construct an elaborate reality where this is all done out of concern that they might make performance worse on microarchitectures they did not design.

If they don't accept AMD optimizations, just fork it.

We use this at GitHub to power the commit token scanning. It is very fast and handles multiple regexes at once. We are looking for secrets from multiple providers. Couldn’t find a better option for this type of usage.

Can you run this as a pre-receive hook to actually prevent a public commit if it has suspicious patterns in it?

You can use gitleaks to do this now. I added that feature in v3: https://github.com/zricethezav/gitleaks/releases/tag/v3.0.0

Can you elaborate? What providers?

For example, if someone commits AWS keys to a GitHub repo, GitHub will alert AWS, AWS will revoke your keys and freeze the account until they're satisfied nothing is compromised.

> https://github.blog/2018-10-17-behind-the-scenes-of-github-t...

What??? Even in a private repo?

From the article:

> Since April, we’ve worked with cloud service providers in private beta to scan all changes to public repositories and public Gists for credentials (GitHub doesn’t scan private code).

Private repos are not scanned.

Private repos aren't scanned and if we did offer the service to scan private repos we would only ever contact the repo owners and not the service providers. In the case of public repos we're in a race against time where bad actors are scraping the site and have automation created to use your tokens within seconds of a leak. Private repos scenarios are much simpler to deal with.

the blog says public repos... but good question. Is it only some people that can subscribe to high entropy strings?

Github: I didnt opt-into 2FA: #439658 (3rd party auth required without 2FA)

I had to read Hyperscan source code for work.

Honnestly, once you understood (some of...) the math/automata details, this is by far one of the best big codebase ever written.

So clean, beautiful, powerful. I've learned a lot from this codebase.

Does anyone know more codebases as well written as Hyperscan ?

ex-Hyperscan guy here: thanks for the kind words. I'm sure my team (who are primarily responsible for the cleanliness - as a developer, I make a great "ideas person") would appreciate it.

Can you disclose how/why you were reading Hyperscan source for work? Just curious, no agenda.

Wow thanks for your work then !

It was for a (now failed) startup selling IPS appliances. What really helped us in the end was that we could run the runtime in C (C++ was not possible); however the compiler/tooling we used was fairly old (...) so I had to dig into the codebase details to make it work.

As a side note, if I had to say something bad about Hyperscan, it would be the lack of high level documentation. I don't know now, but back then only a couple of blog articles available... I always had been curious : was it something intended to prevent copycats ? Lack of time ? Why not try to explain more the high level math/automata details ?

If the high level documentation were to be improved, I'm pretty sure the number of companies integrating Hyperscan would increase, hence Intel sales would increase (since it has been bought by Intel ;)

The lack of high level docs up to the point of open sourcing was partly due to prevention of copycats but mainly due to lack of resources (chiefly time) to spend time writing. We were a small team and time spent on documenting stuff we all understood pretty well was time spent not chasing customers or improving our product.

Later: well, there is a Hyperscan paper and there may be more material coming out later.

Also, not to be a jerk, but a lot of people claim that they will read/understand/use this kind of documentation and my experience was that only a fraction actually do, and of those fraction, most of them don't behave in a way that's actually useful enough to justify having made all those docs. One is more likely to wind up with people kibitzing and making inane tweaking suggestions ("use more NFAs! no, use more DFAs"); less likely is meaningful OSS contributions or using the software when they might not have before.

Wow I was looking for good codebases to dig into in my free time. This one looks like a good option.

As per the documentation, the library supports a subset of the PCRE syntax.


I wrote something very similar many years ago. I described it as a reverse search index. The queries/regex gets indexed into a search tree and then text is run thru it. It supported hundreds of thousands of parallel regex searches. I called it dClass:


Related, but ClickHouse utilizes this for their regex parsing.

Unfortunately no backreferences

I use FLRE: https://github.com/BeRo1985/flre

That is also an impressive project, with multiple subengines from which it chooses the fastest one. All written by one guy. Basically no documentation though

I wonder how the performance compares

Could matching algorithms be expressed as a graph of tensorflow primitive ops and run on Google's TPU?

I suspect yes. This would be a wonderful project for someone. Do note that many regexes can be expressed as a dataflow problem. We had an experimental subsystem (called "Puggsley") that was a data-parallel take on the problem; this wasn't published, but is pretty much an independent invention of the same ideas in icgrep.

The major headache is the prospect of loops that are bigger than 1 character position - these aren't as trivial to express (I have ideas, naturally).

I don't know how well this fits in with Tensorflow - it may be too finegrained? But of course, getting onto custom hardware can yield massive speedups.

Time to raise the rule count limits in regexp-based adblockers then!

Maybe, when webassembly supports SIMD.

I’m fairly sure the parent was talking about content blocker-esque architectures, where the regex matching is performed in native, browser code.

Are there command-line tools for it?

There are example codes; that's about it. This would be a good project for someone, but there's a lot of code in a "grep" that isn't there in Hyperscan (finding and displaying lines and line numbers, lots of command-line goop and options, pretty colors for results, etc. etc.).

Also, the limitations of Hyperscan might be quite noticeable. It doesn't support all pcre facilities (e.g. capturing and arbitrary lookaround). It has a considerable compile time - I wouldn't want to use Hyperscan to grep short files! Justin Viiret (another ex-Hyperscan team guy) has a blog post about this comparing our relatively heavyweight optimization strategy with RE2 (which gets down to the business of scanning a lot quicker than we do). You can find it here: https://01.org/hyperscan/blogs/jpviiret/2017/regex-set-scann...

(sorry about the giant 01.org "dickbar", we couldn't control that)

The upshot is that most people looking for big collections of regexen in huge amounts of data aren't really running 'grep' type tools. If someone wanted to do that, like I said, it would be a good project.

FWIW, the regex engine in ripgrep is entirely pluggable and abstracted (to an extent). It's how it seamlessly supports both PCRE2 and my own regex engine.

It would be an interesting project to add Hyperscan support to it. You would get all the extra goop you mention for free. I would be happy to give guidance on that if anyone's interested.

The same blog on hyperscan's current website: https://www.hyperscan.io/2017/06/20/regex-set-scanning-hyper... (without the offending bar)

Any guesstimates as to when it starts to become worth it to use hyperscan vs pcre etc?

30 regex and 1 meg of data for example.

I'm somewhat curious as I have a couple of scraping things I do where I can compile the regex once and keep it hanging around (or save to/load from disk if that's feasible) for 3 to 5 minutes at a time.

Well, 30 regex is not a great case for libpcre as libpcre really doesn't have a multiple regex mode. Thanks to some excellent work in libpcre2 with accelerated searching, it's no longer the "7-10x slower than Hyperscan on a single regex" case of the past (still frequently slower), but it wouldn't surprise me if HS was 50-100x faster than libpcre once compile time isn't a factor.

However, the compile times of HS for 30 regexes might be not entirely trivial (maybe a second, probably less). A megabyte you'd probably see benefits but I think the benefits might not be drastic.

Probably the best way to find out would be to fire up hsbench (a tool that comes with Hyperscan now), figure out the slightly weird corpus format (sorry) and get some numbers for yourself on your own workload.

If you really need speed you should use FPGA.

This is a pro-level troll!

For context, see slides 5 and 6 about the history of Hyperscan and Sensory Networks Inc.: https://openisf.files.wordpress.com/2015/11/oisf-keynote-201...

History is not a benchmark) Software universal algorithm can not be faster than custom algoritm on FPGA for each job.

Here is a real history https://twitter.com/joeerl/status/1115990630793207808

You are correct that the failure of Sensory Networks from 2006-2009 to develop an effective FPGA matching algorithm does not imply that no-one can do it. However, the rest of what you said is an assertion for which there isn't really that much proof. FPGA is wonderful when you have the sort of available parallelism it needs; it's not great competing with s/w if the problem is expressed in algorithms with serial dependencies.

I think FPGA could do a great job of achieving worst-case performance by simulating NFAs in a very straightforward fashion (beating s/w). However, no-one does this as the headline numbers are terrible.

The thread you posted contains an assertion by someone who built an FPGA parser with no performance numbers to back it up; the performance comparison may well reflect relative skills with software and hardware as opposed to the limits of software. Interestingly, one of the first responses refers to another project of mine. :-) (simdjson). Unfortunately, given the subsequent posts on the thread, it does not look like the author can provide further details on the system, which is sad.

> Software universal algorithm can not be faster than custom algoritm on FPGA for each job.

It often can, because it can clock 50x higher.

And if you want a fair comparison, customize both to the job.

And a truck driver should use a sports car to make his deliveries I guess.

It seems you really don't need speed. Actually you should not use sport car if you don't need to move really fast.

But if you ever need it FPGA instance it is not so expensive as you think https://aws.amazon.com/ec2/instance-types/f1/

Why not ASIC?

Because custom algorithm on FPGA for each job will work faster than ASIC with universal algorithm.

Make custom ASIC with non universal algorithm is too expensive.

Applications are open for YC Summer 2023

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact