Based on the documentation[1], and not surprisingly, this library has been optimized for Intel processors. Particularly, you can enable AVX2 and AVX512 instruction sets in the regex. In addition to that, you can optimize the matching database to a specific Intel processor family (like Haswell or Sandy Bridge).
BSD-license, and they accept contributions. I wonder if they accept any flags that would allow optimization on AMD processors.
I wonder if they support allowing AVX2 optimizations on non Intel processors; Intel MKL refuses to. I think for that reason alone I’d never use an upstream Intel library directly...
I built Hyperscan with my team in Australia (we've since moved on, or been moved on, from Intel). There are no shenanigans to prevent Hyperscan from using AVX2 on non-IA processors, although we didn't have any high performance examples of same when we were last at Intel (2017).
The mechanism is trivial to control IIRC and would not be difficult to patch. It is, after all and unlike MKL, open source.
Thanks for responding. Please excuse my weariness, then. (I think it’s fair to say it’s pretty well earned at this point, but its certainly not you or your team’s fault.)
I don't know anything about Hyperscan, but it surprises me that it has a truly rare level of optimization. How does the optimization compare with typical tuned linear algebra, for instance?
I don't know why you're being downvoted either, for what seems like a legitimate (if slightly oddly framed) question.
Linear algebra libraries have a ridiculously simple task (by comparison) that can be optimized incredibly well - there seems to be no amount of intellectual effort that isn't worth doing to get 1% more on BLAS given the massive usage of it.
By contrast, Hyperscan has many different subsystems (string search of various kinds, NFA, DFA, various quick SIMD 'lookarounds', different acceleration schemes for NFA and DFA etc. etc). It's enormously complicated already (and a bit unpredictable in terms of the different tradeoffs that can happen) and the optimizations have to be best-effort. We couldn't put in nearly as much optimization into any given hot loop as the linear algebra guys can.
That being said, we've done quite a lot, and many people have picked up useful techniques from our codebase. For example, the string search in Hacker News user burntsushi's regex scanner is an entirely legit (i.e. he freely credits where it comes from) clone of our "Teddy" literal matcher, for example.
Rather than just downvoting, could someone answer the genuine question about how the optimization that's been done compares with the work on linear algebra which underpins so much? As a specific example, there's libxsmm for small matrices.
There's no aspersions cast on Hyperscan at all, just a query about what makes it "truly rare" for the benefit of people who don't have time to study it. It would also be interesting to know how it compares with hardware regex implementations, of which I haven't heard anything recently in connexion with bioinformatics where the interest was.
I might be the only other person to solve multipattern regex searching. It’s easy conceptually, and fiendishly tricksy when you try to get the tests to pass. You want to take advantage of commonality between patterns but the matches must still be independent. The engine I wrote uses a classical interpreter backend, optimized well enough to beat TRE and the old PCRE handily. I’d read about the bit parallel algorithms and play with them a bit, but the pattern sets we’d throw at any given BP algorithm would break it, as any given BP algorithm comes with several caveats.
The genius of the HyperScan team is that they threw damn near every algorithm at their pattern sets, including some they invented, and divvied up the patterns into subsets that fit the particular algorithms well. Usually getting the tests to pass with one backend is the last act of any regex author—once it ain’t broke, you’re too exhausted to fix it. Contemplating the complexity that comes with a multi-algorithm backend makes me want to cry. But HyperScan did it.
So, to put it in perspective with linear algebra, imagine a library that first spent time analyzing given matrices, and then split them into submatrices and operated on them with different algorithms according to the particular patterns detected therein. That’s kind of insane to contemplate in a linear algebra—it’s really not a domain that compares well at all—but it’s how HyperScan works... and that ignores all the crazy ISA tricks it used.
The vast, vast majority of code is not optimized to the level of some linear algebra code. Pointing at MKL or comparable linear algebra packages as an example of why that level of optimization is not "rare" is silly.
I keep seeing complaints about MKL not supporting other architectures. Why do people care? For instance, for linear algebra, use BLIS (like AMD do), or OpenBLAS. Intel actually support BLIS (at least to some extent), and I'm not aware of libxsmm being Intel-only, though I haven't checked. MKL-DNN (whatever it's called now) also appeared to be fairly generic when I looked.
My complaint with MKL is that Intel is intentionally harming performance by only allowing AVX2 code to run on Intel’s own processors. It does not matter if there are alternatives because having the ability to avoid something doesn’t excuse bad behavior. That said, if you’re relying on a dependency that uses MKL i guess you are probably SOL without hacks.
I don't know how MKL works, but it's been reported that you can make it run on AMD processors for no good reason. I do know that you're harming performance if you use an implementation for a different micro-architecture rather than one that's correctly tuned. Assuming we're talking GEMM, you don't get the performance just with generic SIMD code. You need to worry about details of caches and instruction latencies, in particular. Consider that you should even avoid AVX512 on some processors.
I'm no apologist for Intel, but I suppose people running MKL on AMD would complain that it was slow because it was wrongly tuned. Note that BLIS gets quite close to MKL on Intel anyhow, and the timing variance of a typical HPC application is likely to be greater than the contribution of ~10% from GEMM.
I don't know what people do that's MKL-specific, but its major functionality is available in free implementations that will be infinitely faster on POWER and ARM apart from AMD.
Now I can’t claim to be an expert by any means but I feel like you are over complicating this. For one, we know in practice the performance is much better when you force enable AVX2 on AMD with MKL; it was benchmarked. For another, that’s already obvious, because even without micro optimizing we’re still talking about codepaths that are able to better fill the processor and support 256 bit integers. Certainly not everything will always be straightforward but MKL seems like one of the single most obvious libraries for this.
Further, Intel has a rich history here. They have been sued and lost over Intel C++ Compiler’s “Cripple AMD” bit. Funny thing is, it's still there. Although now Intel has warnings throughout their pages. Guess removing a strcmp cpuid, “GenuineIntel” was too much work.
Let’s be realistic. Intel wants to win the benchmarks. There’s no need to construct an elaborate reality where this is all done out of concern that they might make performance worse on microarchitectures they did not design.
We use this at GitHub to power the commit token scanning. It is very fast and handles multiple regexes at once. We are looking for secrets from multiple providers. Couldn’t find a better option for this type of usage.
For example, if someone commits AWS keys to a GitHub repo, GitHub will alert AWS, AWS will revoke your keys and freeze the account until they're satisfied nothing is compromised.
> Since April, we’ve worked with cloud service providers in private beta to scan all changes to public repositories and public Gists for credentials (GitHub doesn’t scan private code).
Private repos aren't scanned and if we did offer the service to scan private repos we would only ever contact the repo owners and not the service providers. In the case of public repos we're in a race against time where bad actors are scraping the site and have automation created to use your tokens within seconds of a leak. Private repos scenarios are much simpler to deal with.
ex-Hyperscan guy here: thanks for the kind words. I'm sure my team (who are primarily responsible for the cleanliness - as a developer, I make a great "ideas person") would appreciate it.
Can you disclose how/why you were reading Hyperscan source for work? Just curious, no agenda.
It was for a (now failed) startup selling IPS appliances. What really helped us in the end was that we could run the runtime in C (C++ was not possible); however the compiler/tooling we used was fairly old (...) so I had to dig into the codebase details to make it work.
As a side note, if I had to say something bad about Hyperscan, it would be the lack of high level documentation. I don't know now, but back then only a couple of blog articles available... I always had been curious : was it something intended to prevent copycats ? Lack of time ? Why not try to explain more the high level math/automata details ?
If the high level documentation were to be improved, I'm pretty sure the number of companies integrating Hyperscan would increase, hence Intel sales would increase (since it has been bought by Intel ;)
The lack of high level docs up to the point of open sourcing was partly due to prevention of copycats but mainly due to lack of resources (chiefly time) to spend time writing. We were a small team and time spent on documenting stuff we all understood pretty well was time spent not chasing customers or improving our product.
Later: well, there is a Hyperscan paper and there may be more material coming out later.
Also, not to be a jerk, but a lot of people claim that they will read/understand/use this kind of documentation and my experience was that only a fraction actually do, and of those fraction, most of them don't behave in a way that's actually useful enough to justify having made all those docs. One is more likely to wind up with people kibitzing and making inane tweaking suggestions ("use more NFAs! no, use more DFAs"); less likely is meaningful OSS contributions or using the software when they might not have before.
I wrote something very similar many years ago. I described it as a reverse search index. The queries/regex gets indexed into a search tree and then text is run thru it. It supported hundreds of thousands of parallel regex searches. I called it dClass:
That is also an impressive project, with multiple subengines from which it chooses the fastest one. All written by one guy. Basically no documentation though
I suspect yes. This would be a wonderful project for someone. Do note that many regexes can be expressed as a dataflow problem. We had an experimental subsystem (called "Puggsley") that was a data-parallel take on the problem; this wasn't published, but is pretty much an independent invention of the same ideas in icgrep.
The major headache is the prospect of loops that are bigger than 1 character position - these aren't as trivial to express (I have ideas, naturally).
I don't know how well this fits in with Tensorflow - it may be too finegrained? But of course, getting onto custom hardware can yield massive speedups.
There are example codes; that's about it. This would be a good project for someone, but there's a lot of code in a "grep" that isn't there in Hyperscan (finding and displaying lines and line numbers, lots of command-line goop and options, pretty colors for results, etc. etc.).
Also, the limitations of Hyperscan might be quite noticeable. It doesn't support all pcre facilities (e.g. capturing and arbitrary lookaround). It has a considerable compile time - I wouldn't want to use Hyperscan to grep short files! Justin Viiret (another ex-Hyperscan team guy) has a blog post about this comparing our relatively heavyweight optimization strategy with RE2 (which gets down to the business of scanning a lot quicker than we do). You can find it here: https://01.org/hyperscan/blogs/jpviiret/2017/regex-set-scann...
(sorry about the giant 01.org "dickbar", we couldn't control that)
The upshot is that most people looking for big collections of regexen in huge amounts of data aren't really running 'grep' type tools. If someone wanted to do that, like I said, it would be a good project.
FWIW, the regex engine in ripgrep is entirely pluggable and abstracted (to an extent). It's how it seamlessly supports both PCRE2 and my own regex engine.
It would be an interesting project to add Hyperscan support to it. You would get all the extra goop you mention for free. I would be happy to give guidance on that if anyone's interested.
Any guesstimates as to when it starts to become worth it to use hyperscan vs pcre etc?
30 regex and 1 meg of data for example.
I'm somewhat curious as I have a couple of scraping things I do where I can compile the regex once and keep it hanging around (or save to/load from disk if that's feasible) for 3 to 5 minutes at a time.
Well, 30 regex is not a great case for libpcre as libpcre really doesn't have a multiple regex mode. Thanks to some excellent work in libpcre2 with accelerated searching, it's no longer the "7-10x slower than Hyperscan on a single regex" case of the past (still frequently slower), but it wouldn't surprise me if HS was 50-100x faster than libpcre once compile time isn't a factor.
However, the compile times of HS for 30 regexes might be not entirely trivial (maybe a second, probably less). A megabyte you'd probably see benefits but I think the benefits might not be drastic.
Probably the best way to find out would be to fire up hsbench (a tool that comes with Hyperscan now), figure out the slightly weird corpus format (sorry) and get some numbers for yourself on your own workload.
You are correct that the failure of Sensory Networks from 2006-2009 to develop an effective FPGA matching algorithm does not imply that no-one can do it. However, the rest of what you said is an assertion for which there isn't really that much proof. FPGA is wonderful when you have the sort of available parallelism it needs; it's not great competing with s/w if the problem is expressed in algorithms with serial dependencies.
I think FPGA could do a great job of achieving worst-case performance by simulating NFAs in a very straightforward fashion (beating s/w). However, no-one does this as the headline numbers are terrible.
The thread you posted contains an assertion by someone who built an FPGA parser with no performance numbers to back it up; the performance comparison may well reflect relative skills with software and hardware as opposed to the limits of software. Interestingly, one of the first responses refers to another project of mine. :-) (simdjson). Unfortunately, given the subsequent posts on the thread, it does not look like the author can provide further details on the system, which is sad.
BSD-license, and they accept contributions. I wonder if they accept any flags that would allow optimization on AMD processors.
[1]: http://intel.github.io/hyperscan/dev-reference/api_constants...