1) i see they are doing fully qualified imports. please add major bounds to your version deps, stack doesn't excuse not tracking what versions you used!
2) they have
"data Next a
= Done !a
| Step !a
in their code,
this tells me they could have possibly benefited using the stream abstraction in Vector!
https://hackage.haskell.org/package/vector-0.12.0.2/docs/Dat... is a link to the latter
The exact versions they used are specified by the stack.yaml file.
In this case, see https://stackage.org/lts-13.10
Stack has a feature that lets you automatically added bounds to a hackage upload based on the stack.yaml. (However, I personally do not recommend using this feature, as those bounds usually end up being overly restrictive. Manually-written bounds are better.)
I do agree human judgement based bounds are best, but stackage snapshots aren't that. Furthermore, stackage/fpo LTS is not LTS in the same sense as linux-distributions and their BSD-unix cousins. Proper long term support in any meaningful sense is a commitment by the distro provider to backport security/bug fixes to the LTS. Stackage LTS doesnt do that.
stack tricks its users into not thinking about their relationship with their dependencies and poisons library code from being usable by haskell devs/users/contributors who dont use stack. :)
the more you know! :)
> Relying so much on optimizations to ensure that very high level code compiles to the right low-level code is fragile. Case in point: after upgrading our Stackage snapshot from LTS 10 with GHC 8.2.2 to LTS 13 with GHC 8.6.3, we saw the running time of our benchmark double.
This would scare we about doing any real performance optimization in a language as high-level as Haskell: am I just fiddling with compiler internals now?
> A naive Python script that loops over all needles and calls str.find in a loop to find all matches.
> working through all 3.5 gigabytes of test data once
> Time is ~6.5s.
I love these kinds of benchmarks. They consistently show that for specialised cases you do need to drop down to C/Rust/hand-written Haskell implementations. In many-many more cases, well, even Python with a naive implementation is good enough.
Not to diss Haskell or Rust. Only saying: chose your tools as you see fit, and benchmark.
Also kudos to the authors to use mean instead of average in the graphs.
I'm not going to outright call this "cheating", but most of the time, virtually unconditionally, when Python is fast it's because it isn't Python, it's C.
It definitely feels worth noting that Python is radically slower than other languages, and the main "tool" it provides for improving performance is to rely on another language entirely.
But I don't disagree that the Python owe's it's performance to C.. it owes everything to C :)
What do you mean? "Mean" with no qualification usually denotes arithmetic average (or mean).
For benchmarks much better option would be geometric mean. Because the relation of the results values not the difference.
Just because it isnt identical in implementation to the other languages doesn't mean its much simpler. Its fast because enough people cared to make it that way - it wasn't by some accident
Once you start working on that data using procedures actually written I python performance is usually far from stellar.
I have had that happen to me many times. Back in the guile 2.0 days I tried porting some utils to python based on preliminary benchmarks that were often an order of magnitude faster than my guile ones, but when the logic was implemented the difference was gone. Then guile 2.2 (and now guile 3) happened and everything got magically faster, even though less and less of the runtime is written in C.
It's a shame Haskell copies strings across FFI boundaries, because binding the Rust version would have been almost as fast, with (presumably) significantly less effort.
Presumably creating lots of small strings is a typical use case for Text so losing compacting gc hurts. ByteString on the other hand is more used with binary data and is designed for ffi use.
Out of curiosity: how much optimizing potential is in the software that was programmed by Burntsushi in Rust?
In the literature, there are generally two different variants of Aho-Corasick. The traditional formulation follows failure transitions during search, which comes with overhead. In effect, it's an NFA since search can traverse through multiple states for each byte of input that it sees. The other variant is the "advanced" form, which is just effectively turning the NFA into a DFA by precomputing all of the failure transitions into a single table. This can be quite a bit faster depending on the workload, but obviously uses more memory and takes more time to build. My library supports the DFA version, but it had a bug (which has now been fixed) which prevented the OP from effectively benchmarking it.
The OP hints that some substantial number of samples use a very small number of patterns. Sometimes only one. In that case, it probably wouldn't be best to use Aho-Corasick at all. Instead, it should use an optimized single substring search algorithm, such as Two Way. Even for multiple patterns, if there aren't that many of them, it might be better to use Teddy from Hyperscan.
This article is fairly timely. I'm almost done a rewrite of my Aho-Corasick library. It will bring some small performance improvements (but more importantly a simpler API), and I hope to move the Teddy algorithm into it at some point.
As always, specific workloads can often have their performance improved quite a bit.
The DFA approach is something that I had not heard about before (I never looked into what "full" meant), but we should definitely try it; we reuse the automaton against possibly millions of haystacks, so extra memory or preprocessing time is likely worth it. Thanks for pointing this out!
An adaptive approach would indeed work better for fewer needles, but the threshold depends a lot on the data. We compared Alfred–Margaret against ICU regexes for case-insensitive replaces, and we found that even for a single needle, ours was faster in half of the cases. (It is a bit unfair because of FFI call overhead.)
many thanks for your contribution. I was not expecting to see such an extensive elaboration. Looking forward to see your rewrite, and learn something from it!
AC sucks. It's simple enough, but it's always going to be either slow or big - sometimes both. I'd say that's a source of frustration to me, but it's not really, as we made a lot of money selling string matching to customers back when Hyperscan was closed-source. It's so much easier to sell string matching than regex matching as regexes are full of surprises, but strings are quite tractable.
I'm not delighted with where we left string matching in Hyperscan - "FDR" was wheezing under load in a bunch of ways, and I had some better things in the pipeline. But it still should generally murder AC barring some corner cases.
Unfortunately, Hyperscan has moved away from having a pure-play literal matcher in it (what remains is really more of a specialist 1- to 8-char literal matcher), which I regard as a semi-mistake (so there may be more corner cases than there used to be).
One of the hurdles for using the algorithms from Hyperscan is that it's not always clear when particular algorithms are used and in what conditions. I have the same problem in my regex library, but Hyperscan takes it to another level because it's so much more sophisticated. :)
FDR is for > literals than Teddy. It's Noodle that's for single literals. Hard to imagine how you could get confused given that the names we chose are so transparent. :-)
I've read masters theses that were significantly less thorough and well-researched.
I don’t have much to add to BurntSushi’s reply. I studied the implementation briefly to take inspiration from when we decided to write our own, and there is no obvious low-hanging fruit. One small thing is that it uses a Vec of transitions per state. In Alfred–Margaret we pack all transition tables together in one array (with a separate array that maps state to start index), so it can use the cache more efficiently.
There are ways to do parallel automaton traversal with SIMD, but it only works for a very small number of states.
Yeah, the DFA packs everything into a single table. The NFA keeps them distinct so that it can support a dense representation near the root of the trie (for speed) but a sparse representation farther away from the root (for smaller memory footprint). (Confusingly, "dense" and "sparse" are flipped in the source code, much to my chagrin.) It would be possible to convert this into contiguous memory, but I'm not sure it's worth it.
Beyond that, the other possible (maybe micro) optimizations in this realm are:
1. Use equivalence classes of bytes for your state transitions instead of all possible byte values. However, this is only applicable in a dense representation, and even then, if your alphabet is 2^16 instead of 2^8, it's not clear how much this helps. Byte classes require an extra lookup at search time, but the decreased size of the automaton can be dramtic, which has means better locality and overall better performance. But again, this is compared to the typical dense representation which is probably not relevant in your case.
2. Premultiply state identifiers. Again, probably only pertinent for a dense representation. Although, it sounds like this could eliminate your extra table that maps state identifiers to start indices. The only real downside of this is that it potentially increases the minimum required integer size to represent state identifiers. But that's only relevant if you support using smaller integer sizes for state identifiers in the first place.
3. Rearrange the states such that all match states precede all non-match states (with perhaps an exception or two for the fail/dead states, depending on how you implement Aho-Corasick). In the core match loop, a match state can be determined with a comparison against a fixed integer that's probably in a register instead of a memory access.
With all that said, the best possible next step for y'all is to probably switch to a dense representation. But that might imply finding a way to efficiently use UTF-8 since a dense representation with a 2^16 sized alphabet is tricky.
All of this stuff should be available in my rewrite of the crate. It's also in my regex-automata crate.
based on a very readable paper https://www.microsoft.com/en-us/research/wp-content/uploads/...
It doesn't allow expansion up to 32 bits. How would that even be possible? Some of the bits are needed to indicate to use the second two bytes, so there's a hole in the range and it's less than 32 bits.
UCS-2 is limited to only 16-bit representation, making it _impossible_ to express Unicode values outside the basic multilingual plane (BMP).
Decoding UTF-16 (with surrogate pairs) is pretty similar to decoding UTF-8 except the state machine is simpler (as it's just 1–2 code units per codepoint, instead of 1–3). It also means that if you validate that you have a high surrogate and a low surrogate, the combination is guaranteed to be a valid codepoint (whereas with UTF-8 there are invalid encodings, e.g. any 1-byte codepoint can actually be encoded using 2 or 3 bytes instead, which the state machine must reject).
"A surrogate pair can represent 20 bits, or 2^20 values. A single UTF-16 byte is of course 16 bits, or 2^16 values. 2^20 + 2^16 == 0x110000, which is precisely as many unicode code points as there are (0–0x10FFFF)."
This explanation can't be quite right, because the possible surrogate code unit values are carved out of the set of 2^16 code units. So you are double counting those surrogate code units.
"Since the ranges for the high surrogates (0xD800–0xDBFF), low surrogates (0xDC00–0xDFFF), and valid BMP characters (0x0000–0xD7FF, 0xE000–0xFFFF) are disjoint, it is not possible for a surrogate to match a BMP character, or for two adjacent code units to look like a legal surrogate pair."
"The Unicode standard permanently reserves these code point values for UTF-16 encoding of the high and low surrogates, and they will never be assigned a character, so there should be no reason to encode them. The official Unicode standard says that no UTF forms, including UTF-16, can encode these code points."
The other common fusion technique used in haskell is stream fusion which is very similar to the mutual-recursion-encoded-as-sum-type version in the blog post.
There is also indexing-based fusion which the Repa library uses. It allows more control over traversal order (think optimizing stencil codes) but requires regular arrays, e.g. no filtering.
There are some other research projects like specialized compiler passes via hermit but afaik nothing really production ready.
To do this right you need a fast prefilter and a reasonable fast string match. Our string matcher in Hyperscan blew AC away on size and speed, typically. Unfortunately, I never had a chance to build its successor. But AC is a disaster of one dependent load after another and is mediocre at best in terms of performance.
1. There is no write up of the Hyperscan algorithm of which you speak. I wrote up a part of Teddy, but I still don't understand the full algorithm. Hyperscan's code takes serious effort to understand. I spent days staring at Teddy and I still didn't get everything.
2. Aho-Corasick is reasonably simple to implement and does fairly decently for a large number of tasks.
3. Aho-Corasick is trivially portable.
The fact that AC is portable and simple and that everyone feels good about themselves as a result of implementing a classical-sounding algorithm (why AC and not Rabin-Karp? who knows?) is why is keeps cropping up, like kudzu. It's still either (a) huge, (b) slow, (c) both and (d) a poor fit for modern architectures - it has a way of turning the problem into a latency problem for whichever level of memory its structures fit into (rather than a throughput problem).
It's a solution. Big deal. There are half-a-dozen better ones just floating around in the literate, but this shitty one seems to be the one people fixate on. Maybe it's because Aho and Corasick sort earlier in the dictionary than Rabin and Karp?
I'll have to take a closer look at FDR. I thought it was for single pattern search, not multi pattern search.
I don't really understand why you're being so bombastic about this. AC has a number of nice qualities, as I mentioned above. We don't need to resort to bullshit about how it's only used because it makes people feel good about themselves. That's not a good way to win people over to your cause.
Rabin-Karp needs a fair bit of help to work; I mention it as a starting point. As a core matching technique it isn't much more fancy than "put your stuff in a hash table". Handling strings of different length is in fact hard but using multiple tables is viable. It scales a lot better than Aho-Corasick in that you can engineer the size of the hash tables and anticipate performance.
Conversely, AC just develops a huge 'root cloud' when the pattern sets get big - increasingly, you're always in a prefix of something. It tends to benchmark well (sorta, kinda) especially if you feed it input that doesn't look much like what you're trying to match. You then just bounce around a couple shallow states that get to be resident in L1 and you look pretty smart.
Also, in Rabin-Karp or fancy variants, every access is independent of every other, a property that you can only dream of in AC as you pootle from state to state making reads that depend on what you did last.
"FDR" works on a different principle - it is 'bucketed SIMD shift-or with super-characters' (see the paper).
I have a third matcher in the wings based on PEXT, which I must finish up.
Honestly I don't care what people do. Arguably a good deal of the money we made from Hyperscan in its closed-source era came from selling a 'regex' matcher came from just doing the simplistic case of multiple literal matching. Our Rabin-Karp thingy (multiple hash tables with different lengths and strides) was called "hlm" (originally, "Vomitsauce"), took a few pages of code, wasn't that conceptually complex, and probably by itself (sans regex) netted a few hundred thousand a year (we eventually replaced it with FDR). This is borderline ridiculous; making money of a big complex regex matcher seemed well-deserved, but making money off a simplistic task like this seemed like picking up bills of the ground.
What frustrates me is that people don't generally look beyond one solution, don't innovate, and don't care that they aren't innovating or reading the literature. Apparently algorithm design is a parlor trick for getting into Google, not a thing that one might actually do.
You can't have it both ways. You clearly care to some extent. Whatever though. Clearly we disagree on how to approach this stuff. I realize you probably don't intend this, but you really come off as belittling others for not being as smart as you. It's not cool.
> Conversely, AC just develops a huge 'root cloud' when the pattern sets get big - increasingly, you're always in a prefix of something. It tends to benchmark well (sorta, kinda) especially if you feed it input that doesn't look much like what you're trying to match. You then just bounce around a couple shallow states that get to be resident in L1 and you look pretty smart.
The OP constructed a benchmark from exactly the task they are trying to do and did algorithm development until they got a better result. That seems like exactly the right way to approach this. Not everyone is in a position to make an extra few hundred thousand dollars by switching from AC to Vomitsauce.
That's all I'm trying to do too. It's going to take a _long_ time for the field to collectively catch up to Hyperscan. It'll take me years, if I keep on my current path.
As for the rest I'm done. I don't think I can adequately explain my irritation at the general state of discourse about the things I worked on for 12+ years without coming off in all sorts of ways I don't really intend. Suffice to say I've seen things you people wouldn't believe, strawmen comparison implementations on fire off the shoulder of Orion, etc. :-)
I will try to write up my new string matcher at some point, in a less convoluted context than Hyperscan. HS, from sheer complexity, makes almost all algorithms more or less undiscoverable - it's hard to avoid having some busy-body subsystem jump in front of the simple demo you're trying to do.
Right. I meant it as a short-hand way of saying, "it will take a while for the field to catch up to the battle tested knowledge you've accumulated, of which, some is expressed inside Hyperscan." :-)
It also seemed to require a lot of work to construct the code to make tail call optimization happen, but that’s par for the course in Haskell.
Either way, optimization in any language often requires annotations. The main benefit of Haskell though is still equational reasoning, which this does not obviate. You can get rid of all the annotations and your code will still run, be correct, and can be reasoned about
Similarly, the existence of generators and zero-argument functions in Python doesn't undermine the idea that strictness by default could have benefits.
Though, I suppose you could then say that if the rules are different for low level optimization, they may as well have imported an external library written in Rust.
I'm the author of pyahocrasick, a python C extension that implements the AC. I'd love to see it in the comparison. :)