Hacker News new | past | comments | ask | show | jobs | submit login
How we made Haskell search strings as fast as Rust (channable.com)
246 points by duijf on March 13, 2019 | hide | past | favorite | 87 comments

as maintainer of two of the haskell libraries they use, a member of the core libraries committee, and a hackage trustee, i'd like to throw in my two cents:

1) i see they are doing fully qualified imports. please add major bounds to your version deps, stack doesn't excuse not tracking what versions you used!

2) they have

"data Next a = Done !a | Step !a " in their code,

this tells me they could have possibly benefited using the stream abstraction in Vector!

https://hackage.haskell.org/package/vector- is a link to the latter

About (1).

The exact versions they used are specified by the stack.yaml file.

In this case, see https://stackage.org/lts-13.10

Stack has a feature that lets you automatically added bounds to a hackage upload based on the stack.yaml. (However, I personally do not recommend using this feature, as those bounds usually end up being overly restrictive. Manually-written bounds are better.)

yet its off by default and most users dont use it.

I do agree human judgement based bounds are best, but stackage snapshots aren't that. Furthermore, stackage/fpo LTS is not LTS in the same sense as linux-distributions and their BSD-unix cousins. Proper long term support in any meaningful sense is a commitment by the distro provider to backport security/bug fixes to the LTS. Stackage LTS doesnt do that.

stack tricks its users into not thinking about their relationship with their dependencies and poisons library code from being usable by haskell devs/users/contributors who dont use stack. :)

the more you know! :)

I can't speak to the content or Haskell at all but I really like the clean formatting of the post, particularly the charts and images. Does anyone know what tool might have been used to produce them? So clean and crisp.

It's LaTeX, tikz, and pgfplots for the images. Complied with xelatex to pdf and pdf2svg to svg. The rest is custom HTML and CSS. We use the Hakyll static site generator.

Yes, can confirm. Check out [1] and [2] to learn more about TikZ and PGFPlots.

[1]: http://mirrors.ctan.org/graphics/pgf/base/doc/pgfmanual.pdf [2]: http://mirrors.ctan.org/graphics/pgf/contrib/pgfplots/doc/pg...

I particularly appreciate how the text in the diagrams matches the font (& font size) of the body text. Almost nobody does that and it looks great.

The graph looks a lot like GraphViz output. Not sure about the charts, however.

As always, I will bitch about the colour of the text being grey (#333) instead of black. Hurts readability, doesn't improve design.

#333 is a great color for text on the web.

That depends on the medium used to view the web. On my KoboHD e-ink reader, grey text is not legible. It's sad because thousands of web sites that would be gorgeously legible if their text were #000 are not. But this is more of a browser issue; Kobo should disable colors IMO. And this is something that people can do on regular browsers.

objectively its color contrast ratio on a white bg is more than high enough to meet web content accessibility guidelines

Always depends on the color of the background : )

Well, I disagree.

Impressive work! The clincher is at the end

> Relying so much on optimizations to ensure that very high level code compiles to the right low-level code is fragile. Case in point: after upgrading our Stackage snapshot from LTS 10 with GHC 8.2.2 to LTS 13 with GHC 8.6.3, we saw the running time of our benchmark double.

This would scare we about doing any real performance optimization in a language as high-level as Haskell: am I just fiddling with compiler internals now?

What actually amazes me is that... Python is good enough:

> A naive Python script that loops over all needles and calls str.find in a loop to find all matches.

> working through all 3.5 gigabytes of test data once

> Time is ~6.5s.

I love these kinds of benchmarks. They consistently show that for specialised cases you do need to drop down to C/Rust/hand-written Haskell implementations. In many-many more cases, well, even Python with a naive implementation is good enough.

Not to diss Haskell or Rust. Only saying: chose your tools as you see fit, and benchmark.

Also kudos to the authors to use mean instead of average in the graphs.

> Firstly, the program spends most of its time in the heavily-optimized str.find function, which is implemented in C.

I'm not going to outright call this "cheating", but most of the time, virtually unconditionally, when Python is fast it's because it isn't Python, it's C.

It definitely feels worth noting that Python is radically slower than other languages, and the main "tool" it provides for improving performance is to rely on another language entirely.

Well Python is written IN C, at least the standard version is, so I really wouldn't call it anywhere near cheating... :).

But I don't disagree that the Python owe's it's performance to C.. it owes everything to C :)

I think there's a very big difference between the runtime being implemented in C and the libraries being implemented in C.

Example Java: Runtime written in C++, practically all libraries written in Java, including the standard library, and Java libraries are greatly preferred.

> Also kudos to the authors to use mean instead of average in the graphs.

What do you mean? "Mean" with no qualification usually denotes arithmetic average (or mean).

Argh! Confused it with median. No kudos then!

> Also kudos to the authors to use mean instead of average in the graphs.

For benchmarks much better option would be geometric mean. Because the relation of the results values not the difference.


> even Python with a naive implementation is good enough. There is nothing naive about their implementation. 1) It uses C FFI (as many low level python functions do) 2) It uses a suboptimal but still not naive algo.

Just because it isnt identical in implementation to the other languages doesn't mean its much simpler. Its fast because enough people cared to make it that way - it wasn't by some accident

That is a fragile benchmark. For simple scripts where you rely exclusively on python procedures that are written in C python is often very fast.

Once you start working on that data using procedures actually written I python performance is usually far from stellar.

I have had that happen to me many times. Back in the guile 2.0 days I tried porting some utils to python based on preliminary benchmarks that were often an order of magnitude faster than my guile ones, but when the logic was implemented the difference was gone. Then guile 2.2 (and now guile 3) happened and everything got magically faster, even though less and less of the runtime is written in C.

> Clearly there was plenty of room for speeding up our string matching. We considered binding the Rust library, but due to the way that Haskell’s foreign function interface interacts with the garbage collector, passing a string to a foreign function requires copying it. That copy alone puts any foreign library at a significant disadvantage, so we decided to see how close we could get in Haskell alone.

It's a shame Haskell copies strings across FFI boundaries, because binding the Rust version would have been almost as fast, with (presumably) significantly less effort.

As I wrote elsewhere that depends on the string type used: https://www.reddit.com/r/haskell/comments/b0l93l/how_we_made...

I wondered about this and poked at the internals of ByteString (suitable for ffi) and Text. Turns out Text could allow ffi by changing a single function (newByteArray to newAlignedPinnedByteArray).

Presumably creating lots of small strings is a typical use case for Text so losing compacting gc hurts. ByteString on the other hand is more used with binary data and is designed for ffi use.

Hmm ByteString has the opposite issue and solved it with ShortByteString. Seems like PinnedText could be the analogue.

Interesting article.

Out of curiosity: how much optimizing potential is in the software that was programmed by Burntsushi in Rust?

There is definitely some room. As usual, it depends. I'm away from the computer, so I haven't had a chance to look and see if the OP's benchmarks are published. So it's hard for me to say anything specific at the moment.

In the literature, there are generally two different variants of Aho-Corasick. The traditional formulation follows failure transitions during search, which comes with overhead. In effect, it's an NFA since search can traverse through multiple states for each byte of input that it sees. The other variant is the "advanced" form, which is just effectively turning the NFA into a DFA by precomputing all of the failure transitions into a single table. This can be quite a bit faster depending on the workload, but obviously uses more memory and takes more time to build. My library supports the DFA version, but it had a bug (which has now been fixed) which prevented the OP from effectively benchmarking it.

The OP hints that some substantial number of samples use a very small number of patterns. Sometimes only one. In that case, it probably wouldn't be best to use Aho-Corasick at all. Instead, it should use an optimized single substring search algorithm, such as Two Way. Even for multiple patterns, if there aren't that many of them, it might be better to use Teddy from Hyperscan.

This article is fairly timely. I'm almost done a rewrite of my Aho-Corasick library. It will bring some small performance improvements (but more importantly a simpler API), and I hope to move the Teddy algorithm into it at some point.

As always, specific workloads can often have their performance improved quite a bit.

Thanks for the elaborate reply! Our benchmark input data contains customer data which we cannot share, but we could share the benchmark program.

The DFA approach is something that I had not heard about before (I never looked into what "full" meant), but we should definitely try it; we reuse the automaton against possibly millions of haystacks, so extra memory or preprocessing time is likely worth it. Thanks for pointing this out!

An adaptive approach would indeed work better for fewer needles, but the threshold depends a lot on the data. We compared Alfred–Margaret against ICU regexes for case-insensitive replaces, and we found that even for a single needle, ours was faster in half of the cases. (It is a bit unfair because of FFI call overhead.)

Dear burntsushi,

many thanks for your contribution. I was not expecting to see such an extensive elaboration. Looking forward to see your rewrite, and learn something from it!


You should borrow "FDR" from Hyperscan as well. :-)

AC sucks. It's simple enough, but it's always going to be either slow or big - sometimes both. I'd say that's a source of frustration to me, but it's not really, as we made a lot of money selling string matching to customers back when Hyperscan was closed-source. It's so much easier to sell string matching than regex matching as regexes are full of surprises, but strings are quite tractable.

I'm not delighted with where we left string matching in Hyperscan - "FDR" was wheezing under load in a bunch of ways, and I had some better things in the pipeline. But it still should generally murder AC barring some corner cases.

Unfortunately, Hyperscan has moved away from having a pure-play literal matcher in it (what remains is really more of a specialist 1- to 8-char literal matcher), which I regard as a semi-mistake (so there may be more corner cases than there used to be).

I thought FDR was for single literals only?

One of the hurdles for using the algorithms from Hyperscan is that it's not always clear when particular algorithms are used and in what conditions. I have the same problem in my regex library, but Hyperscan takes it to another level because it's so much more sophisticated. :)

(see other response)

FDR is for > literals than Teddy. It's Noodle that's for single literals. Hard to imagine how you could get confused given that the names we chose are so transparent. :-)

Hah. I've only made it halfway through your paper so far. Excited to read the rest. :)

I recommend reading https://blog.burntsushi.net/ripgrep/

I've read masters theses that were significantly less thorough and well-researched.

I just had a quick look - indeed, the level of detail in the ripgrep blog post is impressive. Many thanks for the link!

Let me start by saying that I am a big fan of BurntSushi’s work, and this post was by no means intended to detract from that. If anything, it set a high bar for us.

I don’t have much to add to BurntSushi’s reply. I studied the implementation briefly to take inspiration from when we decided to write our own, and there is no obvious low-hanging fruit. One small thing is that it uses a Vec of transitions per state. In Alfred–Margaret we pack all transition tables together in one array (with a separate array that maps state to start index), so it can use the cache more efficiently.

There are ways to do parallel automaton traversal with SIMD, but it only works for a very small number of states.

Thanks for the kind words. :)

Yeah, the DFA packs everything into a single table. The NFA keeps them distinct so that it can support a dense representation near the root of the trie (for speed) but a sparse representation farther away from the root (for smaller memory footprint). (Confusingly, "dense" and "sparse" are flipped in the source code, much to my chagrin.) It would be possible to convert this into contiguous memory, but I'm not sure it's worth it.

Beyond that, the other possible (maybe micro) optimizations in this realm are:

1. Use equivalence classes of bytes for your state transitions instead of all possible byte values. However, this is only applicable in a dense representation, and even then, if your alphabet is 2^16 instead of 2^8, it's not clear how much this helps. Byte classes require an extra lookup at search time, but the decreased size of the automaton can be dramtic, which has means better locality and overall better performance. But again, this is compared to the typical dense representation which is probably not relevant in your case.

2. Premultiply state identifiers. Again, probably only pertinent for a dense representation. Although, it sounds like this could eliminate your extra table that maps state identifiers to start indices. The only real downside of this is that it potentially increases the minimum required integer size to represent state identifiers. But that's only relevant if you support using smaller integer sizes for state identifiers in the first place.

3. Rearrange the states such that all match states precede all non-match states (with perhaps an exception or two for the fail/dead states, depending on how you implement Aho-Corasick). In the core match loop, a match state can be determined with a comparison against a fixed integer that's probably in a register instead of a memory access.

With all that said, the best possible next step for y'all is to probably switch to a dense representation. But that might imply finding a way to efficiently use UTF-8 since a dense representation with a 2^16 sized alphabet is tricky.

All of this stuff should be available in my rewrite of the crate. It's also in my regex-automata crate.

Nobody's perfect, but BurntSushi isn't someone who typically does things by half measures.

That would be ToastedSushi.

let's not push him over the pyrolysis threshold

Why UTF-16? Seems like an odd choice of character encodings.

It's what Haskell's 'text' type uses internally. This stackoverflow question seems to discuss why that is the case:


Just to add, there are a few use cases in what UTF-8 makes a large difference, and there is a UTF-8 version of Text on Hackage for those cases.

Sadly every language developed in the 90s has this flaw. Stuck somewhere between bytes and actual UNICODE.

To be more clear, this flaw happened because way back when we thought UCS-2 was sufficient. Then UTF-16 happened and all these systems designed for UCS-2 had to be hacked up for UTF-16 instead.

To be even more clear, this happened because early versions of Unicode consisted of fewer than 2^16 code points, and this was thought to be enough. In 1996, Unicode 2.0 was published and allowed characters whose encoding required more than 16 bits. This necessitated UTF-16 which allows expansion up to 32 bits to represent characters in higher planes (and is compatible with most text that had been encoded in UCS-2).

> allows expansion up to 32 bits

It doesn't allow expansion up to 32 bits. How would that even be possible? Some of the bits are needed to indicate to use the second two bytes, so there's a hole in the range and it's less than 32 bits.

32 bit-representations (compared to UCS-2), not 32 bits worth of values, so it can represent all of Unicode.

UCS-2 is limited to only 16-bit representation, making it _impossible_ to express Unicode values outside the basic multilingual plane (BMP).

Correct, this is what I meant. UTF-16 can use up to 32 bits to represent a larger number of characters, but cannot uniquely represent 2^32 possible characters.

"Surrogate pairs" use two code units (32 bits total) to increase the number of code points that can be represented. It seems a lot more complicated than UTF-8. I have no idea how many code points can actually be represented in theory. There are under 21 bits' worth of Unicode codepoints, though, so I'm sure the claim that UTF-16 can represent all of them isn't some kind of trick.

A surrogate pair can represent 20 bits, or 2^20 values. A single UTF-16 byte is of course 16 bits, or 2^16 values. 2^20 + 2^16 == 0x110000, which is precisely as many unicode code points as there are (0–0x10FFFF).

Decoding UTF-16 (with surrogate pairs) is pretty similar to decoding UTF-8 except the state machine is simpler (as it's just 1–2 code units per codepoint, instead of 1–3). It also means that if you validate that you have a high surrogate and a low surrogate, the combination is guaranteed to be a valid codepoint (whereas with UTF-8 there are invalid encodings, e.g. any 1-byte codepoint can actually be encoded using 2 or 3 bytes instead, which the state machine must reject).

Wait a minute.

"A surrogate pair can represent 20 bits, or 2^20 values. A single UTF-16 byte is of course 16 bits, or 2^16 values. 2^20 + 2^16 == 0x110000, which is precisely as many unicode code points as there are (0–0x10FFFF)."

This explanation can't be quite right, because the possible surrogate code unit values are carved out of the set of 2^16 code units. So you are double counting those surrogate code units.

From Wikipedia: "Since the ranges for the high surrogates (0xD800–0xDBFF), low surrogates (0xDC00–0xDFFF), and valid BMP characters (0x0000–0xD7FF, 0xE000–0xFFFF) are disjoint, it is not possible for a surrogate to match a BMP character, or for two adjacent code units to look like a legal surrogate pair."

OK, I've found the gross reason why the 2^16 + 2^20 arithmetic works:

"The Unicode standard permanently reserves these code point values for UTF-16 encoding of the high and low surrogates, and they will never be assigned a character, so there should be no reason to encode them. The official Unicode standard says that no UTF forms, including UTF-16, can encode these code points."


Errata: I stated UTF-8 decoding was 1–3 bytes per codepoint. It's 1–4.

One of us should try to make this clearer on the relevant Wikipedia page.

UTF-16 encodes a wider character with two surrogate pairs, each pair holding 10 bits of the codepoint. That's why Unicode has 17 planes each containing 65,536 characters, which would otherwise seem odd to have a non-power of two number of Unicode codepoints.

The U in UTF-16 is for Unicode.

> Because we use the Text type for strings in Haskell, which is an array of UTF-16 code units in native endianness, the benchmarks would take UTF-16 data as input.

Converting the function to act as a fold instead of just returning a lazy list, is that necessary because all the unboxing and strictness internally meant the matches couldn't actually be lazy?

Lazy data structures require allocation. GHC's lists use build/fold fusion to remove intermediate lists but that's basically just a less reliable way of converting to folds http://hackage.haskell.org/package/base-

The other common fusion technique used in haskell is stream fusion which is very similar to the mutual-recursion-encoded-as-sum-type version in the blog post.

There is also indexing-based fusion which the Repa library uses. It allows more control over traversal order (think optimizing stencil codes) but requires regular arrays, e.g. no filtering.

There are some other research projects like specialized compiler passes via hermit but afaik nothing really production ready.

Ok so not using a lazy list for matches is basically just around avoiding the requisite allocation? I suppose if you consume the list immediately at the call-site the compiler could in theory after inlining convert it to the closure-based approach for you, but probably wouldn't.

We only did this to fix the performance regression we observed after upgrading to GHC 8.6, but in hindsight I think it is cleaner anyway. With GHC 8.2 the list construction was eliminated, but the optimization turned out to be fragile.

It's a pity that Aho-Corasick always seems to be the starting point for this kind of work, because it's as slow as a wet week and/or the tables are huge (pick your poison).

To do this right you need a fast prefilter and a reasonable fast string match. Our string matcher in Hyperscan blew AC away on size and speed, typically. Unfortunately, I never had a chance to build its successor. But AC is a disaster of one dependent load after another and is mediocre at best in terms of performance.

It's really not a pity at all, considering that:

1. There is no write up of the Hyperscan algorithm of which you speak. I wrote up a part of Teddy, but I still don't understand the full algorithm. Hyperscan's code takes serious effort to understand. I spent days staring at Teddy and I still didn't get everything.

2. Aho-Corasick is reasonably simple to implement and does fairly decently for a large number of tasks.

3. Aho-Corasick is trivially portable.

Ha, there's a paper about FDR now.

The fact that AC is portable and simple and that everyone feels good about themselves as a result of implementing a classical-sounding algorithm (why AC and not Rabin-Karp? who knows?) is why is keeps cropping up, like kudzu. It's still either (a) huge, (b) slow, (c) both and (d) a poor fit for modern architectures - it has a way of turning the problem into a latency problem for whichever level of memory its structures fit into (rather than a throughput problem).

It's a solution. Big deal. There are half-a-dozen better ones just floating around in the literate, but this shitty one seems to be the one people fixate on. Maybe it's because Aho and Corasick sort earlier in the dictionary than Rabin and Karp?

It's been a while since I looked into Rabin Karp, but I remember there being two issues: 1) it really wants all of its patterns to be the same length and 2) it doesn't scale to a large number of patterns. Aho-Corasick has its own issues at scale, but only to the extent that your proportion of memory reads that hit cache decreases. But honestly, I've never implemented it in anger.

I'll have to take a closer look at FDR. I thought it was for single pattern search, not multi pattern search.

I don't really understand why you're being so bombastic about this. AC has a number of nice qualities, as I mentioned above. We don't need to resort to bullshit about how it's only used because it makes people feel good about themselves. That's not a good way to win people over to your cause.

I've spent years benchmarking against various AC implementations; typically in Snort. They're usually stupidly huge and slow.

Rabin-Karp needs a fair bit of help to work; I mention it as a starting point. As a core matching technique it isn't much more fancy than "put your stuff in a hash table". Handling strings of different length is in fact hard but using multiple tables is viable. It scales a lot better than Aho-Corasick in that you can engineer the size of the hash tables and anticipate performance.

Conversely, AC just develops a huge 'root cloud' when the pattern sets get big - increasingly, you're always in a prefix of something. It tends to benchmark well (sorta, kinda) especially if you feed it input that doesn't look much like what you're trying to match. You then just bounce around a couple shallow states that get to be resident in L1 and you look pretty smart.

Also, in Rabin-Karp or fancy variants, every access is independent of every other, a property that you can only dream of in AC as you pootle from state to state making reads that depend on what you did last.

"FDR" works on a different principle - it is 'bucketed SIMD shift-or with super-characters' (see the paper).

I have a third matcher in the wings based on PEXT, which I must finish up.

Honestly I don't care what people do. Arguably a good deal of the money we made from Hyperscan in its closed-source era came from selling a 'regex' matcher came from just doing the simplistic case of multiple literal matching. Our Rabin-Karp thingy (multiple hash tables with different lengths and strides) was called "hlm" (originally, "Vomitsauce"), took a few pages of code, wasn't that conceptually complex, and probably by itself (sans regex) netted a few hundred thousand a year (we eventually replaced it with FDR). This is borderline ridiculous; making money of a big complex regex matcher seemed well-deserved, but making money off a simplistic task like this seemed like picking up bills of the ground.

What frustrates me is that people don't generally look beyond one solution, don't innovate, and don't care that they aren't innovating or reading the literature. Apparently algorithm design is a parlor trick for getting into Google, not a thing that one might actually do.

> Honestly I don't care what people do.

You can't have it both ways. You clearly care to some extent. Whatever though. Clearly we disagree on how to approach this stuff. I realize you probably don't intend this, but you really come off as belittling others for not being as smart as you. It's not cool.

> Conversely, AC just develops a huge 'root cloud' when the pattern sets get big - increasingly, you're always in a prefix of something. It tends to benchmark well (sorta, kinda) especially if you feed it input that doesn't look much like what you're trying to match. You then just bounce around a couple shallow states that get to be resident in L1 and you look pretty smart.

The OP constructed a benchmark from exactly the task they are trying to do and did algorithm development until they got a better result. That seems like exactly the right way to approach this. Not everyone is in a position to make an extra few hundred thousand dollars by switching from AC to Vomitsauce.

That's all I'm trying to do too. It's going to take a _long_ time for the field to collectively catch up to Hyperscan. It'll take me years, if I keep on my current path.

I think the field should pass Hyperscan, not catch up to it. There are a number of things about HS that I'm increasingly uneasy about.

As for the rest I'm done. I don't think I can adequately explain my irritation at the general state of discourse about the things I worked on for 12+ years without coming off in all sorts of ways I don't really intend. Suffice to say I've seen things you people wouldn't believe, strawmen comparison implementations on fire off the shoulder of Orion, etc. :-)

I will try to write up my new string matcher at some point, in a less convoluted context than Hyperscan. HS, from sheer complexity, makes almost all algorithms more or less undiscoverable - it's hard to avoid having some busy-body subsystem jump in front of the simple demo you're trying to do.

> I think the field should pass Hyperscan, not catch up to it. There are a number of things about HS that I'm increasingly uneasy about.

Right. I meant it as a short-hand way of saying, "it will take a while for the field to catch up to the battle tested knowledge you've accumulated, of which, some is expressed inside Hyperscan." :-)

This does seem to actively circumvent a number of the things that are always held up as why Haskell is a great language: it uses strict types in a number of places, has explicit inline and no-inline annotations everywhere.

It also seemed to require a lot of work to construct the code to make tail call optimization happen, but that’s par for the course in Haskell.

I don't think that's true at all. The goal here was to make it as fast as rust (which basically compiles down to 'straight' native code by default). Haskell, while also compiled, kind of runs on a virtual machine (the STG). It should be expected that a language with a different execution model will need some work to fit another one. Of course, one would expect the same work if you were trying to make rust run as fast as Haskell on a processor designed to execute STG expressions (of which there are none, but there could be, like the lisp machines of the past).

Either way, optimization in any language often requires annotations. The main benefit of Haskell though is still equational reasoning, which this does not obviate. You can get rid of all the annotations and your code will still run, be correct, and can be reasoned about

The thesis of laziness by default isn't "everything should be lazy", just that there's some benefit to having strictness be something you add as necessary - maybe that it's easier to add strictness to something lazy than add laziness to something strict, that it gives you more abstraction power, or maybe increases surface area for optimizations.

Similarly, the existence of generators and zero-argument functions in Python doesn't undermine the idea that strictness by default could have benefits.

What's the significance of zero-argument functions in Python?

That's a lazy-evaluated value, if the function is pure.

No one says "Haskell is great because it's fast". They say variants of "Haskell is great because it offers ultra-high-level abstraction, mathematically precise data types, static type-checking and inference, and still giving you the freedom to write nasty low-level preprocesser line noise when you need to be as fast as C"

I'm not that versed in Haskell myself, but is this really the case? I would think that expectations for low level optimization and general code are different.

Though, I suppose you could then say that if the rules are different for low level optimization, they may as well have imported an external library written in Rust.

Optimized Rust feels more idiomatic at least.

but I don't think it's a suprise, since rust is made to be fast while haskell is made to be beautiful (for some sense of beautiful)

They explain in the post that they did consider it but decided to avoid it due to the way the FFI and Gc interact (requiring the string to be copied)

The inline/noinline annotations on the inner functions we only had to add for GHC 8.6 though; on GHC 8.2 it was fast without these. (That turned out to be a downside, relying on a long chain of optimizations is fragile.)

"A naive Python script that loops over all needles and calls str.find in a loop to find all matches. This is not Aho–Corasick [...]".

I'm the author of pyahocrasick, a python C extension that implements the AC. I'd love to see it in the comparison. :)

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact