
How we made Haskell search strings as fast as Rust - duijf
https://tech.channable.com/posts/2019-03-13-how-we-made-haskell-search-strings-as-fast-as-rust.html
======
carterschonwald
as maintainer of two of the haskell libraries they use, a member of the core
libraries committee, and a hackage trustee, i'd like to throw in my two cents:

1) i see they are doing fully qualified imports. please add major bounds to
your version deps, stack doesn't excuse not tracking what versions you used!

2) they have

"data Next a = Done !a | Step !a " in their code,

this tells me they could have possibly benefited using the stream abstraction
in Vector!

[https://hackage.haskell.org/package/vector-0.12.0.2/docs/Dat...](https://hackage.haskell.org/package/vector-0.12.0.2/docs/Data-
Vector-Fusion-Stream-Monadic.html) is a link to the latter

~~~
drb226
About (1).

The exact versions they used are specified by the stack.yaml file.

In this case, see
[https://stackage.org/lts-13.10](https://stackage.org/lts-13.10)

Stack has a feature that lets you automatically added bounds to a hackage
upload based on the stack.yaml. (However, I personally do not recommend using
this feature, as those bounds usually end up being overly restrictive.
Manually-written bounds are better.)

~~~
carterschonwald
yet its off by default and most users dont use it.

I do agree human judgement based bounds are best, but stackage snapshots
aren't that. Furthermore, stackage/fpo LTS is not LTS in the same sense as
linux-distributions and their BSD-unix cousins. Proper long term support in
any meaningful sense is a commitment by the distro provider to backport
security/bug fixes to the LTS. Stackage LTS doesnt do that.

stack tricks its users into not thinking about their relationship with their
dependencies and poisons library code from being usable by haskell
devs/users/contributors who dont use stack. :)

the more you know! :)

------
whalesalad
I can't speak to the content or Haskell at all but I really like the clean
formatting of the post, particularly the charts and images. Does anyone know
what tool might have been used to produce them? So clean and crisp.

~~~
duijf
It's LaTeX, tikz, and pgfplots for the images. Complied with xelatex to pdf
and pdf2svg to svg. The rest is custom HTML and CSS. We use the Hakyll static
site generator.

~~~
ruuda
Yes, can confirm. Check out [1] and [2] to learn more about TikZ and PGFPlots.

[1]:
[http://mirrors.ctan.org/graphics/pgf/base/doc/pgfmanual.pdf](http://mirrors.ctan.org/graphics/pgf/base/doc/pgfmanual.pdf)
[2]:
[http://mirrors.ctan.org/graphics/pgf/contrib/pgfplots/doc/pg...](http://mirrors.ctan.org/graphics/pgf/contrib/pgfplots/doc/pgfplots.pdf)

------
paulddraper
Impressive work! The clincher is at the end

> Relying so much on optimizations to ensure that very high level code
> compiles to the right low-level code is fragile. Case in point: after
> upgrading our Stackage snapshot from LTS 10 with GHC 8.2.2 to LTS 13 with
> GHC 8.6.3, we saw the running time of our benchmark double.

This would scare we about doing any real performance optimization in a
language as high-level as Haskell: am I just fiddling with compiler internals
now?

------
dmitriid
What actually amazes me is that... Python is good enough:

> A naive Python script that loops over all needles and calls str.find in a
> loop to find all matches.

> working through all 3.5 gigabytes of test data once

> Time is ~6.5s.

I love these kinds of benchmarks. They consistently show that for specialised
cases you do need to drop down to C/Rust/hand-written Haskell implementations.
In many-many more cases, well, even Python with a naive implementation is good
enough.

Not to diss Haskell or Rust. Only saying: chose your tools as you see fit, and
benchmark.

Also kudos to the authors to use mean instead of average in the graphs.

~~~
staticassertion
> Firstly, the program spends most of its time in the heavily-optimized
> str.find function, which is implemented in C.

I'm not going to outright call this "cheating", but most of the time,
virtually unconditionally, when Python is fast it's because it isn't Python,
it's C.

It definitely feels worth noting that Python is radically slower than other
languages, and the main "tool" it provides for improving performance is to
rely on another language entirely.

~~~
zie
Well Python is written _IN_ C, at least the standard version is, so I really
wouldn't call it anywhere near cheating... :).

But I don't disagree that the Python owe's it's performance to C.. it owes
_everything_ to C :)

~~~
staticassertion
I think there's a very big difference between the runtime being implemented in
C and the libraries being implemented in C.

~~~
blattimwind
Example Java: Runtime written in C++, practically all libraries written in
Java, including the standard library, and Java libraries are greatly
preferred.

------
timerol
> Clearly there was plenty of room for speeding up our string matching. We
> considered binding the Rust library, but due to the way that Haskell’s
> foreign function interface interacts with the garbage collector, passing a
> string to a foreign function requires copying it. That copy alone puts any
> foreign library at a significant disadvantage, so we decided to see how
> close we could get in Haskell alone.

It's a shame Haskell copies strings across FFI boundaries, because binding the
Rust version would have been almost as fast, with (presumably) significantly
less effort.

~~~
Tarean
I wondered about this and poked at the internals of ByteString (suitable for
ffi) and Text. Turns out Text could allow ffi by changing a single function
(newByteArray to newAlignedPinnedByteArray).

Presumably creating lots of small strings is a typical use case for Text so
losing compacting gc hurts. ByteString on the other hand is more used with
binary data and is designed for ffi use.

~~~
whateveracct
Hmm ByteString has the opposite issue and solved it with ShortByteString.
Seems like PinnedText could be the analogue.

------
wuschel
Interesting article.

Out of curiosity: how much optimizing potential is in the software that was
programmed by Burntsushi in Rust?

~~~
burntsushi
There is definitely some room. As usual, it depends. I'm away from the
computer, so I haven't had a chance to look and see if the OP's benchmarks are
published. So it's hard for me to say anything specific at the moment.

In the literature, there are generally two different variants of Aho-Corasick.
The traditional formulation follows failure transitions during search, which
comes with overhead. In effect, it's an NFA since search can traverse through
multiple states for each byte of input that it sees. The other variant is the
"advanced" form, which is just effectively turning the NFA into a DFA by
precomputing all of the failure transitions into a single table. This can be
quite a bit faster depending on the workload, but obviously uses more memory
and takes more time to build. My library supports the DFA version, but it had
a bug (which has now been fixed) which prevented the OP from effectively
benchmarking it.

The OP hints that some substantial number of samples use a very small number
of patterns. Sometimes only one. In that case, it probably wouldn't be best to
use Aho-Corasick at all. Instead, it should use an optimized single substring
search algorithm, such as Two Way. Even for multiple patterns, if there aren't
that many of them, it might be better to use Teddy from Hyperscan.

This article is fairly timely. I'm almost done a rewrite of my Aho-Corasick
library. It will bring some small performance improvements (but more
importantly a simpler API), and I hope to move the Teddy algorithm into it at
some point.

As always, specific workloads can often have their performance improved quite
a bit.

~~~
glangdale
You should borrow "FDR" from Hyperscan as well. :-)

AC sucks. It's simple enough, but it's always going to be either slow or big -
sometimes both. I'd say that's a source of frustration to me, but it's not
really, as we made a lot of money selling string matching to customers back
when Hyperscan was closed-source. It's so much easier to sell string matching
than regex matching as regexes are full of surprises, but strings are quite
tractable.

I'm not delighted with where we left string matching in Hyperscan - "FDR" was
wheezing under load in a bunch of ways, and I had some better things in the
pipeline. But it still should generally murder AC barring some corner cases.

Unfortunately, Hyperscan has moved away from having a pure-play literal
matcher in it (what remains is really more of a specialist 1- to 8-char
literal matcher), which I regard as a semi-mistake (so there may be more
corner cases than there used to be).

~~~
burntsushi
I thought FDR was for single literals only?

One of the hurdles for using the algorithms from Hyperscan is that it's not
always clear when particular algorithms are used and in what conditions. I
have the same problem in my regex library, but Hyperscan takes it to another
level because it's so much more sophisticated. :)

~~~
glangdale
(see other response)

FDR is for > literals than Teddy. It's Noodle that's for single literals. Hard
to imagine how you could get confused given that the names we chose are so
transparent. :-)

~~~
burntsushi
Hah. I've only made it halfway through your paper so far. Excited to read the
rest. :)

------
sevensor
Why UTF-16? Seems like an odd choice of character encodings.

~~~
theoh
It's what Haskell's 'text' type uses internally. This stackoverflow question
seems to discuss why that is the case:

[https://stackoverflow.com/questions/23765903/why-is-text-
utf...](https://stackoverflow.com/questions/23765903/why-is-text-utf-16-as-
opposed-to-utf-8)

~~~
spullara
Sadly every language developed in the 90s has this flaw. Stuck somewhere
between bytes and actual UNICODE.

~~~
eridius
To be more clear, this flaw happened because way back when we thought UCS-2
was sufficient. Then UTF-16 happened and all these systems designed for UCS-2
had to be hacked up for UTF-16 instead.

~~~
stouset
To be even more clear, this happened because early versions of Unicode
consisted of fewer than 2^16 code points, and this was thought to be enough.
In 1996, Unicode 2.0 was published and allowed characters whose encoding
required more than 16 bits. This necessitated UTF-16 which allows expansion up
to 32 bits to represent characters in higher planes (and is compatible with
most text that had been encoded in UCS-2).

~~~
chrisseaton
> allows expansion up to 32 bits

It doesn't allow expansion up to 32 bits. How would that even be possible?
Some of the bits are needed to indicate to use the second two bytes, so
there's a hole in the range and it's less than 32 bits.

~~~
theoh
"Surrogate pairs" use two code units (32 bits total) to increase the number of
code points that can be represented. It seems a lot more complicated than
UTF-8. I have no idea how many code points can actually be represented in
theory. There are under 21 bits' worth of Unicode codepoints, though, so I'm
sure the claim that UTF-16 can represent all of them isn't some kind of trick.

~~~
eridius
A surrogate pair can represent 20 bits, or 2^20 values. A single UTF-16 byte
is of course 16 bits, or 2^16 values. 2^20 + 2^16 == 0x110000, which is
precisely as many unicode code points as there are (0–0x10FFFF).

Decoding UTF-16 (with surrogate pairs) is pretty similar to decoding UTF-8
except the state machine is simpler (as it's just 1–2 code units per
codepoint, instead of 1–3). It also means that if you validate that you have a
high surrogate and a low surrogate, the combination is guaranteed to be a
valid codepoint (whereas with UTF-8 there are invalid encodings, e.g. any
1-byte codepoint can actually be encoded using 2 or 3 bytes instead, which the
state machine must reject).

~~~
theoh
Wait a minute.

"A surrogate pair can represent 20 bits, or 2^20 values. A single UTF-16 byte
is of course 16 bits, or 2^16 values. 2^20 + 2^16 == 0x110000, which is
precisely as many unicode code points as there are (0–0x10FFFF)."

This explanation can't be quite right, because the possible surrogate code
unit values are carved out of the set of 2^16 code units. So you are double
counting those surrogate code units.

From Wikipedia: "Since the ranges for the high surrogates (0xD800–0xDBFF), low
surrogates (0xDC00–0xDFFF), and valid BMP characters (0x0000–0xD7FF,
0xE000–0xFFFF) are disjoint, it is not possible for a surrogate to match a BMP
character, or for two adjacent code units to look like a legal surrogate
pair."

~~~
theoh
OK, I've found the gross reason why the 2^16 + 2^20 arithmetic works:

"The Unicode standard permanently reserves these code point values for UTF-16
encoding of the high and low surrogates, and they will never be assigned a
character, so there should be no reason to encode them. The official Unicode
standard says that no UTF forms, including UTF-16, can encode these code
points."

Ugh.

------
eridius
Converting the function to act as a fold instead of just returning a lazy
list, is that necessary because all the unboxing and strictness internally
meant the matches couldn't actually be lazy?

~~~
Tarean
Lazy data structures require allocation. GHC's lists use build/fold fusion to
remove intermediate lists but that's basically just a less reliable way of
converting to folds
[http://hackage.haskell.org/package/base-4.12.0.0/docs/src/GH...](http://hackage.haskell.org/package/base-4.12.0.0/docs/src/GHC.Base.html#build)

The other common fusion technique used in haskell is stream fusion which is
very similar to the mutual-recursion-encoded-as-sum-type version in the blog
post.

There is also indexing-based fusion which the Repa library uses. It allows
more control over traversal order (think optimizing stencil codes) but
requires regular arrays, e.g. no filtering.

There are some other research projects like specialized compiler passes via
hermit but afaik nothing really production ready.

~~~
eridius
Ok so not using a lazy list for matches is basically just around avoiding the
requisite allocation? I suppose if you consume the list immediately at the
call-site the compiler could in theory after inlining convert it to the
closure-based approach for you, but probably wouldn't.

------
glangdale
It's a pity that Aho-Corasick always seems to be the starting point for this
kind of work, because it's as slow as a wet week and/or the tables are huge
(pick your poison).

To do this right you need a fast prefilter and a reasonable fast string match.
Our string matcher in Hyperscan blew AC away on size and speed, typically.
Unfortunately, I never had a chance to build its successor. But AC is a
disaster of one dependent load after another and is mediocre at best in terms
of performance.

~~~
burntsushi
It's really not a pity at all, considering that:

1\. There is no write up of the Hyperscan algorithm of which you speak. I
wrote up a part of Teddy, but I still don't understand the full algorithm.
Hyperscan's code takes serious effort to understand. I spent days staring at
Teddy and I still didn't get everything.

2\. Aho-Corasick is reasonably simple to implement and does fairly decently
for a large number of tasks.

3\. Aho-Corasick is trivially portable.

~~~
glangdale
Ha, there's a _paper_ about FDR now.

The fact that AC is portable and simple and that everyone feels good about
themselves as a result of implementing a classical-sounding algorithm (why AC
and not Rabin-Karp? who knows?) is why is keeps cropping up, like kudzu. It's
still either (a) huge, (b) slow, (c) both and (d) a poor fit for modern
architectures - it has a way of turning the problem into a latency problem for
whichever level of memory its structures fit into (rather than a throughput
problem).

It's _a_ solution. Big deal. There are half-a-dozen better ones just floating
around in the literate, but this shitty one seems to be the one people fixate
on. Maybe it's because Aho and Corasick sort earlier in the dictionary than
Rabin and Karp?

~~~
burntsushi
It's been a while since I looked into Rabin Karp, but I remember there being
two issues: 1) it really wants all of its patterns to be the same length and
2) it doesn't scale to a large number of patterns. Aho-Corasick has its own
issues at scale, but only to the extent that your proportion of memory reads
that hit cache decreases. But honestly, I've never implemented it in anger.

I'll have to take a closer look at FDR. I thought it was for single pattern
search, not multi pattern search.

I don't really understand why you're being so bombastic about this. AC has a
number of nice qualities, as I mentioned above. We don't need to resort to
bullshit about how it's only used because it makes people feel good about
themselves. That's not a good way to win people over to your cause.

~~~
glangdale
I've spent years benchmarking against various AC implementations; typically in
Snort. They're usually stupidly huge and slow.

Rabin-Karp needs a fair bit of help to work; I mention it as a starting point.
As a core matching technique it isn't much more fancy than "put your stuff in
a hash table". Handling strings of different length _is_ in fact hard but
using multiple tables is viable. It scales a lot better than Aho-Corasick in
that you can engineer the size of the hash tables and anticipate performance.

Conversely, AC just develops a huge 'root cloud' when the pattern sets get big
- increasingly, you're always in a prefix of something. It tends to benchmark
well (sorta, kinda) especially if you feed it input that doesn't look much
like what you're trying to match. You then just bounce around a couple shallow
states that get to be resident in L1 and you look pretty smart.

Also, in Rabin-Karp or fancy variants, every access is independent of every
other, a property that you can only dream of in AC as you pootle from state to
state making reads that depend on what you did last.

"FDR" works on a different principle - it is 'bucketed SIMD shift-or with
super-characters' (see the paper).

I have a third matcher in the wings based on PEXT, which I must finish up.

Honestly I don't care what people do. Arguably a good deal of the money we
made from Hyperscan in its closed-source era came from selling a 'regex'
matcher came from just doing the simplistic case of multiple literal matching.
Our Rabin-Karp thingy (multiple hash tables with different lengths and
strides) was called "hlm" (originally, "Vomitsauce"), took a few pages of
code, wasn't that conceptually complex, and probably by itself (sans regex)
netted a few hundred thousand a year (we eventually replaced it with FDR).
This is borderline ridiculous; making money of a big complex regex matcher
seemed well-deserved, but making money off a simplistic task like this seemed
like picking up bills of the ground.

What frustrates me is that people don't generally look beyond one solution,
don't innovate, and don't care that they aren't innovating or reading the
literature. Apparently algorithm design is a parlor trick for getting into
Google, not a thing that one might actually _do_.

~~~
burntsushi
> Honestly I don't care what people do.

You can't have it both ways. You clearly care to some extent. Whatever though.
Clearly we disagree on how to approach this stuff. I realize you probably
don't intend this, but you really come off as belittling others for not being
as smart as you. It's not cool.

> Conversely, AC just develops a huge 'root cloud' when the pattern sets get
> big - increasingly, you're always in a prefix of something. It tends to
> benchmark well (sorta, kinda) especially if you feed it input that doesn't
> look much like what you're trying to match. You then just bounce around a
> couple shallow states that get to be resident in L1 and you look pretty
> smart.

The OP constructed a benchmark from exactly the task they are trying to do and
did algorithm development until they got a better result. That seems like
exactly the right way to approach this. Not everyone is in a position to make
an extra few hundred thousand dollars by switching from AC to Vomitsauce.

That's all I'm trying to do too. It's going to take a _long_ time for the
field to collectively catch up to Hyperscan. It'll take me years, if I keep on
my current path.

~~~
glangdale
I think the field should _pass_ Hyperscan, not catch up to it. There are a
number of things about HS that I'm increasingly uneasy about.

As for the rest I'm done. I don't think I can adequately explain my irritation
at the general state of discourse about the things I worked on for 12+ years
without coming off in all sorts of ways I don't really intend. Suffice to say
I've seen things you people wouldn't believe, strawmen comparison
implementations on fire off the shoulder of Orion, etc. :-)

I will try to write up my new string matcher at some point, in a less
convoluted context than Hyperscan. HS, from sheer complexity, makes almost all
algorithms more or less undiscoverable - it's hard to avoid having some busy-
body subsystem jump in front of the simple demo you're trying to do.

~~~
burntsushi
> I think the field should pass Hyperscan, not catch up to it. There are a
> number of things about HS that I'm increasingly uneasy about.

Right. I meant it as a short-hand way of saying, "it will take a while for the
field to catch up to the battle tested knowledge you've accumulated, of which,
some is expressed inside Hyperscan." :-)

------
olliej
This does seem to actively circumvent a number of the things that are always
held up as why Haskell is a great language: it uses strict types in a number
of places, has explicit inline and no-inline annotations everywhere.

It also seemed to require a lot of work to construct the code to make tail
call optimization happen, but that’s par for the course in Haskell.

~~~
orblivion
I'm not that versed in Haskell myself, but is this really the case? I would
think that expectations for low level optimization and general code are
different.

Though, I suppose you could then say that if the rules are different for low
level optimization, they may as well have imported an external library written
in Rust.

~~~
monocasa
Optimized Rust feels more idiomatic at least.

~~~
LeanderK
but I don't think it's a suprise, since rust is made to be fast while haskell
is made to be beautiful (for some sense of beautiful)

------
wmu
"A naive Python script that loops over all needles and calls str.find in a
loop to find all matches. This is not Aho–Corasick [...]".

I'm the author of pyahocrasick, a python C extension that implements the AC.
I'd love to see it in the comparison. :)

