
Is Prefix of String in Table?  a Journey into SIMD String Processing - trentnelson
http://trent.me/is-prefix-of-string-in-table/
======
trentnelson
Side note: people have been inquiring how I made the diagrams and whatnot.
I'll put a little section like that on the next article. For now though:

The two main diagrams, depicting struct layout and then the 5-pane register
walk through, were done in Visio, then saved as SVG, with fonts edited
directly in Vim (took a few cycles to figure out which fonts looked best
across all devices).

The charts were done in Excel, connected to the CSV data files produced by the
benchmark utility, then wired up using a PivotTable. A PivotChart was created,
Save As -> PDF, edit PDF in Inkscape, delete surrounding border, crop canvas
to fit diagram, Save As -> SVG. Edit SVG fonts in Vim, voila, nice vectorized
graph!

Sounds laborious, and it took forever to get the flow down, once I'd settled
on a process it was pretty easy to whip up new graphs.

~~~
BonesJustice
I admire your attention to detail and keen sense of aesthetics. Well done.

~~~
trentnelson
Thanks! Joe Walnes should get credit for the underlying HTML and CSS and site
design and whatnot... I lovingly pilfered[1] his websocketd[2] site a while
back for PyParallel[3] and liked it so much I decided it to use it for this
personal site as well.

[1]:
[https://twitter.com/joewalnes/status/567495778349113345](https://twitter.com/joewalnes/status/567495778349113345)
[2]: [http://websocketd.com/](http://websocketd.com/) [3]:
[http://pyparallel.org/](http://pyparallel.org/)

------
dragontamer
Absolutely excellent!

SIMD-units of the x86-64 CPU are not discussed enough IMO. And every time they
are discussed, I'm always blown away by their efficiency in solving some
tasks.

\---------------

Full disclosure: I don't understand what you've done yet. I've just skimmed
the article really quick, but I already have some questions. Hopefully you
didn't answer them already elsewhere, lol.

* Any reason to highly-optimize "strings of length 16" instead of "strings of length 32" ?? The vast majority of your code uses "xmm" registers (16-bytes), but it would seem like it'd be straightforward to target ymm registers (32-bytes) instead.

Were there any particular reason why you chose 16-bytes instead of 32-bytes?
Like maybe targeting AVX but not AVX2?? (Then again, I see a rare ymm register
here and there... sooo... I guess I'm just a bit confused with the design
decision)

* SSE4.2 has string-specific instructions. Did you look into any of the SSE4.2 string instructions to see if they'd help in your case? (Ironically, forcing you back down to xmm registers, since its SSE only. So kind of a contradiction with my first question, lol) I fully admit that I don't actually know how to use the SSE4.2 string instructions, so don't look into my question too hard.

\------------

I'll have to spend some time studying your code and explanation. Thank you
very much for the writeup! It will be helpful for my SIMD-studies. There
really isn't enough material out there on AVX / SSE optimizations.

~~~
gameswithgo
On benefit of sticking with 16 bytes is it work on any 64bit machine, and as
well you don't run into weird throttling issues that you do with
AVX/AVX2/AVX512

So it is a lot easier to make something general purpose, where almost anyone
can use it and it will always be faster.

With AVX you sometimes have to take some care.

~~~
dragontamer
He's using the "v" instructions, which requires AVX, and probably AVX2 (I
haven't memorized all the instructions... but they're at least AVX). Even if
he sticks with 16-bytes, the use of the "v" instructions basically requires a
relatively recent processor.

~~~
trentnelson
Yeah it's actually a pretty inconsistent mish-mash of ISA usage. I _could_ go
straight SSE4.x but I'd lose out on the vp* goodies you get with AVX and AVX2,
even if I'm only using XMM registers.

The "length of 16, num elements is 16" parity is nice though and simplifies
the logic of the algorithm.

------
trentnelson
Fun tidbits: a) I bombed, like, 6 coding interviews whilst writing this
article. Two of them were related to string processing, too, which I thought
was funny. (In a depressingly ironic kind of way.) No point, other than I
seemingly suck at the "leetcode" coding interview style companies seem to rely
upon these days.

And b) there are at least two Easter eggs in this post that no-one seems to
have noticed.

Oh, and c) perk of religiously keeping line length to <= 80 characters... the
code is surprisingly readable on mobile devices!

~~~
garmaine
So you’re looking for a job then?

~~~
trentnelson
Sure am!

------
jhallenworld
Here is a way to do it with a perfect hash table (so only one lookup).

Find a 128-bit mask that will be used to select bits of the input string. You
want to minimize the number of 1 bits in the mask, since the table size is
2^(number of 1 bits). On the other hand, you need to pick enough 1 bits so
that there is only one string per table entry.

AND this mask with the input string. COMPRESS these bits using the mask. The
result is the index into the table. Each entry has a string to compare and
thermometer coded mask (number of 1 bits equals length of string in bits). AND
the input string with the thermometer coded mask and compare it with the
string. If they match, you have a hit.

COMPRESS (or generalized extract) means to move selected bits to the right so
that they are all next to each other and adjacent to the LSB. There is a
clever algorithm for this, see section 7-4 of Hacker's Delight:

[https://doc.lagout.org/security/Hackers%20Delight.pdf](https://doc.lagout.org/security/Hackers%20Delight.pdf)

~~~
trentnelson
Patches welcome :-)

But seriously though, I played around with compression a bit initially:
[https://github.com/tpn/tracer/blob/master/StringTable/String...](https://github.com/tpn/tracer/blob/master/StringTable/StringTable.h#L966).
Referenced Hacker's Delight when I was working on it. Granted, this was just
for compressing the USHORT lengths into BYTE-sized lengths.

Are you sure your compression suggestion would work for prefix matches? If you
can get lower latency than the fastest assembly version, I'm all ears:
[http://trent.me/is-prefix-of-string-in-
table/#IsPrefixOfStri...](http://trent.me/is-prefix-of-string-in-
table/#IsPrefixOfStringInTable_x64_12)

All of the compress/expand routines mentioned in that Hacker's Delight section
you're quoting seem to mention upwards of 160+ instructions. My negative match
logic fast-path only requires 12 instructions.

~~~
rurban
True. A gperf alike hash table (with special key indices) works only ok with
compile-time known strings. But a simple CMP-alike perfect hash table with
double indirection would be worthwhile to test against.

I called it Hanov after
[http://stevehanov.ca/blog/index.php?id=119](http://stevehanov.ca/blog/index.php?id=119)
Problem is there that you have to calc a hash for each string, which is only
fast with __builtin_crc/_mm_crc32_u64

~~~
glangdale
A hash alone isn't going to help much; he has both a 1-char string in the set
and to top it off, most of the other strings begin with (a distinct from the
1-char string one) the same string. There are some tricks (overpopulate the
hash table with short string) but hashing is tricky here.

Source: 11 years of string matching work. :-) We built a multiple hash table
string matcher in the early days of Hyperscan (it's gone now).

------
lossolo
I was wondering how long it took to write such content rich article, then I
saw this:

> Hours spent on this article to date: 230.56.

~~~
trentnelson
On the plus side, it only takes, like, 3 hours to read end-to-end.

~~~
floor_
You should simd your reading process and split that in to threads.

~~~
gridspy
Inter-thread communication is hard.

------
trentnelson
The article is much longer than I originally anticipated. If you're looking
for some good bits, you could probably jump straight to here:
[http://trent.me/is-prefix-of-string-in-
table/#IsPrefixOfStri...](http://trent.me/is-prefix-of-string-in-
table/#IsPrefixOfStringInTable_10)

That discusses the last C implementation before I took a stab at the assembly
versions.

The final assembly section was pretty juicy as well: [http://trent.me/is-
prefix-of-string-in-table/#IsPrefixOfStri...](http://trent.me/is-prefix-of-
string-in-table/#IsPrefixOfStringInTable_x64_3-review).

I was investigating why an earlier assembly version was so slow, then ended up
writing a version that unequivocally beat all the C and previous assembly
versions across the board (both prefix and negative matching, and worst case
false-positive matching (where up to three comparisons needed to be done)).

TL;DR with optimized data structures specifically geared toward SIMD
instructions, you can definitely get sizable performance improvements in both
C with intrinsics, and then again with raw assembly.

------
gumby
> A reference implementation was written in C as a baseline, which simply
> looped through an array of strings, comparing each one, byte-by-byte,
> looking for a prefix match.

This is a poor choice; the 1970s boyer-moore also is significantly faster for
most case as it avoids many pointless searches. And if you have a dedicated
search table you can optimize the search even more (e.g. pre-sorting keys and
targets, using a more dedicated structure like a trie, etc)

I think these approaches would also be amenable to deliberate acceleration
with intrinsics, as you did with your brute force search.

Regardless, interesting work.

~~~
trentnelson
Well... I mean the first version (IsPrefixOfStringInTable_1) uses the lengths
from the string array to improve performance versus the baseline:
[http://trent.me/is-prefix-of-string-in-
table/#IsPrefixOfStri...](http://trent.me/is-prefix-of-string-in-
table/#IsPrefixOfStringInTable_1)

The super-simple baseline referred to in the intro was whipped up as a trivial
example of the simplest approach you could take for the problem.

Trie was never in contention as I didn't want to rely on any algorithms that
involved pointer chasing. There's this comment at the top of StringTable.h:

    
    
        The design is optimized for relatively short strings (less than or equal to
        16 chars), and relatively few of them (less than or equal to 16).  These
        restrictive size constraints facilitate aggressive SIMD optimizations when
        searching for the strings within the table, with the goal to minimize the
        maximum possible latency incurred by the lookup mechanism.  The trade-off
        is usability and flexibility -- two things which can be better served by
        prefix trees if the pointer-chasing behavior of said data structures can
        be tolerated.
    

[https://github.com/tpn/tracer/blob/v0.1.12/StringTable2/Stri...](https://github.com/tpn/tracer/blob/v0.1.12/StringTable2/StringTable.h#L23)

I'm not sure if Boyer-Moore is well suited to the specific use case I wanted
to address. It is geared toward looking for occurrences of a pattern in a
stream of text bytes, which isn't really what I'm doing at all. I have an
input string of known length. I have up to 16 strings in a table, and I want
to quickly check if any of those strings "start with or are equal to" my input
string, with special emphasis on negative matching being on the fast path, and
preserving the location of the match in the table (so it can be cast to an
enum, for example).

So, this article was specifically about the SIMD-oriented STRING_TABLE
structure, not necessarily about the history of string processing. Another
article potentially comparing this approach to tries, Boyer-Moor or Aho-
Corasick would make sense in the context of larger prefix table data sets (and
a larger text corpus being searched, e.g. wikipedia dumps).

~~~
Myrmornis
What's the issue with "pointer-chasing"? That it might be inefficient in
practice (despite preferable runtime complexity?) due to making poor use of
CPU caches?

~~~
dragontamer
> That it might be inefficient in practice

That's basically it.

Consider that his algorithm completes 16-searches in 20-cycles, or roughly 5
nanoseconds.

A single fetch from main-memory will take 50-nanoseconds, or be ~10x slower
than the methodology discussed in this post. Even a fetch from L3 cache is
typically 10-nanoseconds, while a fetch from L1 cache is 1ns or ~4-cycles.
This SIMD-methodology is incredibly FAST.

The "best-case" scenario of this SIMD code is 6-cycles, which is faster than
two L1 fetches. (!!!)

Note: two L1 fetches would likely go into the reorder buffer and be out-of-
ordered into an efficient manner. But... ignore that plz since that destroys
the point I'm trying to make. Lol. Still, you can see how repeatedly forcing
even L1 fetches would slow down the code, and "pointer chasing" is more likely
to fill up the Reorder buffers, since you cannot rely upon prefetchers to pre-
fetch the data, or other such tricks of the CPU.

Algorithmic complexity means that eventually, the linear / SIMD methodology
eventually will be slower with a big-enough dataset. But when looking at small
data-sets and optimizing them to the maximum ability (ie: searching 16-strings
ASAP), linear / sequential scans are king. Kinda like why insertion sort ends
up winning in a lot of small cases over other sorts. CPUs and RAM are
incredibly well optimized for sequential data scans.

------
gravypod
Is this somewhere a Bloom Filter (or similar probabilistic data structures)
could be used to improve performance? From a quick survey of cryptographic
hash functions something like BLAKE2 would serve your needs at ~4ops/byte. If
you had a wide enough bloom filter you can turn your check, in most cases,
into a binary &.

Just a thought. The author here obviously already knows what they're doing and
has likely thought of this but I'd be interested in hearing why this ASM-
optimized approach was chosen instead.

~~~
trentnelson
The initial length + unique character test allows us to negative match in a
way that is effectively identical to a bloom filter.

If the initial negative match test indicates no match, then we can be certain
there is not a match. Otherwise, we might have a match, and need to compare
the candidate strings in more detail.

I can't imagine how hashing would help in either the fast-path negative match
case, or the prefix match case. Hashing is sort of at odds with prefix
matching, in my opinion.

~~~
gravypod
Could you get good-enough performance with something simpler and more
portable?

The optimized assembly is much more complicated and specialized. Porting
platforms/architectures is very difficult and you loose out on the compiler's
abilities to make assumptions and optimize your code.

String length is a very poor probabilistic filter as you're specifically
looking substrings.

Also if you already have a hash of your input string's prefix, and the hash of
all of your known strings, in the case where you actually need to check
against your "known" strings becomes a fast search of a set of numbers making
the fail-through case of your code negligible. If you impose a limit of
strings in your "known" pool the compiler can have a field day.

~~~
trentnelson
> Could you get good-enough performance with something simpler and more
> portable?

Hmmm. Did you read the article? Any of the C versions are relatively simple
and portable. They basically work the same as the assembly ones and give great
performance... I'm not sure what you're trying to get at.

> The optimized assembly is much more complicated and specialized.

Well, it's my article, I wanted to see if I could beat a PGO C compiler with
all its optimizations turned on :-)

> Porting platforms/architectures is very difficult and you loose out on the
> compiler's abilities to make assumptions and optimize your code.

I know the trade-offs that come with assembly. But if you're okay with them,
you can get the best possible performance out of the underlying hardware. That
was the goal of this article.

> String length is a very poor probabilistic filter as you're specifically
> looking substrings.

I don't understand this comment. String length is a great filter because if
any strings in the table are longer than the input search string, then there's
no way we can have a prefix match.

> Also if you already have a hash of your input string's prefix, and the hash
> of all of your known strings, in the case where you actually need to check
> against your "known" strings becomes a fast search of a set of numbers
> making the fail-through case of your code negligible. If you impose a limit
> of strings in your "known" pool the compiler can have a field day.

I don't know where this fascination with hashes comes from... have you looked
at the implementation? Even when it has to do a string match, the timings now
are around 13-14 cycles. The nature of the data structure means you're really
only ever doing max 2 comparisons... 3 if someone is sending you worst case
data (and even so, with 3, you're only clocking ~20 cycles).

Hashing and prefix matching just don't play well nicely together.

------
glangdale
This is fine work, and well presented. The only flaw (which I've discussed on
Twitter with Trent) is the performance analysis against a single string at a
time means we can't really analyze the effects of branch prediction on a
realistic input (as the branch predictor will converge to 'perfection' almost
immediately). I think these effects would be small, but it really does need to
be properly analyzed.

------
godelski
I always love learning about performance and how this low level stuff works.
Is there a book that people recommend for this type of stuff?

~~~
trentnelson
I bought a bunch of assembly/MASM books from the 90s recently, if that helps.
Writing assembly is a skill that atrophies so quickly without constant
exercise, so once you spool up to an acceptable level of assembly
productivity, you really need to make a conscious effort keep practicing to
maintain it.

If I go for a couple of weeks, I’ll come back and be like “wait, how do I
check if this is a NULL pointer again?”.

The Computer Archtecture books are good, you can pick up 2nd hand copies of
older versions for cheap. Intel and AMD manuals are good, too.

------
ridiculous_fish
This seems like a complicated way to speed up a linear scan. What's the
advantage of this approach over just sorting the strings and doing a binary
search, which is log N?

(Not a naive binary search using strcmp, but the trick where you find the
lower and upper bounds given index 0, refine it using index 1, etc. until your
input is exhausted.)

~~~
xenadu02
Chasing pointers is cache death. Many theoretically optimal algorithms have
been struck down by the God of Memory Latency.

eg, If you measure it, tree performance will often be much improved by storing
slightly less than one cache line's worth of data at each node. The CPU can
pipeline and predict the shit out of the linear search at each node to the
point that all the theoretical O(1) or O(log N) lookups in the world get
demolished by a CS101 scan of the array. This makes a relatively good general
use cache aware data structure because small collections tend to fit into a
single cache line while large data structures have relatively few pointers to
chase.

Of course as the OP demonstrates, optimizing for your specific use case can
yield even more dramatic improvements.

~~~
trentnelson
Heh, “predict the shit out of”, I like that.

------
rabyss
Have you done some comparison with a trie implementation (known for good
performance on prefix matching, see:
[https://en.wikipedia.org/wiki/Trie](https://en.wikipedia.org/wiki/Trie))

~~~
trentnelson
I love tries for larger data sets, and have used them many times for longest
prefix matching problems.

However, they inevitably involve pointer chasing, are quite branchy, and not
really amenable to SIMD optimizations. They also do not have any means in
place for doing a fast-path negative match, which was important for me.

I think you'd really struggle to get anything based on pointer chasing and
binary search to compete with the CPU cycle counts I was seeing toward the end
for the assembly versions (e.g. 6 cycles for negative match, 14 for prefix
match, 21 for worst-case false-positive negative match).

I comment a little bit about it here: [http://trent.me/is-prefix-of-string-in-
table/#other-applicat...](http://trent.me/is-prefix-of-string-in-table/#other-
applications)

> The other nice side-effect is that it forces you to pick which table a given
> string should go in. I made this decision by looking at which types occurred
> most frequently, and simply put those in the first table. Less frequent
> types go in subsequent tables.

> I have a hunch there's a lot of mileage in that approach; that is, linear
> scanning an array of string tables until a match is found. There will be an
> inflection point where some form of a log(n) binary tree search will perform
> better overall, but it would be very interesting to see how many strings you
> need to potentially match against before that point is hit.

> Unless the likelihood of matching any given string in your set is completely
> random, by ordering the strings in your tables by how frequently they occur,
> the amortized cost of parsing a chunk of text would be very competitive
> using this approach, I would think.

> A fun experiment for next time, perhaps!

------
billconan
Curious if GPU is suitable for similar tasks? Any successful GPU based json or
http parser.

~~~
trentnelson
I'm optimizing for ultra low latency here; i.e. what's the absolute fastest I
can detect a prefix match against this known string table? The final assembly
version is down around the 6-20 cycle range, depending on what you're doing.

The overhead of dispatching a kernel to the GPU (and subsequent memory copy of
the search string) would take tens of thousands (if not hundreds of thousands)
of CPU cycles, so, it's not really applicable in this single-shot prefix
lookup case.

Where it would be interesting: much larger prefix data sets (say, the name of
all known cities, states, countries etc) and identifying where they occur in
an even larger text corpus (like a wikipedia dump).

In that approach, you'd obviously compare against a multi-threaded CPU
implementation... but, it'd be interesting to see the comparison.

------
rurban
This is now the 6th duplicate of this article, click on the (trent.me) link to
see the other discussions. Excellent article of course.

~~~
trentnelson
I only submitted one of those about two weeks ago, if that’s any consolation
:-)

