Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> A reference implementation was written in C as a baseline, which simply looped through an array of strings, comparing each one, byte-by-byte, looking for a prefix match.

This is a poor choice; the 1970s boyer-moore also is significantly faster for most case as it avoids many pointless searches. And if you have a dedicated search table you can optimize the search even more (e.g. pre-sorting keys and targets, using a more dedicated structure like a trie, etc)

I think these approaches would also be amenable to deliberate acceleration with intrinsics, as you did with your brute force search.

Regardless, interesting work.



Well... I mean the first version (IsPrefixOfStringInTable_1) uses the lengths from the string array to improve performance versus the baseline: http://trent.me/is-prefix-of-string-in-table/#IsPrefixOfStri...

The super-simple baseline referred to in the intro was whipped up as a trivial example of the simplest approach you could take for the problem.

Trie was never in contention as I didn't want to rely on any algorithms that involved pointer chasing. There's this comment at the top of StringTable.h:

    The design is optimized for relatively short strings (less than or equal to
    16 chars), and relatively few of them (less than or equal to 16).  These
    restrictive size constraints facilitate aggressive SIMD optimizations when
    searching for the strings within the table, with the goal to minimize the
    maximum possible latency incurred by the lookup mechanism.  The trade-off
    is usability and flexibility -- two things which can be better served by
    prefix trees if the pointer-chasing behavior of said data structures can
    be tolerated.
https://github.com/tpn/tracer/blob/v0.1.12/StringTable2/Stri...

I'm not sure if Boyer-Moore is well suited to the specific use case I wanted to address. It is geared toward looking for occurrences of a pattern in a stream of text bytes, which isn't really what I'm doing at all. I have an input string of known length. I have up to 16 strings in a table, and I want to quickly check if any of those strings "start with or are equal to" my input string, with special emphasis on negative matching being on the fast path, and preserving the location of the match in the table (so it can be cast to an enum, for example).

So, this article was specifically about the SIMD-oriented STRING_TABLE structure, not necessarily about the history of string processing. Another article potentially comparing this approach to tries, Boyer-Moor or Aho-Corasick would make sense in the context of larger prefix table data sets (and a larger text corpus being searched, e.g. wikipedia dumps).


What's the issue with "pointer-chasing"? That it might be inefficient in practice (despite preferable runtime complexity?) due to making poor use of CPU caches?


> That it might be inefficient in practice

That's basically it.

Consider that his algorithm completes 16-searches in 20-cycles, or roughly 5 nanoseconds.

A single fetch from main-memory will take 50-nanoseconds, or be ~10x slower than the methodology discussed in this post. Even a fetch from L3 cache is typically 10-nanoseconds, while a fetch from L1 cache is 1ns or ~4-cycles. This SIMD-methodology is incredibly FAST.

The "best-case" scenario of this SIMD code is 6-cycles, which is faster than two L1 fetches. (!!!)

Note: two L1 fetches would likely go into the reorder buffer and be out-of-ordered into an efficient manner. But... ignore that plz since that destroys the point I'm trying to make. Lol. Still, you can see how repeatedly forcing even L1 fetches would slow down the code, and "pointer chasing" is more likely to fill up the Reorder buffers, since you cannot rely upon prefetchers to pre-fetch the data, or other such tricks of the CPU.

Algorithmic complexity means that eventually, the linear / SIMD methodology eventually will be slower with a big-enough dataset. But when looking at small data-sets and optimizing them to the maximum ability (ie: searching 16-strings ASAP), linear / sequential scans are king. Kinda like why insertion sort ends up winning in a lot of small cases over other sorts. CPUs and RAM are incredibly well optimized for sequential data scans.


I think with SIMD type stuff and small data sets (e.g. <=16 strings), conventional big-O complexity analysis can leave a bit to be desired.

Similarly, with SIMD and small data sets, you’re going to be faster than anything involving a pointer chasing data structure. Pointers tend to mean branchy-ness; they require more backend resources; they often introduce dependencies that can inhibit out of order pipelines.


in itself pointer chasing does not mean misusing CPU caches (your dataset could fit in cache and you could be pointer chasing inside it).

the main issue that is strictly due to pointer chasing is that you risk having long dependency chains in your instruction stream (address of the load depends on the result of the previous load, the address of which depends on the result of ...).

This is pretty bad for a modern CPU since a lot of the speed comes from overlapping independent computations.


To the person who downvoted this: you're using downvotes wrongly. I was asking a question about computer science. Down-voting people who ask questions about computer science is not the way to encourage the sort of conversation we're trying to encourage on HN.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: