I'd love to see some benchmarks/comparison on variable length strings. For strings with random length between 10 and 30, gxhash was significantly faster than xxhash for me. I would assume because it processes chunks of up to 32 chars at a time, avoiding branch misses.
Generally my feeling is that at these speeds, designing for branch-predictability for short strings might be more important than absolute throughput.
Nice overview of sorting methods, thanks for sharing!
I also looked a bit into radix and distribution sort at some point over the past year, but in the end high performance sorting is actually too big of a thing to just do quickly on the side, as your post well shows :")
In fact I wasn't aware of the associativity issues for radix sort. That's definitely something to keep in mind and investigate.
Will definitely refer back to it once I'm looking at sorting again in more detail at some point!
I got stuck on sorting too, was working on SingeliSort (https://github.com/mlochbaum/SingeliSort) for a while. The basic performance is there but I need to get serious about testing before using it. But the radix sort and counting sort should be very solid. The approach is about the same as the C code currently used in CBQN, linked below. The main complication is to reduce constant overhead for shorter arrays with a small count type and better prefix sums, interleaved SWAR for SingeliSort since it targets generic architecture and shared SIMD utilities in CBQN.
Email in my Github profile, feel free to contact me any time if you'd like to talk about algorithms!
The final goal is to index DNA, say a human genome, or a bunch of them, and this is static data.
Then as new DNA comes in (eg is read by a DNA sequencer) we can efficiently query against the index built on the reference genome, and this reference is often fixed.
Jup. I've added to the future work section now that the plan is to use this to speed up suffix array searching. Suffix arrays are pretty regularly used to build indices over say a human genome, that can then be queried.
And since DNA sequencers are still increasing there throughput and accuracy, it's always good to have faster algorithms as well.
Thanks! I did spend some time fiddling with CSS to make the yellow highlights so that the titles stand out a bit more, and to improve the code blocks, but otherwise it's a relatively common theme for Hugo. (I think it's linked in the footer.)
You're absolutely right. At some point I spent so long working on the plotting code that I really couldn't be bothered anymore to make it prettier. Hopefully I eventually find some motivation to fix this.
At some point I was playing around with interpolation search on the human genome dataset, and it was really really terrible.
It had some very bad worst case behaviour when the input data has big plateaus and you're searching for something around the edges of the plateau. This can be fixed by only using interpolation for the first few iterations and doing binary search for as long as needed afterwards, but that kills a lot of the predictability of the current approach. So while I really like it in theory, in practice I'm not quite convinced it can be made to work efficiently.
There are tricks: if you need to find uniformly distributed data it is unbeatable: guaranteed to work in 5 hops or less. After 5 hops, is better to use binary search.
If data is not uniformly distributed one trick is to sort by hash of the element (store the hash with the element or calculate on the fly).
But yeah, without those adjustments it has a worst case of O(n).
Assuming this is going to save someone 10ns per query in the end and I spent 150h on coding and writing this up, we only(?) need around 10^14 queries to make it worth it!
But yes, as new technologies stack up, we achieve exponential growth in throughput in the end, which usually enables new science :)
You're right. While I wasn't very explicit about this (because algorithmica and the linked paper spend plenty of words on it already), the code snippet for Eytzinger does indeed do both the prefetching and the formula.
In fact, I'm pretty sure a similar formula could be made to work for higher branching factors, although it would surely be slower. (Probably it depends on the number of times 17 divides the final index it so, which is not great, but with B=15 it would be the number of factors of 16 which is easy again.)
The more annoying issue with not storing everything in the final layer is that we have to keep a running-answer for each query as we go down the tree, which I suspect will add measurable runtime overhead. But maybe that's a tradeoff one is willing to make to avoid the 6.25% overhead.
Generally my feeling is that at these speeds, designing for branch-predictability for short strings might be more important than absolute throughput.