Very interesting post. The best results I knew of for static ordered search [1] were from the 2015 Khuong-Morin paper [2], which is not cited in the article but the author cites it in the previous post, and in their benchmarks Eytzinger+prefeching performed best. The improvement on top of that is pretty impressive. At the larger sizes, the latencies are basically 1 cache miss (~50ns).
However, I have some comment on the benchmark, the iterations have no loop dependency on the previous result, so the CPU is allowed to pipeline some loads. This is rarely representative of real workloads, where the key being searched comes from other computations, so this is more a measure of highest possible reciprocal throughput than actual latency. Would be interesting to see what happens if they make the key being searched a function of the result of the previous lookup. (EDIT: as gfd below points out, the article actually discusses this!)
Also some of the tricks only work for 32-bit keys, which allow the vectorized search in nodes. For larger keys, different layouts may still be preferable.
[1] Ordered search is when you need to look up keys that are not in the dataset, and find predecessor/successor. If you only do exact lookups, a hashtable will almost certainly perform better, though it would require a bit more space.
Thanks! You're right, I missed it because it was in the section "Comparison with std::lower_bound" and I thought there would be just some gloating :)
Most published data structure benchmarks have this pitfall, and don't discuss it, then when trying to measure the improvements after plugging into a real workload it turns out that the microbenchmark was not predictive. When I see latencies that top out at 50ns when they clearly have to make more than one causal load I immediately get skeptical.
This is one of the very few articles I've seen that clearly explains this, which makes it even more excellent. I only wish it didn't label the other graphs "Latency" too, or at least put an asterisk there.
Somewhat understandable. As I sometimes stumble over comparable situations I have made it a habit to never just make any graphs, I always make a script that makes the graphs. That makes it easy and quick to go back and just re-generate them if you later change your mind about some detail.
If you ever get around to updating the graphs, there's also a bug with where you're drawing the vertical dotted lines (for example I expect 4M to be at 2^22)
Just this past day, I prototyped a Skip List backing a sorted key/value store, and it was super simple to do so (as compared to AVL / Red Black).
B-Trees are notorious for their complexity, and so SingleStore (memSQL) reasoned that their move to use Skip Lists to back their tables was justified: https://archive.is/Qhpgn
It would be interesting to see the author attempt faster Skip Lists, as their expertise in this topic seems deep enough.
I’ve been wondering something similar. I recently published a rust rope library based on skip lists. (Ropes are “fat strings” optimized for efficient insertion and deletion of arbitrary characters.)
My rope is backed by a skip list, where each node in the skip list has a gap buffer. Even without the gap buffer my skip list is ~1.5x faster than the next fastest library on cargo (ropey - which is b-tree based). With gap buffers the resulting code is 4x faster. My library is also significantly smaller, both in lines of code and compiled size.
I’ve got another b-tree library I’ve been using for CRDT editing in Diamond Types. Based on this result I’m considering rewriting my btree as a skip list to see how it compares. It’d be a direct apples to apples comparison - I’ve already tuned my btree well as I can. I’m very curious how performance would change.
I'd had, but never gotten at all close to implementing, this idea of something like a a mutable RRB tree (think https://docs.rs/im/latest/im/struct.Vector.html but, again, mutable) with somewhat chunky (say, few-KB) gap buffers as the leaves. The tree gives you asymptotically not-terrible behavior with zillions of items, and the gap buffers could help practical performance when there's a lot of locality in your accesses.
I figure merging two lists of items or filtering some items out of a list might have the sort of locality of edits gap buffers handle well, and largish gap buffers could allow iteration to mostly spend time walking through contiguous chunks.
I still don't know how well something like that applies to a non-rope use case, but it's incredibly cool to see you actually did the work to build a hybrid structure out, validated it, and it does very well.
> might have the sort of locality of edits gap buffers handle well, and largish gap buffers could allow iteration to mostly spend time walking through contiguous chunks.
Yeah this was my insight too. The gap buffers allow the leaves to be larger (because there's fewer memcpys). And large leaves improve the cache behaviour. I was really surprised how well it works in practice - its crazy fast. Like, 30M keystrokes / second fast based on some real world editing traces. I can replay the edits from an 11 page academic paper in 5ms. Doing the same thing naively in javascript takes ~1s - which is 200x slower.
> I still don't know how well something like that applies to a non-rope use case,
My intuition agrees with you. I think the performance gain is mostly thanks to the high locality of edits in a rope. And you'd get that with some data sets in a tree structure, but not all.
But what if the data is semi-random? I'm imagining a BTreeMap-like data structure with gap buffers at the leaves. In this case, you would still get fast O(log n) performance of the btree / skip list. The gap buffer would help a lot less - but it would still halve the number of items memcpy'ed in the average case when inserting or deleting. Performance might still improve a little thanks to the gap buffers.
> I can replay the edits from an 11 page academic paper in 5ms.
That's super cool.
FWIW, what got me curious about RRB trees and so on was the idea of trying for a structure that's fast in some happy cases and degrades to...maybe slow but not, say, quadratic. The thread around https://news.ycombinator.com/item?id=24357976 and got me started and mentions some other fun stuff, including a "tier vector" with sqrt(N) insert/remove anywhere.
Cool! I was surprised to see ropey as a dependency, I agree those utility functions should be inlined.
I'm currently using ropey to keep the server side state of documents in an LSP server. I recommend you add at least a slice type, as that's typically desirable. But looks great otherwise!
Yeah those utility methods will be inlined anyway by the compiler. But copy+paste (with a comment for attribution) is still probably the right tool here.
I like the idea of adding a slice type. What use case do you have in mind for that?
I don't know about other people, but my tokenizer is parameterized over string slice types. So I tokenize from a slice (be it &str or ropey's RopeSlice) and some of the returned tokens have subslices. This allows tokenization to do no heap allocation.
Note that the article is exclusively about static data. The vast majority of the complexity of B-trees comes from supporting updates, but constructing a balanced B-tree from a sorted list is relatively straightforward.
Thanks, I used that neat knuth coin-flip trick from your code.
Why does the comment on the file say that its lock-free implementation is half as slow as compared to 'sequential' one? Where does the slowness come from? Is it all those while(1) loops waiting to race atomic operations?
I agree it's a beautiful data structure, but this imagined high school student would be one who is comfortable with (learning about) pointers and locks. Those exist of course, in fact we probably both qualified back in the day.
My point however is that that implies some major selection bias towards high school students who love programming for programming's sake, not some average high-school student who learned a bit of Python and JavaScript in their introduction to computer science class or because they want to write a game (which is a totally valid reason to learn programming). I don't really see why B-trees as a data structure would be that much more difficult to grok if we're already talking about that type of student?
Read the memSQL blog I linked to; it talks about why not B-Trees for their workloads. It also points out that the concurrent and lock-free nature of Skip Lists was overall a win, despite cache-locality (and other such) concerns. Though, if one can tame allocs (ref: https://archive.is/kwhnG), then they may have cache-locality (just don't expect it to be at Judy Array levels).
I wouldn't say RB-tree rebalancing is much simpler. On the other hand, in B-trees there's a whole lot of poorly documented nuances, binary trees are more intuitive. But in the end of the day the implementation effort perfectly worth it. https://www.scylladb.com/2021/11/23/the-taming-of-the-b-tree...
> we combine the vector masks using the packs instruction and readily extract it using movemask just once
Blend instructions are generally slightly faster than pack instructions, _mm256_blend_epi16 in that case. The order is not the same so the permute() function needs different shuffle vector, but otherwise it should work the same way.
> B-tree from Abseil to the comparison, which is the only widely-used B-tree implementation I know of
Hmm... using blend is actually a good suggestion. You can pre-shuffle elements [1 3 5 7] [2 4 6 8], do a comparison against the search key, and then blend the low 16 bits of the first register with the high 16 bits of the second. Will add that to the article, thanks!
Speaking about pre-shuffling and blending, you can try doing twice as much work with vectors. Pre-shuffle slices of 32 keys into [ 0, 4, 8, 12 .. ], [ 1, 5, 9, 13.. ] etc, then when searching combine 0-2 and 1-3 pairs of vectors with _mm256_blend_epi16(a, b, 0b10101010), then merge with _mm256_blendv_epi8 using _mm256_set1_epi16((short)0xFF00) for the blending mask.
This way _mm256_movemask_epi8 will get you a bitmap of 32 bits for 32 integer keys you can scan with _mm_tzcnt_32. Especially good on AMD where vpblendvb completes in a single cycle, Zen 3 can even run 2 of them per clock.
I've already implemented a 32-byte version. But, as mentioned in the article, it's not faster (in terms of throughput) because it needs to fetch two cache lines.
>The numeration we will use is actually half a millennium old, and chances are, you already know it.
Then, it is revealed:
>Michaël Eytzinger is a 16th-century Austrian nobleman known for his work on genealogy, particularly for a system for numbering ancestors called ahnentafel (German for “ancestor table”).
Probably not a lot, but the idea of using 2k and 2k + 1 as indices for children is used in a lot of basic data structures taught in CS courses, e. g. for binary heaps.
Good point, though technically a hashtable can only test for existence and can't find the lower bound. Also, a hashtable is likely to require more memory.
Btw, if you need ordering for ranges, CHM minimal perfect hashes keep the words sorted in the array. So you can easily use it for ranges. But I haven't yet tested if CHM is faster than this optimized B-Tree. CHM is usually pretty slow, more recommended for >100.000 entries.
In general the "best" algorithm seems dependent on specific usage patterns. A different angle is to study or sample actual usage patterns and automatically adjust the structure/algorithm based on that. The indexes can be readjusted during off/slow hours. (If you don't have off hours, then such a system may not be the best fit.)
I guess for static data a carefully crafted trie would probably be even faster as it just has to compare a byte or more in each node and thus is O(K) whereas K is the key length instead of O(n) where n is the number of keys.
If it isn't static the height optimized trie by Robert Binna is the most advanced trie, I guess. It might also be a good fit for a static trie as it keeps the height to a minimum.
This is very cool. Imagine using this for checking if a group ID in an ACL is present in a subject's group list. Binary search is fast, but this might be faster.
However, I have some comment on the benchmark, the iterations have no loop dependency on the previous result, so the CPU is allowed to pipeline some loads. This is rarely representative of real workloads, where the key being searched comes from other computations, so this is more a measure of highest possible reciprocal throughput than actual latency. Would be interesting to see what happens if they make the key being searched a function of the result of the previous lookup. (EDIT: as gfd below points out, the article actually discusses this!)
Also some of the tricks only work for 32-bit keys, which allow the vectorized search in nodes. For larger keys, different layouts may still be preferable.
[1] Ordered search is when you need to look up keys that are not in the dataset, and find predecessor/successor. If you only do exact lookups, a hashtable will almost certainly perform better, though it would require a bit more space.
[2] https://arxiv.org/pdf/1509.05053.pdf