Skip lists and splay trees, two frequent suggestions on this thread, are "lesser known" for a reason: skip lists because for any given skip list implementation there is most probably an encoding of balanced binary trees that outperforms it, even in concurrency, and splay trees because every balanced binary tree outperforms them --- and, in order to avoid writing and testing balancing code, you have to trade off the fact that reading the tree modifies the data structure.
Judy arrays came up once too; there's a really excellent critique of Judy arrays here:
(Long story short: you can get comparable performance from a straightforward hash table, and Judy Arrays have to be tuned to the microarchitecture).
Favorite data structure not cited here:
Aguri trees, which marry a bounded-size radix trie (like you'd use in a software routing table) to an LRU list, and automatically synthesize aggregates (like, 10.0.0.0/16 from 1,000 observations across all IPs) from the pattern of insertion. They're best known in traffic analysis, but we've used them on in runtime memory analysis as well.
for any given skip list implementation there is most probably an encoding of balanced binary trees that outperforms it
Not just probably, absolutely. Skip lists are probably pretty balanced, which is always less balanced than actually pretty balanced. Their sole claim to fame is their simplicity in implementation compared to actual self-balancing trees.
Not so! There's one thing you can do with skip lists that I don't know of any easy way to do with other data structures. Suppose you want a priority queue, and you have a bunch of cores, and these cores want to insert into the queue and remove the minimum element concurrently. How do you implement this to allow fast concurrent access from a lot of threads?
First, there's the approach everybody remembers from Intro to Algorithms: use a binary min-heap. It's guaranteed to be balanced, so you get O(lg n) time insertions and delete-the-minimum operations, with low constant factors. Nice! But how do you make it concurrent? You can put a lock on the whole thing and only let one thread use it at once, but that's slow. You could use fancy fine-grained locking, but there will still be inter-thread memory conflicts arising from the heapify operations you need to maintain the heap invariant. There has been some work on this, and they've come up with some decent ideas, but it still has scaling problems.
Now look at skip lists. It's a randomized sorted list data structure, and it claims to be, as you put it, probably pretty balanced. Various threads can insert concurrently without breaking that "probably pretty balanced" property. The memory read- and write-sets are very local, and it's possible to do all this with lock-free synchronization. The end result is a priority queue data structure that scales to hundreds of cores. And the code doesn't fry your brain, which is a plus. There's a pretty neat paper about it here:
By the way, if you happen to be using a processor with hardware transactional memory support (you aren't, yet), then this code becomes even easier to write, as you don't have to worry about how to do lock-free synchronization. I almost felt cheated by how simple it was.
The general lesson is that randomized algorithms often remove the need for synchronization between concurrent processes. For example, choosing ids at random from a large space rather than coordinating to choose unique ids.
Only in simulations, I'm afraid. As far as I know, the only processors to have HTM support are Sun's Rock processor (now discontinued by Oracle, I think) and the Vega chips from Azul Systems. I don't have access to either of these, and neither of them have the HTM enhancements that I'm studying. The state of the art in research is a lot more complex than the playing-it-safe state of the art in production chips.
That said, future HTM systems have a lot of potential, and I think we'll see them come into wider use eventually. The main problem with them seems to be that existing software ecosystems aren't written with HTM in mind, so you get pathological memory access patterns slowing things down. But with some minor enhancements to the HTM design (like a load instruction which immediately adds its destination address to the write set of a transaction) and compiler and runtime support, and a reasonable set of concurrent data structures, HTM can absolutely fly.
You can find blog posts and Stack Overflow articles talking about lock overhead in rb-trees versus skip lists, and just looking at a diagram you can intuitively sense where that idea comes from. But these kinds of analyses are usually flawed by the fact that they assume the most naive possible encoding of tree edges (void* tree* tree*).
I don't have an answer for you (I'm just making the point that when people argue about how inefficient trees are, they're usually assuming a struct with two edge pointers), but a more direct answer to your question might be:
Judy arrays came up once too; there's a really excellent critique of Judy arrays here:
http://www.nothings.org/computer/judy/
(Long story short: you can get comparable performance from a straightforward hash table, and Judy Arrays have to be tuned to the microarchitecture).
Favorite data structure not cited here:
Aguri trees, which marry a bounded-size radix trie (like you'd use in a software routing table) to an LRU list, and automatically synthesize aggregates (like, 10.0.0.0/16 from 1,000 observations across all IPs) from the pattern of insertion. They're best known in traffic analysis, but we've used them on in runtime memory analysis as well.