
What are the lesser known but cool data structures? - limist
http://stackoverflow.com/questions/500607/what-are-the-lesser-known-but-cool-data-structures
======
tptacek
Skip lists and splay trees, two frequent suggestions on this thread, are
"lesser known" for a reason: skip lists because for any given skip list
implementation there is most probably an encoding of balanced binary trees
that outperforms it, even in concurrency, and splay trees because _every_
balanced binary tree outperforms them --- and, in order to avoid writing and
testing balancing code, you have to trade off the fact that _reading_ the tree
modifies the data structure.

Judy arrays came up once too; there's a really excellent critique of Judy
arrays here:

<http://www.nothings.org/computer/judy/>

(Long story short: you can get comparable performance from a straightforward
hash table, and Judy Arrays have to be tuned to the microarchitecture).

Favorite data structure not cited here:

Aguri trees, which marry a bounded-size radix trie (like you'd use in a
software routing table) to an LRU list, and automatically synthesize
aggregates (like, 10.0.0.0/16 from 1,000 observations across all IPs) from the
pattern of insertion. They're best known in traffic analysis, but we've used
them on in runtime memory analysis as well.

~~~
sesqu
_for any given skip list implementation there is most probably an encoding of
balanced binary trees that outperforms it_

Not just probably, absolutely. Skip lists are probably pretty balanced, which
is always less balanced than actually pretty balanced. Their sole claim to
fame is their simplicity in implementation compared to actual self-balancing
trees.

~~~
pjscott
Not so! There's one thing you can do with skip lists that I don't know of any
easy way to do with other data structures. Suppose you want a priority queue,
and you have a bunch of cores, and these cores want to insert into the queue
and remove the minimum element concurrently. How do you implement this to
allow fast concurrent access from a lot of threads?

First, there's the approach everybody remembers from Intro to Algorithms: use
a binary min-heap. It's guaranteed to be balanced, so you get O(lg n) time
insertions and delete-the-minimum operations, with low constant factors. Nice!
But how do you make it concurrent? You can put a lock on the whole thing and
only let one thread use it at once, but that's slow. You could use fancy fine-
grained locking, but there will still be inter-thread memory conflicts arising
from the heapify operations you need to maintain the heap invariant. There has
been some work on this, and they've come up with some decent ideas, but it
still has scaling problems.

Now look at skip lists. It's a randomized sorted list data structure, and it
claims to be, as you put it, probably pretty balanced. Various threads can
insert concurrently without breaking that "probably pretty balanced" property.
The memory read- and write-sets are very local, and it's possible to do all
this with lock-free synchronization. The end result is a priority queue data
structure that scales to hundreds of cores. And the code doesn't fry your
brain, which is a plus. There's a pretty neat paper about it here:

<http://www-cs-students.stanford.edu/~itayl/ipdps.pdf>

By the way, if you happen to be using a processor with hardware transactional
memory support (you aren't, yet), then this code becomes even easier to write,
as you don't have to worry about how to do lock-free synchronization. I almost
felt cheated by how simple it was.

~~~
scott_s
What and where are you working that you have access to a processor with
hardware transactional memory support?

~~~
pjscott
Only in simulations, I'm afraid. As far as I know, the only processors to have
HTM support are Sun's Rock processor (now discontinued by Oracle, I think) and
the Vega chips from Azul Systems. I don't have access to either of these, and
neither of them have the HTM enhancements that I'm studying. The state of the
art in research is a lot more complex than the playing-it-safe state of the
art in production chips.

That said, future HTM systems have a lot of potential, and I think we'll see
them come into wider use eventually. The main problem with them seems to be
that existing software ecosystems aren't written with HTM in mind, so you get
pathological memory access patterns slowing things down. But with some minor
enhancements to the HTM design (like a load instruction which immediately adds
its destination address to the write set of a transaction) and compiler and
runtime support, and a reasonable set of concurrent data structures, HTM can
absolutely fly.

------
panic
Nobody mentioned the unrolled linked list, a simple, useful, and often-
overlooked data structure.

<http://en.wikipedia.org/wiki/Unrolled_linked_list>

The idea is the same as a linked list, but instead of just one element in each
node, it stores an entire array. This simple change fixes the two biggest
problems with linked lists — memory overhead and cache efficiency. It's also
easy to tweak to be more "array-like" or more "list-like" as needed.

~~~
pjscott
At one point, Python tried to move to something similar for their standard
list data structure:

<http://www.python.org/dev/peps/pep-3128/>

The BList is similar to a B+Tree. It was meant to act like an array for small
lists, but to make operations like append, concatenation, and slicing faster.
It was a pretty cool combination of arrays (for memory compactness and good
cache performance and small constant factors) with trees, for the asymptotic
improvements they can bring.

Sadly, it didn't take off, as it would have broken backward compatibility with
extension modules.

~~~
_delirium
Some lisp and scheme implementations have used it as well, partly spurred by
this 1994 paper: <http://portal.acm.org/citation.cfm?id=182453>

(I don't know offhand if any widely used ones currently do, though.)

------
Darmani
ZDDs. A ZDD is a DAG of outdegree two where each node represents a set of
subsets of some arbitrarily-ordered domain. A node's left child contains all
subsets which lack the smallest element in those subsets; its right child
contains all subsets which do contain the smallest element.

This allows for tremendous compression of search spaces -- one example Knuth
gave in a talk I went to a few weeks ago was representing all five-letter
words in the English (represented as subsets of {a_1, a_2, ..., z_4, z_5},
where e.g.: {k_1, n_2, u_3, t_4, h_5} represents "knuth"), and efficiently
making queries such as finding all words that, when a 'b' is replaced with an
'o', yield another word. More impressive to me was representing all of a
certain class of tilings with only a few hundred thousand nodes, when the
total number of such tilings was, IIRC, on the order of 10^20.

~~~
jules
It's not all good however. Try representing all valid Sudokus with a ZDD

------
ljlolel
By far the coolest data structure is the soft heap
([http://www.link.cs.cmu.edu/15859-f07/papers/chazelle-soft-
he...](http://www.link.cs.cmu.edu/15859-f07/papers/chazelle-soft-heap.pdf)).

From wikipedia: <http://en.wikipedia.org/wiki/Soft_heap>

In computer science, the soft heap, designed by Bernard Chazelle in 2000, is a
variant on the simple heap data structure. By carefully "corrupting"
(increasing) the keys of at most a certain fixed percentage of values in the
heap, it is able to achieve amortized constant-time bounds for all five of its
operations:

~~~
igravious
Off-topic but meh. The guy is a frighteningly funny and shrewd to boot. See
any of his posts at A Tiny Revolution <http://www.tinyrevolution.com/mt/>

Oh a thanks for the heads-up on this data structure ljlolel.

------
sjs
Not sure if kd-trees are lesser known, but they are neat. I only learned about
them recently so to me they were "lesser known".

"In computer science, a kd-tree (short for k-dimensional tree) is a space-
partitioning data structure for organizing points in a k-dimensional space.
kd-trees are a useful data structure for several applications, such as
searches involving a multidimensional search key (e.g. range searches and
nearest neighbor searches). kd-trees are a special case of BSP trees." --
<http://en.wikipedia.org/wiki/Kd-tree>

~~~
cloudkj
I second kd-trees. I also only recently learned them, and at first encounter
thought that they are in concept similar to binary space partitions used in
computer graphics. Nearest neighbor search can be implemented using a pretty
simple two-dimensional kd-tree. Useful for things like finding the n closest
neighboring points for a given point on a plane (used for mapping
applications, etc.)

------
seldo
Nothing makes me want to hire developers like reading lists of algorithms and
discovering I don't understand them.

~~~
Tichy
If only the daily grind of programming would require interesting algorithms
more often. Certainly not the case for web development :-(

~~~
nailer
The most practical lesser known types for my day to day programming are:

\- element trees (etrees), partiocularly lxml implementation. Simple insert
and append operations at xpaths (eg /body/html/table/tr[3]/td[7]), iterating
over children, accessing element properties as object properties.

\- dependency graphs (aka graphs). I have no idea why they're called graphs (I
did business, not CS) but anywhere you have dependencies to store and work out
(say for a project management app, or a packaging tool) these are the best
fit.

------
argv_empty
An earlier HN item discussed an interesting one
(<http://news.ycombinator.com/item?id=1156628>)

------
RodgerTheGreat
I didn't see any mention of "Threaded Trees":
<http://en.wikipedia.org/wiki/Threaded_binary_tree>

They're discussed in detail in volume one of Knuth's "The Art of Computer
Programming".

------
jacquesm
it's hard to qualify 'lesser known', I'm not sure if red-black trees qualify,
but they're certainly cool:

<http://en.wikipedia.org/wiki/Red-black_tree>

~~~
pjscott
I like relaxed balanced red-black trees. In those, you're allowed to violate
the balancing conditions; you can go back and fix the violations later. This
is useful if you want to use them concurrently. The balancing transformations
tend to cause lots of contention between threads, so deferring that
rebalancing for a later, clean-up thread can really help with scalability.

~~~
jacquesm
I only learned about this fairly recently:

<http://www.itl.nist.gov/div897/sqg/dads/>

And I feel pretty stupid for not having realized earlier that something like
that almost has to exist.

~~~
pjscott
Heh; it hit me that way, too. It seems obvious in retrospect. I found that
this page has a nice mini-introduction:

<http://www.imada.sdu.dk/~kslarsen/RelBal/>

If requests that use a red-black tree tend to come in bursts, then you can
probably get speedups by deferring rebalancing for idle periods, even single-
threaded. That's pretty darn cool. I would like to see some programming
language runtime that watched rb-tree access patterns and decided if this
would be a good idea at runtime.

------
jperras
The Markov Chain:
[http://www.itl.nist.gov/div897/sqg/dads/HTML/markovchain.htm...](http://www.itl.nist.gov/div897/sqg/dads/HTML/markovchain.html)

------
ulvund
I found a lot of the data structures in "Computational Geometry: Algorithms
and Applications" interesting

------
budwin
happy to see disjoint sets in there

