
Skip Lists: A Probabilistic Alternative to Balanced Trees - silentbicycle
http://en.wikipedia.org/wiki/Skip_list
======
BrandonM
I've used a skiplist before to implement the "Six Degrees of Kevin Bacon"
program in C++. It was easier to implement than a balanced tree, and it had
similar performance, so it was pretty interesting and worthwhile. It took
about 30 seconds to read in over 100K actors (if I remember correctly), but
after that, searches were very fast.

There are several negatives, however. One is that skiplists use about (lg n)/2
times as many pointers as a balanced tree, if you actually make your skiplist
tall enough to be worthwhile. This could mean an extra MB or more of memory
for a skiplist with 100K items in it. In some cases it could be that half of
the memory consumed by the skiplist is the overhead involved in creating the
skiplist. This is also true of hash tables that don't have many elements, of
course, but with hash tables, the situation gets better as more items are
added. With skiplists, the overhead remains high regardless of the number of
items in the list.

Another "problem" is that of standard libraries. Even C has a tsearch function
(in search.h) that implements balanced search trees, and many other languages
include balanced search trees as well. Balanced search trees have less
overhead, both in terms of memory and performance. With a skiplist, you have
to choose a few random numbers (on average) for every item added to the tree.
Then with searches, you always have to start at the top (the "skippiest" part)
with comparisons, working your way down towards the bottom until you figure
out which node to jump to. With BSTs, it's just a single comparison and then a
movement left or right.

And of course, if your items are hashable, skiplists really lose out to hash
tables, which have near-constant lookup time, very little overhead as the
number of items in your table approaches the capacity, and implementations in
nearly every language.

~~~
ajross
I think your size computation is wrong, no? A skip list is a doubly-linked
list at each level, so two pointers per node at level zero (i.e. same as a
tree). Each higher level has (statistically) half as many entries, so you have
to multiply by the sum of the 1/2^N series, which is 2. So on average a skip
list takes 4 pointers per node, which is twice as much as a tree. It's just a
constant factor, there's no dependence on set size.

(It's true that balanced trees take a little more than two pointer per node,
but e.g. red-black trees only need one extra bit, which you can pack into the
bottom of one of the pointers if you like).

On the whole, skip lists are cute, but not really "better" than a red-black or
AVL tree. And the balanced trees aren't _that_ hard to implement, even if you
decide to do your own. Add that to the fact that tree stuctures in code are
kinda esoteric (almost always, you just want to be able to fetch an item by
identity. Range comparisons are pretty rare in memory -- that's what databases
are for) and skip lists leave me a little ... meh.

~~~
BrandonM
Yes, the size _could_ be computed that way. However, I was going by the
original paper in my implementation. This meant that I had to first estimate
how many items would be in the tree, then take the base-2-log of that (in
order to get O(lg n) performance), and then make _each_ node allocate that
many pointers. Of course, with most nodes, most of these pointers are
uninitialized and unused.

In my comment, I was referring to my (and the reference) implementation. After
reading your comment, though, you're right that you could allocate a number of
pointers equal to the height of the specific node. The only downside is that
you then have two mallocs for each new node (once for the node and once for
its array of pointers), unless you do something hackish like putting the array
of pointers at the end of the struct and manually adjusting your malloc call
to allocate the right amount.

One other note: I don't think my skiplist was doubly-linked. The search
algorithm involves moving forward on the highest level until the move would
take you past the desired element, then you move down a level and continue. By
the time you hit the bottom, you're guaranteed to either hit the element
you're looking for, or your position is the node before where the element
would be if it was in the list. Thus double-linking is entirely unnecessary.

------
ntoshev
It is a cool structure, but not a practical one because it is very cache-
unfriendly. I'm surprised Wikipedia doesn't mention this.

~~~
gaius
There are cases in which your dataset is larger than the cache, y'know.

~~~
seregine
That is the most common reason for using a cache in the first place, so you
seem to be stating the obvious. Am I missing something?

If I remember right, skip lists are unfriendly to caches (compared to balanced
search trees) because they don't optimize locality of reference. This matters,
for example, when your huge dataset is on disk and you're trying to cache the
working set in memory. In a skip list, related elements don't end up on the
same page as often, which means you have to read more data than you need from
disk, and read from the disk more often.

~~~
gaius
Well yes, that is true, but if you happen to know upfront that you're going to
be accessing your data randomly, and that for whatever reason you can't cache
it all, then you should optimize for that case. If on the other hand, you know
that you're going to have hotspots then you can size your cache accordingly.
Choosing before you know (or have had a chance to observe) is a premature
optimization.

~~~
johnm
But that's another argument of where skip list are _not_ appropriate. Or do
you have a real example?

~~~
gaius
It is simply a matter of your cache hit ratio. If that is very low then you
have two choices - increase the size of your cache, or use an algorithm that
is less reliant on caching for predictable performance. The size of your cache
is a pure price/performance calculation, "is it worth spending X to improve
performance by Y%".

By all means stick to caching/LRU/B*tree if that works for your application. A
cache is a brute-force solution, tho', and it helps to have more than one tool
in your bag.

~~~
dreish
But by increasing the number of pointers by log n, you reduce your cache hit
ratio, do you not? It would seem that the only time this doesn't matter is
when you've long since blown past the CPU cache and your RAM cache hit ratio
is going to be in the <5% range, with almost every access to the data
structure resulting in disk accesses.

~~~
gaius
Yes that is the case to which I am referring. I am often working with multi-
terabyte datasets.

------
silentbicycle
I meant to link to William Pugh's original paper when posting, but submit
doesn't seem to take FTP links:
ftp://ftp.cs.umd.edu/pub/skipLists/skiplists.pdf

------
alexstaubo
Skip lists are one of my favourite data structures. They're slower than
balanced trees, but ridiculously simple to implement.

