Hacker News new | comments | show | ask | jobs | submit login
Skip Lists Done Right (ticki.github.io)
451 points by signa11 on Dec 19, 2016 | hide | past | web | favorite | 82 comments

Java has a lock-free concurrent skip list in the ConcurrentSkipListMap class [1]. Herlihy et al. claim that existing skip list algorithms (including the Java implementation) are complicated, and instead provide a supposedly faster implementation [2].

The two techniques they use are:

- Optimistic traversal of lists without grabbing a lock: Only when it arrives at the item being seeked, it grabs the lock and validates that the list is unchanged.

- Lazy deletion: marking a node as deleted and removing it from the list later.

[1] https://docs.oracle.com/javase/8/docs/api/java/util/concurre...

[2] http://people.csail.mit.edu/shanir/publications/LazySkipList...

Redis uses a skip list (doubly linked) implementation for sorted sets, here are antirez's (BDFL) comments as to why from 7 years ago:

There are a few reasons:

1) They are not very memory intensive. It's up to you basically. Changing parameters about the probability of a node to have a given number of levels will make then less memory intensive than btrees.

2) A sorted set is often target of many ZRANGE or ZREVRANGE operations, that is, traversing the skip list as a linked list. With this operation the cache locality of skip lists is at least as good as with other kind of balanced trees.

3) They are simpler to implement, debug, and so forth. For instance thanks to the skip list simplicity I received a patch (already in Redis master) with augmented skip lists implementing ZRANK in O(log(N)). It required little changes to the code.

Data structures in Redis have evolved a bit in the last 7 years. One of the main workhorses is indeed something called a "ziplist". The implementation (especially the format description at the top) is worth a read: https://github.com/antirez/redis/blob/90a6f7fc98df849a9890ab...

For those who are familiar with pre-2.6 Redis code, there used to be something called zipmap, but it was deprecated in favor of ziplist.

I have not looked at Redis code in eons, but I think it's not exactly true - all the structures are there, it's just that redis will favor ziplist for the very small ones.

A ziplist is a list of compact varying length integers (or strings) where only the bits that make a difference are stored (I contributed the 24-bit version to Redis[1]). The argument is that a sequential search over a small data structure always wins over more complex things like hash tables, skiplists, etc.

As you get more data, Redis will switch to using skiplists or whatever - there is a config setting for that (-max-ziplist-entries and -max-ziplist-value).

[1] https://github.com/antirez/redis/commit/5a86ab47995586f0a0ef...

Elaborating on gtrubetskoy comment, ziplists are doubly linked lists (of integers and strings) encoded into an array of bytes that stores the size of both the next and previous element so you can quickly find the location of the next/previous element. This makes ziplists space efficient and have great cache efficient, but expensive to insert into since you need to reallocate the array and shift a potentially large number of elements over every time you inset/delete. These properties make ziplists useful when they are small, but not so as they get larger, so Redis will switch to different implementations once the ziplists reach a certain size.

Interestingly, Redis lists are implemented as a double linked list of ziplists, where the size of each ziplists is bounded. This gives you good caching as the quicklist is being traversed, while limiting the cost of an insertion. Redis will also compress ziplists not near the edge in order to save space.

Yes, that is correct. The code checks the encoding at runtime ( https://github.com/antirez/redis/blob/8444b46d20ef9c8de3f7e2... ) and converts between the structures as needed (see calls to `zsetConvert`). The alternative (for large zsets) is called zskiplist, the manipulation code is directly in t_zset.c and the structure is defined at https://github.com/antirez/redis/blob/8444b46d20ef9c8de3f7e2...

Looks like I already do pretty much all those things in twoskip:


(except for the level choice. Which is a bit meh to me, because a write will probably trigger 3 fsyncs, so the cost of level choice is very low)

That said, my next DB won't have any skiplists in it I don't think, or at most an in-memory one, but not on disk.

>That said, my next DB won't have any skiplists in it I don't think, or at most an in-memory one, but not on disk.

What will you use instead (and why)?

B-trees, LSM trees, or any data structure that's actually designed to be cache friendly.

I think his claim that his optimizations can beat a b-tree is bullshit otherwise he would've shown benchmarks.

I haven't written it yet. I suspect I can't beat b-tree in entirely random read workloads.

Adding to rawnlq's comments:

Disk access (reads and writes) still happen in chunks, even with SSDs. It's much more efficient to write entire chunks of data out than to write a few hundred bytes over a dozen different chunks.

The requirements for writing out a B-Trees to disk is in line with writing out chunks of data, which makes them very popular (and performant) with database implementations.

And this doesn't cover how performant CPUs are at performing operations over a contiguous list of data. The complexity may be O(n), but the implementation introduces a fractional constant which can outperform a O(log(n)) algorithm which involves cache misses.

> Disk access (reads and writes) still happen in chunks, even with SSDs. It's much more efficient to write entire chunks of data out than to write a few hundred bytes over a dozen different chunks.

To add, also memory access happens in chunks. Currently chunk (cache line) size is 64 bytes on almost all x86 and ARM CPUs. It's the smallest quantity of data you can read from or write to RAM.

CPUs can also efficiently prefetch when they detect sequential access patterns, so accessing large contiguous blocks of memory is preferable to achieve high performance.

Sorry I didn't get back to this.

It's a virtual linear log file which gets split occasionally by appending a sorted list of pointers to the end and "closing" that log file, along with occasional tasks repacking the log files together.

Individual records (either key/value or key/tombstone) have a pointer showing how far back they cast a "shadow" so you know how long you need to keep tombstones for.

Since only the active log file can ever have changes made, you can repack any two adjacent files together into a single compacted log file at any time, removing any duplicates or tombstones that don't cast a shadow outside the range being compacted.

Skip lists are a hidden gem! I first learned about them from a GitHub developer at OSCon a few years ago, and at time they seemed like black magic.

It's a shame that universities (or at least mine) don't teach more interesting algorithms like this... it would make the algorithms/data structures coursework so much more interesting.

It was part of my normal university computer science data structures course. Was actually one of the structures we had to implement. Also, for all the "hard to get right" comments I remember them being much easier to write than RB Trees or B Trees.

Georgia Tech definitely teaches it in the intro data structures and algorithms course. I don't remember if we had to implement it, but it definitely came up in lecture.

We had to implement them when I took it with Dr. Waters at least (Summer 2015). CS 1332 is a pretty good intro course.

WUSTL does teach Skip List. We spent a good two weeks learning about it. Intro, analysis, lab, test about Skip List. It's very cool and hard to get right.

Yeah, I was just thinking about that 241 lecture. I do feel like we did the naive version though

Berkeley did cover skip lists, along with a lot of other algorithms, both practical and esoteric.

UT Austin's multicore computing course had an optional project to benchmark concurrent skip lists or implement faster skip list algorithms. My group aimed to improve element access times by boosting the height of frequently accessed elements.

My University (UMD) does, but that might only be because Bill Pugh teaches/taught here.

I'm pretty sure my program at KU covered skip lists in our data structures class.

Illinois covers them.

While I really do like this article and the really interesting techniques it showed me, I'm bothered by the fact that they constantly mention performance but there's not a single benchmark on that page. It's not that I don't believe that these are huge theoretical gains, it's just that I'd like to see how these perform in practice compared to other approaches.

Skip lists are probably the easiest way of getting O(log(n)) lookups on ordered lists.

Recently I got rid of huge bottleneck on an oldish piece of software by moving from a vanilla linked list to a skip list. And I did do many things suggested in the article (e.g. having a vector of fixed length for the pointers, and thus a fixed tallness).

Funnily enough, for my case, I managed to do without a RNG just fine. I just have an element that is of tallness 'k' every 2^(k*3) insertions (e.g. every 8th insertion is 1 level tall; every 64th insertion is 2 levels tall; and so on). For my particular pattern of insertions, this proved to be more than enough, and and simplified things a little bit.

> Skip lists are probably the easiest way of getting O(log(n)) lookups on ordered lists.

I wouldn't say that. The algorithms for AVL/red-black/splay trees are not especially complex, and they contain much less subtlety than skip lists. If you just look them up on wikipedia and see the list of operations, it's not particularly hard to implement them. Skip lists, on the other hand, are very easy to get wrong (this is the premise of the linked article, after all).

Using a skip list is frequently the better choice for performance and memory reasons (especially if you're iterating through the list), but they are not necessarily simpler data structures.

I did a brief survey before deciding to go with skip lists, and found them to be easier to grasp and reason about. (e.g. there is no need to balance trees; a skip list is very similar to a linked list, and linked lists are very simple).

I remember somebody saying that "in a sane world skip lists would always have been discovered before red-black trees".

This historical accident (skip lists were only discovered 1989; while rb-trees date back to the 1970s) is probably the reason why skip list use is not more widespread, and seen as somewhat exotic at times.

(AVL and Splay-trees also predate skip lists.)

Skip lists are just B+ trees in disguise.

Look at the basic diagram here.


See how there is an "express lane" containing just (3 5), and the lower layer contains the entire (1 2 3 4 5 6 7) list?

The main difference is that it's a combination of arrays and links rather than parallel lists.

Binary search tree rotations are hard to implement correctly, but relatively easy to test that you have them right.

There are a lot of subtleties in any stochastic algorithm (like a skip list) that are much harder to correctly test.

While we're discussing "easy to implement" algorithms, I find Tries (aka radix trees) to be by far the easiest to implement for both ordered and unordered sets/maps. Naive implementations are very space inefficient though.

> The algorithms for AVL/red-black/splay trees are not especially complex

and can be coded once, then used in millions of programs. That whole point is largely moot.

A red-black tree gives you certain guarantees without any probabilistic arguments. Wikipedia: A skip list does not provide the same absolute worst-case performance guarantees as more traditional balanced tree data structures, because it is always possible (though with very low probability) that the coin-flips used to build the skip list will produce a badly balanced structure.

A skip list algorithm which guards against the worst case will necessarily have to do some rebalancing that requires nodes to move among levels according to some heuristic.

Skip lists are not storage efficient unless the nodes are variably sized. If each node includes enough pointers to participate in any level, then you waste space. Variable size means we cannot draw nodes from a single, compact pool array of nodes for a skip list. Variable size also creates difficulties if we want to use this as an "intrusive container". (A container which works by inserting the necessary node links and other fields into the client's data structure, so that the client data structures directly link into the graph as nodes, rather than being separately allocated and referenced from it.)

I'm looking at a C implementation pointed at by DADS. http://epaperpress.com/sortsearch/txt/skl.txt

This uses the C struct hack for variable node sizing, which makes it rely on malloc for the nodes. A client cannot pass in a pre-allocated node.

A mistake in this type of code can lead to a buffer overflow: accessing the n-th level of a node which only has k < n levels worth' of pointers.

If logic were added which moves a node from one skip level to another, that would require the entire node to be realloc'ed, if its number of links needs to increase. Realloc-ing can change its address; all existing pointers to the node must be rewritten.

Another consideration regarding memory: a binary tree with no parent pointers has two pointers per node. (Traversal then needs temp space due to to recursion or iteration with explicit stack.)

However, a binary tree can be traversed in O(n) in both directions.

A skip list with a singly-linked level zero cannot be traversed in reverse.

If we make it doubly linked to meet the requirement for reverse traversal then ... it has two pointers per node and then some. There goes your storage advantage.

Actually, you can. As per the article, you can view the skiplist as a tree when turned 1/8th.

So you can start at the head, put it on the stack, move to the next node referenced at the hightest level, put it on the stack, etc. just like you can start from the top of the tree, put it on the stack, move right, put it on the stack, etc.

The advantage of the skiplist over the tree is that you can traverse forward in O(1) space. But both are equally bad in reverse

[Shameless plug] I've written a skip list implementation you can play with from the browser:


Tip: click "Insert a random item" a bunch of times to populate the (initially empty) skip list.

Most interesting aspect of the article was the last section and first comment, linking to SkipNet: http://research.microsoft.com/en-us/um/people/alecw/usits-20...

Ideas like SkipNet are what differentiates a hack like me from genuinely smart people who can look at the definition of in-memory data structure and spot properties useful to an Internet-sized system

MemSQL (distributed relational database with in-memory tables) uses skip lists: http://blog.memsql.com/the-story-behind-memsqls-skiplist-ind...

The pseudocode + example given for the initial implementation of 'insert' doesn't seem right, no 'below' connection is made between the two '13' nodes.

The way to fix this would be to let 'insert' return the node inserted at the given level. There already are a few returns in the 'insert' pseudocode right now, but it doesn't seem like anything is being returned right now.

Something like this:

    -- Recursive skip list insertion function.    
    define insert(elem, root, height, level):
        if right of root < elem:
            return insert(elem, right of root, height, level)
            if level = 0:
                new ← makenode elem
                old ← right of root
                right of root ← new
                right of elem ← old
                return new
                if level ≤ height:
                    new ← makenode elem
                    old ← right of root
                    right of root ← new
                    right of elem ← old
                    below new ← insert(elem, below root, height, level - 1)
                    return new
                    return insert(elem, below root, height, level - 1)
Or more simply:

    define insert(elem, root, height, level):
        if right of root < elem:
            return insert(elem, right of root, height, level)
        elif level > height:
            return insert(elem, below root, height, level - 1)
            new ← makenode elem
            old ← right of root
            right of root ← new
            right of elem ← old
            if level > 0:
                below new ← insert(elem, below root, height, level - 1)
            return new

Maybe off-topic, but thanks for putting what tools you used to do your illustrations with, always happy to see what people are using for that.

Dia and TiKZ, they make the diagrams I produce seem a bit..rough..

Excellent article, I really love algorithm and datastructure. if anyone else is interested like me, I recommend this book https://www.amazon.com/Introduction-Algorithms-3rd-MIT-Press...

It is also very handy if you need to flatten a small (< 2 kg) household pet.

> It is also very handy if you need to flatten a small (< 2 kg) household pet.

or your ego :)

I always say college was very practical learning as at different points in time I was able to kill a mouse with both algorithms (this very book) and computer architecture.

It took you 2 books to kill 2 mice, but I'm sure with more observations it would have been obvious that your approach was actually O(logN).

It would be bad to have a book about multiple trees which implemented less than 1 tree in its printed form.

Implemented? I don't think it implements tree, so much as inherits from tree.

On the upside, it is really cheap by weight.

I'm not a big fan of that book. For all its weight its still surprisingly informal.

The author states in the "Extra Remarks" section:

"Performance is mostly about the same, but skip lists are more DoS resistant if you make sure that all links are F2F."

Can someone tell me what F2F stands for? I am not familiar with this acronym.

My guess is friend-to-friend, i.e. a private peer-to-peer network.

I see - P2P, so something like Chord. Thanks.

In the context of a DHT, F2F stands for friend-to-friend, i.e. authenticated nodes.

I've read this page before, and again today, and I still don't understand how these unrolled lists are supposed to work in practice.

Based on the author's example at https://i.imgur.com/FYpPQPh.png, how do you take an unrolled skiplist that has a bottom row like this:

    [1,2,3] -> [4,5,_] -> [7,8,9]
And insert 2.5? An inevitable tree restructuring would have to occur, which vastly complicates the insertion logic.

That first chunk will get split, 3 will get copied to new chunk, and list is restructured. I think the selling point here is you paid this price but get dividends on reduced cache misses in the future.

Interesting. Whether that approach is efficient depends entirely on the workload. That complicates the analysis. (And one of the major benefits of skiplists is that the analysis is supposed to be simple.)

Aren't skip lists more-or-less essential for indexed search engines? I wouldn't say they're uncommon considering how much indexed search surrounds us.

e.g., http://lucene.apache.org/core/6_3_0/core/org/apache/lucene/c...

I just came across skip lists as part of a sliding-window aggregator in Apache Samza's metrics system[0]. So cool! Being able to work with a range of values in somewhere between O(1)and O(log n) time, while still providing ordered key/value mappings, is great. Dug through java.util.concurrent looking for more structures, but this seems to be the most interesting of the bunch.

[0] https://github.com/apache/samza/blob/master/samza-api/src/ma...

Archived: https://archive.fo/H7F9e

The site doesn't work without scripts and trips up links2, so you can use the archive link instead (it works without scripts).

Something to note: you state that BST's have logarithmic time in order successor, but many implementations use threaded binary trees which provide an amortized constant successor operation. Great article though.

So my takeaway is that two main use cases for this DS are:

1: as an alternative to a distributed hash tables

2: as an alternative to a self-balancing tree such AVL tree etc.

Are there any other common uses cases for skip lists?

Some memory allocators use them for tracking free blocks because insertions/removals are cheap and they're good at best-fit style searches.

> I can't seem to find my favorite trick: deterministic (ie. without rng) skipping by linking on traversal instead of insert/delete.

Says one comment. Can anyone explain that?

It sounds like they mean that, if you have all the data you want for your skiplist and all you need going forward is a read-only data structure for doing lookups, then you can put your data in order in a regular list and then build up the towers of skip links in deterministic fashion to get an optimally balanced distribution.

Doesn't seem like much of a trick, though. If you can do that, then you can just stick it all in an array and bsearch to find things.

I initially thought it might be referring to deterministic skip lists (Munro, Papadakis, Sedgewick, http://www.ic.unicamp.br/~celio/peer2peer/skip-net-graph/det...), but the "linking on traversal" part stumps me. Maybe it's referring to the top-down approach in section 3 of the paper.

Might anyone familiar with both data structures be able to briefly explain how this actually differs from a B-tree?

There are a lot of differences; here's a couple:

A skip list node contains one key and has a random fanout; a B-tree node contains 1-N keys and has fanout equal to <number of keys> - 1.

A B-tree is guaranteed amortized log(N) lookup/insert/delete time, a skip-list has a high probability of a log(N) lookup/insert/delete time, but does not guarantee it.

The advantage of skip lists over b-trees is that they can have a smaller constant-factor, and are more cache friendly under many workloads.

Another nice thing about skip lists is that you can traverse them backwards without adding backpointers with a recursive algorithm and an O[logn] size stack.

What framework are you using to publish the blog?

Looks like Hugo[1] running the Bleak[2] theme.

  [1] http://gohugo.io/
  [2] https://github.com/zenithar/hugo-theme-bleak

Footer says that Hugo is being used.


Skip Lists. Seem handy for increasing access times on ordered lists...

Any Lispers care to comment?

what's your method for quickly searching an ordered list? Binary search falls apart since you don't have free random access.

Skip lists allow for that, sort of.

In Lisp, most people would just, you know, not use a list for that. You can do that, you know...

The only reason I asked about lispers is that lispers really like lisps, and I was curious if any of them had implemented skip lists for some reason, and if so, how.

> ... just, you know ... you know ...

Yes, I know. Do you? You made the suggestion that skip lists are inferior to ordered lists, not me. (good for "increasing access times on ordered lists") I'm still perplexed as to your rationale for that statement.

No, I didn't, AFAIK. I'm baffled as to where you drew that from...

Maybe because you wrote "increasing access times", which could be read as "skip lists are slower." Not what you meant, probably.

This was my understanding of that comment. How else could that be read? I am not able to think of another interpretation of that statement or a context when increased access times would be desired.

The intended interpretation is that skip lists make access to specific items in ordered lists faster.

I am actually a native speaker, so I have no excuse here. I just suck at writing, I guess...

Skip lists are container data structures for keeping items in order, an alternative to various balanced trees like red-black.

The funny thing is how rarely you need that sort of thing if you have an unordered associative container, and a sorting function.

It's possible to combine a hash table with a list. E.g. the hash table references not the items directly, but the conses of a list in which those items are stored as the cdr. Then you have keyed random access to the list, as well as a particular order. Finding a spot where to insert is linear, of course.

A container which keeps items in order is useful primarily for real-time programming, for implementing a priority queue: the situation is that a dynamic set of items is in constant churn, and needs to be in order all the time (where the order is different from the order in which those items come to life or arrive or whatever). For performance, we don't want to just add an item to the end of the queue and then call sort.

If we don't care about this real time aspect, we can just accumulate items and then sort them. Even if we sort a list several times, that may not be of consequence if it is bounded to a small length. Unix treats directories like this: the "dents" are not ordered in any way: globbing expansion and the "ls" utility accumulate names and sort them. Unix could keep directories in sorted order (and there are, I think, some file systems which actually do).

There is an anti-pattern of abusing sorting when people don't have an associative container. An example of this is shell scripts which use sorting to help de-duplicate items, instead of an obvious awk job with an associative array. Another anti-pattern occurred in C++: for long time C++ had only std::map and std::set: ordered containers. These were used all the time, even when order was not required (Now there is std::unordered_map).

In the late 1990's, I wrote a pretty nice red-black tree data structure in C, which ended up used in various places, such as in ext2fs tools. one company I worked for used it for the basis of a "real-time database" for software running on a switch node for real time packet inspection and processing. The implementation had to be modified to allow for shared memory.

I don't miss that kind of container in everyday Lisp programming somehow. I have a several years old branch where that code is half-added to my Lisp dialect. There isn't much motivation to finish the job; the need for that just somehow doesn't arise. Most of the time, either you're dealing with dynamic sets which are unordered (by definition of "set", practically), or else with lists which are ordered, but don't have to be constructed by out-of-order insertions. A lot of the time, the order of a list doesn't actually follow any predicate: the only specification of order is that list instance itself and its actual order. (The order came from somewhere, but its reason is not necessarily known, only that it should be preserved.)

ANSI CL doesn't have any optimized ordered list container, only hashes. Why? It seems like nothing notable of that sort originated in any of the base dialects from which CL was formed.

Suppose you needed some real-time database for some service application: it would make a lot of sense to just pick one and make Lisp bindings to it.

Makes sense. Thanks for the insight Kaz.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact