
What are the lesser known but useful data structures? - mck-
http://stackoverflow.com/questions/500607/what-are-the-lesser-known-but-useful-data-structures?rq=1
======
tikhonj
There's a whole set of interesting data structures that are not very well
known: succinct data structures[1]. The idea is simple: we want to store data
in a compressed form, but also perform certain operations quickly without
uncompressing.

These can be very useful for certain applications. The article on "Cramming
80,000 Words into a JavaScript File"[2] is a nice example. It shows you how
you can store a compressed trie in memory but still use it. I also like
this[3] series of blog posts leading up to wavelet trees.

These certainly count as obscure data structures, unlike many of the ones
listed on SO. I had never even considered the _idea_ of compressing data in
memory like this, much less encountered actual examples of succinct data
structures! I have to thank Edward Kmett for introducing me to the whole
field.

These data structures are important not just because they're neat themselves,
but because they got me to think a new way. In particular, I realized that
using pointers all over the place--to represent things like trees--is not
always efficient. Instead of parsing data, it might be better to store it as a
blob of some sort with a binary index. Just starting to consider details like
that is valuable all on its own.

[1]:
[http://en.wikipedia.org/wiki/Succinct_data_structure](http://en.wikipedia.org/wiki/Succinct_data_structure)

[2]:
[http://stevehanov.ca/blog/index.php/?id=120](http://stevehanov.ca/blog/index.php/?id=120)

[3]: [http://alexbowe.com/rrr/](http://alexbowe.com/rrr/) and
[http://alexbowe.com/wavelet-trees/](http://alexbowe.com/wavelet-trees/)

~~~
ot
> Instead of parsing data, it might be better to store it as a blob of some
> sort with a binary index.

This is exactly something I did for JSON, I call it semi-indexing: instead of
parsing it into a tree of pointers, I create a succinct representation of the
parsing tree, which is orders of magnitude smaller than the original JSON.
Construction is much faster than parsing because there are basically no memory
allocations, and access is not that much slower.

About performance of succinct data structures in general, it is true that they
have shown poor practical performance, but things are changing, both because
we have better CPUs (while memory latency is pretty much unchanged), and
better algorithms are being found. I did my Ph.D. on practical succinct data
structures. We found that in some applications, the access times are
competitive or faster than the non-succinct counterparts, while the space is
much smaller. One example is tries: in my thesis [2] there are experiments for
string dictionaries, and for query autocompletion (for example for search
engines).

Another area where (quasi-)succinct data structures are having some success is
inverted indexes: recently proposed posting lists based on Elias-Fano [3] have
been shown to outperform standard delta-encoded posting lists for queries with
sparse intersection, and are used in Facebook's graph search [4].

Finally, the biggest success story of SDS has historically been molecular
biology, because the size of the DNA sequences processed is so large that non-
succinct data structures are impractical. Many sequence assemblers/aligners
use variants of FM-indexes and Compressed Suffix Arrays, that are self-indexes
based on the Burrows-Wheeler Transform and Wavelet Trees.

[1] [https://github.com/ot/semi_index](https://github.com/ot/semi_index)

[2]
[http://www.di.unipi.it/~ottavian/files/phd_thesis.pdf](http://www.di.unipi.it/~ottavian/files/phd_thesis.pdf)

[3]
[http://vigna.di.unimi.it/ftp/papers/QuasiSuccinctIndices.pdf](http://vigna.di.unimi.it/ftp/papers/QuasiSuccinctIndices.pdf)

[4]
[http://www.vldb.org/pvldb/vol6/p1150-curtiss.pdf](http://www.vldb.org/pvldb/vol6/p1150-curtiss.pdf)

~~~
ocfnash
IMHO, the FM-index deserves to be highlighted. That it is possible to store a
string in a compressed format which can answer length-P substring queries in
O(P) time (with good constant factors) is quite surprising at first sight.

I recently wrote a few words about this here:
[http://ocfnash.wordpress.com/2014/01/03/dna-of-a-password-
di...](http://ocfnash.wordpress.com/2014/01/03/dna-of-a-password-disaster/) By
the time I finished I decided the whole area was exciting and seems not at all
as well known as it should be. I also think there are plenty of related, as-
yet-undiscovered ideas, waiting for somebody to find them!

~~~
Blahah
A nice application for the FM-index: it has massively accelerated alignment of
short DNA sequences against large databases.

------
kintamanimatt
> This question exists because it has historical significance, but _it is not
> considered a good, on-topic question for this site_ , so please do not use
> it as evidence that you can ask similar questions here.

Yet it's one of the best questions on SO. Something's very wrong with SO if
this isn't considered a good, on-topic question for a programming Q&A site.

~~~
YZF
How is that one of the best questions on SO? There is always someone who wants
SO to be what they want it to be. I'm not saying everything is perfect in SO
but I think the standard answer is that if you want a site that is about
subjective discussions related to programming you should make one...

~~~
kintamanimatt
How is it not one of the better questions on SO? It's educational, it's on-
topic, it's more cerebral than the usual questions about jQuery, it has high
quality answers, the question itself is as clear as day, it's interesting, and
it's very popular in terms of views and upvotes.

There's a degree of opinion in most answers anyway especially as there's often
multiple ways to do the same thing and every answerer will have a preference
of some kind. I don't necessarily see this one as a particularly subjective
discussion anyway, or why you think I'm trying to make SO be something it's
not. In any event, why are marginally subjective discussions are a horrible
thing anyway in a Q&A site?

~~~
YZF
How can anyone answer "really useful but are unknown to most programmers"?
They are so useful we don't know about them? Aren't the answers, by
definition, known, to a lot of programmers? Useful for what exactly?

I think I heard this criticism of SO for about 1 million times and this is the
1 million + 1 that tripped my fuse, so apologies for that. This question, and
the answers, though IMO are bits of trivia and opinions more suited to a blog
post.

Does it have some marginal usefulness in exposing some people without
background to some random data structures or algorithms- maybe. The question
is still there, it wasn't deleted, so SO is acting as the resource that you'd
like it to be. If it's your first exposure to data structures I think there
are better options out there (text books, online courses). If you are looking
to solve some particular problem this is probably not the best resource. So it
just stands as a (maybe interesting) bit of computer science trivia.

There are endless subjective topics that are tangentially related to
programming and may be fun/amusing/interesting but SO is about things that
have answers, not about discussions. I like it that when I have a question I
can search and get a good answer for it, not just someone's opinion. For
entertainment value and interesting things I go to other places, such as HN...

~~~
kintamanimatt
> The question is still there, it wasn't deleted, so SO is acting as the
> resource that you'd like it to be.

Try asking such a question today, or any time in the last couple of years!

> So it just stands as a (maybe interesting) bit of computer science trivia.

Just because you label it trivia doesn't necessarily mean it wasn't important
to the asker or the thousands of people that read, commented, and upvoted.
Just like people asking about jQuery, the asker had a question in mind and
wanted to crowdsource the answer.

> but SO is about things that have answers

An opinion can be an answer too.

> For entertainment value and interesting things I go to other places, such as
> HN...

Why must the answers on SO be bland as tofu? You seem to give the impression
that learning and the SO answers must be plain and dull, and any hint of
entertainment must be quickly quashed. I see nothing wrong with being
entertained and educated at the same time, especially as things that are
entertaining tend to be more readily recalled.

~~~
oneeyedpigeon
Why do meat eaters have such a desire to disparage vegetarians in completely
irrelevant contexts?

~~~
kintamanimatt
I didn't disparage vegetarians, and not even their diets. Tofu on its own is
bland and that's really hard to refute. Tofu with other stuff is delicious.

------
teddyh
Even though they are included in the GNU C library, most people do not seem to
know about Obstacks:

 _An "obstack" is a pool of memory containing a stack of objects. You can
create any number of separate obstacks, and then allocate objects in specified
obstacks. Within each obstack, the last object allocated must always be the
first one freed, but distinct obstacks are independent of each other.

Aside from this one constraint of order of freeing, obstacks are totally
general: an obstack can contain any number of objects of any size. They are
implemented with macros, so allocation is usually very fast as long as the
objects are usually small. And the only space overhead per object is the
padding needed to start each object on a suitable boundary._

[https://www.gnu.org/software/libc/manual/html_node/Obstacks....](https://www.gnu.org/software/libc/manual/html_node/Obstacks.html)

Sure, they’re not very _interesting_ , but the point is that you get them _for
free_ in the GNU C standard library.

~~~
dllthomas
The FIFO equivalent can also be useful.

~~~
teddyh
I’m not sure what you are referring to, do you have a link?

~~~
dllthomas
I don't have a link, but it's an obvious transformation - basically, using a
ring buffer for allocations where I can guarantee that older things will be
freed before newer stuff.

------
bazzargh
It's not highlighting one thing, but Chris Okasaki's book on Purely Functional
Data Structures, and this brilliant top answer to a question about functional
data structures published since the book will keep you in reading material for
a while:
[http://cstheory.stackexchange.com/a/1550](http://cstheory.stackexchange.com/a/1550)

(it was all 'lesser known' to me when I started using haskell not so long ago)

~~~
dllthomas
I second the plug for Purely Functional Data Structures - brilliant stuff!

------
batbomb
I use a Hierarchical Triangular Mesh for indexing gamma ray events from the
universe. The data is partitioned in the database according to it's HTM id.

[http://arxiv.org/pdf/cs/0701164.pdf](http://arxiv.org/pdf/cs/0701164.pdf)

Currently I use this for indexing ~11 billion gamma ray events. Researchers
typically supply a region in the sky, a search radius, and some cuts (energy,
event quality, etc...)

~~~
codezero
HTM is generally good. We used it to index white light data from out
satellite. It makes for a very flexible way to store spherical surface data
without losing context or information when projecting.

------
hyperpape
Looking at these lists, I strongly suspect that people upvote based on whether
they personally recognize the data structure.

It goes against the intent of the original question, but iIt's almost ideally
designed to make you feel good--you get the rush of knowledge then nerd sniped
as you head to wikipedia.

~~~
andrewflnr
Lest people think this is useless negativity, the point is that you need to
scroll down and look at the other pages to see the really interesting ones. :)

------
VexXtreme
I love how the question was locked because "it is not considered a good, on-
topic question for this site". It's crazy. Unless an extremely specific
concrete answer can be given, a question immediately gets killed. SO has
turned into such a turd of a website.

~~~
joelthelion
Speaking of which, is there a good alternative site for questions that are
forbidden on SO?

~~~
jzwinck
There are other sites in the StackExchange network like Programmers and Code
Review and some others for math and other things. But I'm not sure which site
(SE or otherwise) would be the best for this particular topic.

~~~
sherwin
Probably the CS stackexchange
([http://cs.stackexchange.com/](http://cs.stackexchange.com/)), or CStheory
([http://cs.stackexchange.com/](http://cs.stackexchange.com/)). Unfortunately,
neither are anywhere near as popular as StackOverflow, so the chances of
getting a good discussion are lower.

------
chas
I'm happy to see finger trees got mentioned. Finger trees[0] are extremely
useful and general data structure that can be used to implement persistent
sequences, priority queues, search trees and priority search queues.
(Haskell's Data.Sequence[1] uses specialized 2-3 finger trees internally) They
can form the basis of all sorts of interesting custom structures by supplying
the appropriate monoid[3], but this does make them harder to approach if you
are not familiar with the abstractions.

[3] A monoid is any structure that has members that can combine associatively.
In addition, it must have an element that can combine with any other element
and result in the other element. Some examples: (strings, string
concatenation, the empty string); (integers, addition, 0); (natural numbers,
max, 0); (booleans, and, True); (functions, composition, the identity
function). The functional pearl[2] that describes the design of Haskell's
diagrams library[4] goes into much more detail if you are interested in their
application to programming.

[0] [http://apfelmus.nfshost.com/articles/monoid-
fingertree.html](http://apfelmus.nfshost.com/articles/monoid-fingertree.html)

[1]
[http://hackage.haskell.org/package/containers-0.5.4.0/docs/D...](http://hackage.haskell.org/package/containers-0.5.4.0/docs/Data-
Sequence.html)

[2] [http://www.cis.upenn.edu/~byorgey/pub/monoid-
pearl.pdf](http://www.cis.upenn.edu/~byorgey/pub/monoid-pearl.pdf)

[4]
[http://projects.haskell.org/diagrams/](http://projects.haskell.org/diagrams/)

~~~
dllthomas
_" In addition, it must have an element that can combine with any other
element and result in the other element."_

Aka, an identity element.

Monoids that have inverses (that is, every element has another element that,
when combined, produces the identity element) are "groups". Of the examples
you gave:

"Strings over composition" does not form a group - there's nothing you can
concatenate with a non-empty string to get an empty string.

"Integers over addition" does form a group. The inverse of x is -x.

"Natural numbers over max" is not a group. Once you get above 0 you cannot get
back to it just by applying max.

"Booleans over and" is not a group. Once you have false you can't get back to
true.

"Functions over composition" is not a group. Many functions have inverses, but
some do not. If you restrict the set to "functions with inverses" then you
_do_ have a group.

~~~
chas
Exactly. I was trying to avoid any mathematical jargon while describing
monoids because they are such general and pervasive structures in programming.
Further, being able to construct useful new monoids is crucial for making full
use of a finger tree and I didn't want to have someone find an interesting
structure unapproachable because they had to walk through an abstract algebra
jargon storm in order to understand it. I realize I referenced associativity
without describing it, so I didn't quite achieve my goal, but hopefully the
examples got the idea across to someone who might otherwise have dismissed it
due to the perception that it was too academic and unapproachable.

As an aside, trying to remove the jargon from a definition like this is
surprisingly hard and really expands the size of the definition. I think it is
still worthwhile to try when writing for a general audience because algebraic
structure is everywhere and very useful to recognize for people who are
interested in rigorously manipulating abstraction, i.e. programmers.

~~~
dllthomas
I agree; my "aka" wasn't meant as criticism. It's easy to get lost in the
longer definitions, so I was just restating it directly in the hopes that one
or the other - or the combination - will be clear to most people.

Though I'm not sure "abstract algebra jargon storm" properly describes use of
"associative" and "identity" \- I remember learning about the "associative
property of multiplication" and such in elementary school and my wife confirms
she had similar experience.

------
jboggan
Bloom filters and count-min sketch are awesome.

[http://en.wikipedia.org/wiki/Bloom_filter](http://en.wikipedia.org/wiki/Bloom_filter)
[http://en.wikipedia.org/wiki/Count%E2%80%93min_sketch](http://en.wikipedia.org/wiki/Count%E2%80%93min_sketch)

------
nilkn
I don't really consider tries and bloom filters all that poorly known. These
commonly come up in interviews for fresh graduates at Google/Facebook.
Zippers, skip lists, ropes, round-robin databases, etc. are more genuinely not
known I think.

------
abcd_f
XOR linked list:
[http://en.wikipedia.org/wiki/XOR_linked_list](http://en.wikipedia.org/wiki/XOR_linked_list)

It's a double-linked list with just one link per node. However, to start
traversing it you have to know at least two adjacent nodes.

PS. May not be _useful_ per se, but interesting nonetheless.

~~~
aaronem
The XOR swap is interesting, too. But I wouldn't want to see one in code
someone wrote today, and which wasn't targeted at an embedded platform -- and
a heavily constrained embedded platform, at that. (MSP430? Sure. ARM
Cortex-M3? You've got enough space to do it properly.)

~~~
dllthomas
It's not just "you've got enough space" \- it's only _relevant_ if both
operands are in registers already. If they're both in memory, then the xors
are entirely redundant (first things you'd have to do is read values into two
registers, at which point you can just write it back the other way).

I wouldn't object to seeing it in a critical section in high performance code
with substantial register pressure, but ideally it could be inserted by the
compiler!

------
krisgee
I was going to say Trie but it was the first response to the SO thread so I
guess it wasn't as little known as I thought.

I implemented it because I was making a game that had scrabble elements in it
and needed to check ahead to see if the player had a word that could still
take letters (a prefix) or if they'd hit a dead end. Fit the whole SOWPODS
into a remarkably tiny space with millisecond lookups. Probably my favourite
part of the project.

------
lifthrasiir
I found the following page in the Concatenative wiki particularly interesting:
[http://concatenative.org/wiki/view/Exotic%20Data%20Structure...](http://concatenative.org/wiki/view/Exotic%20Data%20Structures)
(Note that the page itself is not related to the concatenative languages.)

------
nl
Bloom Filters:
[http://en.wikipedia.org/wiki/Bloom_filter](http://en.wikipedia.org/wiki/Bloom_filter)

Hamming Codes:
[http://en.wikipedia.org/wiki/Hamming_code](http://en.wikipedia.org/wiki/Hamming_code)

------
ww520
Extendible hashing is amazing in space utilization while retaining the
performance of hashing.

------
shurcooL
Does anyone know of an implementation of rope in golang? Something a little
more feature complete than
[https://github.com/christianvozar/rope](https://github.com/christianvozar/rope).

------
swah
Let me just drop this video that is on my watchlist:
[http://www.youtube.com/watch?v=-sEdiFMntMA&feature=share&lis...](http://www.youtube.com/watch?v=-sEdiFMntMA&feature=share&list=PLFDnELG9dpVxEpbyL53CYebmLI58qJhlt)
(Erik Dermaine is the lecturer)

------
pnathan
I recently learned about the spatial index tree family in connection with data
mining. I hope to implement a data-mining centric X tree (n-dimensional)
solution for a data analytics package I'm writing soon. That family is is how
you efficiently handle KNN lookups, afaict.

------
WWKong
Me and my friend were pretty serious about creating a new data structure
called "drum". A drum is a one way store. You write to it but can't read from
it. We put it off till we figured a practical use.

~~~
statusgraph
An append only log? The value of writing data is that you can read it,
otherwise your structure is semantically equivalent to not writing at all.

~~~
dllthomas
Or reading some information computed from it, I suppose. Something needs to
read it, ever, for that of course, but it doesn't necessarily need to be
exposed in the interface.

------
keefe
imho most DS & Algo are like kihon - the basic principles t'werk just fine,
it's about augments and applications.

------
halfdeadcat
Cup-a-bits

------
serge2k
It's mentioned in the link, but circular/ring buffers.

I've been grappling with decoding/playing back an audio stream and wouldn't
have gotten it working if I hadn't found out about boosts lockfree ring
buffer.

~~~
dllthomas
Recently I learned that if you're using a ring buffer for message passing, the
clflush opcode (after sending and after receiving) can dramatically reduce
cache misses.

