
Choosing the right data structures - jwdunne
http://mathieularose.com/choosing-the-right-data-structures/
======
tlarkworthy
The underlying lesson here is having a process to identify hotspots (profile),
and then being able to swap one data structure for a contextually better one
(polymorphic collections).

Presumably the developer of the topological sort was operating on small graphs
where time complexity was not an issue. Its better to have the inefficient but
numerically correct algorithm at hand than no implementation at all. So I
don't think the author of the original topological sort has make a mistake.

The best method for developing high performance algorithms is write the least
error prone implementation first, with no eye on efficiency. Then build a
separate high performance implementation. You test the second one is working
by blowing random data into both and double check the fast one computes the
same thing as the simple to understand implementation. That's how I built a
persistent red-black tree over a space of two weeks[1]

1:
[http://wiki.edinburghhacklab.com/persistentredblacktreeset](http://wiki.edinburghhacklab.com/persistentredblacktreeset)

~~~
xixi77
While we should avoid premature optimizations, it is quite important to keep
efficiency considerations in mind from the beginning. If we are talking about
internal workings of one isolated function, fine -- but quite often a data
structure picked "with no eye on efficiency" early on gets baked into the rest
of the code so deep that by the time it becomes a problem, only extensive
refactoring and debugging can get it out of there. And quite often the error-
proneness of the more efficient data structure is no worse -- the classic
structure-of-arrays vs. array-of-structures question is one example.

------
recentdarkness
Hmm this post reminds me really of Container Choice:
[http://i.stack.imgur.com/HNMy4.png](http://i.stack.imgur.com/HNMy4.png)

Which was, at least in the C++ community I was active in, pretty famous

It actually helped a lot of beginners choosing the right C++ Standard Library
containers for their needs.

~~~
Volscio
Thanks for sharing this.

Are there any similar posts to the original that provide simple explanations &
benches of the effects of implementing different data structures/algos?

------
joubert
I often come across developers who only think to use data structures that are
native to their language (a Sapir-Whorf trap, if you will).

For example, in JavaScript, we essentially have arrays (great for tracking
order, O(n) lookup) and objects (great for O(1) lookup, no order), but that
doesn't mean you can't: a) implement other data structures such as doubly
linked lists, heaps, red black trees, etc. b) combine the strengths of two
different data structures in a new object

I think it was Bill Gates who emphasized the importance of picking good data
structures up front.

A book I recommend for learning how to implement data structures (and
concomitant algorithms) is: Introduction to Algorithms
([http://mitpress.mit.edu/books/introduction-
algorithms](http://mitpress.mit.edu/books/introduction-algorithms))

It is also useful to have some heuristic for picking data structures, and I
have found the recipe proposed by Gayle Laakmann in her book, Cracking the
Coding Interview, useful in most non-esoteric cases.

~~~
kmfrk
The Algorithm Design Manual is basically an algorithm cookbook, which makes
for a nice companion book.

------
nano_o
There is some very interesting work in the area of "choosing the right data
structures".

For example "An introduction to data representation synthesis"
([http://theory.stanford.edu/~aiken/publications/papers/cacm13...](http://theory.stanford.edu/~aiken/publications/papers/cacm13.pdf))
presents a technique to do the following:

1) specify what the data structure should do using relational algebra (like
for data bases)

2) automatically generate a data structure that is optimized for your specific
hardware and workload.

A big advantage is that you can optimize for different hardware and workloads
without changing your code. Also, the relational specification is much easier
to get right than an optimized implementation.

It has also been extended to concurrent data structures:
[http://theory.stanford.edu/~hawkinsp/papers/pldi12concurrent...](http://theory.stanford.edu/~hawkinsp/papers/pldi12concurrent.pdf)

------
maaaats
Somewhat related, I have always liked Jonathan Blow's (creator of "Braid")
talk on data structures. Slides and audio here:
[http://www.myplick.com/view/7CRyJCWLM71](http://www.myplick.com/view/7CRyJCWLM71)

> Using the right data structure is usually bad

He argues, because really, most of the time it won't matter.

~~~
baddox
Perhaps the more important practice is to establish visibility into which
areas of your code are the performance bottlenecks. If you're developing in an
environment where you're comfortable quickly profiling things with low
friction, then it's much easier to know where to spend your time optimizing
data structures and algorithms.

------
stiff
It is a pity that people only think of "data structures" when they deal with
large amounts of data and are forced to think of lists, trees, and all that.
Even in simple business problems, choosing the appropriate set of variables to
model a given thing can make a huge difference in how easy it is to solve a
problem and how elegant the solution is. Recognizing that a data structure is
needed in the first place seems to be a craft in itself, and one that isn't
taught much.

It's all the more funny, given the structured programming guys talked a lot
about this some thirty years ago (at least), and now most of it is forgotten.

~~~
eru
Don't you always have a data structures by default? How could you not have any
data structure?

~~~
stiff
I will give you a really prosaic example: I was writing a JavaScript carousel,
where from a collection of N elements, K were displayed at a time, and you
could jump J elements to the left, or J elements to the right, additionally
the whole thing was looped so you could endlessly scroll to the left or to the
right.

I initially just wrote it like most JavaScript is written, here define an
onclick, here remove some HTML elements, here add some HTML elements etc. I
tracked some state here, some state there, but the data wasn't very
structured, that's better phrasing I guess. Soon I ended up having bugs in the
logic that I had problems fixing. What helped me immensely was defining an
abstract SlidingWindow type, which stored the complete array of the elements,
the width of the window, the width of one "slide" and carefully defined the
elementary operations like slideLeft, and what they do to the data. This is
what I mean by the ability to recognize underlying data structures and make
those explicit.

Going from the first form of the code to the second was the essence of the
structured programming movement. When OO came to be it felt out of fashion,
but the OO people tend to only teach modelling on problems with really quite
obvious decompositions, and as a result one doesn't learn this ability of
noticing a data structure that could be formalized in a mess of straight-line
code.

~~~
klibertp
For very similar purpose I once defined a CircularIterator class (also in JS,
where it's not available as itertools.cycle in Python for example) which
wrapped any kind of collection and had next, prev, get and set methods (at the
time I was yet to learn about Smalltalk's streams - the methods would be named
like nextPut and next would have an optional argument, but I digress :)).

In essence it was exactly the same thing you did, but due to it being OO, I
managed to reuse it quite a few times since (having something which cycles
between three states on click is a breeze now). The thing here is that the
problem of "wrapping" a sequence instead of throwing some kind of Out-of-
bounds-the-sky-is-falling error is pretty common and it deserves generic
solution. You - and structural programming, for that matter - trapped a
solution to this problem inside a solution for carousel. That's exactly one of
more important things OOP came to fix.

But I guess it's true that "the OO people tend to only teach modelling on
problems with really quite obvious decompositions". I think nearly 100% of
examples of OO in tutorials and books are completely useless and that there
should be some major change in the way of teaching and thinking about OO. For
me, that I managed to come up with a useful abstraction back then was pure
luck. Later, when I saw Smalltalk - and worked in it for some time - I finally
understood what OO is about and how beautifully it can simplify and generalize
problems which would be one-off with procedural/structured programming.

~~~
eru
What made you think the solution was trapped inside the carousel?

I understand what both of you are trying to say. Thanks for the answers. Yes,
I'm applying that kind of thinking quite often myself. (I do Haskell at work,
where these abstractions are particularly natural. No OOP necessary. )

------
mercurial
Short version: if all you need to do is to insert and check for existence (in
this case, whether a node has already been explored), use a set, not a list.
News at 11.

~~~
xerophtye
Yep, that pretty much sums it up. I was hoping for a more detailed analysis of
situations and data structures... not just ONE specific scenario.

------
zamalek
Tarjan's strongly connected components algorithm[1] is designed to discover
strongly connected components (interdependent vertices in graphs) but a side
effect is that it does a topological sort (I think I remember it being
reversed). It runs in O(|V| + |E|) time (V = vertices, E = edges).

You might want to look into it.

[1]:
[http://en.wikipedia.org/wiki/Tarjan's_strongly_connected_com...](http://en.wikipedia.org/wiki/Tarjan's_strongly_connected_components_algorithm)

------
jheriko
i would point out that implementing a list with an array is lazy madness. i've
seen this before and it always results in the same fundamental problem - you
lose the performance benefits of the list because it incurs the performance
penalties of an array for insertion and removal when you already have a node.

i've seen this same fundamental blunder in AAA game code bases that will
remain nameless...

so its not just choosing the right data structure, but understanding the
implementation well enough that you chose the correct implementation of a data
structure - which might not be the one in the standard library unfortunately

~~~
icebraining
I'm hardly a very experienced programmer, but almost all uses of lists I see
on Python codebases involve appending, iterating and reading/setting specific
values; I couldn't tell you the last time I saw an insert to the head/middle
of a list.

------
wfunction
Data structures 101.

------
thinkersilver
It's the simple things that have the biggest impact. Even though the analysis
is elementary to many I thought the post was useful.

------
pramalin
This is a well established formula: Algorithms + Data Structures = Programs \-
Niklaus Wirth 1st Edition February 1976

------
bsaul
using nsset instead of nsarray is my recent hobby because i've only recently
realized that many operation i did on arrays are in fact intersection or
membership tests, and that those are already implemented in the lib.

anyone knows if membership test in nsset is also as faster than in nsarray as
described in the article ?

------
dschiptsov
There is CS61A for that, not even a specialized course of algorithms and data-
structures.)

Lists, being an "asymmetrical" recursive data structure, have O(1) time for
adding an element to its head but O(n) time for sticking an element to its
tail, and a search (for membership) would be O(n) in the worst case. Keeping
the list pre-sorted would save some time.

In case one needs a quick (close to O(1)) look-up there are hash-tables. It is
even funny to see that it comes as a surprise.)

As Brian Harvey explicitly mentioned few times, constant factors doesn't
matter, while choosing appropriate data-structure could make a dramatic
changes, such as changing form a linear time to near constant.

Another topic is about using non-memoized recursion, like in the classic case
of a naive recursive function of computing Fibonacci numbers.

That is another proof that taking a decent introductory CS course (preferably,
based on SICP) is the must for any person who for some reason decided to code.

~~~
icebraining
_Lists, being an "asymmetrical" recursive data structure, have O(1) time for
adding an element to its head but O(n) time for sticking an element to its
tail_

Well, a list is an abstract data structure. Being Python, it's implemented as
an array, so adding an element to the head actually costs O(n) - since all
elements must be copied to the next position - while appending to the tail is
O(1) (amortized, since in certain cases it triggers a full resize, requiring a
full new copy of the list).

The language wiki has a page with the time complexity of a few of its data
structures:
[https://wiki.python.org/moin/TimeComplexity](https://wiki.python.org/moin/TimeComplexity)

~~~
eru
Thanks for the insight.

By the way, you could make Python's list's adding of one element at the back
worst case O(1), too, if you were willing to waste some space.

