Hacker News new | comments | show | ask | jobs | submit login
Choosing the right data structures (mathieularose.com)
82 points by jwdunne on Oct 23, 2013 | hide | past | web | favorite | 35 comments

The underlying lesson here is having a process to identify hotspots (profile), and then being able to swap one data structure for a contextually better one (polymorphic collections).

Presumably the developer of the topological sort was operating on small graphs where time complexity was not an issue. Its better to have the inefficient but numerically correct algorithm at hand than no implementation at all. So I don't think the author of the original topological sort has make a mistake.

The best method for developing high performance algorithms is write the least error prone implementation first, with no eye on efficiency. Then build a separate high performance implementation. You test the second one is working by blowing random data into both and double check the fast one computes the same thing as the simple to understand implementation. That's how I built a persistent red-black tree over a space of two weeks[1]

1: http://wiki.edinburghhacklab.com/persistentredblacktreeset

While we should avoid premature optimizations, it is quite important to keep efficiency considerations in mind from the beginning. If we are talking about internal workings of one isolated function, fine -- but quite often a data structure picked "with no eye on efficiency" early on gets baked into the rest of the code so deep that by the time it becomes a problem, only extensive refactoring and debugging can get it out of there. And quite often the error-proneness of the more efficient data structure is no worse -- the classic structure-of-arrays vs. array-of-structures question is one example.

The author might not have even chosen an inefficient implementation. For smal sets of data, arrays can easily be faster at search and insertion than hash tables, particularly if the hash function or collision resolution scheme is inefficient.

X having a higher growth rate than Y does not mean that Y is unconditionally faster. It means that there exists some problem size n such that Y is faster than X for all problems larger than n. Below that value, all those terms that big-O notation ignores are still significant performance factors.

I wonder if it would be possible for a sufficiently intelligent compiler to do this automatically.

Hmm this post reminds me really of Container Choice: http://i.stack.imgur.com/HNMy4.png

Which was, at least in the C++ community I was active in, pretty famous

It actually helped a lot of beginners choosing the right C++ Standard Library containers for their needs.

Thanks for sharing this.

Are there any similar posts to the original that provide simple explanations & benches of the effects of implementing different data structures/algos?

Thanks! Is there something similar for Java?

I often come across developers who only think to use data structures that are native to their language (a Sapir-Whorf trap, if you will).

For example, in JavaScript, we essentially have arrays (great for tracking order, O(n) lookup) and objects (great for O(1) lookup, no order), but that doesn't mean you can't: a) implement other data structures such as doubly linked lists, heaps, red black trees, etc. b) combine the strengths of two different data structures in a new object

I think it was Bill Gates who emphasized the importance of picking good data structures up front.

A book I recommend for learning how to implement data structures (and concomitant algorithms) is: Introduction to Algorithms (http://mitpress.mit.edu/books/introduction-algorithms)

It is also useful to have some heuristic for picking data structures, and I have found the recipe proposed by Gayle Laakmann in her book, Cracking the Coding Interview, useful in most non-esoteric cases.

The Algorithm Design Manual is basically an algorithm cookbook, which makes for a nice companion book.

I don't know about Bill Gates, but this quote from Linus Torvalds sprang to mind:

"I will, in fact, claim that the difference between a bad programmer and a good one is whether he considers his code or his data structures more important. Bad programmers worry about the code. Good programmers worry about data structures and their relationships."


There is some very interesting work in the area of "choosing the right data structures".

For example "An introduction to data representation synthesis" (http://theory.stanford.edu/~aiken/publications/papers/cacm13...) presents a technique to do the following:

1) specify what the data structure should do using relational algebra (like for data bases)

2) automatically generate a data structure that is optimized for your specific hardware and workload.

A big advantage is that you can optimize for different hardware and workloads without changing your code. Also, the relational specification is much easier to get right than an optimized implementation.

It has also been extended to concurrent data structures: http://theory.stanford.edu/~hawkinsp/papers/pldi12concurrent...

Somewhat related, I have always liked Jonathan Blow's (creator of "Braid") talk on data structures. Slides and audio here: http://www.myplick.com/view/7CRyJCWLM71

> Using the right data structure is usually bad

He argues, because really, most of the time it won't matter.

Perhaps the more important practice is to establish visibility into which areas of your code are the performance bottlenecks. If you're developing in an environment where you're comfortable quickly profiling things with low friction, then it's much easier to know where to spend your time optimizing data structures and algorithms.

This is now one of my favorite technical talks. Very salient points!

I absolutely love his section on how calling some programmers lazy when all visible evidence shows otherwise is silly.

This is a pretty awesome talk. I wish he expanded on the "array of records" thing and how he implemented it.

Maybe an array of key-value pairs?

It is a pity that people only think of "data structures" when they deal with large amounts of data and are forced to think of lists, trees, and all that. Even in simple business problems, choosing the appropriate set of variables to model a given thing can make a huge difference in how easy it is to solve a problem and how elegant the solution is. Recognizing that a data structure is needed in the first place seems to be a craft in itself, and one that isn't taught much.

It's all the more funny, given the structured programming guys talked a lot about this some thirty years ago (at least), and now most of it is forgotten.

Don't you always have a data structures by default? How could you not have any data structure?

I will give you a really prosaic example: I was writing a JavaScript carousel, where from a collection of N elements, K were displayed at a time, and you could jump J elements to the left, or J elements to the right, additionally the whole thing was looped so you could endlessly scroll to the left or to the right.

I initially just wrote it like most JavaScript is written, here define an onclick, here remove some HTML elements, here add some HTML elements etc. I tracked some state here, some state there, but the data wasn't very structured, that's better phrasing I guess. Soon I ended up having bugs in the logic that I had problems fixing. What helped me immensely was defining an abstract SlidingWindow type, which stored the complete array of the elements, the width of the window, the width of one "slide" and carefully defined the elementary operations like slideLeft, and what they do to the data. This is what I mean by the ability to recognize underlying data structures and make those explicit.

Going from the first form of the code to the second was the essence of the structured programming movement. When OO came to be it felt out of fashion, but the OO people tend to only teach modelling on problems with really quite obvious decompositions, and as a result one doesn't learn this ability of noticing a data structure that could be formalized in a mess of straight-line code.

For very similar purpose I once defined a CircularIterator class (also in JS, where it's not available as itertools.cycle in Python for example) which wrapped any kind of collection and had next, prev, get and set methods (at the time I was yet to learn about Smalltalk's streams - the methods would be named like nextPut and next would have an optional argument, but I digress :)).

In essence it was exactly the same thing you did, but due to it being OO, I managed to reuse it quite a few times since (having something which cycles between three states on click is a breeze now). The thing here is that the problem of "wrapping" a sequence instead of throwing some kind of Out-of-bounds-the-sky-is-falling error is pretty common and it deserves generic solution. You - and structural programming, for that matter - trapped a solution to this problem inside a solution for carousel. That's exactly one of more important things OOP came to fix.

But I guess it's true that "the OO people tend to only teach modelling on problems with really quite obvious decompositions". I think nearly 100% of examples of OO in tutorials and books are completely useless and that there should be some major change in the way of teaching and thinking about OO. For me, that I managed to come up with a useful abstraction back then was pure luck. Later, when I saw Smalltalk - and worked in it for some time - I finally understood what OO is about and how beautifully it can simplify and generalize problems which would be one-off with procedural/structured programming.

What made you think the solution was trapped inside the carousel?

I understand what both of you are trying to say. Thanks for the answers. Yes, I'm applying that kind of thinking quite often myself. (I do Haskell at work, where these abstractions are particularly natural. No OOP necessary. )

There is CS61A for that, not even a specialized course of algorithms and data-structures.)

Lists, being an "asymmetrical" recursive data structure, have O(1) time for adding an element to its head but O(n) time for sticking an element to its tail, and a search (for membership) would be O(n) in the worst case. Keeping the list pre-sorted would save some time.

In case one needs a quick (close to O(1)) look-up there are hash-tables. It is even funny to see that it comes as a surprise.)

As Brian Harvey explicitly mentioned few times, constant factors doesn't matter, while choosing appropriate data-structure could make a dramatic changes, such as changing form a linear time to near constant.

Another topic is about using non-memoized recursion, like in the classic case of a naive recursive function of computing Fibonacci numbers.

That is another proof that taking a decent introductory CS course (preferably, based on SICP) is the must for any person who for some reason decided to code.

I agree with your overall sentiment, but I have some objections with the specific things you are saying:

- SICP the book and the original MIT course that is available online do not cover data structures or algorithms very well, it isn't that much in the focus of the book, they implement a queue in one place from what I remember, there is some mention of big-oh, but it's all rather cursory. Thomas Cormen published "Algorithms Unlocked" recently, which seems like a good intro book for programmers who are not willing to devote a fraction of their life to studying the big Cormen book or Knuth.

- Constant factors matter increasingly more and more. There has been a lot of talk in the last years about how things like big and very fast processor caches make some not-asymptotically optimal algorithms outperform the asymptotically optimal ones for all input sizes we can ever expect to handle. One should of course still understand asymptotic analysis, but you also have to understand caches, locality of reference and do real-world benchmarking. One doesn't substitute the other.

There is a wonderful set of lectures by the two magicians.) I used to watch some before going to bed.

Actually it takes a long time to understand what they are saying (big ideas behind the words) and why they chose so (why this or that particular example).

But the great idea is quite simple: "All you need is Lambda and the List structure as a glue (to create layers of DSLs) plus self-discipline of not violating abstraction barriers". But one has to really understand them. (Well, arrays and hash-tables are also necessary due to some efficient algorithms based on its unique properties).

Lists, being an "asymmetrical" recursive data structure, have O(1) time for adding an element to its head but O(n) time for sticking an element to its tail

Well, a list is an abstract data structure. Being Python, it's implemented as an array, so adding an element to the head actually costs O(n) - since all elements must be copied to the next position - while appending to the tail is O(1) (amortized, since in certain cases it triggers a full resize, requiring a full new copy of the list).

The language wiki has a page with the time complexity of a few of its data structures: https://wiki.python.org/moin/TimeComplexity

Yes, my inaccuracy - I considered them from the classic Lisps point of view. The trade-offs they made in Python about list's implementation are disputable.)

Thanks for the insight.

By the way, you could make Python's list's adding of one element at the back worst case O(1), too, if you were willing to waste some space.

Short version: if all you need to do is to insert and check for existence (in this case, whether a node has already been explored), use a set, not a list. News at 11.

Yep, that pretty much sums it up. I was hoping for a more detailed analysis of situations and data structures... not just ONE specific scenario.

Tarjan's strongly connected components algorithm[1] is designed to discover strongly connected components (interdependent vertices in graphs) but a side effect is that it does a topological sort (I think I remember it being reversed). It runs in O(|V| + |E|) time (V = vertices, E = edges).

You might want to look into it.

[1]: http://en.wikipedia.org/wiki/Tarjan's_strongly_connected_com...

i would point out that implementing a list with an array is lazy madness. i've seen this before and it always results in the same fundamental problem - you lose the performance benefits of the list because it incurs the performance penalties of an array for insertion and removal when you already have a node.

i've seen this same fundamental blunder in AAA game code bases that will remain nameless...

so its not just choosing the right data structure, but understanding the implementation well enough that you chose the correct implementation of a data structure - which might not be the one in the standard library unfortunately

I'm hardly a very experienced programmer, but almost all uses of lists I see on Python codebases involve appending, iterating and reading/setting specific values; I couldn't tell you the last time I saw an insert to the head/middle of a list.

Data structures 101.

It's the simple things that have the biggest impact. Even though the analysis is elementary to many I thought the post was useful.

This is a well established formula: Algorithms + Data Structures = Programs - Niklaus Wirth 1st Edition February 1976

using nsset instead of nsarray is my recent hobby because i've only recently realized that many operation i did on arrays are in fact intersection or membership tests, and that those are already implemented in the lib.

anyone knows if membership test in nsset is also as faster than in nsarray as described in the article ?

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact