Hacker News new | past | comments | ask | show | jobs | submit login
Topics in Advanced Data Structures [pdf] (stanford.edu)
931 points by htiek on April 29, 2019 | hide | past | favorite | 83 comments

I was going to roll my eyes about stuff no one will ever hear of, much less use in their day job, but there are some really relevant structures here. Finger trees, cache-oblivious structures, R-trees, etc., just to name a couple from a random page or two. The "why they're worth studying" summaries are gold.


A doctor has to mostly always treat common cold, and influenza. But that's no excuse for her / him to not know the purpose of the left pulmonary vein. The more basics they know, the better doctors they are. Sorry if this analogy is a bit extreme, but I want to make a point.

I think a more apt analogy would be if you would disqualify doctors who don't know the nucleotide sequence of the genes that code for the cytochrome p450 enzyme process from treating your cold or flu.

The left pulmonary veins are gross anatomy, akin to knowing what an 'if' statement is. Knowing the current state of the art in determining a minimum complexity for an operation on an obscure tree implementation is deep domain knowledge used by only a few specialists.

> much less use in their day job

Yes, some are indeed very relevant. In one of my previous businesses, we built a search engine where we used DAWGs (Directed Acyclic Word Graphs) [I didn't do this part]. And they worked really, really well, in fact they work well to this day.

Good to hear they do. :)

When was the last time you wrote a finger tree? At work?

[I'm just responding to your first clause. :-)]

FM indices as well.

The material seems to be for this course: http://web.stanford.edu/class/cs166/

There are more slides and info on that site :)

It's a pity there are no video lectures. Are there lectures out there for a similar algorithms class?

MIT 6.851 Advanced Data Structures, Spring 2012: https://www.youtube.com/playlist?list=PLUl4u3cNGP61hsJNdULdu...

I took that class that year, one of the best classes I ever took. A ton of work, too.

> Traditional data structures assume a single-threaded execution model and break if multiple operations canbe performed at once. (Just imagine how awful it would be if you tried to access a splay tree with multiplethreads.) Can you design data structures that work safely in a parallel model – or, better yet, take maxi-mum advantage of parallelism? In many cases, the answer is yes, but the data structures look nothing liketheir single-threaded counterparts.

Makes me wonder which data structures have "parallel" versions besides the two mentioned.

The Art of Multiprocessor Programming by Herlihy and Shavit have many examples, including stacks, queues, skip lists, and hash tables. There's plenty of papers out there too that have parallel implementations of vectors, various tree structures and more.

would all of scala's default( immutable) datastructures fit the bill?

I was just curious. Not sure why this is downvoted.

This kind of stuff fascinates me. General software engineering seems comparatively boring. I was considering going back to university to do a phd in cs and this is making me realize that research in algorithms/data structure would actually be viable (still lots of stuff to discover).

Does anyone here know what it is like doing research in these areas? Any general advice?

I've got a PhD in CS, and these structures fascinate me as well. The Bloomier Filter reads amazing.

Unfortunately, day to day job has almost nothing to do with these algorithms, but mainly software architecture and maintainability.

I think the majority of these algorithms are not really something you will find implementing in day to day work. And at most I can see myself using a library with one of those. Even if you do research, it will have to be very focused on data structures and algorithms to really get deep into some of these.

Still, very enjoyable.

Just today I finished implementing radix heaps in C# to optimize some of my Dijkstras.. When I googled I saw zero implementations in C#, so I may be the first to write it. It worked well, giving me 3x speedup in practice over DAry heap and very low GC pressure, fortunate as my process was exceeding 256GB ram.

Rolling your own data structures is pretty vital if you're working on algorithms.

> I think the majority of these algorithms are not really something you will find implementing in day to day work. And at most I can see myself using a library with one of those.

That's true but if you don't know what's possible, it's likely that you don't even look for it or stumble over it randomly.

So you should know about them, even if you never implement them. You wouldn't need to know about them in that detail for that but it's just plain fun to learn stuff like this. :-)

Does anyone in their work find that they are able to employ data structures like this, and if so, what do you work on? I've almost always had to delegate all my state to a database using default indexes, etc., which is productive, yet a little disappointing, because I'm always applying my brain power instead toward more mundane tasks.

When working on a high volume web application I found data sketches (specifically the count-min sketch) very useful for finding and logging unique query patterns (Check if we've seen it before, how often have we seen it, if it's new, log it). Using a sketch bounded how much memory we used and eliminated database trips. In theory a hashmap data structure might have worked as well, but its size is unbounded given too many unique queries, while all we needed was an estimate.

Probabilistic data structures like bloom filters and sketches are really useful for gathering statistics on data sets that are too large to manage or would have a negative performance impact by having an extra database trip. So something to do with internal application diagnostics/debugging/logging is a great low-risk place to use these sorts of algorithms even when it makes sense to keep all business logic in the database.

There's a general class of problems to be solved with data structures, which is that you have a collection of records of data, and you're trying to find [1] a record, or collection of records, given some criteria. In essence, you have a database of some kind, so it should not surprise you to learn that databases (and things that are databases but not normally thought of as such--filesystems, for example) rely very heavily on these more "exotic" data structures internally. Indeed, a database is basically a heavily-optimized data structure that usually optimizes for the bursty nature of disk access [2]. And given that implementing these data structures tends to be a rich source of bugs, if you can offload the work to a database implementation, you probably should.

I will also point out that when you view the problem of data structures as an implementation that solves various queries, you can start building tools that automate the implementation of complex data structures given the queries you want to ask the data structure. I fully expect to see these tools start to become available over the next decade or so, given the current progress of program synthesis I see in the academic literature.

[1] Or insert, delete, or update. But usually you already have the record at that point, and you just want to update the data structure's representation of that record.

[2] You can apply the same techniques to cache access and even vector register usage for purely in-memory data structures. Same problem, just with different sizes of disparity.

Stratos Idreis's team has been working on exactly this. Automatic design of data structures based on workload.

Their work has been on HN before: https://news.ycombinator.com/item?id=18314555

I was specifically thinking of https://homes.cs.washington.edu/~mernst/pubs/collection-synt..., but mostly because that's the paper I actually read.

>I fully expect to see these tools start to become available over the next decade or so, given the current progress of program synthesis I see in the academic literature.

Interesting. I did know about program verification efforts (trying to prove that programs are correct), but had never heard of program synthesis (assuming it is somewhat different from code generation). Do you know of any examples of the kinds of areas it can be applied to?

The best overview of the topic I know of is https://www.microsoft.com/en-us/research/wp-content/uploads/..., which describes the state of the field ~18 months ago.

The major field where it's already being applied is in automating the conversion of data--building the program that can take "John Doe" and turn it into "<name>John</name> <surname>Doe</surname>" just by example. You can also do similar generalization to automate things like constructing highly repetitive graphics (e.g., draw a few teeth of a gear and the program does the rest). Superoptimization is the other main prong, where you really combine the synthesis and the verification to get correct faster code.

Those applications sound cool. Will check out that paper. Thanks.

I do plasma simulations and recently had the problem of finding the distance to the nearest neighbor for every of the particles in the simulation. Doing that naively is O(n^2) and took hours even for small test problems. Building an R-tree once and using if for nearest-neighbor look-ups brought that down to 5 minutes.

libspatialindex lacks documentation, but worked really nicely. The rtree interface in python is much friendlier.

I’m taking Advanced Data Structures at UCSD right now and our first assignment was making a K-D Tree and an efficient KNN Classifier. It was surprisingly simple and the efficiency between the KD Tree and brute force implementation was quite drastic.

If you only build the tree once and do no insertions what is the benefit of an R-Tree vs KDTree?

I actually do plan to update the tree as I insert additional particles in locations where the distance to the nearest neighbor is large.

There is a library called flann that does approximate nearest neighbors using a kd-tree and best bin first method. It is typically used not only for nearest neighbors but finding them in higher dimensions.

Plasma simulation sounds really damn cool.

It's actually usually pretty hot. </pun>

I'm curious, how many particles do your simulations typically contain?

Typical full sized simulation runs contain between 100 million (1e8) and 100 billion (1e11) particles. Small tests while debugging might be as small as a few thousand particles (1e4). And tests of the code might only use one or two particles.

The thing I was working on when I switched from naive O(n^2) to R-trees had half a million (5e5) particles.

Gamedev, but even then actually implementing the data structures is rare. And nothing outside of geometric/parallel data structures listed towards the end.

I've had to use a concurrent priority queue once (locking a regular priority queue was too slow, dropped in a concurrent implementation), implemented kd trees a few times, and I've used half-edge once (a cousin of quadedges). Mainly doing graphics work.

Often the naive data structures work better because there's less indirection and it's easier to vectorize. In my case, depending on how much GPU budget you have available, you can also just scrap the CPU implementation and throw it in a compute shader.

I think the reality is that >80% of SWEs will never need any of this, because most SWEs are high-level integrators and consumers of these algorithms via libraries. For a lot of people, their time is better spent learning how to architect and use available tools to quickly solve problems while writing bug-free code.

But then life changes and you may find yourself in need of this knowledge. I used to laugh at graph algorithm questions on interviews because I never directly used a graph in 20 years of SWE. Then I got a job where I am on a team maintaining a graph-based API. Jokes on me, now those 'silly' algorithms are very relevant.

That means the graph questions were only really relevant for the position where you maintain a graph based API. They were completely irrelevant for the other positions.

My guess is, even after working with graph algorithms for awhile, you still needed more than 20-30 minutes to make implementation changes and still needed to have a reference to look at while doing so.

I know for a fact graphs are quite useful. I also know complex algorithmic questions during an interview for most cases are also not useful. The fact you obtained the position and were able to adapt and still manage graph structures shows that those questions didn't gauge your fundamental abilities. Maybe you didn't hit the ground running day one, but after a little bit of learning/practice, you were fine under reasonable delivery conditions.

I managed to use DAWGs (Directed Acyclic Word Graphs) in a search engine, and later HyperLogLog+ and a Bloom Filter for a scalable distributed Multi-Armed Bandit implementation.

Otherwise I find that the built-in Clojure data structures are so good for all-round use that it's difficult to justify the time to use anything else, except in extreme cases.

>Otherwise I find that the built-in Clojure data structures are so good for all-round use that it's difficult to justify the time to use anything else, except in extreme cases.

I don't know Clojure, so can't compare it's data structures with those of Python, but I've read a few times that Python's built-in data structures are so useful that for many applications, it suffices to just use them as is, or in combinations (e.g. lists of dicts, dicts of lists, ...), instead defining your own custom ones using classes. More so when used in combination with other Python features like {list,dict,set} comprehensions and libraries like itertools.

And I've found that to be somewhat true in my own Python work.

I recently had to store large numbers of geospatial points (lat/lng pairs with data) in RAM for querying. I used a basic quad tree, which is similar to R-trees, but much simpler. Getting a subset of the points based on a bounding box is very fast and simple.

You've answered your own question - the person writing the database you're using, for example.

computing is so cheap that most people can delegate to a database. at some point tho, if money is a problem and you don't have the computing resources, you can squeeze more out of that hardware by using more specialized DBs, (graph, doc, column), something like redis, etc, but then at some point those might not map very nice to your problem and in which case custom data structures will get you far. Hence the reason FAANGs are big on algorithm & DS questions during interviews.

I'm a Bioinformatics student, so it's not for work, but I have been looking into Suffix Arrays, BWTs and FM indexes for my project.

I've seen them applied to some real world next gen genomic tools for example this one https://github.com/vgteam/odgi implements a multi string BWT. Technically they're in academia and not making money afaik but it's interesting stuff.

Yes, but not often!

Recently used nested R-Trees for in memory geospacial querying for a game (millions of atomic updates a second which would be a pain to manage sharding as items move around the world).

For example as a tree gets too big I split it at the hot spot and run the job processing for each tree on separate threads.

Few years ago implimented a simple tree for a survey product I built. The pathing in the survey is stored in the tree and I use the tree to find loops (which are invalid) in the survey.

Pro multimedia software here. We deal with fundamentally concurrent architectures with a variety of not-so-soft realtime requirements. So thread safety, lock/wait free guarantees, and cache performance are my key concerns.

Most off-the-shelf data structures are poorly suited for that environment so I need to roll my own or modify existing structures. I'm no data structure expert, so this kind of stuff is really helpful.

Though my use of stranger data structures is relatively rare in my line of work, I enjoy learning them because they keep revealing new ways to approach problems. It's less "I use CRDTs" and more "CRDTs have changed what I think is feasible", similarly for stuff like bloom filters (no-false-negatives can be very powerful) and locality-sensitive hashing (don't search for all X every time, pre-compute a simpler version so you only check promising items). Concurrent data structures have also been pretty fruitful, since they're also often "how can we guarantee X in our distributed system, and what primitives do we need" strategies / solutions.

(obviously some of this is "I learned a tangential thing while learning about X", but that's kinda the point. you may never use a rope, but you might be introduced to a new idea in the process of learning about it)

It's hard to beat an R-Tree for doing in-memory geospatial queries.

Work in speech recognition, from this list use lots of advanced automata algorithms, bloomier filters, interesting types of hashing, fancy priority queues.

It's not as interesting as some of the structures in the PDF, but we are working on a problem where we have lots of timestamped data and need to quickly and frequently access the nearest data point from a given input timestamp. We ended up using a tree-like structure but that said, we ended up using whatever tree-like structure was available in the language.

I work on a mapping team. We implemented our own R tree for spatial search as existing libraries had some unoptimized hot paths for our use case.

FM-index is super useful for large string collections.

I love learning Algorithms and Data Structures. The issue is that I don't get to use these frequently, not even basic DS. Most of what I need exists in the language or some framework, and if I am to implement it from scratch, I am sure I will do worse. The only time I really use this knowledge is during interviews.

I (did) feel excatly the same. Now after working for 17 years I realize that I barely ever use this stuff, and when asked this stuff in an interview I do a lot worse than I did straight out of university. I am however a lot better software engineer than I was then.

Hm, where else are Lowest Common Ancestors in tree-structures useful? Storing for instance ORDPATH/DeweyIDs allows to simply check for a common prefix (they are hierarchical node labels). I think maybe for locking in a storage system with multiple read/write transactions or to determine which of two given nodes is the firs one in preorder, which is useful for XQuery/XPath processing (to determine the so caslled document order). Can anyone think of other usages? Or for having hierarchical node labels in trees?

You can use lowest common ancestor queries in conjunction with suffix trees to solve a lot of interesting string problems. For example, take two indices within a string, find their corresponding suffixes in the suffix tree, and then take their LCA. That gives you an internal node corresponding to the longest string that appears starting at both indices (this is called their "longest common extension.") You can use this as a subroutine in a bunch of genomics applications.

Thanks, great use case and I have to say I have to read about genomics... :-)

Error-boundaries in VDOM frameworks. If you have an asynchronous diff cycle, you might get multiple errors from more than one node in a single diff cycle. One option is to bubble up errors to the nearest LCA that is an error-boundary. In general if you have marked two nodes in a tree and need to modify the smallest subtree that contains those nodes, then you will probably need to know the LCA.

Lowest Common Ancestor shows up in graph theory, particularly to help compute dominator trees or dominator frontiers.

These problems arise in compilers that convert programs into static single assignment (SSA) form.

I've seen it used in blockchain analysis and calculating the base for a three way merge operation.

So much stuff you can learn in the computer science world. For me the blockchain still is more or less a black box, should read a book about this stuff, as it seems to be hyped everywhere ;-)

Thanks :)

Perhaps worth starting with Merkel trees before digging into blockchain implementations.

I'm familiar with merkle trees :-)

Thanks for the suggestion/tip :-)

Does anyone know books which explore in depth recent development in advanced data structure (or just not well known advanced data structure) ?

I'm surprised no one mentioned the "Crazy Good Chocolate Pop Tarts" algorithm. That one took my by surprise :)



The list is great, but unfortunately has no pointers to documentation. I find nothing online for some of the topics.

Where can I find the description and discussion of 'ravel trees'?

They also caught my eyes. I finally found a paper here (the correct denomination seems to be RAVL tree) : http://sidsen.azurewebsites.net/papers/ravl-trees-journal.pd...

Super, thanks for being better at searching! (The name makes more sense this way.) :-)

This is awesome, thank you!

Used nested R-Trees in a personal project recently (sharded in memory spacial db for a game) which is not something I thought I'd ever have to do.

What font are they using? I find it very unpleasant to read on a 4k screen set to 125% scale. Or maybe it's Firefox.

I loaded the page on Chrome and the text is definitely heavier (and more readable) but also less sharp.

Where can I learn more about ravel trees?

My favourite somewhat obscure data structure that I didn't see on the list: Hash Array Mapped Tries.

Great! More fodder for interview questions /s.

That's probably it, any use of these algorithms in business software would need to be justified for the increased training your fresh-out-of-school replacement would need to maintain this.

This is quite cool, thanks.

nice sharing !!

Please provide the solutions in git repo. Thanks.

Paraphrasing (almost every) math book: "Solutions are left as an exercise for the reader."

Sure, but it will be quite expensive for you.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact