Hacker News new | past | comments | ask | show | jobs | submit login
The Case for Learned Index Structures (arxiv.org)
398 points by anjneymidha on Dec 11, 2017 | hide | past | favorite | 66 comments

This is a very interesting paper, lots of ideas I haven't seen before and it gets a few ideas very right that most computer scientists overlook (more on that below). But comparing performance against B-trees tilts the results. State-of-the-art access methods use succinct data structures which encode the data distribution, which was called out in the paper as something that doesn't really exist currently. I've built and seen these structures at a few different companies. In fairness, I don't recall any literature on these (as is the norm for high-end database computer science). The key points on succinct structures:

- They can be constructed adaptively for high-throughput mixed workloads, also called out in the paper as not being a feature of these structures -- which is true if you limit the scope to succinct structures that don't encode data distribution. One of the driving use cases, beside reducing B-tree index bloat, is real-time write performance.

- These structures are extremely compact. The equivalent search structure for the test data set in the paper easily fits (per my cocktail napkin calculation) entirely in L2 cache and each level is traversed with a handful of (admittedly clever) bit-twiddling operations. While the algorithms in the paper are much more compact than B-trees, which is an interesting and valuable result, they are still much larger than alternatives. It should be noted that the succinct data structures used here are not tiny B-trees -- they operate on different principles.

- Multidimensional versions of the succinct data structures already exist. The majority of the performance of my spatial databases can be attributed to the development of succinct index structures that generalize to spatial data models. The spatial algorithms allow it to scale out but the performance is due to succinctness.

Which is to say, the ideas in the paper are really neat, but they are unlikely to supplant other algorithms for databases.

Where the paper really gets it right is framing "indexing" as a learning/prediction problem. Most computer scientists think of indexes as a data structures to search for things but give little thought to the theoretical limits of indexing in the abstract. As in, what is the best possible indexing structure for data models generally, and how close can we get to that for practical purposes? The abstract description of optimal indexing is essentially as an algorithmic induction/prediction problem, which makes an ideal implementation intractable but when you start to think of indexes in terms of algorithmic information instead of organizing values, it leads to interesting constructs like the data structures mentioned above that are dramatically more efficient and effective than traditional top-down indexing algorithm designs.

Optimal index construction is, oddly enough, closely related to the problem of AI. Consequently, it doesn't surprise me that algorithms from AI can be applied to produce efficient index representations. At the limit, you would expect the data structures for indexing and AI to converge.

> In fairness, I don't recall any literature on these (as is the norm for high-end database computer science).

Check out Navarro's book Compact Data Structures: A Practical Approach. It's a really great survey of the literature from a data structures perspective. It also spends a lot of time on compressing these structures, an alternate perspective to learning/prediction, as you mention.

Could you share some pointers for someone who'd like to read more on the topic of succinct data structures, esp. multidimensional ones? A book or maybe a paper? Both theory and implementation.

How does it relate to things like fm-indexes [0], searching in the compressed domain or projecting the data into a higher dimensional but flat space? To state of the obvious, we wouldn't need indexes if data was linear and evenly distributed.

Indexing feels like a subset of compression. And compression's logical conclusion is AGI [1].

[0] https://en.wikipedia.org/wiki/FM-index

[1] http://prize.hutter1.net/hfaq.htm

Here is a brief summary of the approach. http://muratbuffalo.blogspot.com/2017/12/paper-summary-case-...

I like it. This is a natural trend: we will see databases and distributed systems become more data-aware to achieve efficiency and performance. http://muratbuffalo.blogspot.com/2015/08/new-directions-for-...

Thanks for posting this summary. I haven't seen it yet.

The really cool application of this, which they allude to at the end of the paper, is multi dimensional indexes. For example right now kNN for similarity search (think recommender systems) is usually done with some form of K-D tree or projection method. This has a bunch of downsides, including memory usage. If we can "learn" that index directly, that would be really fancy.

Sounds like an interesting approach, but just that I understand the scope or impact of the paper right: Surely data-aware indexing can't be the novel part, right? Or was it always so complicated to model the data distribution that no one managed to do it until now? It seems natural to try to adapt your index to the type of data you see more often than not.

Very cool idea though.

The novel part is using machine learning to implement that data awareness, in principle this means that the data owner doesn't need to use any human knowledge or heuristics to build an efficient index -- the machine running a general algorithm can find the optimized representation of the index.

Thanks for summarizing one of our major points.

Some thoughts on the paper as a researcher in space-efficient data structures:

When we design compressed data structures, we generally assume that the data has been generated by some statistical or combinatorial model. Then we design the encoding for the data and the data structure itself, assuming that model. There are often principled ways to achieve that.

We also need an index to find the correct block of data to decompress. This is where we often have to resort to ugly hacks. The constant-time select structure for bitvectors is a good example. To find the i-th 1-bit, we split the bitvector into blocks and superblocks of certain number of 1-bits. Then we have different ways to encode the blocks and superblocks, depending on whether they are dense or sparse. There are lots of special cases, but the index still often ends up wasting space for redundant information. We could adapt to the data better by adding even more special cases, but that quickly becomes impractical.

The key idea in this paper is that data structures often contain components that are learnable. For example, B-trees and rank/select indexes are essentially increasing functions, which should be easy to learn. We can afford having more special cases, if we don't hard-code them but handle them as data. This in turn can make the indexes smaller and faster.

The drawback is index construction, which requires a lot of resources. (This is a common problem with sophisticated data structures.) Spending minutes to index 200 million elements can be justified in some applications, but it's way too much in other applications.

This seems interesting but to me there is a flaw near the beginning. They state a btree assumes worst case distribution. That's a feature . Much better than a "this will be fast, if you're lucky" distribution.

But who knows, maybe for read heavy analytical workloads this will be an interesting way of improving performance or reducing space usage.

> Much better than a "this will be fast, if you're lucky" distribution.

This is a really wrongheaded way to look at this. BTrees are themselves an adaptive function that can be used at the top of bottom of an index. We've just got other ones here being proposed that are much faster and smaller, when they work. And there is a methodology here to identify and train them, when they do.

This exists in sharp contrast to our existing family of algorithms, which tend to lack any sort of introspective features in their operation. Tree building, for example, is (like many divide-and-conquer algorithms that are actually trees) incredibly sensitive to the ordering of the arguments, and algorithms that balance them use only structural information and balance them at the last possible (and often most expensive) moment.

What's most amazing about this approach is that you've got a recursive model where the model is trained in layers that propose a submodel, or "an expert which has a better knowledge about certain keys."

That model could be a tree, or a b-tree, or a linear regression, or a neural network. They actually test using other tree types and search algorithms (which are implicitly tree types) in the paper.

This process adaptively trains models which can, in a very real sense, increase and decrease their level of detail as the structure of the data demands.

And in the later case (which seems to excite people less but excites me more) they propose a bloom filter alternative that can automatically identify useful structural elements and latch onto those as part of its categorization, which is is perhaps one of the most revolutionary ideas I've seen bringing ML into the distributed systems world.

B-trees incrementally rebalance on every insert


I thought maybe I had made an error in writing or editing that implied that b-trees were not self-balancing. So I reread it twice. Maybe I just don't see where I implied otherwise.

That doesn't mean that the amount of computation and memory comparisons done doesn't vary based on the order of the input.

This is the exact point of view they are rejecting. You want spectacular average-case performance at the cost of a slow but not catastrophic worst-case.

So basically suitable for batch mode only? There's really no other situation in software where average is a useful measure.

How about searching the web? I'd rather most queries take 1 second, and 10% taking 10 seconds, than every query taking 5 seconds.

I don't understand how you can be so confident about attaching a utility function to latency.

Well, no, for many user-facing use cases your performance essentially averages out if you have enough users - i.e. for a single user even the worst case performance is faster than they care about, but you're limited by how many concurrent users your hardware can serve.

In that case, you want the average to be as low as possible because then you need less hardware for the same throughput, or you can get faster response time for any users once the waiting/delay caused by other users is taken into account.

A blocking read operation that predictably takes 10 ms will limit you to 100 requests per second, and with an uneven/bursty distribution many of those requests will be queued for 100ms; A blocking read operation that sometimes takes 50 ms but has 5 ms on average will have almost no queuing in 100 requests per second, so the 50ms will really be the worst case.

This is not only wrong for most user-facing cases, but it misses the key notion of training 1-N sub-models, which is a gateway to actually measuring the degree of success or failure in the index building step and modifying it as needed.

Depends on application. They're looking at analytical workloads, which presumably means they're focused on batch jobs. For OLTP-type workloads, I care very much about worst-case performance.

And they will be forced to learn, the hard way, that this is a bad way to approach algorithms design. The same argument applies to quicksort, and we know how that turned out: http://www.cs.dartmouth.edu/~doug/mdmspe.pdf

That argument cannot be generalized to every use-case. In fact it's very specific to quick sort.

Yes, this approach may not work for a lot of situations, but there are some (as pointed out elsewhere in the comments), where it's perfectly reasonable, even desirable to improve the average runtime at the cost of the worst case.

That adversarial quicksort only applies under specific constraints (2. Pivot-choosing takes O(1) comparisons; all other comparisons are for partitioning. ) that are easily worked around.

Honestly, for this specific use case what I want is for btree to be _much_ better than learned index structures in some cases and _much_ worse in other cases. That way I can run both in parallel and whichever queries back first is the one I go with. My overall performance skyrockets.

Databases see order-of-magnitude slowdowns depending whether the query plan uses an index. Operating system caches result in high performance variation for disk I/O. Even CPU's have highly variable performance depending on branch prediction and the size of on-CPU caches. So this is nothing new; other than hard real-time programming, we already live with high performance variation.

This seems like a rehash of hashing. sorry for the pun. Neural Networks are, in essence, just really good at compression.

All model building is compression. That's the point of building models!

The innovation is not in the technique itself. Showing the technique is practical in current hardware for current problems is what the paper is about.

If you want to view it that way, it is data dependent hashing. That is the innovation: the mapping from data to hash is learned from the distribution of the data.

It's not really that new - gperf has been doing that for decades.

Perfect Hashing was crica 1984.

Perfect hashing provably has size that is O(n) in the data being indexed. You can think of this as approximate perfect hashing. You trade off a bit of accuracy for a lot of compression.

Thanks. That [is] what I gathered as well. My OP was merely to note relevant existing results.

Sorry to be a Debbie Downer, but what's the point?

It's not that the performance of the existing indices is bad. It's the fact that they are often ignored by the more complex operations even when the column is indexed.

I'd be more curious to see that addressed instead.

the performance of the existing indices _is_ bad in many cases.

With the reason being, they are ignored.

Not when they are used.

As it says in the paper, this might be useful for data warehouses. But, it’s not coming to postgres anytime soon. Index updates on the order of seconds to minutes would be too much for a transactional db.

There is also the cold start problem. How do you start to lay out the data on disk as you begin inserting it? Do you have a pre-trained net and use it at first (inserting where the net thinks the data should be)? The strategy probably differs by index type.

Most of current storage backends have Log-structured Merge Tree implementation or something like that.

The larger layers of LSMT have enormous size and should be accessed/built as rare as possible.

Being able to predict that given element exists in the larger layers at all is quite a bonus. You can skip reading megabytes of data.

The rareness of building of the larger layers justifies training deep neural model for them.

I cannot verify existence of LSMT backend for major SQL DB engines, but NoSQL engines use it a plenty: https://en.wikipedia.org/wiki/Log-structured_merge-tree

So you want a LSM for inserts and the DNN for reads? Seems OK. You still have to update/retrain the DNN after an insert into a larger layer, which will be expensive. So you’d probably get high latency at the 99% (or some high number).

There are no inserts into larger layers, only merges. Which are long (usually processed in background by separate thread) and that longness justifies training a new net in parallel to merge process.

Rocksdb is lstm for an SQL engine

Well, Postgres does already have something pretty close in Dexter. Made possible due to the Hypothetical Indexes extension. Dexter can either automatically create indexes using concurrent index creation or it can build a list that you can load at your convenience.

I'm just waiting for it to make it into one of the big PG providers.


> it’s not coming to postgres anytime soon

Isn't Peloton close? http://pelotondb.io/

Can someone please explain this to me

> For efficiency reasons it is common not to index every single key of the sorted records, rather only the key of every n-th record, i.e., the first key of a page. [1] This helps to significantly reduce the number of keys the index has to store without any significant performance penalty

Is this true of B-Tree indexes in Postgres, for example?

This only applies to indexes over a pre-sorted structure. For unordered, PG has something vaguely similar in the form of BRIN indexes, but really they're totally different tricks

I'd say this is a generally misleading statement, but if you like it is right in some context.

In particular, one could argue that it is true for clustered indexes (e.g. in SQL Server or MySQL/InnoDB). In that case the table is actually in an ordered fashion and happens to be identical to the leaf-node-level of the index. However, if you argue that this level doesn't belong to the index but is just the payload, the level above the leaf nodes is as they say: it has one pointer per page.

Not complaining, but curious about why this link took off and mine didn't. (https://news.ycombinator.com/item?id=15892476)

Did adding the "Data Storage and Retrieval as an ML Task" synopsis make it less appealing?

There's a lot of randomness in what makes it to the front page on HN.

Show HN: (2018) Using Learned Index Structures to optimize HN titles.

(Sorry if that doesn't make any sense, I haven't actually read the link yet.)

Can anyone with a research/professional background in DB shed some light about the significance/implications behind this paper?

Normal DB indexes are designed to be general-purpose so that they give predictable performance no matter the data you insert. The paper basically describes a more efficient way to look up data by using an index that's tailored to the specific dataset it's built over. It's expected that you'd be able to get a performance increase by doing this. In more detail, they use neural nets to learn an approximation of the distribution of the data and use that to look up the rough position of each key faster than usual. They still use traditional structures for the "last-mile" which is not so easily learned. You could accomplish the same thing without NNs by using anything that can approximate the distribution of the data. E.g. a histogram would work for some cases, and you could do some PCA and normalization first to deal with more cases. NNs have the advantage that they can learn more complex distributions.

The other advantage is more opportunity to parallize the work. So with new silicon architectre like the Google TPUs. This part is an important element to the equation.

This is actually pretty cool - it makes sense to figure out ways to apply ML to core infrastructure in storage and databases. One wonders what other parts of such systems can be ML optimized... ?

Any place where heuristics and arbitrary constants are used.

Interesting idea. It seems like this technique could be applicable to bioinformatics, especially sequence assembly.

Also known as Skynet-index

OK. In a way, this is sort of terrifying.

Meh. That's only because of the association between AI and disaster movies. If you just think about it as math, it's no more (or less) scary than what came before it.

The invention of atomic weapons was "just math".

The invention of atomic weapons was physics, a precise description of the elementals of our universe. Results from Riemann geometry are not dangerous, but the statement "our universe's spacetime obeys Riemann geometry" is.

We can play a similar game with nuclear weapons, no?

The maths used to discover nuclear weapons weren't dangerous. What was dangerous was that the maths happened to align with our physical world.

Actually, neither was that dangerous. For it to be dangerous took years of engineering work.

What does it matter if it's "just physics" or "just engineering" or "just math"? Anything you can say about AI, you can say about the atomic bomb. AI involves a quite a bit of engineering as well. This argument is silly.

At the end of the day we are still building a powerful new technology. To assume it will be completely harmless is naive. No technology is completely harmless. But at least our previous inventions didn't sprout minds of their own. And have desires and plans independent of their creators.

This paper summarizes the results of 'years of engineering work'.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact