- They can be constructed adaptively for high-throughput mixed workloads, also called out in the paper as not being a feature of these structures -- which is true if you limit the scope to succinct structures that don't encode data distribution. One of the driving use cases, beside reducing B-tree index bloat, is real-time write performance.
- These structures are extremely compact. The equivalent search structure for the test data set in the paper easily fits (per my cocktail napkin calculation) entirely in L2 cache and each level is traversed with a handful of (admittedly clever) bit-twiddling operations. While the algorithms in the paper are much more compact than B-trees, which is an interesting and valuable result, they are still much larger than alternatives. It should be noted that the succinct data structures used here are not tiny B-trees -- they operate on different principles.
- Multidimensional versions of the succinct data structures already exist. The majority of the performance of my spatial databases can be attributed to the development of succinct index structures that generalize to spatial data models. The spatial algorithms allow it to scale out but the performance is due to succinctness.
Which is to say, the ideas in the paper are really neat, but they are unlikely to supplant other algorithms for databases.
Where the paper really gets it right is framing "indexing" as a learning/prediction problem. Most computer scientists think of indexes as a data structures to search for things but give little thought to the theoretical limits of indexing in the abstract. As in, what is the best possible indexing structure for data models generally, and how close can we get to that for practical purposes? The abstract description of optimal indexing is essentially as an algorithmic induction/prediction problem, which makes an ideal implementation intractable but when you start to think of indexes in terms of algorithmic information instead of organizing values, it leads to interesting constructs like the data structures mentioned above that are dramatically more efficient and effective than traditional top-down indexing algorithm designs.
Optimal index construction is, oddly enough, closely related to the problem of AI. Consequently, it doesn't surprise me that algorithms from AI can be applied to produce efficient index representations. At the limit, you would expect the data structures for indexing and AI to converge.
Check out Navarro's book Compact Data Structures: A Practical Approach. It's a really great survey of the literature from a data structures perspective. It also spends a lot of time on compressing these structures, an alternate perspective to learning/prediction, as you mention.
Indexing feels like a subset of compression. And compression's logical conclusion is AGI .
I like it. This is a natural trend: we will see databases and distributed systems become more data-aware to achieve efficiency and performance. http://muratbuffalo.blogspot.com/2015/08/new-directions-for-...
Very cool idea though.
When we design compressed data structures, we generally assume that the data has been generated by some statistical or combinatorial model. Then we design the encoding for the data and the data structure itself, assuming that model. There are often principled ways to achieve that.
We also need an index to find the correct block of data to decompress. This is where we often have to resort to ugly hacks. The constant-time select structure for bitvectors is a good example. To find the i-th 1-bit, we split the bitvector into blocks and superblocks of certain number of 1-bits. Then we have different ways to encode the blocks and superblocks, depending on whether they are dense or sparse. There are lots of special cases, but the index still often ends up wasting space for redundant information. We could adapt to the data better by adding even more special cases, but that quickly becomes impractical.
The key idea in this paper is that data structures often contain components that are learnable. For example, B-trees and rank/select indexes are essentially increasing functions, which should be easy to learn. We can afford having more special cases, if we don't hard-code them but handle them as data. This in turn can make the indexes smaller and faster.
The drawback is index construction, which requires a lot of resources. (This is a common problem with sophisticated data structures.) Spending minutes to index 200 million elements can be justified in some applications, but it's way too much in other applications.
But who knows, maybe for read heavy analytical workloads this will be an interesting way of improving performance or reducing space usage.
This is a really wrongheaded way to look at this. BTrees are themselves an adaptive function that can be used at the top of bottom of an index. We've just got other ones here being proposed that are much faster and smaller, when they work. And there is a methodology here to identify and train them, when they do.
This exists in sharp contrast to our existing family of algorithms, which tend to lack any sort of introspective features in their operation. Tree building, for example, is (like many divide-and-conquer algorithms that are actually trees) incredibly sensitive to the ordering of the arguments, and algorithms that balance them use only structural information and balance them at the last possible (and often most expensive) moment.
What's most amazing about this approach is that you've got a recursive model where the model is trained in layers that propose a submodel, or "an expert which has a better knowledge about certain keys."
That model could be a tree, or a b-tree, or a linear regression, or a neural network. They actually test using other tree types and search algorithms (which are implicitly tree types) in the paper.
This process adaptively trains models which can, in a very real sense, increase and decrease their level of detail as the structure of the data demands.
And in the later case (which seems to excite people less but excites me more) they propose a bloom filter alternative that can automatically identify useful structural elements and latch onto those as part of its categorization, which is is perhaps one of the most revolutionary ideas I've seen bringing ML into the distributed systems world.
That doesn't mean that the amount of computation and memory comparisons done doesn't vary based on the order of the input.
I don't understand how you can be so confident about attaching a utility function to latency.
In that case, you want the average to be as low as possible because then you need less hardware for the same throughput, or you can get faster response time for any users once the waiting/delay caused by other users is taken into account.
A blocking read operation that predictably takes 10 ms will limit you to 100 requests per second, and with an uneven/bursty distribution many of those requests will be queued for 100ms; A blocking read operation that sometimes takes 50 ms but has 5 ms on average will have almost no queuing in 100 requests per second, so the 50ms will really be the worst case.
Yes, this approach may not work for a lot of situations, but there are some (as pointed out elsewhere in the comments), where it's perfectly reasonable, even desirable to improve the average runtime at the cost of the worst case.
It's not that the performance of the existing indices is bad. It's the fact that they are often ignored by the more complex operations even when the column is indexed.
I'd be more curious to see that addressed instead.
Not when they are used.
There is also the cold start problem. How do you start to lay out the data on disk as you begin inserting it? Do you have a pre-trained net and use it at first (inserting where the net thinks the data should be)? The strategy probably differs by index type.
The larger layers of LSMT have enormous size and should be accessed/built as rare as possible.
Being able to predict that given element exists in the larger layers at all is quite a bonus. You can skip reading megabytes of data.
The rareness of building of the larger layers justifies training deep neural model for them.
I cannot verify existence of LSMT backend for major SQL DB engines, but NoSQL engines use it a plenty: https://en.wikipedia.org/wiki/Log-structured_merge-tree
I'm just waiting for it to make it into one of the big PG providers.
Isn't Peloton close? http://pelotondb.io/
> For efficiency reasons it is common not to index every single key of the sorted records, rather only the
key of every n-th record, i.e., the first key of a page.  This helps to significantly reduce the number of
keys the index has to store without any significant performance penalty
Is this true of B-Tree indexes in Postgres, for example?
In particular, one could argue that it is true for clustered indexes (e.g. in SQL Server or MySQL/InnoDB). In that case the table is actually in an ordered fashion and happens to be identical to the leaf-node-level of the index. However, if you argue that this level doesn't belong to the index but is just the payload, the level above the leaf nodes is as they say: it has one pointer per page.
Did adding the "Data Storage and Retrieval as an ML Task" synopsis make it less appealing?
The maths used to discover nuclear weapons weren't dangerous. What was dangerous was that the maths happened to align with our physical world.
At the end of the day we are still building a powerful new technology. To assume it will be completely harmless is naive. No technology is completely harmless. But at least our previous inventions didn't sprout minds of their own. And have desires and plans independent of their creators.