
The Case for Learned Index Structures - anjneymidha
https://arxiv.org/abs/1712.01208
======
jandrewrogers
This is a very interesting paper, lots of ideas I haven't seen before and it
gets a few ideas very right that most computer scientists overlook (more on
that below). But comparing performance against B-trees tilts the results.
State-of-the-art access methods use succinct data structures which encode the
data distribution, which was called out in the paper as something that doesn't
really exist currently. I've built and seen these structures at a few
different companies. In fairness, I don't recall any literature on these (as
is the norm for high-end database computer science). The key points on
succinct structures:

\- They can be constructed adaptively for high-throughput mixed workloads,
also called out in the paper as _not_ being a feature of these structures --
which is true if you limit the scope to succinct structures that _don 't_
encode data distribution. One of the driving use cases, beside reducing B-tree
index bloat, is real-time write performance.

\- These structures are extremely compact. The equivalent search structure for
the test data set in the paper easily fits (per my cocktail napkin
calculation) entirely in L2 cache and each level is traversed with a handful
of (admittedly clever) bit-twiddling operations. While the algorithms in the
paper are much more compact than B-trees, which is an interesting and valuable
result, they are still much larger than alternatives. It should be noted that
the succinct data structures used here are not tiny B-trees -- they operate on
different principles.

\- Multidimensional versions of the succinct data structures already exist.
The majority of the performance of my spatial databases can be attributed to
the development of succinct index structures that generalize to spatial data
models. The spatial algorithms allow it to scale out but the performance is
due to succinctness.

Which is to say, the ideas in the paper are really neat, but they are unlikely
to supplant other algorithms for databases.

Where the paper really gets it right is framing "indexing" as a
learning/prediction problem. Most computer scientists think of indexes as a
data structures to search for things but give little thought to the
theoretical limits of indexing _in the abstract_. As in, what is the best
possible indexing structure for data models generally, and how close can we
get to that for practical purposes? The abstract description of optimal
indexing is essentially as an algorithmic induction/prediction problem, which
makes an ideal implementation intractable but when you start to think of
indexes in terms of algorithmic information instead of organizing values, it
leads to interesting constructs like the data structures mentioned above that
are dramatically more efficient and effective than traditional top-down
indexing algorithm designs.

Optimal index construction is, oddly enough, closely related to the problem of
AI. Consequently, it doesn't surprise me that algorithms from AI can be
applied to produce efficient index representations. At the limit, you would
expect the data structures for indexing and AI to converge.

~~~
sujayakar
> In fairness, I don't recall any literature on these (as is the norm for
> high-end database computer science).

Check out Navarro's book _Compact Data Structures: A Practical Approach_. It's
a really great survey of the literature from a data structures perspective. It
also spends a lot of time on compressing these structures, an alternate
perspective to learning/prediction, as you mention.

------
mad44
Here is a brief summary of the approach.
[http://muratbuffalo.blogspot.com/2017/12/paper-summary-
case-...](http://muratbuffalo.blogspot.com/2017/12/paper-summary-case-for-
learned-index.html)

I like it. This is a natural trend: we will see databases and distributed
systems become more data-aware to achieve efficiency and performance.
[http://muratbuffalo.blogspot.com/2015/08/new-directions-
for-...](http://muratbuffalo.blogspot.com/2015/08/new-directions-for-
distributed-systems.html)

~~~
edchi
Thanks for posting this summary. I haven't seen it yet.

------
anon1253
The really cool application of this, which they allude to at the end of the
paper, is multi dimensional indexes. For example right now kNN for similarity
search (think recommender systems) is usually done with some form of K-D tree
or projection method. This has a bunch of downsides, including memory usage.
If we can "learn" that index directly, that would be really fancy.

------
Asdfbla
Sounds like an interesting approach, but just that I understand the scope or
impact of the paper right: Surely data-aware indexing can't be the novel part,
right? Or was it always so complicated to model the data distribution that no
one managed to do it until now? It seems natural to try to adapt your index to
the type of data you see more often than not.

Very cool idea though.

~~~
goialoq
The novel part is using machine learning to implement that data awareness, in
principle this means that the data owner doesn't need to use any human
knowledge or heuristics to build an efficient index -- the machine running a
general algorithm can find the optimized representation of the index.

~~~
edchi
Thanks for summarizing one of our major points.

------
jltsiren
Some thoughts on the paper as a researcher in space-efficient data structures:

When we design compressed data structures, we generally assume that the data
has been generated by some statistical or combinatorial model. Then we design
the encoding for the data and the data structure itself, assuming that model.
There are often principled ways to achieve that.

We also need an index to find the correct block of data to decompress. This is
where we often have to resort to ugly hacks. The constant-time select
structure for bitvectors is a good example. To find the i-th 1-bit, we split
the bitvector into blocks and superblocks of certain number of 1-bits. Then we
have different ways to encode the blocks and superblocks, depending on whether
they are dense or sparse. There are lots of special cases, but the index still
often ends up wasting space for redundant information. We could adapt to the
data better by adding even more special cases, but that quickly becomes
impractical.

The key idea in this paper is that data structures often contain components
that are learnable. For example, B-trees and rank/select indexes are
essentially increasing functions, which should be easy to learn. We can afford
having more special cases, if we don't hard-code them but handle them as data.
This in turn can make the indexes smaller and faster.

The drawback is index construction, which requires a lot of resources. (This
is a common problem with sophisticated data structures.) Spending minutes to
index 200 million elements can be justified in some applications, but it's way
too much in other applications.

------
anonacct37
This seems interesting but to me there is a flaw near the beginning. They
state a btree assumes worst case distribution. That's a feature . Much better
than a "this will be fast, if you're lucky" distribution.

But who knows, maybe for read heavy analytical workloads this will be an
interesting way of improving performance or reducing space usage.

~~~
Donald
This is the exact point of view they are rejecting. You want spectacular
average-case performance at the cost of a slow but not catastrophic worst-
case.

~~~
anonacct37
So basically suitable for batch mode only? There's really no other situation
in software where average is a useful measure.

~~~
willchang
How about searching the web? I'd rather most queries take 1 second, and 10%
taking 10 seconds, than every query taking 5 seconds.

I don't understand how you can be so confident about attaching a utility
function to latency.

------
posterboy
This seems like a rehash of hashing. sorry for the pun. Neural Networks are,
in essence, just really good at compression.

~~~
noelwelsh
If you want to view it that way, it is data dependent hashing. That is the
innovation: the mapping from data to hash is learned from the distribution of
the data.

~~~
eternalban
Perfect Hashing was crica 1984.

~~~
noelwelsh
Perfect hashing provably has size that is O(n) in the data being indexed. You
can think of this as approximate perfect hashing. You trade off a bit of
accuracy for a lot of compression.

~~~
eternalban
Thanks. That [is] what I gathered as well. My OP was merely to note relevant
existing results.

------
vadimberman
Sorry to be a Debbie Downer, but what's the point?

It's not that the performance of the existing indices is bad. It's the fact
that they are often ignored by the more complex operations even when the
column is indexed.

I'd be more curious to see that addressed instead.

~~~
goialoq
the performance of the existing indices _is_ bad in many cases.

~~~
vadimberman
With the reason being, they are ignored.

Not when they are used.

------
jmcminis
As it says in the paper, this might be useful for data warehouses. But, it’s
not coming to postgres anytime soon. Index updates on the order of seconds to
minutes would be too much for a transactional db.

There is also the cold start problem. How do you start to lay out the data on
disk as you begin inserting it? Do you have a pre-trained net and use it at
first (inserting where the net thinks the data should be)? The strategy
probably differs by index type.

~~~
thesz
Most of current storage backends have Log-structured Merge Tree implementation
or something like that.

The larger layers of LSMT have enormous size and should be accessed/built as
rare as possible.

Being able to predict that given element exists in the larger layers at all is
quite a bonus. You can skip reading megabytes of data.

The rareness of building of the larger layers justifies training deep neural
model for them.

I cannot verify existence of LSMT backend for major SQL DB engines, but NoSQL
engines use it a plenty: [https://en.wikipedia.org/wiki/Log-structured_merge-
tree](https://en.wikipedia.org/wiki/Log-structured_merge-tree)

~~~
jmcminis
So you want a LSM for inserts and the DNN for reads? Seems OK. You still have
to update/retrain the DNN after an insert into a larger layer, which will be
expensive. So you’d probably get high latency at the 99% (or some high
number).

~~~
thesz
There are no inserts into larger layers, only merges. Which are long (usually
processed in background by separate thread) and that longness justifies
training a new net in parallel to merge process.

------
djhworld
Can someone please explain this to me

> For efficiency reasons it is common not to index every single key of the
> sorted records, rather only the key of every n-th record, i.e., the first
> key of a page. [1] This helps to significantly reduce the number of keys the
> index has to store without any significant performance penalty

Is this true of B-Tree indexes in Postgres, for example?

~~~
_wmd
This only applies to indexes over a pre-sorted structure. For unordered, PG
has something vaguely similar in the form of BRIN indexes, but really they're
totally different tricks

------
AlexCoventry
Not complaining, but curious about why this link took off and mine didn't.
([https://news.ycombinator.com/item?id=15892476](https://news.ycombinator.com/item?id=15892476))

Did adding the "Data Storage and Retrieval as an ML Task" synopsis make it
less appealing?

~~~
Kinrany
Show HN: (2018) Using Learned Index Structures to optimize HN titles.

~~~
Kinrany
(Sorry if that doesn't make any sense, I haven't actually read the link yet.)

------
oh-kumudo
Can anyone with a research/professional background in DB shed some light about
the significance/implications behind this paper?

~~~
power
Normal DB indexes are designed to be general-purpose so that they give
predictable performance no matter the data you insert. The paper basically
describes a more efficient way to look up data by using an index that's
tailored to the specific dataset it's built over. It's expected that you'd be
able to get a performance increase by doing this. In more detail, they use
neural nets to learn an approximation of the distribution of the data and use
that to look up the rough position of each key faster than usual. They still
use traditional structures for the "last-mile" which is not so easily learned.
You could accomplish the same thing without NNs by using anything that can
approximate the distribution of the data. E.g. a histogram would work for some
cases, and you could do some PCA and normalization first to deal with more
cases. NNs have the advantage that they can learn more complex distributions.

~~~
jacksmith21006
The other advantage is more opportunity to parallize the work. So with new
silicon architectre like the Google TPUs. This part is an important element to
the equation.

------
Upvoter33
This is actually pretty cool - it makes sense to figure out ways to apply ML
to core infrastructure in storage and databases. One wonders what other parts
of such systems can be ML optimized... ?

~~~
perfmode
Any place where heuristics and arbitrary constants are used.

------
harveywi
Interesting idea. It seems like this technique could be applicable to
bioinformatics, especially sequence assembly.

------
nabla9
Also known as Skynet-index

------
w_t_payne
OK. In a way, this is sort of terrifying.

~~~
tw1010
Meh. That's only because of the association between AI and disaster movies. If
you just think about it as math, it's no more (or less) scary than what came
before it.

~~~
Houshalter
The invention of atomic weapons was "just math".

~~~
backpropaganda
The invention of atomic weapons was physics, a precise description of the
elementals of our universe. Results from Riemann geometry are not dangerous,
but the statement "our universe's spacetime obeys Riemann geometry" is.

~~~
MaxBarraclough
We can play a similar game with nuclear weapons, no?

The maths used to discover nuclear weapons weren't dangerous. What was
dangerous was that the maths happened to align with our physical world.

~~~
moomin
Actually, neither was that dangerous. For it to be dangerous took years of
engineering work.

~~~
Houshalter
What does it matter if it's "just physics" or "just engineering" or "just
math"? Anything you can say about AI, you can say about the atomic bomb. AI
involves a quite a bit of engineering as well. This argument is silly.

At the end of the day we are still building a powerful new technology. To
assume it will be completely harmless is naive. No technology is completely
harmless. But at least our previous inventions didn't sprout minds of their
own. And have desires and plans independent of their creators.

