Hacker News new | past | comments | ask | show | jobs | submit login
PGM Indexes: Learned indexes that match B-tree performance with 83x less space (unipi.it)
583 points by hbrundage 86 days ago | hide | past | favorite | 123 comments

Some context: A few years ago, there was a paper from Google (https://dl.acm.org/doi/10.1145/3183713.3196909) that made learned data structures popular for a while. They started from the idea that indexes such as B-trees approximate an increasing function with one-sided error. By using that perspective and allowing two-sided error, they were able to make the index very small (and consequently quite fast).

Many data structure researchers got interested in the idea and developed a number of improvements. The PGM-index is one of those. Its main idea is to use piecewise linear approximations (that can be built in a single quick pass over the data) instead of the machine learning black box the Google paper was using.

Hi Jouni!

You may find interesting these other papers of ours:

- The ALENEX21 paper "A 'learned' approach to quicken and compress rank/select dictionaries" (http://pages.di.unipi.it/vinciguerra/publication/learned-ran..., https://github.com/gvinciguerra/la_vector), where we introduce a compressed bitvector supporting efficient rank and select queries, which is competitive with several well-established implementations of succinct data structures.

- The ICML20 paper "Why are learned indexes so effective?" (http://pages.di.unipi.it/vinciguerra/publication/learned-ind...) where we prove that, under some general assumptions on the input data, the space of the PGM-index is actually O(n/B^2) whp (versus Θ(n/B) of classic B-trees).

Only having skimmed the work, read the following as a somewhat educated guess.

I think one could see this similar to repeated applications of interpolation search. If you are looking for x in a sorted array of n numbers between a and b, then index (x - a) / (b - a) * (n - 1) would be a good guess assuming uniform distribution of the numbers.

But as one can not assume a uniform distribution in general, one does that repeatedly. The first interpolation leads to better interpolation coefficients for the relevant subrange, which may lead to even better interpolation coefficient for an even smaller subrange until one eventually finds what one was looking for.

If there is no structure in the data that can be exploited, this degenerates into a more or less ordinary tree as we one certainly fit a line through two points, but if at some level a larger range of the data can be well approximated with the interpolation function, then it can save space and search time because one can get close to all the values in the range with only one set of interpolation coefficients and only one interpolation.

Isn't it sometimes better to assume data has a structure rather than having less performance but knowing it works equally well with unstructured semi-random data?

But if you are inventing or implementing a general purpose data structure, then you can not assume any structure by definition, especially not when it comes to worst case performance. On the other hand you can certainly include special handling of common special cases that will improve the performance or even come up with special data structures that only work under specific circumstances but then outperform general purpose solutions. As always it is a trade-off, here between generality and performance.

An interesting case in which Google trolled people into action by publishing a purely hypothetical paper for which there's no evidence they intended to put it into actual practice.

This is a major practical advance from the succinct data structure community.

This community has produced so many brilliant results in the past years. But, they work in the shadows. Since the rise of interest in neural network methods, I've often described their work as "machine learning where epsilon goes to 0." It's not sexy, but it is extremely useful.

For instance, Ferragina previously helped to develop the FM-index that enabled the sequence alignment algorithms used for the primary analysis of short genomic reads (100-250bp). These tools were simply transformative, because they reduced the amount of memory required to write genome mappers by orders of magnitude, allowing the construction of full-text indexes of the genome on what was then (~2009) commodity hardware.

I don't get it. I've implemented B-trees. The majority of space the used by a B-tree is the data itself. Each N-ary leaf of the tree is a basically a vector of data with maybe some bookkeeping at the ends. The leaves are more than half of the tree.

Sure, you can compress the data. But that depends on the data, completely random data can't be compress. Other data can be. But a point blank 83x space claim seems bizarre - or it's comparing to a very inefficient implementation of a B-tree.

Edit: It seems the 83x claim is a product of the HN submission. I could not find it on the page. But even the page should say something like "a compressed index that allows full speed look-up" (akin to succinct data structures) and then it would make sense.

So, from a quick read, I think there are a few things at play here that allow for "compression of random data".

One, and probably the biggest one, _this isn't lossless compression_. As other commenters mentioned, this aggregates groups of points into line segments and stores their slopes (allowing for a pre-specified error of up to epsilon).

Two, while the sample input data is randomly generated, it then needs to be sorted before it can be used here. This completely changes the distributional qualities (see: order statistics sampled from a uniform distribution [0]). Just as a toy example, suppose this was a million randomly-generated binary digits. Sure, you could store the million digits in sorted order, or you could just use run-length encoding and say "I have 499,968 zeroes and 500,032 ones" [1].

[0] https://en.wikipedia.org/wiki/Order_statistic#Order_statisti...

[1] I know, this is a dense sampling on the input space. But that's the sort of intuition that allows you to compress sorted data better than you'd be able to compress the unsorted data. The provided C++ code provides a sparse sampling.

The index does not store the data at all, it store slopes.

You start by sorting the data, make a piecewise linear interpolation, and you store each slope as triplet (key, slope, intercept) with key being the smallest value in the piecewise interpolation.

I find it quite clever to be honest.

I am not sure how it works on inserts and delete, didn't read the whole paper.

More digestible info on the slides.

I'm almost good at math. Could you enlighten me, when you say "slope", are you talking about the slope of a line as defined by "The Equation of a Straight line" [0], i.e. the "m" in "y=mx+b"?

And what do you mean by "intercept". Are you referring to the "b" in that same equation?

If that's true, in what way is that an optimization to storing "x" and "y" at each node?

[0] https://www.mathsisfun.com/equation_of_line.html

In a traditional tree you just store a bunch of y's. So let's say 7 points are just y0-y6. In a tree you would need at least 7 keys stored. In this you would just have m and b stored and you would use the formula to look up the index.

Thanks, that really helps get an intuition for how this works! It's pretty clever really.

As parent said, it is a piecewise linear interpolation, which means it is divided into line segments. Hence you don't need the parameters for the whole index, only for each segment, as the parameters are the same for each point in the same segment.

"piecewise linear interpolation"

After having linearly interpolated the points on the curve piecewise, instead of a smooth curve, you now have line segments that connect to other line segments, each point now connected to the next, not smoothly, but linearly, and each of those lines/segments can be optimized by storing only slope and intercept. Yes?

That seems both ingenious and very, very straight-forward. What is the reason behind us not having deployed massive amounts of search indices into the world that were built using this method? Has the drawback of this method been too great?

The drawback to me seems to be the amount of computation needed to produce such an index.

Hi @kreeben, the construction is very fast, it can be done in linear time with a single scan of the data. Just to give you some figures, we constructed a PGM-index on 10^12 key-value pairs (8 bytes + 8 bytes) loaded in main memory in less than 3 seconds.

About the information that we store, we prefer to say "piecewise linear approximation" because the line segments do not necessarily connect points, and they are not connected to each other. The guarantee of the piecewise linear approx is that each line segment is far from the input points by at most a user-given integer ε. With this guarantee, when we compute the approximate position p of a given query key q, we can always find the true position of q after a fast binary search in the range of positions [p-ε,p+ε].

The piecewise linear model is nothing more than a list of triples (key,slope,intercept). The key is needed because we need to know where a segment starts and where the previous one ends. The slope and the intercept are exactly as you said, the parameters that let us compute the approximate position via p = slope * q + intercept.

It is, arguably, deployed at scale, if you squint and call it delta compression. The reason its not more widely used is there are drawbacks. Updates can be more expensive, for instance: what was a single-key change perhaps blew up a 1000 key "line" and now all those keys need reshuffling. You also see more muted returns because generally keys in indexes point to something, so compressing the keys creates complexity in how you map each key to its value or reference. Compared to a solution that stores keys close to their values, a solution that packs keys together gets poor cache locality for resolving the values, and may need mapping tables.

Basically, delta compression is used all over the place, but it isnt a silver bullet.

> What is the reason behind us not having deployed massive amounts of search indices into the world that were built using this method?

There are a lot of other strategies for segmented data indexes. Many of which are roughly equivalent to this in practice. For example, using a computed strategy for distributing data among various nodes on a server. In such a situation, a master node can run a computation on an indexed value to identify the node(s) in the cluster which contain the data and send a query request to only the nodes containing that information.

I would think the biggest drawback to this particular strategy is, often times, you need to analyze the data in the index to get an answer anyway. So it's not really beneficial to throw away the index data. For example, if the indexed field is a timestamp, and the result set is often sorted on that field, then it makes total sense to keep the index values and use that to perform sorting while a separate thread fetches the data needed to flesh out the query.

If there's even a rough order to the underlying data, I'll buy their claim. On ordered data, a Postgres block-range index (BRIN) is often several orders of magnitude smaller than a B-tree index.

If the data is random, I suspect you're right and the PGM index is no-better than a B-tree index. Most data does have an order and would probably see similar gains.

Not only is there a rough order, they explicitly require the ability to meaningfully embed data into the reals, and the performance gains come from assuming those embeddings have simple delta distributions. I wouldn't be surprised if the technique is worse than useless when that assumption is violated.

Edit: I don't have time right now, but a toy example I like to throw at these kinds of problems is mapping primes to their indices (e.g. 2->0, 3->1, 5->2, ...). General-purpose learning algorithms can usually make a little headway with it, but not much, and only with substantial resources thrown at the problem. I'd be shocked if that toy example were any faster with their solution than a b-tree.

Due to the prime number theorem, your toy example actually has a very good approximated mapping.

Kind of. Appropriately normalizing the primes with some kind of function based on log(log(n)) would probably allow the article's technique to perform well (still struggling on larger primes -- you can't avoid needing a lot of bits to express those gaps), but it's precisely the shifting density which makes the problem hard to learn with any standard algorithm, and I don't think theirs would be an exception since they explicitly rely on gaps drawn from some fixed distribution.

I discovered BRIN indexes fairly recently. You trade odd a bit of speed (data & query dependant, of course), but potentially gain a huge amount of disk space vs B-Trees - the first time I created a BRIN index, I did a double take because I thought something must be wrong because the index was so small!

The size comparison here only concerns the size of index. That is the size of the data is excluded.

Parent means the size of the key data present in the index, I suspect. In a B-tree every byte of every key value is present.

They do it by not storing keys in the index. A B-tree has copies of all the keys in the tree, and (normally) also in the data. Here, they just have slopes that get you close to the right actual element.

> completely random data can't be compress

Given that a normal B-Tree can't retrieve the original order, it must be more compressible than random data and a representation that would let you represent a wrong order has invalid sequences, so it must be less space efficient than one that would use those sequences to mean something valid.

The paper says 83x, and even makes stronger claims that are fairly unqualified:

In short, the experimental achievements of the PGM-index are: (i) better space occupancy than the FITing-tree by up to 75% and than the CSS-tree by a factor 83×, with the same or better query time; (ii) uniform improvement of the performance of RMI in terms of query time and space occupancy, and 15× faster construction, while requiring no hyperparameter tuning; (iii) better query and update time than a B+-tree by up to 71% in various dynamic workloads while reducing its space occupancy by four orders of magnitude (from gigabytes to few megabytes).

I wonder why it was just four orders of magnitude, though. Why not six or twelve?

Have you examined the results presented in section 7? They take pains to qualify these claims.

If you are curious, the authors have provided completely reproducible experiments.

I didn't say that these specific claims are simply false. Just that they sure seem to hint at extraordinary improvements in a broad range of cases. The website itself also seems to me to be suggestive in just the same way, only much more so.

I myself don't think that that amounts to taking pains to qualify their claims. Other people will have to make their own minds up about whether or not that's the case.

Their slides https://pgm.di.unipi.it/slides-pgm-index-vldb.pdf about PGM index, page 21.

They stop at page size of 1024 bytes - that indicates they are tested in-memory situation. And, which is worse, their compression ratio advantage almost halves when block size is doubled. Thus, what about B-tree with blocks of 16K or even 256K?

Also, what about log-structured merge trees where bigger levels can use bigger pages and, which is quite important, these bigger levels can be constructed using (partial) data scan. These bigger levels can (and should) be immutable, which enables simple byte slicing of keys and RLE compression.

So, where's a comparison with more or less contemporary data structures and algorithms? Why beat half a century old data structure using settings of said data structure that favors your approach?

My former colleague once said "give your baseline some love and it will surprise you". I see no love for B-trees in the PGM work.

Hi @thesz!

The experiment you are referring to is done in main memory with an optimised in-memory B+tree implementation. We didn't plot the performance for larger page sizes because in our machine they performed poorly, as you can already see from the configuration with 1024-byte pages. So we're not favouring our approach at all.

Note also that next-gen memories have smaller and smaller access granularities. For example, the Intel's Optane DC Persistent Memory accesses blocks of 256 bytes, while the Intel's Optane DC SSD accesses blocks of 4 KB. I guess that data structures with blocks of 16K-256K are disproportionate in these cases.

About LSM-trees, nothing prevents you to use a PGM-index (which you can construct during the compaction of levels, thus without scanning data twice) to speed up the search on a long immutable level. Or also, to use a PGM-index on data which is organised into RLE-compressed disk pages ;)

These blocks of 256 bytes most probably are stored in wear-leveling database of some sort hidden inside NVME. These databases are often LSM-tree-based. So, writing larger blocks still has benefits, especially when you use compression.

If you think you only need 256 byte pages, average price for 10G hard disk drive is ~$300 [1] and average price for 2G SSD drive is also ~$300 [2].

[1] https://pcpartpicker.com/trends/internal-hard-drive/?__cf_ch...

[2] https://pcpartpicker.com/trends/internal-hard-drive/?__cf_ch...

Five times price/Gb difference. If you need large storage, you need hard disks.

B-trees are good with hard disks, is PGM index good with them too?

From a Big-Oh point of view, the answer is a big yes. No matter the memory technology or the disk page size, be it 256B or 16KB, the PGM-index can scale as B-trees or even better (see my comment here https://news.ycombinator.com/item?id=25901889).

Can you provide us with (preferably drop-in) replacement of LMDB as a proof?

Because your big-O looks like big-O of cache-oblivious algorithm and I saw no proof of that.

Are you aware of any contemporary on-disk or in memory DB system that typically uses a page size of 4KiB or less? And if not, are you expecting that to become a trend in the future?

Hello everyone. I'm Giorgio, the co-author of the PGM-index paper together with Paolo Ferragina.

First of all, I'd like to thank @hbrundage for sharing our work here and also all those interested in it. I'll do my best to answer any doubt in this thread.

Also, I'd like to mention two other related papers:

- "Why are learned indexes so effective?" presented at ICML 20, and co-authored with Paolo Ferragina and Fabrizio Lillo.

PDF, slides and video: http://pages.di.unipi.it/vinciguerra/publication/learned-ind...

TL;DR: In the VLDB 20 paper, we proved a (rather pessimistic) statement that "the PGM-index has the same worst-case query and space bounds of B-trees". Here, we show that actually, under some general assumptions on the input data, the PGM-index improves the space bounds of B-trees from O(n/B) to O(n/B^2) with high probability, where B is the disk page size.

- "A 'learned' approach to quicken and compress rank/select dictionaries" presented at ALENEX 21, and co-authored with Antonio Boffa and Paolo Ferragina.

PDF and code: http://pages.di.unipi.it/vinciguerra/publication/learned-ran...

TL;DR: You can use piecewise linear approximations to compress not only the index but the data too! We present a compressed bitvector/container supporting efficient rank and select queries, which is competitive with several well-established implementations of succinct data structures.

I enjoyed glossing over the paper (will read it properly after work), but it was not immediately obvious to me how to implement this for strings.

I'm no expert in this area, so this might have an obvious answer.

I mean I guess you could treat characters as base 2^32 or something like that and convert a string to a real that way, but often strings have a non-trivial sort orders.

You are right, variable-length strings are difficult. You could try to pack as many characters as possible in a computer word (or in a big int data type), say P characters, and then use the PGM-index to find the strings that share a prefix of P chars with the given query string. I discussed this solution in a GitHub issue (https://github.com/gvinciguerra/PGM-index/issues/8#issuecomm...). It may work in practice, but it's far from being adequate if compared to trie data structures (and their recent advancements).

Hi Giorgio, thanks for sharing!

You mentioned that tries have had recent advancements, could you please point me to a paper about these advancements so I can learn more? I did a quick Google search but wasn't able to find anything that seemed relevant.

Hi @Gh0stRAT, you are very welcome! For prefix search on strings, I recommend the classic String B-tree paper (https://dl.acm.org/doi/10.1145/301970.301973). Among recent results, there's the c-trie++ paper (https://arxiv.org/pdf/1904.07467.pdf) and the papers mentioned in their Related Work section.

If you do not need to do similarity searches, indexing over the (fixed size and numeric, possibly cryptographic) hash of the string might work.

edit: but that will map to an uniform range, making interpolation trivial. As this solution can be used for any type, I'm probably missing something. The index discussed in the paper can probably be used for range queries (which hashing would prevent), not just punctual searches.

edit: bah, I'm just reinventing an hash map.

Can PGM index and linear approximation models in general be applied to clustered indexes, where actual data of variable size are stored in the index along with the keys?

Yep, in that case you could use an indirection vector containing, for each key k, the offset to the first byte of k. This is what is typically done in B-trees, where the indirection vector is stored in the header of a disk page. It's described for example in Section 3.3 "Variable-length records" of Goetz Graefe's "Modern B-Tree Techniques".

Thank you for your work!

Are there current efforts in your research going in mainstream RDBMS (say postgres)?

The space improvements are so great columns could just be indexed by default.

Thank you so much for your interest, BenoitP!

Right now I'm focusing more on the design of compressed data structures. RDBMS are complex systems, and gaining sufficient knowledge of their internals would require several months of work. Though, it would wonderful for me to collaborate with some RDBMS engineers to integrate my current research efforts in their system.

Actually, some time ago, we asked a bachelor's student at the University of Pisa to integrate the PGM-index in Redis (which is simpler than an RDBMS). He did it, and the results were really promising, -3x overall memory usage with respect to Redis ZSETs.

Does Savitch's theorem apply here? If yes then you can have a strict bound in section 7.1.

Did I say something wrong?

Not at all @virattara ;)

I don't have an answer right now, I just need to think about it more deeply

ok ok

The slides: https://pgm.di.unipi.it/slides-pgm-index-vldb.pdf

It seems they are only talking about compressing the index (keys) not the values.

Also, the slides seem to imply the keys need to be set in sorted order? That way their memory locations will be in increasing order too. That’s quite an important limitation, that means the index is read-only in practice once populated. Though it may still be useful in some cases.

Did I misunderstand?

Hi @zupa-hu!

In the paper we focused on indexing keys, as the compression of keys and values is an orthogonal problem. For example, you can compress disk pages containing keys and values with a method of your choice, and then use the PGM-index to efficiently locate the page containing the query key.

For what concerns insertion and deletions (in non-sorted order), they are discussed in Section 3 "Dynamic PGM-index" and experimented in Section 7.3. The implementation is available at https://github.com/gvinciguerra/PGM-index/blob/master/includ... and the documentation at https://pgm.di.unipi.it/docs/cpp-reference/#classpgm_1_1_dyn....

Hi @gvinciguerra, thanks for your reply!

I mainly stated your solution is about the size of the index and not the values because others comments at the time questioned if it's possible at all. I wanted to add my bits trying to decipher what this is. :)

I have implemented a custom index myself so I'm really interested in this stuff but I have to admit I don't understand the language used to explain it. I'm not sure who the audience is. If it's not only targeted at researchers but also random programmers out there, maybe consider adding either an ELI5 translation to things or more proof that it's worth investing time in this (eg. learning your lingo). For example, by adding performance charts. I totally missed that.

Wow, looking at it again, I now see you have some charts at the bottom of your slides. I'm pretty sure 99% of your visitors did not find it. And even so, I'm not quite sure I understand those charts.

My 2 cents:

1) Simplify them. Don't pack so much info in a single chart. This prevents me from getting to the AHA moment. Maybe show them on separate tabs.

2) Move this chart/these charts to the top of your home page. I knew this is about an index before landing on your site, so when I landed, I had exactly 1 question in my mind: Should I care? Your homepages did not answer that. I found links to papers which I don't even open as I believe they are huge time investment and I want to first figure out if I should invest more time. I was happy about the slides because slides are about selling the big ideas. It was better than the home page but in the end I left not having my only question answered. I saved the link. I will probably remember to look at it again when I have special requirements for an index. But I'm not motivated to invest more time right now, I'll keep reading my book in the mornings instead of your paper.

(I'm sharing this so you can improve the site if you want.)

Of course it begs the question: if the keys are sorted, what do we need an index for? A simple halfing method would trivially do it then with btree like performance and infinitely better index size (0).

Maybe they may have made an improvement here trading some space for even better lookup times? In that case, the 83x space over btree indexes is certainly possible - given that infinite improvement is possible too.

Binary search is quite slow on modern hardware, particularly for this use, where you would need to fault in a page for each probe. With a billion records that is 30 probes. They get much better than log2(n).

This is a lot like how you look up words in the dictionary. It is roughly radix-like, but the closer you get, the better the fit. If you are looking up a word that starts with S, you adjust to how broad S is as you home in.

This is on in memory data. So binary search would seem reasonable except that you can't do inserts, updates, or deletes efficiently in an ordered array. That inevitably leads to using a btree or trie structure.

"In-memory" doesn't mean so much as it once did. Faulting a page into cache from RAM takes hundreds or even thousands of cycles. Same for faulting an index page, if the whole index doesn't fit in cache.

A B-tree replaces a log2(N) binary search of the whole range into a K-ary search, with log-base-K(N) probes; but adds searches through the K keys per tree node, which all have to be brought into cache.

Even once the K keys have been brought into cache, a binary search in the index page is quite slow on modern hardware because the iterations involve randomly mispredicted branches.

A great advantage of PGM indexes seems to be that the whole index can be held in (L3) cache. Faulting a line from L3 to L1 cache takes only ~30 cycles. Once the index has been walked, you are close to the desired record and can walk forward or back a few steps to locate it.

If you have to handle insertions, it is often better to keep an overflow table searched first (or last) so you can batch-rebuild during downtime. Deletions may be handled by marking dead entries, and updates are easy. Most multigigabyte tables don't see much churn, relative to their size.

Thanks, that's a very good point!

Only watched the video, was disappointed by https://youtu.be/gCKJ29RaggU?t=408 , where they are comparing against tiny b*tree page sizes that nothing uses any more - 4k, 16k and 64k are way more common

Hi @jabberwcky! The plot refers to a B+tree implementation optimised for main-memory (https://panthema.net/2007/stx-btree/). We didn't show the performance for larger/smaller page sizes because in our machine they performed poorly. Indeed, you can see from the figure that the fastest B+tree configuration had page size set to 512 bytes. The one using 1024-byte pages is already much slower, that's why we didn't clutter the plot with page sizes larger than 1k ;)

This sounds like a great advancement, however an implementation in RDBMS products may be some way away yet - MSSQL uses 8KB pages by default, and I believe (without checking) that most other RDBMSes use at least 4KB. B+ tree index implementations on RDBMS products may be here to stay for a while yet unless these performance issues can be minimised, or unless there is a paradigm shift to use smaller pages - which would have a knock-on effect to query performance unrelated to indexes, such as number of page lookups, increased I/O for non-contiguous page reads...

Thank you @mr_gibbins.

I think an implementation in RDBMS is worth exploring. After all, the page size of the PGM-index can be tuned to match the one of the media you are storing your data to (or to match the default setting of the RDBMS).

About the paradigm shift to use smaller pages, it's worth mentioning technological innovations like the Intel's Optane DC Persistent Memory, in which the access granularity is 256 bytes. I expect that similar products will be available in the near future and that new/updated RDBMS implementations will take advantage of them.

I assumed that a bigger page size would incur a worse query performance. You can already see the trend in the figure. So the index size comparison is based on the b+-tree which has a similar query performance with the proposed learned index.

AFAIK sqlite's default page size is 4k

We produced a detailed comparison of such "fitting" and "learning" techniques, available here: https://vldb.org/pvldb/vol14/p1-marcus.pdf

(Thomas Neumann, one of authors of the blog post, is a co-author of the linked paper)

How would one (very roughly) approximate what this index does in terms of big-O notation for time and space? Is it the same as a b-tree in time but with linearly less space?

The paper submitted to VLDB [1] has a table (Table 1) which lists the time complexity for the PGM Index and compares it with a Sorted Array, a B-Tree and another type of Data Aware/Learned Index - FITing-tree

[1] http://www.vldb.org/pvldb/vol13/p1162-ferragina.pdf

Hi @etaioinshrdlu!

The worst-case bounds are discussed in *Section 2.2* and *Theorem 1*. Essentially, we have the following bounds:

Query: O(log_c(m) log_2(ε/B)) I/Os Space of the index: O(m)

where: n = number of input keys B = disk page size ε = user-given maximum error of the piecewise linear approximation (determines how many keys you need to search at each level) m = number of segments in the piecewise linear approximation c = fan out of the data structure (differently from standard B-trees it is not fixed, and it is potentially large)

Intuitively, the query complexity comes from the fact that the PGM-index has O(log_c(m)) levels, and at each level you do a binary search that costs O(log_2(ε/B)) I/Os.

Note that m and c depend on the "linearity" of the given input data. For example, if the input data can be approximated by a few segments, i.e. if m=O(1), and you choose ε=Θ(B), then the PGM-index takes O(1) space and answer queries in O(1) I/Os!

In general, you can remove the dependence from m and c if you can prove a lower bound on the length of a segment (i.e. the number of keys it "covers"), irrespective of the input data. We proved that the length of a single segment is at least (thus c≥2ε), or equivalently, that the number of segments m is upper bounded by n/(2ε) [Lemma 2, the proof is very straightforward]. Again, if you choose ε=Θ(B), then you have the following (rather pessimistic) worst-case bounds:

Query: O(log_B(n)) I/Os Space of the index: O(n/B)

Basically, these bounds tell you that the PGM-index is *never* worse in time and in space complexity than a B-tree!


However, in our experiments, the performance of the PGM-index was better than what the above bounds show, and this motivated us to study what happens when you make some (general) assumptions on the input data. The results of this study are in the ICML20 paper "Why are learned indexes so effective?" (http://pages.di.unipi.it/vinciguerra/publication/learned-ind...).

We found that, if you assume that the gaps between input sorted keys are taken from a distribution with finite mean and variance, then you can prove (Corollary 2 of the ICML20 paper) that the space of the PGM-index is actually O(n/B^2) whp (versus Θ(n/B) of classic B-trees).

Note that the result applies to *any* distribution, as long as the mean and variance of the RVs modelling the gaps are finite. Indeed, we specialised our main result to some well-known distributions, such as Uniform, Lognormal, Pareto, Exponential, and Gamma (Corollary 1 of the paper).

For a detailed study of learned indexes, see this work: https://vldb.org/pvldb/vol14/p1-marcus.pdf

All code is available in open source: https://github.com/learnedsystems/SOSD

I've already thought about the idea of making statistics to optimize access time, so I guess this a viable implementation to do it correctly.

That's pretty amazing... I can somehow imagine this tech landing on every modern computer, allowing users to search for anything that is on their machine.

Many devs are probably familiar with perfect hashes as the gperf tool seems omnipresent on Linux machines. Is this a related concept? The learning part makes me suspect so but the slopes and interpolation part makes me doubt it.

This is interesting. Could this be adapted to store 2D data, like how a quadtree is a 2D range tree? (If you link me to a paper / pseudocode for that, I could implement it.) I imagine it would be useful in GIS, gaming, etc.

Check out work by Jialin Ding and Vikram Nathan, they both work on multi-dimensional learned index structures.


We index geospatial data using a learned index in this work (cf. Section 3): http://cidrdb.org/cidr2021/papers/cidr2021_paper19.pdf

Code: https://github.com/learnedsystems/RadixSpline

Hi @crazypython and thank you! Yep, I just added an implementation of the multidimensional PGM-index in the main repo. If you want to improve it, you are more than welcome. Drop me an email if you have some ideas. Thanks again!

The multidimensional version definitely looks exciting! Do you have any benchmarks for it yet? And will you be publishing a paper on it?

Maybe space filling curve 2D to !D map gets you most of the way there?

Also, a learned index from Microsoft: https://github.com/microsoft/ALEX

In the example they sort the data array. Does that mean this works just on sorted arrays? Insert and delete performance would be horrible I guess.

Hi @legulere!

Yep, the example of Figure 2 shows only a static PGM-index on a sorted array.

Insertion and deletions are discussed in Section 3 "Dynamic PGM-index" and experimented in Section 7.3.

The Dynamic PGM-index is open-source too: you can find the implementation at https://github.com/gvinciguerra/PGM-index/blob/master/includ... and the documentation at https://pgm.di.unipi.it/docs/cpp-reference/#classpgm_1_1_dyn...

They propose a solution for dynamic PGM indexes in the paper (section 3) and benchmark it (section 6). A summary is that, in their benchmark, their index is faster by 13%-71% in most cases, but can be slower (1%-15.2%) in a few cases.

I agree the example would be more eye-catching without that sort.


In the full paper they quote a rather interesting method [1] that allows you to insert values in amortized O(log(n)) time (deletes are apparently handled with tombstones, presumably rebuilding the whole thing when a sufficiently large proportion is deleted).

A very abridged explanation of how they handle inserts: you split the collection in a list of collections where position k contains either nothing or a collection size 2^k. When you want to add a new value you find the first empty spot and fill it by building a set of your new value together with the collections of all the preceding spots (because the sizes are all sequential powers of two this will fit exactly). Provided that merging the collections takes linear time this takes an amortized O(log(n)) per inserted item.

Of course once you have this you can use it for any learned index that can be learned in linear time.

[1]: M. H. Overmars. The Design of Dynamic Data Structures, volume 156 of Lecture Notes in Computer Science. Springer, 1983.

Skimming the paper it appears so. B-trees also require sorted data.

B-trees don't need sorted data. The B-tree insertion algorithm performs the sorting for you.

B-trees need sortable data; GP is not saying that the data has to be pre-sorted, but that it can't be unorderable or otherwise uncomparable.

Isn’t sorting required for range based indexes?

Not necessarily. If you have indexing structures for data types that do not have a total order, only a partial order, you can store and do an indexed range search on data types that do have a total order. The primary implication is that the output of the range search will not reflect the total order in the way it would for a traditional B+Tree.

Same question. Can you give me an example of an indexing scheme that works on non-sortable data? I can’t think of any.

The canonical example is indexing rectangles. They have no total order. It is far from the only example. Any data type where equality and intersection are not equivalent test functions will effectively be non-sortable.

There are many indexing schemes for data with these properties. They focus on topological relationships rather than order relationships.

Do you know of any blog/papers which talks about this - using topology for such interval data types.

Any Postgres implementation of this yet?

Not likely. I have done a lot of work on B-Tree indexing in Postgres in the past several years, and this is all Greek to me.

It looks like it found its way into Google Bigtable though: https://arxiv.org/abs/2012.12501

Could this be used for/generalized for Nd spatial proximity lookup tables?

Yep, I'm working on a multidimensional version that I hope to upload to the main repo (https://github.com/gvinciguerra/PGM-index) in a few weeks.

How does it compare to RTree?

Neat :)

Yeah I was curious about using it for raytracing, maybe like a kd-tree where you consider one dimension per level.

Hi @magicalhippo and @midjji. Please, have a look at the main repo, I just uploaded an implementation of the multidimensional PGM-index supporting orthogonal range searches ;)

Awesome, thanks for the follow-up!

Any chance it will make its way to Postgres?

There are a lot of fixes PostgreSQL can do before this exotic one. Starting from ZHeap etc etc.

Can custom index types be packaged into an extension, or would implementing this require deeper integration?

ZHeap a fix ? A different data management option, yes. A fix, no.

In the sense of lowering per-row-overhead, supposed to be faster on commit and slower on rollback(usual workloads).

Reminds me of TokuDB. What happened to it?

It failed to attract customers and was sold at a fire sale. (I was an investor.)

I was a happy customer!

That makes you part of a very exclusive club :-(

I've only heard of B-trees in passing. In what kinds of situations are these data structures used?

Databases. If you want to be able to quickly do a lot of useful operations on large amounts of data, B-trees and their variants (B+ trees) are the way to go. Using a B-tree, you can find an entry, sort and do range queries by key, and inserts and deletes are fast.

And here I was thinking probabilistic graphical models were finally getting the spotlight :)

I wonder if databases and/or browsers will make use of it

They should have chosen another name, the acronym PGM already stands for Probabilistic Graphical Model and they overlap in possible usages.

This type of comment is pretty common, but never adds to the discussion. Some things are going to have similar names, and usually it just doesn’t matter.

Take Rust the game and Rust the programming language. How often do people confuse them? I’ve never seen it happen. Never mind the fact that rust is also a compound that forms when iron combines with oxygen.

In my book, it’s better to come up with a name that makes sense or is memorable as long as it’s not very confusing.

People confuse the game and the language all the time in Reddit. We even had a talk at RustConf a few years back about teaching an ML model how to distinguish them.

(That said I agree with your post generally...)

portable grey map? precision guided missiles? idk.

PGM is platinum group metals for me, just another naming collision.

You mean like they had to call the bug planet in Starship Troopers “Planet P” because all the other names in the universe had been taken?

I thought it was called Klendathu

That would probably be a good name for some programming project, I doubt the Name is taken. Then again, we all now Klendathu is riddled with bugs...

PGM index should clarify it :)

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact