
Index Search Algorithms for Databases and Modern CPUs (2010) [pdf] - tjalfi
https://arxiv.org/ftp/arxiv/papers/1706/1706.06697.pdf
======
gravypod
I can't wait for for spatial databases with GPU acceleration. Things like
massively parallel collision checking for complex shapes become relatively
fast when you have a GPU.

------
flgr
Original author here; was quite surprised to find this on the front page
today.

Let me know if you have any questions. Happy to answer them!

~~~
fnl
Foremost, thank you for your work on this subject and your contributions.

As this is work from 2010, and, to cite from your paper:

> In this work, we will not consider finding multiple records at the same
> time. The algorithms given in pseudo-code will not directly support range
> queries. In case there is multiple matching records for the same key, the
> algorithms will only return one of the matching records. For most of the
> algorithms given here, it is however simple to extend them for range
> queries, or returning all of the records matching a key. We also will not
> consider set queries, where all records matching a set of keys are returned
> from one search operation. There is some potential for follow-up work in
> this area ...[snip]... Current commercial database systems already know when
> to switch from index probing to a full table scan. There might be some
> potential for performing better than both of those when explicitly
> considering this case in index structures.

The precise reason for using b-tree(-ish) data structures in databases and
search engines like Lucene is precisely those two issues: to enable set and
range retrievals. Have you published any follow-up of your work that extends
the algorithms presented to cope with these two issues?

~~~
lobster_johnson
Last I checked, Lucene doesn't use B-trees. It uses a variation on skip lists
to implement efficient inverted indices.

As I understand it, its performance comes from a column-store-like approach
where large document lists can be efficiently intersected. Lucene also uses a
LSM-style approach when writing index files; instead of updating a single
large index in place, it incrementally appends to new partial index files, and
to execute a query it queries every partial index and merges the results.
(These partial index files are then incrementally merged in the background to
reduce the number of files that need to be aggregated at query time.)

Lucene is slower than an RDBMS on narrowly targeted queries where an RDBMS can
use a B-tree, but is more efficient when a query matches many documents
because its index data structures are more efficient than a classical
"sequential scan" across row pages.

~~~
fnl
Not sure I understand your reply - which of it explains how NitroGen (i.e. the
indexing algorithm described in the linked paper) has evolved?

~~~
lobster_johnson
I was responding to the assertion that Lucene uses B-trees: "The precise
reason for using b-tree(-ish) data structures in databases and search engines
like Lucene is..."

~~~
fnl
Oh, that's why I had mentioned tree-ish structures, sort of referring to skip-
lists & friends on a broad level. Sorry for the misleading expression. Rather,
I'm interested if any ranged/set retrieval approach can be found that scales
better than log n. (n.b. _ranged_ , so LSH doesn't count either ;-))

