

R-trees – adapting out-of-core techniques to modern memory architectures - nkurz
http://sebastiansylvan.spaces.live.com/blog/cns!4469F26E93033B8C!224.entry

======
crux_
Probably as or more important than hierarchical memory performance, a big
reason and the main motivation for using an R-Tree is to store "data objects
of non-zero size located in multidimensional spaces." (<http://www-
db.deis.unibo.it/courses/SI-LS/papers/Gut84.pdf>)

In particular, a lot of other spatial indexing strategies that work well with
points (kd-trees, quad/oct-trees, bsp-trees) get cumbersome when you adapt
them to deal with objects that can span across multiple nodes, particularly if
the objects in question are dynamic.

For those who don't give a hoot about ultimate raw speed but want to store and
retrieve 2D spatial things conveniently, I've been working on an open source
pure python R-tree implementation: <http://code.google.com/p/pyrtree/>

(I also have a WIP implementation in straight C, following a "packed" approach
with node compression and tree flattening, focusing on a build-then-query
workload rather than a fully dynamic one. Release TBD, maybe dependent on
interest since the python version works well enough for my immediate needs.)

~~~
timtadh
If the kind of query you are interested in running is a "K Nearest Neighbor
Query" (that is for a give point give me the K nearest objects) you should
also consider looking at metric trees. To have a metric tree you must have a
metric function which takes two objects and returns a distance between them.
The distance must satisfy:

    
    
       1. d(x, y) ≥ 0 (non-negativity)
       2. d(x, y) = 0 if and only if x = y (identity of indiscernibles)
       3. d(x, y) = d(y, x) (symmetry)
       4. d(x, z) ≤ d(x, y) + d(y, z) (triangle inequality).
    

Metric trees can be highly useful for data which is either highly dimensional
(ie having greater than 3 or 4 dimensions) or non dimensional (like strings
for instance DNA sequences).

The M-Tree is probably the most generally useful metric tree:
<http://en.wikipedia.org/wiki/M-tree>

If you data is static or only updated very infrequently you should use an MVP
tree. It is probably the best static structure, closely followed by Sergey
Brin's GNAT structure.

MVP Tree:
[http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.43.7...](http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.43.7492)
GNAT: <http://infolab.stanford.edu/~sergey/near.html>

Finally for some data (like strings) it can be very expensive to calculate the
distance function. Therefore, there is another set of structures which relax
the triangle inequality and are "near metric" trees. These can be useful for
pruning your search space.

For more info on metric data structures see:
[http://www.amazon.com/Foundations-Multidimensional-Metric-
Da...](http://www.amazon.com/Foundations-Multidimensional-Metric-Data-
Structures/dp/0123694469/ref=cm_cr_pr_product_top)

~~~
crux_
Agreed. Also: R-trees don't do so well as you add data dimensions. (This is
partly received wisdom; my rough understanding is that the empty volume in
each node becomes cavernous and nice splitting becomes much less achievable.)

Another reference dump: If you haven't seen them I found spill trees a nice
extension of metric trees for certain types of problems (CV anyone?); found
via surfing links from some forgotten HN article:
<http://books.nips.cc/papers/files/nips17/NIPS2004_0187.pdf>

------
hga
Cache is the new RAM, RAM is the new disk....

------
jallmann
To make it easier to follow the post, do read through the slides that are
linked in there.

As an aside, I was dreading having to download a .pptx, but Office Live will
display it with a slick web-based viewer. Nice.

