
Spatial Data Structures for Better Map Interactions - efkv
http://chairnerd.seatgeek.com/spatial-data-structures-for-better-map-interactions/
======
ryandrake
Spatial data structures are great, and heavily used in all kinds of mapping
applications. Most of the systems I've worked with used the more constrained
quadtree structure, where the map is divided into uniformly-sized tiles, and a
level-of-detail hierarchy is built. This is good for raster data, where each
tile is a picture and the whole world is covered by tiles. The R-tree has
great advantages when the complexity is not uniformly distributed across the
map, for example, vector road data, where vast areas on earth have no data,
and most of the data is concentrated in small urban areas.

I believe PostGIS, which adds spatial operations to Postgres, uses an R-tree
as its spatial index.

An improvement on the R-tree, the R(star)-tree, uses a different node
splitting algorithm and includes re-insertions (similar to balancing a
B-tree), reducing both coverage and overlap. The insertion complexity is
greater, but in general, R(star)-tree query performance tends to be a bit
better for mapping applications. Generally, maps don't change that often, so
building the tree tends to happen far less often than querying it.

There are many more specialized spatial data structures available, for
example, Kd-trees, which can be perfectly balanced and are useful for storing
point data.

If you're really interested in this stuff, the holy bibles for spacial data
structures (which I keep in a special place on my bookshelf) are a pair of
books written by H. Samet: The Design and Analysis of Spatial Data Structures,
and Applications of Spatial Data Structures: Computer Graphics, Image
Processing, and GIS.

EDIT: Formatting, and apparently you can't write the asterisk character on HN

R-Tree paper (1984):
[http://postgis.org/support/rtree.pdf](http://postgis.org/support/rtree.pdf)

R(star)-Tree (1990): [http://epub.ub.uni-
muenchen.de/4256/1/31.pdf](http://epub.ub.uni-muenchen.de/4256/1/31.pdf)

PostGIS: [http://postgis.net](http://postgis.net)

H.Samet textbooks:
[http://www.cs.umd.edu/~hjs/design.html](http://www.cs.umd.edu/~hjs/design.html)

~~~
jandrewrogers
The spatial data structure literature is quite incomplete, and there is a lot
of confusion about what to use when. Many of the issues with spatial indexing
performance at companies I go into is that they are doing it wrong, not that
they necessarily have a fundamental problem.

It should be pointed out that R-family data structures should only be used for
data sets that are (1) small and (2) relatively static. They scale very poorly
in a number of ways. The primary reason they are used in databases is that
they can handle interval (i.e. non-point) spatial data types without the
possibility of pathological space complexity (bad when talking about
databases) and fit B-tree indexing models adequately, which that software
understands.

Quad-tree variants can scale well for point data types but are often terrible
for indexing non-point data types due to pathological space complexity.

Grid-file variants are the canonical structure for large-scale spatial data
sets if the data is static. However, the number of companies implementing
these correctly at scale is approximately zero in my experience.

For general, scalable, online/dynamic spatial indexing structures, there is
another family of algorithms and data structures (adaptive spatial sieves)
that are basically ignored in the literature even though they were first
described in 1990, albeit in a not very useful form at the time. If you are
doing petabyte-scale real-time indexing of polygons at extremely high rates,
this is what you would use but little is published about them because most
modern variants did not come out of academia.

And the most advanced algorithm family for indexing point-like data is not in
the literature _at all_.

(Background: my day job involves extreme-scale, high-performance spatial
indexing software and I invented a few spatial indexing algorithms back in the
day that are still the state-of-the-art in their respective algorithm
families.)

~~~
rdtsc
> And the most advanced algorithm family for indexing point-like data is not
> in the literature at all.

So what is it? Is it patented, secret(classified)?

And how do you know it is the most advanced without a peer review?

BTW "adaptive spatial sieves" on Google search produces exactly 0 results. Is
there another name for it.

Spatial indexing is interesting and complex but not exactly rocket science,
there have been a good number of people thinking about. It is hard to believe
there are general approaches there that haven't been thought of yet. But
again, but I am not an expert, so I could be wrong.

Can you write some more about it, I am very curious.

~~~
jandrewrogers
What makes spatial indexing different than other areas of computer science is
that its fundamental operands are interval types (like hypercubes). Almost all
other areas of computer science focus on solutions that only work on point-
like data and so many idioms and intuitions computer scientists use are not
actually valid when dealing with spatial representations. Data types that
require no less than two integers to describe break the assumptions of
computer science developed for data types that require one integer to
describe.

Oddly, there are almost no people working on the computer science of spatial
representations and indexing. That was how I became involved in the first
place; I needed solutions to specific problems for which no active research
was being done in academia (and I was talking to people like Samet at the
time). And to this day, academics _still_ aren't doing any interesting work on
spatial structures.

The algorithms are not classified, just not published. There are a couple
patents out there on spatial sieves -- I am the inventor on the first
practical one -- but more sophisticated and advanced variants exist that will
probably never be patented. Widmayer (respected CS academic) was the first to
propose this approach to the problem of generalized interval indexing but a
useful algorithm was not discovered until my work in 2007.

The point indexing algorithm family I mentioned has never been patented or
published anywhere but was based on the development of a novel theoretical CS
construct that allows the expression of polymorphic space-filling curves.
These are essentially adaptive to the information theoretic properties of the
high-dimensionality data set but still embarrassingly parallel and
distributable. I don't think these are being used in production anywhere (yet)
but they've been around for several years. Very cool but the number of people
that grok them can be counted on one hand.

All modern research on computing with spatial representations is being done at
one of a few companies, which is why nothing is published. People like me, and
I haven't done the pure research in years, don't have time to generate
hundreds of pages of fairly deep content. Consequently, learning advanced
spatial indexing is fairly prohibitive; you have to figure it out yourself,
which is far from easy, or be at one of the handful of places where they are
using it.

~~~
kastnerkyle
I found some related papers for others who are interested:

[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.56....](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.56.3068&rep=rep1&type=pdf)

[http://bmi.osu.edu/resources/techreports/ahmedBokhariV21.pdf](http://bmi.osu.edu/resources/techreports/ahmedBokhariV21.pdf)

Also a book:

Space-Filling Curves: An Introduction with Applications in Scientific
Computing

by

Michael Bader

------
adam-a
The other way to do this, which is common in (older?) 3d video games, is to
render your scene twice, once normally and once with each object shaded in a
unique flat colour. Then sample your second bitmap at the mouse position and
map the pixel value back to your list of objects. This is very fast and only
falls over when you start to have transparent objects, in which case ray-
casting and r-trees become appropriate.

~~~
munificent
This is basically a grid[1] where the resolution is down to 1 pixel. Not a bad
solution!

[1]:
[http://en.wikipedia.org/wiki/Grid_(spatial_index)](http://en.wikipedia.org/wiki/Grid_\(spatial_index\))

------
sigil
If you find yourself doing big-time spatial queries, consider using the
PostGIS extension for Postgres, which uses R-Trees-on-top-of-GiST indices:
[http://postgis.net/](http://postgis.net/)

In addition to ray-casting and R-Trees, there's another alternative for the
"point in polygon test" not mentioned by the article: compute the Winding
Number. Appropriate for a very low number of polygons where a full spatial
index might be overkill.

[http://en.wikipedia.org/wiki/Point_in_polygon#Winding_number...](http://en.wikipedia.org/wiki/Point_in_polygon#Winding_number_algorithm)

I think seatgeek made the right choice in this case though with a client-side
R-Tree.

Aside: does anyone know what browsers use for hit-testing polygons defined by
the <map/> tag?

------
cwmma
the rtree implementation they found [1] isn't connected to leaflet except that
the org that it lives in also manages some leaflet plugins, Vladimir, the
author of leaflet, does have his own rtree implementation [2].

For the record I manage the one in the article.

1: [https://github.com/leaflet-extras/RTree](https://github.com/leaflet-
extras/RTree) 2\.
[https://github.com/mourner/rbush](https://github.com/mourner/rbush)

~~~
efkv
Ah, you are right. I must have looked at Vladimir's implementation and then
just associated him in my mind with the leaflet-extras implementation.

------
kstop
tl;dr: we re-implemented imagemaps.

