
Why Are Geospatial Databases So Hard to Build? - chippy
http://www.jandrewrogers.com/2015/03/02/geospatial-databases-are-hard?h
======
jandrewrogers
Author of that article here. Feel free to ask me questions about all things
spatial (and beyond), I will attempt to answer them.

~~~
ChuckMcM
I enjoyed the article, and wondered what are the queries you support?
Something like "SELECT * FROM world WHERE (distance_from_me < 500 meters);" ?

I ask because a lot of what you discuss seems to resonate with folks in the
graphics business who are, for example, culling object geometry from a render
scene by doing what is effectively the above operation. Its also what a
'whisper' or 'yik yak' type of App does trying to find people near itself
although in that case the problem is greatly simplified by assuming a sphere
and testing for containment in the sphere (or the circle if being planar)

~~~
jandrewrogers
You can do much more sophisticated operations than simple point-in-polygon
queries.

The geospatial implementation in SpaceCurve is quite sophisticated and in some
aspects the state-of-the-art. Complex geometry types and operators are
obviously supported, and the implementation is fully geodetic and very high
precision by default. Complex polygon constraints, aggregates, joins,
intersections, etc on data models that also contain complex geometries are
supported and massively parallelized. There is nothing else comparable for big
data in this regard that I know of.

It is useful to understand that the entire platform, database engine on up, is
fundamentally spatially organized even for plain old SQL "text and numbers"
data. The SQL implementation is a thin layer that is translated into the
underlying spatial algebra. You can use it for non-geospatial things, it just
lends itself uniquely well to geospatial data models because it can properly
represent complex spatial relationships at scale.

~~~
mrec
Graphics will tend to want 3D spatial queries, though, and this gives the
impression of being optimized for, if not restricted to, 2D or 2D-plus-
curvature. The article consistently talks about "polygons" as opposed to
"polytopes", for example.

Have I got the wrong impression?

~~~
jandrewrogers
The geospatial implementation is a true 3-space model. You can content-address
objects and features past low-earth orbit by default. All surfaces upon which
geometries are constructed are embedded in this space. Polygons relative to an
embedded surface, such as a geodetic 2-ellipsoid, are actually represented as
3-space shapes under the hood; the 2-surface is a convenience for popular use
cases but the data is not organized that way.

There are an unbounded number of possible shapes and surfaces in 3-space.
Storing and content-addressing them is easy enough. The challenge is that to
be useful you need, at a minimum, generally correct intersection algorithms
for all of the representable shapes. It is an open-ended computational
geometry problem and we continuously extend geometry capabilities as needed.
Currently, the most complex shapes can be constructed relative to a number of
well-behaved mathematical surfaces embedded in 3-space. Shapes directly
constructible in 3-space are significantly more limited but I expect that to
expand over time as well, particularly if there are specific requirements and
use cases.

The underlying representation was intended to mirror the spatiotemporal
organization of the physical world at a data level. We can generate
projections but the data lives in 3-space.

------
gumby
This is an OK introduction to the issues of geospatial DBs (a meta meta
article as it were) but two interesting (to me) points were made in passing:

> Most software engineers assume today, as I did a decade ago, that the design
> of scalable geospatial databases is a straightforward task.

You could substitute almost any computing topic for "geospatial databases" and
still (sadly) have a perfectly reasonable sentence. Partially that is because
our field is so new, and partially it's by simple naïveté which is
occasionally powerful but mostly dangerous.

> Algorithms in computer science, with rare exception, leverage properties
> unique to one-dimensional scalar data models. In other words, data types you
> can abstractly represent as an integer.

Alas, this is painfully true, and the field of algorithms is so complex that
the one-dimensional nature of our exploration is easy to overlook.

</grumble>

------
geophile
I'm not sure why the author is dismissing space-filling curves so quickly,
just based on the age of the idea. I've worked with them extensively, in the
context of spatial indexing in a database system, in both research and product
settings. They address a number of problems described for interval data types.
The techniques I've been working with "decompose" a spatial object into a
number of regions by recursively partitioning the space, ending up with a
small number of "z-values" for each. A z-value is represented by an integer,
which is then used as a key in a conventional index, e.g. a sorted array,
balanced binary tree, B-tree, skiplist, etc.

Getting a uniform distribution for sharding is easy in one dimension. I'm not
sure what kind of scaling the author has in mind, but if you put the z-values
in a B-tree, it scales exactly like a B-tree, i.e., extremely well. I agree
there is no exploitable ordering of the spatial objects, but the trick is to
order the z-values. If a given spatial object gives rise to 4 z-values, then
they -- and the spatial object -- appear in 4 places in the ordering. That's
fine. Run your search (in one dimension), get qualifying records, and then get
rid of duplicates later.

~~~
jandrewrogers
The point was really that space-filling curve indexes do not solve the
underlying issues with polygon indexing, their oldness was not a factor. That
type of indexing is so old that the patents (yes, they were patented by the
big database companies) have long since expired. And yet they are rarely used
commercially. There is good reason for that. There is a resurgence in their
usage by people new to spatial indexing but it is pretty obvious that few
implementors are aware of the extent of the computer science, which goes back
into the 1960s.

The assumption of size was "large-scale", which in this day and age usually
means massively distributed. A B-tree is not a solution to that problem
domain.

~~~
geophile
> The assumption of size was "large-scale", which in this day and age usually
> means massively distributed. A B-tree is not a solution to that problem
> domain.

Another thought on this one: Space-filling curves decomposes the problem
nicely: Scale out your favorite, ordered key/value stored, and then layer
spatial indexing on top. No need to develop a new data structure that both
scales out and handles spatial data.

~~~
agentargo
This is my thought as well. It seems like a perfectly feasible solution to me,
but I might be speaking out of the naivete mentioned in the article.

------
joe_the_user
Hmm, Do you have to index intervals directly?

Algorithms for spatially indexing points are tractable - see k-d trees etc.
R-trees seem pretty lame to be honest, like an ad-hoc data structure to partly
meet a number of requirements.

Of course, some kinds of indexing and searching are inherently harder than
others. But I don't see why one couldn't program any needed spatial algorithm
directly from just an index of all points in the system.

For example, suppose you have an index of triangles. If want to find all
triangle intersecting triangle (x,y,z), you could itterative expand a near-
neighbor search from each point. Every point you find close than the given
triangle-point's further point is a potential neighbor and could checked
reasonably quickly. This gives the set S of intersecting triangles to a given
triangle in log^n(size(S)), making the process relatively scalable.

K-d tree seem more sensible:
[http://en.wikipedia.org/wiki/K-d_tree](http://en.wikipedia.org/wiki/K-d_tree)

And there's more: "Storing objects in a space-partitioning data structure (kd-
tree or BSP for example) makes it easy and fast to perform certain kinds of
geometry queries – for example in determining whether a ray intersects an
object, space partitioning can reduce the number of intersection test to just
a few per primary ray, yielding a logarithmic time complexity with respect to
the number of polygons."

Lends credence to my argument that a polygon-polygon intersection should be
obtainable in log^n or maybe even log time.

Space partitioning in general does cool stuff:
[http://en.wikipedia.org/wiki/Space_partitioning](http://en.wikipedia.org/wiki/Space_partitioning)

~~~
falcolas
The problem is less with the mathematics, but efficient storage and retrieval
of a large number of points (i.e. index won't fit in memory) from disk. Doing
a search across a tree rendered to disk is really slow.

Even with mathematically sound storage trees as indexes, locations which are
close in space could easily be split between two arms of the tree, making
proximity searches quite expensive - you would have to traverse the entire
tree to ensure that you do not miss data separated by a poor splitting plane.

There are a number of theoretically efficient solutions to this, but they are
all become remarkably inefficient when you add in a hard drive.

~~~
gobengo
I think if you're going to be doing sufficiently 'biggish data' with those
kind of retrieval charcteristics, you've got to just give up on disk and use
flash only.

~~~
falcolas
Flash solves the "added overhead for random access" portion of the problem,
but the overhead for any access at all is still remarkably high, particularly
when compared to main memory.

Perhaps when we get more "flash connected to the main bus" options (there are
a few commercial options which sit in PCI Express, AGP, and even DRAM slots),
we can begin discounting the overhead more and more, but we're not there quite
yet.

------
nigwil_
A possible solution to the firehose problem are stream-databases, the catch
phrase is "rather than data waiting for queries, it is queries waiting for
data".

The people behind Truviso
([http://en.wikipedia.org/wiki/Truviso](http://en.wikipedia.org/wiki/Truviso)
subsequently bought out by Cisco) had an implementation of stream processing,
an overview is here:

[http://www.neilconway.org/docs/cidr2009_franklin.pdf](http://www.neilconway.org/docs/cidr2009_franklin.pdf)

We looked at Truviso when we had to solve video-player-analytics and it seemed
to us to solve that problem. I imagine a tree-structure of stream-queries
where the leaves are the data gathering points, aggregating up the tree for
summaries would address the scaling problem.

~~~
jandrewrogers
The problem with stream processing is that spatiotemporal analytics are rarely
summarizations. Typically, the only aggregates that are computed at ingest are
path, vector, and polygon features (e.g. an entity path reconstructed from
samples). This is not the real-time analysis part of the application, just the
ingest processing, and these features must be continuously indexed and online.

Most spatiotemporal analytics compare events and entity behavior in the
present with the past at a pretty fine-grained level, hence why the
summarizations or in-memory shortcuts do not work that well in practice. Your
working set is frequently measured in days to months of data for common use
cases, limited primarily by the storage budget.

In short, spatiotemporal analytics requires fully online indexing of the data
in the stream, and a pretty big index at that for most apps.

------
mbq
Instead of Vincenty's, there is a very nice Meeus method for great circle
distance, which has virtually identical accuracy and no loop inside;
[http://www.abecedarical.com/zenosamples/zs_great_circle_rout...](http://www.abecedarical.com/zenosamples/zs_great_circle_route.html)

------
tomjen3
The article seems to lack even one good concrete example. All it came up with
was a bunch of high level complaints.

------
kyberias
I'd love to hear some opinions (from the author?) on how well the current
offering of major DBMS-vendors (eg. Microsoft, Oracle, ... ) handle geospatial
data. How well do they scale and do they have some inherent limitations?

------
ericfrenkiel
A great article, and very true as MemSQL (w2011) just announced its own real-
time Geospatial capabilities: [http://blog.memsql.com/geospatial-
intelligence/](http://blog.memsql.com/geospatial-intelligence/)

------
aristus
I respect the author's experience, but "Database Engines Cannot Handle Real-
Time Geospatial" is probably not true. We (MemSQL) have developed geospatial
on top of our existing skiplist indexes, and can deliver thousands of
intersection queries per second, over billions of points, in milliseconds per
query, on very modest hardware. If you want more performance, just add more
hardware.

We previewed our geo code at the Esri Developer's Conference last week. The
demo delivers queries in millisecond time over 170M records without any
tuning. The naive version worked fine.

[http://blog.memsql.com/geospatial-
intelligence/](http://blog.memsql.com/geospatial-intelligence/)

~~~
jandrewrogers
With all due respect, 170M points is a trivial data model, it fits on a thumb
drive. That is seconds worth of data in IoT; you can brute-force the
implementation and look great. Several well-known companies do. It just
doesn't scale to anything real.

The location analytics data models I need to support, e.g. mobile telecom,
have _trillions of polygons_. Indexing millions of records per second
continuously concurrent with sub-second queries is pretty normal for IoT, and
easy to support with good computer science. (Location is almost never
described as a point by primary sources but as polygons; these are converted
to centroid approximations for platforms incapable of doing analysis on the
source.)

While I do not know the details of your implementation, I do know that
indexing complex geometries using a skiplist index won't achieve the necessary
volume or velocity required for IoT-like applications. If you restrict
yourself to point data, people have been building extremely fast, extremely
large quad-trees for years; you just can't scale high-value analytics with
them, which limits their utility.

~~~
aristus
Fair enough; though I'm intrigued by the mention of trillions of polygons. How
complex are they? 10 points? 100? 1000? What benefit does the raw data give
that centroids can't?

