
SpaceCurve: A Unique and Fast Geospatial Database - jamii
http://www.jandrewrogers.com//2015/10/08/spacecurve/
======
th0br0
Lots of buzzwords everywhere and [http://www.geekwire.com/2014/big-data-
startup-spacecurve-cut...](http://www.geekwire.com/2014/big-data-startup-
spacecurve-cuts-staff-ceo-departs/) makes me sceptical. Though they do have a
VMware image of their DB.

~~~
jandrewrogers
As a caveat, the VMware image is really for demoing basic functionality. It is
a true parallel-distributed database engine, and is not designed to run as a
tiny local instance. The VMware instance is a custom build that disables a
bunch of things mooted by being non-distributed and to be reasonably well-
behaved on a system small enough that almost any GIS database would likely
work.

------
moconnor
I'm impressed - this is the first system I've read about from a startup that
actually sounds as if it is architected to be fast. Targeting I/O and memory
bottlenecks as the only acceptable limits is exactly what you should do.

Nice to see HPC learning being used outside the field. Many people
horrifically underestimate or simply overlook the cost of fetching data from
outside your NUMA node.

~~~
jandrewrogers
I spent a few years designing new parallel algorithms on a number of
supercomputing architectures that I enjoyed immensely, mixed in with my usual
database work. I learned a tremendous amount about software design and
optimization I probably would have not learned otherwise. There is a lot of
knowledge in HPC that tends not to travel beyond that domain. It gave me the
"assume you have a million cores..." mindset when thinking about algorithms
and data structures.

In fact, debugging an HPC code on a Cray machine late one night, staring at
large dumps of binary coded hyper-rectangles, was where I first noticed the
deep bit-wise patterns in some types of spatial representations that
SpaceCurve now exploits. It allows computation on hyper-rectangle primitives
to be nearly as fast as integer primitives on ordinary silicon, which was not
the case with earlier versions of my work.

~~~
nickpsecurity
That's a trip. I agree with you about HPC having much wisdom that isn't
exploited outside of it. Lots of what I heard from "cloud," etc had already
been done in MPP's, etc. They still haven't caught up to the field totally and
I still plan on resurrecting MPP eventually.

A prior discussion here suggests what holds it back is cultures. The HPC
people keep to themselves and don't try to make anything useful outside the
field. The cloud projects act like they're the only thing going on. And so on.
Bridging the gap may never happen but I always encourage upstarts to dig into
the HPC literature on their topics as they might find gold. One of few cases
an ACM/IEEE fee is worthwhile. ;)

Personal example was trying to find a patent-resistant, NUMA architecture to
use for security-enhanced SPARC CPU's. Stumbled upon the MIT Alewife machine
that did NUMA up to 512 nodes using (a) SPARC processor modified for quick
context-switching & message passing; (b) a single, custom chip to handle NUMA
aspects. Making cheap, COTS hardware NUMA w/ open design would take one ASIC
development using that architecture. Just cuz I studied supercomputing once
and remembered to look it up. :)

------
noblethrasher
I wonder if this is related to using space-filling curves for indexing:
[http://www.drdobbs.com/database/space-filling-curves-in-
geos...](http://www.drdobbs.com/database/space-filling-curves-in-geospatial-
appli/184410998)

One of the professors at my institution is big in that field.

~~~
jandrewrogers
Common misconception. It is not related to space-filling curves. Space-filling
curves are for linearizing points but have limited utility for interval data
types. Technically though, all binary space decompositions can be described in
terms of space-filling curves. However, for most kinds of space decomposition
it is a mistake to use something describable on a fixed curve (e.g. Hilbert);
the curve definitions are dynamically constructed and expressed in terms of a
simple algebra. In extremely adaptive representations that squeeze as much
information theoretic locality as possible out of the data model, those
expressions can be pretty long but from a software standpoint you only need to
reason about a local fragment of the expression at any point in the
processing.

In SpaceCurve, all points are represented as hyper-rectangles. Literally
everything is a hyper-rectangle. It is the most unusual part of the whole
thing. There is no linearization on a space-filling curve.

~~~
shawn-butler
Why did you choose the name?

~~~
jandrewrogers
Honestly? Because I had been sitting on that domain name for years and
couldn't find anything better. I have no idea why I originally registered it.
:-)

It was evocative while being vague. People only started associating it with
space-filling curves in the last year or two once those started becoming
trendy again among programmers.

------
calpaterson
I wish he had stated that he is the founder of SpaceCurve more prominently. I
would be interested to learn more but this post seems to be mostly marketing

------
jamii
Frustratingly light on detail, but the possible connections to
[http://arxiv.org/abs/1404.0703](http://arxiv.org/abs/1404.0703) are
intriguing.

~~~
jamii
Specifically, the linked paper models indexes on a table as sets of
n-dimensional volumes showing where rows aren't. It then gives a join
algorithm that is asymptotically instance-optimal (up to a poly-log factor) ie
better than any previous result.

Unfortunately it relies on efficiently representing and querying the set of
volumes built up during the query process and the data structure used in the
paper is ridiculously impractical. I haven't seen an implementation yet that
managed to tame the constant factors. Perhaps seeing how the OP models
volumes/constraints would help.

------
nickpsecurity
Great work. It's a counter-example of my meme where few learn from the past of
our field. The author dug deep into the topics to find whatever he could. The
author mixed that with incredibly different way of looking at things for a
truly original work. The author focused on the right areas to prevent
bottlenecks where it counted. Probably the best work I've seen on databases in
a while.

I expect plenty more to come from this.

------
shin_lao
Truly shared nothing architectures are very difficult to do for a database
unless you are willing to copy every data for each core which then creates a
problem of heavy memory usage.

Then when it comes to net I/O there is still the problem of the IRQ coming to
a specific core (or maybe a random core depending on the OS) which is not the
core of your thread (if you have a shared nothing architecture).

Lots of claims in this post.

------
DennisP
> Second, the database is built on discrete topology internals, not ordered
> sets, which is more efficient for massively parallelizing most database
> operations, notably geospatial and join operations...relational databases
> could be designed with similar characteristics.

Anybody know how a relational join algorithm like this would work?

~~~
jandrewrogers
Yes. The closest analogy is a distributed hash join, though it is more general
than just equi-joins. In a distributed hash join, there is a hashing stage
that reorganizes the data to improve locality and selectivity before the
actual join stage. For topological representations, there is information
theoretic locality inherent in the sharding organization but the underlying
"shape" of two tables may look very different such that there is no direct
mapping of records between the two. Fortunately, there is a trivially
constructible and very fast transform function on shard definition (which is
just a constraint) that allows it to directly map to constraints (shards) in
different data models without moving any data. In effect, it is as though the
table is already been hash distributed for whatever join you want to run.

The capability of these generalized parallel ad hoc join algorithms where
demonstrated on a supercomputer circa 2009.

~~~
ibdknox
> The capability of these generalized parallel ad hoc join algorithms where
> demonstrated on a supercomputer circa 2009.

Is this external work? If so, do you have any pointers to it?

~~~
jandrewrogers
It is easy to forget that in 2009 even MapReduce was still considered exotic.
Particularly back then, this was far in front of what most people were
thinking about.

The only people at the time that immediately grokked the broad implications
and had a deep interest in the theory behind it were computer science
researchers at certain government agencies. Hence why so much of the early
computer science was verified on supercomputers instead of commodity clusters.
Unlike today, you could not rent a cluster for this type of thing back then.

None of the early research produced public results.

------
njharman
Sounds like DB for The Machine from "Person of Interest" TV Show.

------
imaginenore
* Not free

* Not open source

* Not a single benchmark

This is just an ad for a commercial product.

~~~
dang
Please don't post dismissive comments to HN.

The OP may not answer every question but this is obviously serious work, so it
deserves a fair treatment, not a glib dismissal—especially since the people
working on it are here answering questions. This is deeply interesting stuff
and there's no excuse for unsubstantive discussion.

------
hellbanner
How does this compare to
[http://docs.mongodb.org/manual/applications/geospatial-
index...](http://docs.mongodb.org/manual/applications/geospatial-indexes/) ?

Nevermind, it's not open source.

~~~
mmalone
I was lead architect at SimpleGeo and spent most of the time working on our
distributed geospatial database. I haven't looked at the MongoDB
implementation in years, but at that time it was really a toy compared to what
other people were doing. Basically, they used a space-filling curve to do
dimensionality reduction and then stored geospatial data in the same b-tree
index that they use for everything else. It's suitable for a lot of trivial
use cases at scale, and even for non-trivial use cases at low data/throughput
scale. But the architecture was (too lazy to determine if it still is) not
suitable for large data set sizes, query throughput, or distribution for fault
tolerance / improved locality, etc.

At the time it was really frustrating to here comments like yours since the
MongoDB solution was superficially similar to the consumer, but was basically
worse on every measurable dimension if you knew what you were doing. I hadn't
heard of SpaceCurve before today, but I skimmed the linked article and some
other stuff on the blog and it looks like this guy knows what he's talking
about. My guess is it's a much better geospatial architecture than MongoDB's.

