
How to Partitioning Data for Linear Scalability in Geospatial Queries? - ninjakeyboard
How do you partition geospatial data for horizontal scalability? Seems the best option is less so partitioning and moreso to define geographic regions and then just duplicate the data to query against a region. Otherwise you&#x27;ll have these weird borders, potentially butting up against the border of 4 tiles for a query, so you would have to query against 4 nodes (for a single geospacial query to get the data from all 4 tiles. I wonder how google places api etc handle this sort of problem.<p>The other potential solution is to overlap data so a node contains the tiles along its edges from the next and previous nodes as well. Not 100% sure how to handle this, what the best technology is etc.<p>Any recommendations welcome. I&#x27;m probably looking at the problem wrong - eg that a partition key in a columnar database query (eg cassandra) may be the floored lat &amp; long integers getting a column range of the lesser significant digits. But maybe there is another way of looking at the data&#x2F;problem space?
======
brudgers
This podcast talks about scaling Second Life which has a strong geographic
component:

[http://www.se-radio.net/2009/07/episode-141-second-life-
and-...](http://www.se-radio.net/2009/07/episode-141-second-life-and-mono-
with-jim-purbrick/)

My naive intuition is that sharding on two or more axes with some
denormalization makes sense: e.g. sharding on both geospatial location and
information layers. Infrequently modified elements that overlap several
geospatial regions could be stored alongside each. This implies eventual
consistency and high availability. On the other hand, some elements might need
higher consistency and therefore have lower availability.

Which is to say that the proper architecture is one that allows accurate
metrics and high levels of tuning based on actual use and application
requirements.

Good luck.

------
SamReidHughes
Having to query against 2 or 4 nodes is not bad, because you can and should
run them concurrently, so you've still got the latency of one query. I
wouldn't want to overlap data because that opens a new door for
inconsistencies to occur.

~~~
ninjakeyboard
Ya that was my fear with the duplication as well.

------
ninjakeyboard
I found this - may be relevant.
[http://arxiv.org/abs/1509.00910](http://arxiv.org/abs/1509.00910)

