
Geospatial indexing on Hilbert curves - poudro
https://blog.zen.ly/geospatial-indexing-on-hilbert-curves-2379b929addc
======
digsmahler
> Instead of projecting the Earth directly onto a single square, it projects
> the Earth onto the 6 faces of a cube enclosing the Earth and then applies an
> extra non-linear transformations to reduce even more the deformations. Each
> cell in s2 is in fact part of one of six quadtrees that describe the whole
> planet.

That's a super cool detail! I once implemented a 2D index using a Z-Order
curve that directly translated lat/lon coordinates to a linear ordering. It
works well enough because nobody really lives at the poles--the search regions
with a single projection get really obtuse there. Projecting the earth onto a
6-sided die is a really elegant solution to that problem! Go go Google
engineering!

~~~
jandrewrogers
There is a subtle concept at work here that many people don't know but which
manifests in geospatial data models: the representation you use to shard data
should be homeomorphic to the intrinsic topology of the data model. Using
cartographic projections is popular but does not meet this criteria, and it
does eventually break for non-trivial geospatial data models. A cube, on the
other hand, is homeomorphic to a spheroid (like a donut is to a coffee cup)
and therefore capable of practically representing much more complex data
models.

That said, using a cube projection has its own set of limitations and issues
for advanced geospatial analytics even though it is well-behaved for sharding.
Current best practice representations embed a spheroid in a synthetic 3-space
and shard the 3-space, which has few edge cases to worry about and is very
efficient in time and space.

~~~
tel
> the representation you use to shard data should be homeomorphic to the
> intrinsic topology of the data mode

Why is that?

~~~
jandrewrogers
Some valid relationships in the data model may not be representable in the
sharding scheme. As a simple example, this is why many projection-based
sharding schemes do not allow geometries in the polar regions. For any
sufficiently rich analytical data model, you will eventually run into data or
derived relationships that can't be properly represented in the system.

On a more practical level, non-homeomorphic representations also create a
large number of additional edge cases that need to be handled to ensure
correctness, many of which are obscure and non-obvious. In most
implementations (including open source), developers tend to ignore many of
these defects because they rarely affect simple mapping applications -- the
reason they used a projection based representation in the first place is
because it was easy. For complex, massive-scale geospatial analytics,
customers have an uncanny ability to find these edge cases almost immediately.

~~~
jillesvangurp
I spent some time doing adapting some geometry algorithms to work with
geospatial coordinates a few years ago. The problem with the poles is that
latitude converges to 90 degrees whereas longitude degrees are all over the
place if you move even slightly. This combined with precision issues with
floating point math causes all sorts of issues.

A good example is a simple algorithm I did to draw circles on a map by turning
them into polygons:
[https://github.com/jillesvangurp/geogeometry/blob/master/src...](https://github.com/jillesvangurp/geogeometry/blob/master/src/main/java/com/jillesvangurp/geo/GeoGeometry.java#L693)

This works perfectly fine if you stay away from the poles but if you get close
enough the circles become a bit irregular. The algorithm tries to work around
some of the issues but the results don't look pretty.

Other issues I encountered were several datasources with invalid degrees due
to rounding errors. This is an issue along the dateline (180 degrees
longitude). E.g. 180.0000001 degrees is invalid.

Another fun edgecase in geo is null Island, a fictional island of the coast of
Africa at (0,0) that has become a fun little easter egg in many datasources. A
friend of mine dedicated this website to it:
[https://www.vicchi.org/2014/04/05/welcome-to-the-republic-
of...](https://www.vicchi.org/2014/04/05/welcome-to-the-republic-of-null-
island/)

------
kgraves
Tinder's engineering team wrote a post about using Hilbert Curves for their
geolocation-based recommendations engine[1]:

Part 2 of Tinder post is still in my bookmarks to read, but already I am
loving Zenly's in-depth analysis of their findings!

It seems that using Google's S2 library is pretty much standard for this
problem, I'm curious if other companies are doing this too?

[1][https://tech.gotinder.com/geosharded-recommendations-
part-1-...](https://tech.gotinder.com/geosharded-recommendations-
part-1-sharding-approach-2/)

------
dylrich
I'm interested in their reasons for not using an R-Tree or an R*-Tree for the
index. I know they mentioned the debate in this post but I'm quite curious how
they arrived at their decision. Have they done performance tests with both
methods? Many other high performance applications use R-Tree based structures,
and I've always been under the impression that R-Trees usually outperform
Quadtree style indexes in PIP queries.

~~~
repsilat
Funny, almost every r-tree in the codebase at a former job of mine eventually
became a performance problem and got replaced by square fixed-size buckets.

Of course, it helped that most of the queries we did could be phrased like
"Find all things within a fixed (and known ahead of time) radius of this
point." R-trees are much more versatile, but much slower to query and much
_much_ slower to construct/maintain.

------
MauranKilom
Any particular reason why you'd use a Hilbert curve instead of, say, a Z curve
([https://en.wikipedia.org/wiki/Z-order_curve](https://en.wikipedia.org/wiki/Z-order_curve))?
Conversion to and from actual coordinates seems much more straightforward for
that one (just bit [de]interleaving).

I mean, any Quad/Octree/N-dimensional equivalent can have its cells numbered
by giving each quadrant/octant/each of the 2^N sub-cells a certain bit
combination and then chaining those together as you descend the tree. The
Hilbert curve version is just a special case of this with complicated rules
for the "sub cell" <-> "bit sequence" mapping. If you were to use a Z curve,
the resulting data format and querying algorithm would be exactly identical to
the one in the article, just a lot less complicated in the (not presented)
details of "where is this child" than the Hilbert version...

~~~
digsmahler
The "where is this child" query is unaffected by the ordering, whether
Z-order, Hilbert, or other. However querying "which children are in this area"
requires that you come up with corresponding ranges along the curve. This is
where the Hilbert curve is slightly better because in many cases the same area
can be covered by fewer ranges.

Follow up questions: How do the number of ranges compare with the different
orderings? How much does having fewer range segments affect database query
performance? Does it make up for the added computational complexity of Hilbert
curves? I've not answers, but these can be answered by science.

------
isaachier
Shameless plug: why not use H3
([https://github.com/uber/h3](https://github.com/uber/h3))?

~~~
lainga
what does H3 gain over S2 in exchange for cells at different levels no longer
matching up? do you think their use case justifies that change?

~~~
ISV_Damocles
H3 has advantages in the analysis of the gathered data. Movement of users
between cells is guaranteed to go through edges instead of points, so you can
do flow analyses with electric current modeling.

A hex grid is the most efficient way to pack circles and is therefore the best
"pixel" type to approximate radii, so simply choosing a hex size best matching
the desired query radius can give you a very fast nearest-neighbor
approximation.

And H3 retains all of S2's good features like hexagons following a curve (not
the Hilbert curve, though) so hexagon IDs of similar value will more than
likely be near each other, making range queries from a database still useful.

------
speleo
Didn't Randal Munroe invent this technique?

[https://xkcd.com/195/](https://xkcd.com/195/)

~~~
detaro
Using Hilbert curves for accessing spatial data is way older, at least the
early 1990s. (and arguably that xkcd works the other way round, mapping one-
dimensional thing (IPs) to a 2D-image)

