Damn Cool Algorithms: Quadtrees and Hilbert Curves

jrockway · on Jan 19, 2012

This might be obvious to many, but spatial indexes are good for more than just "show me the restaurants within five miles of me". It's a good technique to know for efficiently searching any two columns simultaneously. If you have, for example, a database of people, their hometown, and their favorite color, and you want to know all people from Chicago that like green, a spacial index over color and hometown will be vastly more efficient than a "regular" single-column index over any one of those columns. (The reason is that an index on one column doesn't efficiently produce a small resultset. If we index on color and there are five colors, an index over color only reduces the dataset by 1/5th. Similarly, indexing over hometown might reduce the size of the dataset by 1/1000th. But if we index over both columns at the same time, then the only results that the index will ever produce are results that are the exact answer to our question. No unnecessary scanning!)

Anyway, lat/long might be the easiest "pair" of columns to imagine when discussing anything "spatial", but most other two-column pairs work the exact same way.

tom_b · on Jan 19, 2012

Yes - I remember using these techniques addressing 2d dense matrices. The specific case was when the matrices involved grew so large that the linear access operations were causing page faults for sequential row or column access. This was a perfect example of pathologically bad memory accesses when data was stored in row-major or column-major layouts.

Very impressive speedups for these interleaved data layouts. The end goal was to embed these layouts into the compiler tools, so that the programmer would do normal two dimensional matrix allocations and would get the optimized layout based on analysis.

I wrote terrible C to accomplish these layouts and minimize the instruction counts.

Smerity · on Jan 19, 2012

Whilst I was an intern at Google I worked with Hilbert Curves. Range queries are something BigTable and Cassandra excel at and are the core database component of any Hilbert Curve based spatial index.

The real issue is building the database queries. Google released Uzaygezen[1], a Java library for multi-dimensional indexing, that handles the majority of this for you.

[1]: http://code.google.com/p/uzaygezen/

manmal · on Jan 19, 2012

I did not get this in the article - how exactly do we construct range queries (which the Hilbert curve improves on) here? Is the quadtree range resolved into actual coordinates, or is the datatype of the index bound to be integer, so we can query a range over the index (0110110 to 0110111 = 54 to 55, so the query range is [54, 55])?

emson · on Jan 19, 2012

Thanks Smerity that was really interesting - we are looking at applying special searches to Riak.

robrenaud · on Jan 18, 2012

This technique is simultaneously cool and depressing. Remember all that cool stuff you learned in computational geometry? Throw it out.

Just associate geometry with a set of well chosen ids, and you can stuff it into really simple, conventional data structures and do all of the spatial operations you wanted to quickly and simply in practice.

willvarfar · on Jan 19, 2012

I'm a bit confused. I've naively put axis-aligned bounding-boxes (AABBs) in my octree without any kind of thought. I put them in the smallest box that contains them. I also keep a byte using bits to mark the sectors they intersect, a cache that greatly cheapens things.

It works very well. I even keep small things like bullets in my octree (most game engines keep them separate), putting them in an AABB representing their position for some duration.

I am doing my queries in nanoseconds.

Its worth saying that for big worlds (think minecraft++), octrees don't work as well as zoning systems e.g. pages. You can of course use a hierarchy, using pages or zones with octrees in them.

jasonkolb · on Jan 19, 2012

Does anyone know if there's a commercial database using quadtrees or hilbert curves? (Aside from the libraries mentioned in the comments, I mean something more productized)

abbotao · on Jan 19, 2012

Oracle used to use a similar fractal curve system for their spatial index, but has since moved to R-trees.

I believe MongoDB is using a geohash system for their spatial queries as well.

rcfox · on Jan 18, 2012

Thanks, I had never heard of geohashes before, but it sounds like they might really help with a problem I've been having lately.

manmal · on Jan 19, 2012

How would someone implement this, then? Is it really just an index column besides the data tuples, containing the recursive quadrant position (111111 for bottom-right in a 3-level grid)? And, what datatype would you use for this - clearly, an integer is not long enough if you have a great number of levels..

manmal · on Jan 19, 2012

Ok I just wrapped my head around it. For querying to work here, the datatype HAS to be integer or long, or else queries on higher levels would not return the tuples associated with lower-level-quads.

on Jan 19, 2012

[deleted]

sodiumphosphate · on Jan 19, 2012

I believe you commented on the wrong article.

philf · on Jan 19, 2012

The Wikipedia description of Geohashes is a bit odd in that its base32 decoding uses the alphabet with some letters (a, i...) left out instead of simply using [0-9a-v]. Anyone know why that would make sense?