This might be obvious to many, but spatial indexes are good for more than just "show me the restaurants within five miles of me". It's a good technique to know for efficiently searching any two columns simultaneously. If you have, for example, a database of people, their hometown, and their favorite color, and you want to know all people from Chicago that like green, a spacial index over color and hometown will be vastly more efficient than a "regular" single-column index over any one of those columns. (The reason is that an index on one column doesn't efficiently produce a small resultset. If we index on color and there are five colors, an index over color only reduces the dataset by 1/5th. Similarly, indexing over hometown might reduce the size of the dataset by 1/1000th. But if we index over both columns at the same time, then the only results that the index will ever produce are results that are the exact answer to our question. No unnecessary scanning!)
Anyway, lat/long might be the easiest "pair" of columns to imagine when discussing anything "spatial", but most other two-column pairs work the exact same way.
Yes - I remember using these techniques addressing 2d dense matrices. The specific case was when the matrices involved grew so large that the linear access operations were causing page faults for sequential row or column access. This was a perfect example of pathologically bad memory accesses when data was stored in row-major or column-major layouts.
Very impressive speedups for these interleaved data layouts. The end goal was to embed these layouts into the compiler tools, so that the programmer would do normal two dimensional matrix allocations and would get the optimized layout based on analysis.
I wrote terrible C to accomplish these layouts and minimize the instruction counts.
Whilst I was an intern at Google I worked with Hilbert Curves. Range queries are something BigTable and Cassandra excel at and are the core database component of any Hilbert Curve based spatial index.
The real issue is building the database queries. Google released Uzaygezen[1], a Java library for multi-dimensional indexing, that handles the majority of this for you.
I did not get this in the article - how exactly do we construct range queries (which the Hilbert curve improves on) here? Is the quadtree range resolved into actual coordinates, or is the datatype of the index bound to be integer, so we can query a range over the index (0110110 to 0110111 = 54 to 55, so the query range is [54, 55])?
This technique is simultaneously cool and depressing. Remember all that cool stuff you learned in computational geometry? Throw it out.
Just associate geometry with a set of well chosen ids, and you can stuff it into really simple, conventional data structures and do all of the spatial operations you wanted to quickly and simply in practice.
I'm a bit confused. I've naively put axis-aligned bounding-boxes (AABBs) in my octree without any kind of thought. I put them in the smallest box that contains them. I also keep a byte using bits to mark the sectors they intersect, a cache that greatly cheapens things.
It works very well. I even keep small things like bullets in my octree (most game engines keep them separate), putting them in an AABB representing their position for some duration.
I am doing my queries in nanoseconds.
Its worth saying that for big worlds (think minecraft++), octrees don't work as well as zoning systems e.g. pages. You can of course use a hierarchy, using pages or zones with octrees in them.
Does anyone know if there's a commercial database using quadtrees or hilbert curves? (Aside from the libraries mentioned in the comments, I mean something more productized)
How would someone implement this, then? Is it really just an index column besides the data tuples, containing the recursive quadrant position (111111 for bottom-right in a 3-level grid)? And, what datatype would you use for this - clearly, an integer is not long enough if you have a great number of levels..
Ok I just wrapped my head around it. For querying to work here, the datatype HAS to be integer or long, or else queries on higher levels would not return the tuples associated with lower-level-quads.
The Wikipedia description of Geohashes is a bit odd in that its base32 decoding uses the alphabet with some letters (a, i...) left out instead of simply using [0-9a-v]. Anyone know why that would make sense?
Anyway, lat/long might be the easiest "pair" of columns to imagine when discussing anything "spatial", but most other two-column pairs work the exact same way.