Great write up! I cringe whenever I see that blog post; other spatial indexing systems at Uber were and are designed more thoughtfully, and with better performance characteristics.
The article didn't really go into detail about how this works within each city.
I am surprised they're scanning each city linearly, even a crude index (such as boxing), or a binary search would greatly improve the lookup time. I guess if the data fits into the cpu cache the order these are scanned doesn't matter so much?
Seriously. S2 is just quadtrees over a cubemap, with a nice mapping from integers to cells.
So, instead of using a quadtree (because it's "complicated"), they chose to make their own two-level tree requiring O(N) linear scans at both levels, as opposed to O(1) with a simple trick.
S2 was much less well maintained (in the OSS release) when Uber wrote this. Even 4-5 months ago all that was available was a mirror from when google code hosting shut down - undocumented stuff. This repo is new (and great to see)
Library also available in Java[0] and to an extent in Golang[1], though the latter is incomplete.
S2 is fantastic for geofencing. Arbitrary regions can be covered by a set of S2 cells of varying level. A point (lat/lon) can be converted to a S2 cell of the smallest level (around a cm^2 of area on the sphere). Checking to see if one S2 cell is contained in another amounts to searching an integer range, since all children of a given S2 cell have IDs that fall within a fixed ranged. See also [2]
4500 queries per second per node doesn't sound like much.
"100s" of linear polygon scans could likely turn into a few, plus a quad tree or spatial grid hash lookup. I don't quite understand why those approaches are considered "complex?" Every simplistic game engine does this as soon as it implements collision.
Finally, choosing languages for workloads is fine. Python and JavaScript are famously non-threaded. But what about Java, C#, or C++?
And why is the index updated in process, instead of a new process spinning up with the new index, and then switching over who takes requests?
Not saying what they built doesn't work, just curious about options.
> Finally, choosing languages for workloads is fine. Python and JavaScript are famously non-threaded.
Python has famously misunderstood threading support. It is certainly single threaded sometimes (or rather multi threaded but only one can run at a time even on a multi core machine). But it can be usefully multithreaded if most of your computation is done with a C extension module that releases the GIL while it's being called e.g. numpy. If such a wrapper for the C++ S3 library exists (I haven't checked) it would be viable for this task.
I agree that choosing languages for work loads is fine though, and admittedly even in the best case Python would still not be as efficient as Go.
It's interesting - their use case requires consistent background refresh, which might make postGIS less optimal, but unsure what rate they were doing it at.
Query-wise, PostGIS should be able to use geohashes which can do these lookups in O(1) time (S2 geometry would be similar). That said, indexing scales terrible with geohashing/S2 at high granularity levels, though they wouldn't need something particularly granular for this (seems fine to be off by tens or hundreds of meters)
I know it's fun to hate on Uber here, and there are plenty of reasons to do so. But this is surprising to me. I was at one of the Uber tech meetups in NYC last year or maybe a year and a half ago, and it seemed like a total circus. The guy who was championing golang was up there talking about how cool go is and that they were rewriting all their microservices in it just because go is cool.
He also talked about how many problems it had caused them and all that fun stuff. I left that meetup thinking two things: 1. Who the fuck breaks stuff just to use a cool new language? 2. Good lord, I really don't ever want to work on this team. It's chaos.
Instead of indexing the geofences using R-tree or the complicated S2, we chose a simpler route based on the observation that Uber’s business model is city-centric; the business rules and the geofences used to define them are typically associated with a city. This allows us to organize the geofences into a two-level hierarchy where the first level is the city geofences (geofences defining city boundaries), and the second level is the geofences within each city.
What about Uber's business model impacted this choice of algorithm?
I'm not quite sure what you are referring to by strict background checks. But here's I think the service is used. A user initiates a call to the service requesting a ride. The service takes in the user's location and first finds the city (first tier), then the service searches and returns the georeferences in that city for ones that contain that user location. These georeferences might attached to drivers, and those drivers might be analyzed to find good drivers to send that ride request to, or they could be something like neighborhoods, where Uber would search that neighborhood for drivers. I'd guess that city specific requirements could be associated with georeferences in either tier.
Disagree. It's impertinent, both in the sense of its being off-topic for the technical achievement in The Fine Article, and that of its being ill-mannered.
I say that as someone who assiduously avoids using Uber's services, who finds what I know of their culture to be odious, and who has strongly criticized some of their other technical decisions, so please don't construe my comment as fanboy defensiveness.
https://medium.com/@buckhx/unwinding-uber-s-most-efficient-s...