
Geosharded Recommendations with Hilbert Curve at Tinder - realfun
https://tech.gotinder.com/geosharded-recommendations-part-1-sharding-approach-2/
======
jamesrom
Using a Hilbert Curve for indexing is a pretty standard approach for
geospatial queries. Nothing new here...

However, using a Hilbert Curve for sharding doesn't seem like the best
approach. You can partition by anything you like, it doesn't have to be
arbitrary points along your index. Using 1-dimension to shard 2D data isn't
optimal.

For example, construct a heatmap of your 'load score' and shard based on that,
in two dimensions. Then use an S2 curve to index inside that shard.

~~~
mbid
_However, using a Hilbert Curve for sharding doesn 't seem like the best
approach._

Yes, that's also what I thought. Searching for "same size k-means" yields a
simple postprocessing step to even out the clusters produced by the usual
k-means algorithm.

EDIT: k-means is adapted directly here: [https://elki-
project.github.io/tutorial/same-size_k_means](https://elki-
project.github.io/tutorial/same-size_k_means)

~~~
n4r9
In my experience the linked algorithm behaves quite poorly when the dataset
and number of desired clusters becomes large. I think the issue is that it
only allows for pairwise switching between clusters, and this ends up with a
lot of points still being assigned to clearly the wrong cluster. Some day I
would like to try more complex neighbourhood searches with it and see if it
helps.

------
101km
Sometimes it feels like Groundhog Day, behold:
[https://medium.com/@buckhx/unwinding-uber-s-most-
efficient-s...](https://medium.com/@buckhx/unwinding-uber-s-most-efficient-
service-406413c5871d)

tl;dr - S2 is a good library.

There's also a joke in here somewhere about bikeshedding and yak shaving.

~~~
cornholio
I present to you _yak shedding_ : pointless bickering about an irrelevant
detail that everyone has an opinion of, and the result of this decision
process has such fundamental implications that the whole project is re-
engineered from scratch to include a yak farm.

------
lainga
How big of a problem is the fact that shards constructed this way don't like
crossing the "big" boundaries in the S2 curve? Especially if a high-usage
location happens to straddle one of those boundaries (like, IIRC, Toronto)?

~~~
lixtra
You can compute which shard to use and if you’re in a border region you just
use all shards that overlap with your 100km radius.

~~~
throwaway080383
Yeah it seems like the real bottleneck is that in these super dense areas,
there might be a ton of shards in a mid-size radius. This is inevitable since
in the middle of Manhattan, there are just a lot of people in a small area.

One solution would be to scale their recommendation radius with density; no
need to look farther than 5 miles for true love in Manhattan, but in Wyoming
you may have to drive a bit.

~~~
peteretep
The flip side though is that you don't need to show everyone to everyone, you
just need to make sure people have enough other people to swipe. This opens up
all sorts of other optimizations, and makes the use of an actual search engine
seem a bit weird to me.

~~~
lainga
That's tragic! What if the love of your life is on the other side of the
Hilbert curve, fated never to meet?

------
gregoriol
Is this an advertisement for S2? there is no description of alternatives, of
how the choice was made, no comparison to any other system or provider, ...

Redis has geo search for example, why not using it?

~~~
realfun
geo-indexing is different from geo-sharding, search indexes normally have geo-
indexing, but it is still one big index handles every single search. The
alternative(geohash) is mentioned in article, which has high distortion around
earth poles.

------
EGreg
Let me share my experience building location based search at Qbix. All of it
is available now for free in the Places plugin:

[https://github.com/Qbix/Platform/tree/master/platform/plugin...](https://github.com/Qbix/Platform/tree/master/platform/plugins/Places)

First of all, we normally do our sharding by the hash of the primary key.
Typically it is the publisherId and hash of the name of a “stream”, which is
our general dynamic data structure. What this does is essentially distribute
the load evenly between hashes without having to worry about the “water” stuff
mentioned above.

Ok now about location based search...

1\. We use GeoHash for addressing the surface of the Earth. For lookups, we
calculate actual corners of the lat-lng rectangle in Haversine distance and
then translate it to geohash.

2\. Technically we could have just used a database index at this point and be
done with it. But we want to handle both PULL requests and PUSH notifications
when something happens nearby.

3\. So given a radius the subscriber is interested in, we set up 4 streams for
SUBSCRIBERS that together (as rectangles) cover the circle surrounding the
location and radius.

4\. Then we also set up streams for PUBLISHERS which correspond to all the
various drop-down values we have for the radius. This is compicated by the
fact that we have one set for metric system and another for imperial. But we
just make one stream for each of them. The more radii available the more
streams you have to post to (like 10-20) but posting doesn’t happen that often
and it’s better to precompute this stuff.

5\. Relations between streams are one of the many features that streams do out
of the box. So basically the user can just subscribe to a certain location and
radius, interest, etc. and get either realtime updates or offline
notifications. The item being posted is itself a stream and what’s being
inserted is relations between it and the category stream the person is
interested in.

Thus you can also see right away who is interested in what. In the future
however we plan to encrypt the data end to end so the servers won’t know this
info. Then the location search will be harder. (qbix.com/blog)

~~~
realfun
Thanks for sharing EGreg.

Elasticsearch has geo-indexing as well(based on geohash internally), and by
default it does id hashing similar to what you said(murmurhash3), we actually
leverages that for location based searches.

The challenge addressed in the blog is not in how to search/address(as said
Elasticsearch handles it already), it is about how to distribute the load so
calculation only happens on limited nodes, and reduce the index size so it can
be more performant.

~~~
EGreg
Ah, the goal makes sense. I would suggest that it’s not so bad to have a
controller node fan-out and fan-in queries, as long as the database can handle
many concurrent queries. Essentially you’re distributing the work evenly
across nodes but you don’t have affinity for a particular node. Yes, there is
more latency (it is as slow as the slowest connection) but it is endlessly
scalable. But, I am sure I missed some benefits from localizing calculations
to only a node or two.

In the scheme above, by the way, it DOES localize searches on one shard.
Essentially all relations to a stream are on the same shard as the stream. And
each center+radius has one associated stream and therefore the search takes
place on one shard.

------
eeZah7Ux
When implementing similar algorithms on a load-balancing reverse proxy cache-
friendliness and lockless reads and updates in the "load scores" can be way
more important than optimizing for CPU cycles.

Another option would be to map coords into very small 2D tiles and than look
up "tile -> shard number" with a simple, large, array. A variable number of
tiles would land into the same shard. Fast and lockless reads and writes. I
wonder if it could be faster.

------
realfun
Part 2: [https://tech.gotinder.com/geosharded-recommendations-
part-2-...](https://tech.gotinder.com/geosharded-recommendations-
part-2-architecture/)

------
Bedon292
Very interesting. Used to do something similar for a project I work on. Used
geohash based sharding, instead of complex curves like that. Worked great, but
with recent advances in ES its no longer needed. They don't seem to mention
which version of ES they are on, and would be curious how it has changed over
time in their experience.

~~~
realfun
Same strategy can be applied regardless of ES version. ES6 does has index
sorting and other optimizations, but not as good as pre-sharding for location
based services, cases like Tinder.

------
bitL
Why not Z-order curve instead? Hilbert is more complex to compute...

~~~
aeroevan
They mentioned the distortion near the poles when using geohashing (which uses
z-order curves), but I doubt they have many users at the North or South Pole.

~~~
biggerfisch
They seem to have a least a few:
[https://www.theatlantic.com/technology/archive/2014/02/confi...](https://www.theatlantic.com/technology/archive/2014/02/confirmed-
tinder-works-in-antarctica/283907/)

------
hidiegomariani
a bit off-topic, it's funny you use all this fancy technology behind the
scenes and your iOS app is buggy, slow and always problems with login and
notifications which do not update.

TLDR; Awful

~~~
Lewton
Another upvote, it is by far the buggiest app I’ve ever used. It’s actually
really impressive in some weird way

~~~
psergeant
The impressive bit here is how network effects and brand recognition can save
you from God-awful tech

------
modells
I’m sorry, but Tinder and similar swipe-on-photos cargo-cult hookup apps have
de facto cannibalized the online dating-scene with its entire focus on
appearance (and atrophying genuine communication), promoting a major
captological phenomenon of manipulating women into rating men more negatively
based on nothing other than appearances. It contributes to unrealistic
expectations, disposable interactions, paradox of choice and a lack of
authenticity. If you’re not perfect and super photogenic with 4 professional
“candid” pics, you have zero chance on these anti-relationship apps. tldr:
it’s far phonier than it needs to be, encourages casual wasting of time and
leads to a society where people stop talking to/slam the door on the real
world. If you move to a new area, the solid, normal people aren’t going to be
on hookup apps, but psycho nymphos with herpes are most def on there.

~~~
pennaMan
I use an image of a strong interest I have as my pic, no personal info, not
even my real name. It acts as a strong filter, so far my experience has been
net positive.

I find it far more rewarding to try to hack the awful world around me instead
of complaining how awful the world is/is becoming.

------
ahakki
I’m pretty sure this was just upvoted because of the title.

~~~
smt88
My observation is that most upvotes are because of the title, and then a
minority of people will read and critique the article in the top-voted
comments.

Doesn't sound like a perfect system, but actually does a wonderful job at
debunking very appealing-sounding articles.

~~~
Nition
The bigger problem is when people comment and even discuss the title without
reading the article. It's extremely common on Reddit but luckily less
prevalent on Hacker News .

~~~
Theodores
...if paywalled then at least you know some people at least tried to read the
article because they comment about the paywall.

Everyone else a) paid their way (as if), b) got a free pass as they don't
generally read articles, c) used that new trick for bypassing the article that
only people who read articles know about or d) just blurted their opinion
without reading the article.

I know there have been changes on the internet lately and there is only so
much that one can 'stalk' in someone's browser, but is it possible to have a
comment system where you can only comment if the article is in the browser
history, with that page having been visited at least three minutes before the
comment is submitted?

If I go shopping online then I jump straight to the reviews and decide to buy
on that basis. the 'shoddy' product description sometimes doesn't get a glance
over. On HN I find some articles are the same, the real insights are in the
comments so paywalled does not deter, the real insights will be in the
comments.

~~~
Nition
I don't mind when people only read the comments, honestly. As you say,
sometimes the comments are more insightful than the article. But I do think
that anyone _making_ a comment related to the article should have read the
article. Otherwise it's the blind leading the blind.

