

Locality-sensitive hashes are designed to cause collisions and are useful - gauravsc
https://github.com/andrewclegg/sketchy

======
tylerneylon
I created a useful LSH algorithm for real-valued vector data. Its strong suit
is that it is deterministic. In other words, it is deterministically
guaranteed to find all close-enough neighbors, and omit all far-enough
neighbors. Most other LSH algorithms do not provide both of these guarantees
deterministically -- instead they strive toward those goals probabilistically.
It's also fast, conceptually simple, and built on top of a very cool non-
obvious tessellation of arbitrarily high dimensions.

Here are some slides on it:

[https://files.pbworks.com/download/dJjN51z5uR/hackerdojo/271...](https://files.pbworks.com/download/dJjN51z5uR/hackerdojo/27189819/lsh_hacker_dojo_talk.pdf)

This work was also published in SODA 10:

[https://www.siam.org/proceedings/soda/2010/SODA10_094_neylon...](https://www.siam.org/proceedings/soda/2010/SODA10_094_neylont.pdf)

------
marcusf
I just spent an inordinate amount of time reading up on LSH for images. In the
system we develop, people tend to upload images from their desktop time and
again instead of searching in the system. Part of it is a UI challenge, we
have to make search better. But in the case of them uploading the same image
repeatedly it would be good to be able to do some kind of LSH and see if the
image is already in our database (like TinEye). I started out with LSH via
random projection [1] and got further down the rabbit hole from there.

What stumped me was creating a good enough feature vector, balancing size with
information. We have a hack day tomorrow, might pick it up again.

[1]
[https://engineering.purdue.edu/~malcolm/yahoo/Slaney2008(LSH...](https://engineering.purdue.edu/~malcolm/yahoo/Slaney2008\(LSHTutorialDraft\).pdf)

------
gms
For anyone looking to use this, please note that the benefits of LSH rapidly
diminish if your nearest neighbours are in fact far away.

~~~
arnoldoMuller
My startup has a fast nearest neighbor engine that allows to have far matches:
<http://simmachines.com/Products/r01.html> We hope to be the "berkeley DB" of
the big data era. Feedback would be greatly appreciated!

~~~
nknight
> _Feedback would be greatly appreciated!_

My main piece of feedback is that explicitly comparing your product to a
database engine that people love to hate is probably not a great way to market
it. Lots of us have horrible memories of badly-corrupted Berkeley DB
databases.

~~~
arnoldoMuller
Interesting point, thank you!

------
lclarkmichalek
Kind of related, just recently I wrote a blog post on how to do fuzzy location
aware matchmaking using redis and geohash. It exploits similar properties,
namely that as you reduce the precision of the geohash, you get more
collisions. You can find it here:
[http://www.generictestdomain.net/Redis/2012/05/07/location-a...](http://www.generictestdomain.net/Redis/2012/05/07/location-
aware-matchmaking-with-redis/)

------
ndl
I recently had some ideas about how to use the concept of locality-sensitive
hashes with thread/worker pools that share locking resources. Basically, it's
useful when want things which are going to take write locks on the same
resources to end up on the same thread, since otherwise you are just
needlessly blocking up extra workers waiting for other workers to finish.
Also, in the case of having multiple actors with independent task queues, you
can send tasks using a particular resource to the same queue, so that they
will be processed in the same order they were received.

There are probably better ways to do this in most cases, but I thought it an
interesting idea.

