
Parallel Locality Sensitive Hashing - espeed
http://istc-bigdata.org/plsh/
======
herrherr
I've been trying for weeks now to get a system running that can handle larger
than RAM datasets and returns queries in an acceptable time. It's running ok
now but far from optimal (size of DB is ~100 GB and it contains a few hundred
million entries).

Does anyone here have experience with any implementations (such as likelike,
lshkit, etc.) and can recommend something that can handle larger sets? All the
implementations I have found were either not maintained, old, not running or
not suitable for production use.

Will definitely take a look at the paper but unfortunately it's always a very
long way from here to an actual implementation (there is no code published as
far as I could see).

~~~
espeed
Google's simhash paper shows how to do 8 billion 64bit fingerprints in memory:

Detecting Near-Duplicates for Web Crawling
([http://www.wwwconference.org/www2007/papers/paper215.pdf](http://www.wwwconference.org/www2007/papers/paper215.pdf))

SEOMoz has in-memory and db-backed implementations of simhash in Python
([https://github.com/seomoz?query=simhash](https://github.com/seomoz?query=simhash))

~~~
bunderbunder
Simhash is indeed wicked fast.

Unfortunately, it's also encumbered by a patent:
[http://www.google.com/patents/US7158961](http://www.google.com/patents/US7158961)

------
nkcmr
That's crazy, I was just diving into the nilsimsa LSH utility
([http://ixazon.dynip.com/~cmeclax/nilsimsa.html](http://ixazon.dynip.com/~cmeclax/nilsimsa.html))
code when I decided to take a HN break, and boom! LSH news.

Cool.

~~~
paulgb
This is the first time I've heard of nilsimsa. Looking through the docs to
find the algorithm, here's how it is described, but I'm still not clear on
what a "character separation" is in this context or which 3 characters are
chosen:

> Nilsimsa uses eight sets of character separations (character, character,
> etc.) and takes the three characters and performs a computation on them:
> (((tran[((a)+(n))&255]^tran[(b)]*((n)+(n)+1))+tran[(c)^tran[n]])&255), where
> a, b, and c are the characters, n is 0-7 indicating which separation, and
> tran is a permutation of [0-255]. The result is a byte, and nilsimsa throws
> all these bytes from all eight separations into one histogram and encodes
> the histogram.

I've found that in practice, shingling text and then minhashing it is scary-
good at finding similar documents.

------
espeed
This is the paper linked to at the bottom...

"Streaming Similarity Search over one Billion Tweets using Parallel Locality-
Sensitive Hashing" ([http://istc-
bigdata.org/plsh/docs/plsh_paper.pdf](http://istc-
bigdata.org/plsh/docs/plsh_paper.pdf))

------
fiatmoney
For some context, locality sensitive hashing is handy for quickly finding
"similar" data points in high-dimensional space. This comes up a lot in
collaborative filtering and recommendation problems (find similar products to
hawk at your customers) and topic modeling (find similar text that is
semantically getting at the same thing).

