

Finding similar items using minhashing - sadiq
http://www.toao.com/posts/finding-similar-items-key-store-minhashing.html

======
agazso
In practice there is a certain kind of minhashing that is used in key-value
stores, which is called a Bloom-filter.

Basically it is a probabilistic bitmap index for quickly telling if a certain
elem is in a set. It is used in BigTable, Hbase etc.

<http://en.wikipedia.org/wiki/Bloom_filter>

The Jaccard similarity (or Jaccard index) mentioned in the article is used for
finding similar items in a big set. E.g. Google News uses it for aggregating
articles on the same subject.

~~~
siculars
Indeed, bloom filters work but, as with all things, do have limitations in
practice. You have to make specific choices as to which type of bloom filter
you will implement. You have to optimize for variables like n, m or k or you
quickly begin to lose confidence in your responses. Also need to optimize for
space vs speed.

------
siculars
Sergei Vassilvitskii (of Yahoo! Research) went over this first class of "COMS
6998-12: Dealing with Massive Data", <http://www.cs.columbia.edu/~coms699812/>
just a few weeks ago. Very interesting stuff. The class is dealing
specifically with large data handling in both parallel processing and
streaming techniques.

It looks as if a major component of the class is mathematical distillation of
complex problems trading exact matching for acceptable probabilistic bounds.
When you are able to do these math tricks you can cut an O(n^2) or worse
problem down to O(n log n) or better problem which is the real win.

------
swanson
Good article. I like that it was more approachable than a journal paper -- as
a developer, I am interested in these kind of algorithms but sometimes I don't
even know where to begin searching. And as someone without a math PhD, I
appreciate the simplifications and real-world applications.

I don't know what you have planned for the next article in the series, but I
would recommend adding code snippets in a language of your choice. I would
have liked to see your example implemented in python or ruby to make it more
concrete.

~~~
sadiq
Thanks.

I'm planning to discuss a couple of different techniques (using shingling to
actually create the set elements from items is the next one) and then
demonstrate them on a large dataset.

I'll start including some Python in the follow up ones.

------
yakto
Looks really, REALLY, similar to this article:
<http://knol.google.com/k/simple-simhashing>

Hmmmm...

~~~
yakto
"Suppose you have a huge number of items that you would like to group together
by a fuzzy notion of similarity. Suppose the only tool available to you is a
key-value store. Suppose you only have the resources to consider each object
once. Never fear, simhashing is here!"

~~~
yakto
"It's been raining all day and you find yourself with a sufficiently large
pile of items (tweets, blog posts, cat pictures) and a key-value database. New
items are arriving every minute and you'd really like a way of finding similar
items that already exist in your dataset (either for duplication detection or
finding related items). Clearly we don't want to scan our entire existing
database of items every time we receive a new item but how do we avoid doing
so? Minhashing to the rescue!"

------
nreece
Worth reading: the open source library of Similarity metrics:
[http://staffwww.dcs.shef.ac.uk/people/S.Chapman/stringmetric...](http://staffwww.dcs.shef.ac.uk/people/S.Chapman/stringmetrics.html)
(comparison:
[http://staffwww.dcs.shef.ac.uk/people/S.Chapman/images/strin...](http://staffwww.dcs.shef.ac.uk/people/S.Chapman/images/string_metrics_comparison.jpg))

------
Wawl
You can also find similar items using neural networks. There was an awesome
talk about that at pycon 2009 :
<http://us.pycon.org/2009/conference/schedule/event/71/>

Neural network data mining can also sometimes survive missing data or
extrapolate from unknown instances.

------
sfphotoarts
You might want to take a look at <http://code.google.com/p/scipy-cluster/> for
some code in Python that allows systems like this to be built quickly.

------
jarin
Looking into it, I found a Ruby implementation:

<https://github.com/jmhodges/minhash>

Which depends on a Ruby implemenation of MurmurHash2:

<https://github.com/jmhodges/murmur_hash>

Anyone have any idea what the 23 is for?

    
    
      # 23 can be any unsigned 32-bit integer (i.e. from 0 to 2**32 - 1)
      hash_number = MurmurHash.murmur_hash("somestring", 23)

~~~
Shamiq
23 is the seed s/he's using for the hasing function it seems.

------
arnoldoMuller
There are also a number of similarity search indexes available:
<http://obsearch.net/> <http://www.sisap.org/Metric_Space_Library.html>
<http://lshkit.sourceforge.net/>

------
ajays
I'm sorry, but this article is poorly written. The Google Knol article,
<http://knol.google.com/k/simple-simhashing> , is much better.

It is clear that the author does not understand what a hashing function is. He
says, "Your programming language of choice's API will almost certainly have a
few acceptable ones within arm's reach", but that's not true. Firstly, what is
a programming language's "API" ? Secondly: most languages that I'm aware of
have no builtin hash functions; they are usually libraries (I know, I'm
getting pedantic here). And thirdly: most hash functions that you'd find in a
library don't return a floating point value that you can operate on with a
"min" function (or even an int value). They typically return a string of 40-,
128 or 256 bits (depending on the function). So you'll need to map the hash
returned by your hash function to a value that you can compare; for example,
by grabbing the last 4 bytes and treating them as an unsigned integer.

~~~
rlmw
Its obvious he means standard library by 'API'. Also the type returned doesn't
need to be an integer, or even a number. If you need to write a min function
then the type merely needs to form a partially ordered set. A string could
fulfill this requirement - simply compare an integer representation of the
first character if they're equal move on to the following character.
Alphabetical ordering, taught to children at all good primary schools, is a
restricted version of this comparison algorithm.

