Basically it is a probabilistic bitmap index for quickly telling if a certain elem is in a set. It is used in BigTable, Hbase etc.
The Jaccard similarity (or Jaccard index) mentioned in the article is used for finding similar items in a big set. E.g. Google News uses it for aggregating articles on the same subject.
It looks as if a major component of the class is mathematical distillation of complex problems trading exact matching for acceptable probabilistic bounds. When you are able to do these math tricks you can cut an O(n^2) or worse problem down to O(n log n) or better problem which is the real win.
I don't know what you have planned for the next article in the series, but I would recommend adding code snippets in a language of your choice. I would have liked to see your example implemented in python or ruby to make it more concrete.
I'm planning to discuss a couple of different techniques (using shingling to actually create the set elements from items is the next one) and then demonstrate them on a large dataset.
I'll start including some Python in the follow up ones.
The article came out of my reading the data mining slides from the Ullman - Stanford course, maybe there's common terminology? I know I tried to keep it consistent with the wikipedia Jaccard Index.
Neural network data mining can also sometimes survive missing data or extrapolate from unknown instances.
Which depends on a Ruby implemenation of MurmurHash2:
Anyone have any idea what the 23 is for?
# 23 can be any unsigned 32-bit integer (i.e. from 0 to 2**32 - 1)
hash_number = MurmurHash.murmur_hash("somestring", 23)
It is clear that the author does not understand what a hashing function is. He says, "Your programming language of choice's API will almost certainly have a few acceptable ones within arm's reach", but that's not true. Firstly, what is a programming language's "API" ? Secondly: most languages that I'm aware of have no builtin hash functions; they are usually libraries (I know, I'm getting pedantic here). And thirdly: most hash functions that you'd find in a library don't return a floating point value that you can operate on with a "min" function (or even an int value). They typically return a string of 40-, 128 or 256 bits (depending on the function). So you'll need to map the hash returned by your hash function to a value that you can compare; for example, by grabbing the last 4 bytes and treating them as an unsigned integer.
That's points 2 and 3 covered, I'll leave it up to you whether or not you want to split hairs on what's part of Java's "API" (a sensible reader would probably take that as the Java Class Library in the case of Java).