

Ask HN: Trying to pin down both algorithm and problem it solves ... - RiderOfGiraffes

OK, I don&#x27;t expect this will make sense to anyone, but if there&#x27;s any community that can help me, it&#x27;ll be this one.  Either that, or please feel free to suggest somewhere else I should be asking this question.<p>This is about some sort of string matching, or content matching, or other language processing requirement.  The idea is to use several hash algorithms, and with each hash, compute the hash of each word, and then sort the results.  This gives a sorted list of hashes.<p>You do that for each item in a selection of items, and perform some comparison between the lists of hashes.  Then you do it again, but using a different hash.<p>I remember reading a paper about this, knowing that it was a problem I needed to solve.  I remember being completely perplexed by the explanation, but last night it dawned on me what the paper might have been saying.  So now I really, really want to work out, or re-discover, what the whole thing is about.<p>Does it sound familiar to anyone?  Please?  Clues?<p>Thanks for any help, and thanks for reading.<p><i>Edit: Think I&#x27;ve found it: MinHash.  Magic terms are Locally Sensitive Hashing, and Jaccard Index.  Right.  Off to bed to sleep on it - we apologize for the inconvenience.&quot;</i>
======
zardeh
Are you by any chance thinking of a
[http://en.wikipedia.org/wiki/Bloom_filter](http://en.wikipedia.org/wiki/Bloom_filter)?

~~~
RiderOfGiraffes
Nice guess, but no - I know the Bloom filter, what it does, its limitations,
its strengths, how it works, and how to implement it. It's one of the tools I
have to hand already.

But I can see how that nearly matches by description, so it would be a good
match. Thanks.

------
QuantumDoja
Reminds me of a server thing, where addresses are hashed, then sorted, I think
it was for storage, gah, now I can't even remember it!

