

Ask HN: Algorithm to identify the same events based on semantic similarity - btw0

I am working on a project where it needs to be able to identify the same events described with short phrases by different people. I think workflow should be something like: 1. extract the meaningful words(nouns, verbs) from the phrases. 2. measure the semantic distances between these words. 3. clustering.<p>I am asking for your advices. Thank you.
======
aristus
The current consensus is to use statistics without trying to write a program
that "understands" human language.

A common approach is to break your items down into a vector of "features"
(words, phrases, tokens, etc) and apply standard IR techniques like kNN,
singular value decomposition (SVD), tf/idf, etc.

One interesting thing about your problem is that people use different words
and phrasing for the same thing, which means you need a way to equate
different tokens with each other. SVD is pretty good because it finds degrees
of similarity based on context: 'Pikachu' becomes strongly associated with
'Pokemon' and 'Squirtle', and also with misspellings like 'Picachoo'. SVD is
tricky and expensive if you implement it wrong. The good news is that it
recently came off-patent so there are a lot of new libraries popping up.

When I worked on a keyword search engine prototype for Brasil, I found that a
mix of Tanimoto and Ferber similarity scores worked _very_ well for finding
similarity between short phrases, and I didn't even need to know how to read
Portuguese.

PressFlip is using support vector machines for a similar task of clustering
news items. The Vowpal Wabbit is a stellar library used by Yahoo. I think they
recently open-sourced it. allvoices.com is tackling the same area as well.
(disclaimer: I am a dev at allvoices, though I don't work on that part :)

~~~
ntoshev
Statistical language processing is a good way to go, but he must have enough
data for it to be useful. I'd recommend to start by gathering the data.

~~~
abhikshah
Amazon's Mechanical Turk seems like a good way to gather training data.. and
it might even be cheap enough to use instead of a trained model (at least at
first).

------
haidut
OK, there is a pretty "simple" way to do it but I can't divulge all the
details b/c we are using it for my semantic news startup. However, I can give
you a hint on what you need to do. You need to find a way to reduce different
words to their "base" semantics and then compute a similarity measure of sorts
between the different event descriptions. Traditionally, people have been
breaking down the text into word vectors and then calculating a pearson or
tanimoto, or whatnot similarity measures b/n the vectors. The problem with
this approach is not the similarity measure but the fact that (as aristus
points out) different people use different words to describe the same thing.
So computing the tanimoto score of two sentences like "Bush visits Iraq to
encourage soldiers" and "US. President arrives at Bagdad to raise troops
morale" will result in very low similarity score using just the literal words
used in both sentences. However, in reality both sentences are almost
identical in meaning. So let's say that there was a way to reduce the
semantics of both sentences down to their "base", then for sentence one you
would have something like "US President present in Middle Eastern country to
increase spirit of soldiers", and for sentence two you would have "US
President present in capital of Middle Eastern country to increase spirit of
soldiers". So NOW if you take the tanimoto measure of similarity or actually
any other decent similarity measure b/n the semantic "base" of both sentences
you would get a very high score, which is what is needed. So, like I said I
can't divulge the method of how to do reduce text to its semantic "base" but
there is definitely at least one way, which is what we are using. If you are
interested, send me an email to haidut@gmail.com and when I launch my startup
within the next couple of weeks I can hopefully divulge more. Sorry for my
vagueness but my lawyers will make a minced-meat out of me if I say more:-)
Here is something you can do and I am allowed to divulge. You can use
Pointwise Mutual Information (PMI) to find out the semantic relatedness of
words in both event descriptions and if the overall cross-word similarity is
above a certain threshold then you can consider both event descriptions
similar. To find out more on how to calculate PMI and some more info on how
useful it is, read this paper: <http://cogprints.org/1796/0/ECML2001.pdf>
Overall, what you are trying to do is simple in concept but not easy to
implement. Basically, in order for semantic similarity to be computed
automatically by a machine you'd need a mapping of EVERY word in EVERY
language to its semantic "base" so you can create a table and compute this
efficiently. Such mapping does not currently exists and one of the reasons is
the constantly changing nature of English language itself - new words come out
all the time and even if you could track them all - some of them don't have
semantic "base". Using the example I gave above - the fact that Bush is
currently a President won't be true in couple of months so that huge word-to-
meaning mapping table would have to account for that, so mapping English
language to a semantic base if a (very fast) moving target.

That's all I have for now. Good luck in your endeavor.

------
sbt
What you are trying to do is very hard. Statistical NLP is the way to go for
unbounded domains. You might be able to do better semantic analysis for
smaller closed domains.

There is a relatively new great book on NLP out now that I suggest you take a
look at. Particularly the chapters semantics are very useful, but they should
give you an idea of how incredibly difficult what you're trying to do is.

Book: [http://www.amazon.com/Language-Processing-Prentice-
Artificia...](http://www.amazon.com/Language-Processing-Prentice-Artificial-
Intelligence/dp/0131873210/ref=sr_1_1?ie=UTF8&s=books&qid=1221233467&sr=1-1)

------
joshu
I recommend an IR and/or NLP book.

Check into Jaccard distance. <http://en.wikipedia.org/wiki/Jaccard_index>

You might consider looking at WordNet for generalizing words...

------
MaysonL
There was a great Google teck talk posted here a while ago: The Next
Generation of Neural Networks by Geoffrey Hinton
<http://www.youtube.com/watch?v=AyzOUbkUf3M> that might be applicable.

------
snprbob86
Will these people be mentioning the same URLs, by chance? :-)

