

NLP Challenge: Find semantically related terms over a large vocabulary (1M)? - bravura
http://metaoptimize.com/blog/2010/11/05/nlp-challenge-find-semantically-related-terms-over-a-large-vocabulary-1m/

======
palish
This doesn't make much sense....

From the website, here are some example "documents":

    
    
      roger heavens trinity cricketers willie sugg early cambridge cricket giles phillips cambridge university cricket
    
      roger hornetttom pridmore 020 7399 4270 collins stewartalan geeves 020 7523 8800 buchanan communicationscharles ryland /isabel podda 020 7466 5000
    
      roger hubbold aig .1 force-field navigation
    

Now, what is the point in trying to generate any kind of "meaning" from those
documents if they consist of completely meaningless gibberish?

As I was reading this challenge, I immediately thought of spam filtering /
youtube comment classification ("smartness" classification) / etc as a
potential _useful_ application of this technology.

For example if each "document" is a youtube comment, then in theory you could
write an algorithm to examine each comment and output a "smartness guess" for
each. Then you (as in, you personally, by hand) would look at the results and
specify your own "smartness rating" for a few comments. Then you'd run an
algorithm to look at the difference between your specified "smartness rating"
and the "smartness guess". Then, using that difference, it would tweak the
settings in the original algorithm until it outputs a "smartness guess" that
more closely fits your "smartness rating". If you repeat that process enough
times, and your original algorithm has enough modifiers to tweak, then you
might wind up with an algorithm that can make a pretty good guess about
whether any given youtube comment is retarded or not.

And that's just one example of a practical application for this kind of thing.

That said, if the input "documents" are _completely and utterly meaningless_ ,
then there does not seem to be any point in trying to build "meaning" from
those inputs. (Garbage in, garbage out.)

~~~
puredemo
Seman­ti­cally related != meaningful.

Also, they do note that, "'Seman­ti­cally related' is pur­pose­fully left
vague."

~~~
rarestblog
I'm sorry, but "Semantics" is "the study of meaning" (at least according to
Wikipedia). How's that possible that "Semantically related != meaningful"?

~~~
Qz
You left out a word - 'related'. Consider:

Tennis Playpen

Semantically _related_ yes, meaningful no.

------
miguelpais
That's somewhat similar to my second mini-project of my NLP course (that I
just delivered like one week ago).

What we had to do was, first to gather a large corpus (in this case) of
portuguese. Then, produce the frequencies of each unigram and bigram. Then,
process those frequencies and output the top 10 and bottom 10 collocations of
the corpus. Collocation definition from wikipedia: "Within the area of corpus
linguistics, collocation defines a sequence of words or terms that co-occur
more often than would be expected by chance".

The method used to calculate the collocations was based on a chi-square test.
The values with higher chi-square were the most probable collocations (if
above the critical value that would reject the null hypothesis).

So, I actually have Ruby code done that I could change just a little bit for
this, but there are probably more clever methods.

------
nikster
This sounds interesting but the main challenge here seems to be to define the
question: What is a semantically related word. I can think of three classes -
variations (singular/plural), synonyms (thesaurus), contextual (occur in the
same sentence/paragraph). Once the definition of "semantically related" is
established, it should be easy to find a clever algorithm to produce results.
However, out of context the problem is a little bit boring - in order to find
a good solution I'd like to know the real world problem the list of
"semantically related" words is going to solve.

~~~
bravura
There are two aspects of the problem that I consider particularly interesting:
a) defining the problem, because being able to define the prob­lem and
jus­tify your answer is half the puzzle in a lot of NLP work. b) scaling to a
corpus of this size (vocabulary of 1M), because this scale is tricky but
useful in many web problems.

Semioticians typically distinguish between _paradigmatic_ and _syntagmatic_
axes of semantic relatedness. Paradigmatic means that two words occur with
similar other words, e.g. these two words typically have the same word
immediately to their left, like "blue" and "azure". Syntagmatic means that the
words typically co-occur in usage, like "blue" and "sky". Check out the image
on this page for another illustration of these axes:
<http://www.aber.ac.uk/media/Documents/S4B/sem03.html>

Regardless of whether you choose to do a paradigmatic or syntagmatic analysis,
it's interesting to see how you motivate your approach and if you can scale it
to 1M different vocabulary words.

~~~
riffer
_scaling to a corpus of this size ... it's interesting to see how you motivate
your approach and if you can scale it to 1M different vocabulary words_

It looks like the sparsity of the matrix is going to be a much bigger
challenge than the scale.

I understand about the focus being primarily on the approach, that makes
sense; how are you intending to evaluate the results files?

~~~
bravura
_It looks like the sparsity of the matrix is going to be a much bigger
challenge than the scale._

Sparsity is good. The sparsity is the only reason that you can keep a matrix
with this many dimensions in memory.

 _I understand about the focus being primarily on the approach, that makes
sense; how are you intending to evaluate the results files?_

For any submission, I will post for a random subset of vocab words each
entry's 10 related terms. I'll then ask people to vote blind.

~~~
rarestblog
I think the more fair comparison would be to show THE SAME random subset for
each entry. I.e. same X words for all result sets.

Otherwise, it might happen that a superior result would just show words that
don't even have good results, while an inferior subset would get better
covered words.

~~~
bravura
Yes, it will be the same random subset for all participants. Sorry if that
wasn't clear.

------
nadam
My first idea would be the following:

If I had enough time for the problem I would first use a morphological and
syntactical analyzer to try (at least with a high probability) to determine
the morphological structure of the words and I would tag them with their
syntactical category (part of speech). So I would treat for example 'work' and
'worked' as being the same word. On the other hand I would treat 'can:aux' and
'can:noun' as different words.

Now I would completely eliminate all the words which are not nouns not
adjectives and not verbs. (the, a, can:aux, may, etc...) These are
'asemantical' words.

Maybe eliminate extremely rare words also (with 1 or 2 occurences)

After this preprocessing now comes the semantical relatedness analysis.

A very primitive solution:

\----

Compute the ten best B for A where the following is maximal:

f(AB) / (f(A) * f(B))

f can be the frequency. f(AB) can be the frequency of occuring together in a
document or within a distance (like 5 words). But you can fine tune it: weight
it with the actual distance.

You don't have to maintain the full matrix. As you go through the documents
just maintain max 100 words for each word for which this measure is maximal,
and in the end just take the 10 maximal from the 100.

\----

A more sophisticated solution for the semantically relatedness analysis:

Let's put all the possible words into a 2 dimensional map with totally random
x,y coordinates. Now as you go through the documents imagine that the words
which occur near each other attract each other on the map so their distance is
made *=0.8. They gravitate. Imagine that the mass of the word is its
frequency: the distance is made smaller by small frequency words going towards
the big frequency word a lot, and the a big frequency word goes just a litle
bit. Just like gravitation. Of course time to time we change the scale: we
blow the whole map up a bit to not make everything gravitate into one point.
This algorithm is a 'global algorithm' in the sense that it takes into account
indirect things like synonims which are semantically really related even
though they not really occur in the same sentence. In the end determining the
10 nearest neighbours fast is easy: you can use a simple grid as 'space
partitioning'.

Edit: A 2 dimensional map s not necessarily optimal, it might be too tight. I
would also experiment with setting the map dimension to 3,4,5 etc...

~~~
sukuriant
try it and see :D

I think I'm going to be attempting a solution, too (though it's not the same
as yours)

------
tgflynn
I checked in a simple co-occurrence frequency based implementation for this
here <https://github.com/tgflynn/NLP-Challenge> with a short discussion on my
blog: <http://cogniception.com/wp/>.

------
sdrinf
Solution: [http://groups.google.com/group/metaoptimize-challenge-
discus...](http://groups.google.com/group/metaoptimize-challenge-
discuss/browse_thread/thread/48bc7de317e79576)

------
Tycho
Hmm. Without any thought given to computability: Rearrange the words of every
document into alphabetical order, then sort the whole list of documents into
alphabetical order. Now, start reading the documents, and for each one make it
a tree with each word a leaf, BUT if there is already a tree with a fragment
of the same phrase, then that tree is linked to as a node of the new tree.

So for the document 'christmas day' the tree

______d1______.

__/_________\\__.

christmas___day

is formed. But for the document 'chistmas day special', the tree

_____d2______.

___/____\\____.

d1_______special

is formed. Note: word order doesn't matter. For any given document, how will
it know if there's already a tree that uses some of the same words? When a
tree is created from a document, give it a hash for the full set of words,
plus hashes for sub-strings. These can be stored in a big lookup table, so new
documents know where to look for relevant trees (or -- use whatever solution
is well established for this sort of problem). Might also need to
algorithmically redefine older trees as we go along, ie. divide a tree into
smaller phrases

For each leaf let it keep track of what trees its part of (including
grandparents and above). Then to find the top-k, for each leaf object (if it's
in the vocabulary), print out all the other leaves of all the trees its part
of. Like

 _if leaf.__str__ == 'day':

___ for tree in leaf.trees

______ print tree.leaves_

Then do a simple frequency count to find the top-k associated words.

(sorry for the n00b answer... IANACS)

------
blr_hack
One way of doing this:

1\. Get some million tweets from twitter API (each tweet is exactly like what
the article describes as a 'document')

2\. After ignoring common words (e.g. 'a', 'an', 'the' ... ) in a tweet ,
start assigning some kind of relevance rank to all the other word pairs found
in the tweet e.g. its likely that 'Cricket' and 'Sachin' (Or 'NBA' and <top
NBA player) will both increase each other's relevance rank, WRT to each-other.

3\. Process all the tweets like this, while maintaining the output of step 2
in a most suitable data structure. You would also need to start dropping word-
pairs (to avoid having the problem of storing Million-C-2 words! ) based on
some logic/heuristic.

4\. If we have a good logic for having reasonably not-big storage (by avoiding
the million-c-2 explosion), then what we have is at the end of processing: A
simple look up of a million words, where each word has its 'top'
max_allowed(k) semantic words.

PS: The most complex piece in the approach is to come out with a solution for
dropping word-pairs (in step 3)

~~~
spencertipping
I have a preliminary implementation of this, geared for map/reduce-style
parallelism in Perl, up at <http://github.com/spencertipping/metaoptimize-
challenge> (in the fcm directory). It may be a start to solving step 3 --
using reduce-by-two on each step and dropping low-relevance words (I think
this is valid, though I'm not 100% certain).

~~~
blr_hack
Hi! Thanks for reading my suggested approach. I can understand how Map-reduce
can be used to process the things faster (by processing them in parallel and
later using reduce to aggregate the results??)

    
    
      But can't imagine how reduce (of map-reduce) can help in dropping of word-pairs (i.e. in Step 3)

~~~
spencertipping
I think it's an imperfect solution, though it might still be solid. It depends
on your reduction strategy (basically breaking associativity). If you do a
fold-left, then you'll accumulate such high relevance that you'll quickly
start discarding every word in the right-hand data set. On the other hand, if
you binary-split the folding process you're more likely to be OK.

I used a binary split and due to the sparseness of the data (and the
uninsightfulness of my algorithm) didn't run into too many spurious drop
cases. But the implementation I posted is very basic and shows my lack of
background in NLP-related pursuits -- I'm lucky it did anything useful at all
:)

------
pajarofrio
I will no bother with this challenge, since there is no a clear good solution.
I would find more interesting working with semantically rich documents in
which your work can be tested to measure the quality of your solution. That
is, a good work must always gain enough recognition to be pursued.

------
robryan
Couple of approaches that I can think of from papers of which combining in
some way would probably mean decent results. Matching left and right fragments
of n length, adding POS tagging to the matching fragments and using wordnet
relation closeness for terms that exist in wordnet.

------
jdrock
I encourage anyone to use Extractiv (<http://www.extractiv.com>) for this
challenge. We're rolling out an On-Demand (local document analysis similar to
Calais) Semantic Conversion service that could be helpful.

~~~
anigbrowl
Slow down the rollout a little: your 'take a tour' page inline link to
Semantic On Demand 404s, as does the inline link from your pricing page:

<http://www.extractiv.com/tour-semantic-on-deman.html> \- missing letter in
'...deman.html'

<http://www.extractiv.com/tour-semantic-on-demand> \- missing '.html'

Also, the pricing page says users need to buy a plus or premium subscription
for access to Semantic On-Demand, but confusingly says the basic plan includes
'On-Demand access to 1,000 documents / day' - I presume you meant to emphasize
the unlimited usage available with the premium plans.

~~~
riffraff
also, the example from here <http://www.extractiv.com/tour-semantic-on-
demand.html> does not work.

I second, this seems interesting (and I would be really interested in seeing
what you produce as "relations")

