

Ask HN: Are there any NLP projects that use context? - iLoch

I was thinking about natural language processing today. I am by no means educated on the subject, but it seems like an interesting field. Most NLP libraries can parse text into its parts-of-speech, but often times it can be wrong because it&#x27;s a word like &quot;project&quot; where the verb form is a completely different word than the noun form, etc.<p>In all the libaries I looked at, &quot;project&quot; was interpreted as a noun despite the sentence being &quot;We will project our earnings&quot;.<p>Could we not provide context to the parsing process? For instance: a database could store the frequency of parts of speech for a given word&#x27;s N nearest neighbours (ie. the n-1 neighbour of &quot;project&quot; is a definite article 90% of the time), as well as the frequency for each part-of-speech for the given word. So if the n-1 neighbouring word of &quot;project&quot; was not a definite article, it may indicate with a certain weight that &quot;project&quot; is not being used in its typical form either. Then you could do the reverse lookup (the n+1 neighbour of &quot;will&quot;) and see that <i>that</i> neighbour is a verb 50% of the time. Putting those two pieces of information together gives you pretty good confidence that &quot;project&quot; probably isn&#x27;t a noun like it usually is.<p>So is this just too computationally expensive or is there something else that I&#x27;m missing?
======
wodenokoto
There are a lot of ways of creating context in NLP. When a normal person
thinks of contex, they probably consider the meaning of the sentence and from
there they'll figure out the POS.

For a simple POS-tagger you would today probably use an HMM. What this means
in simple terms in you look at how likely is the next word (regardless of what
word there may be) to be a given category. Let's say the next word has a 50/50
chance of being a verb or a noun. Then we look at the actual word "project"
which has a 20/80 chance of being a verb or a noun.

In another context the POS might have a 90% chance of being a verb, and then
project would probably end up being. Tagged as such.

You can also start looking at bi, and trigrams (series of two or three words)
and so on.

So what you describe is pretty much an actual thing.

~~~
iLoch
Oh interesting! Thanks for sharing.

------
sfrechtling
Yes, some NLP POS (part of speech) taggers do use context. I'm not entirely
sure if they work in the way that you describe. I believe that they use the
grammar of the sentence to derive what the tags over the sentence are. That
is, they can tag the sentence on what they know and then iterate over the
sentence until everything can fit into the grammar (or corpus) that it was
trained on.

If you want more background - the nltk book, in particular chapter five is a
good place to start:
[http://www.nltk.org/book/ch05.html](http://www.nltk.org/book/ch05.html)

What library did you use? NLTK on Python 2.7 gave the following:

[('We', 'PRP'), ('will', 'MD'), ('project', 'VB'), ('our', 'PRP$'),
('earnings', 'NNS')]

~~~
iLoch
Interesting I'll have to take a look at this Python lib. I was trying out
various Node.JS libs.

