

Avoiding NLP At All Costs - bdfh42
http://teddziuba.com/2008/11/avoiding-nlp-at-all-costs.html

======
petercooper
Summary: NLP is hard. There are lots of complicated academic concepts involved
in parsing natural text that I don't understand. I have failed at using NLP in
some indeterminable way. Please don't use NLP because you probably can't do
anything with it either.

------
justindz
I thought this was going to be about Neuro-Linguistic Programming. Turns out
it was natural language processing, so the rest of it became tl;dr.

~~~
hugh
Oh, so that's what it was. I read the entire article to try and figure out
what NLP meant, and still didn't figure it out.

------
jimbokun
The off the shelf natural language parsing algorithms are polynomial time
(generally n^3 or so).

<http://en.wikipedia.org/wiki/List_of_algorithms#Parsing>

In that list CYK, Earley, GLR, and Inside-Outside are taught as suitable for
parsing natural languages (oversimplifying, the dividing line between parsers
for computer languages and human language is that the former are deterministic
and the latter spectacularly ambiguous).

So, you want to process a bunch of text, with time to parse each sentence
increasing as a polynomial of the length of a sentence...I think you see where
I'm going with this. Obviously, lots of people are looking for shortcuts here,
but that probably means sacrificing something in accuracy, at which point you
need to question whether you really want to be parsing, anyways. Last I
checked, state of the art parsing accuracy is low 90s, and that's if all your
sentences come from the Penn Treebank corpus.

So, when saying NLP is hard, NLP probably means parsing. You can probably
describe applied NLP research as finding clever ways to answer the question
you want to answer without actually parsing. Or parsing just a little bit, but
not everything, you get the idea.

An interesting side question is just how do people parse natural language in
real time so accurately. Or do people really parse at all, in the algorithmic
sense?

~~~
dmoney
> An interesting side question is just how do people parse natural language in
> real time so accurately.

We have hardware support for massive neural nets:) I would guess that as we
read/listen we speculatively build several different parse trees that could
represent the sentence, eliminating each one when its probability (based on
the implied semantics) falls too low.

~~~
jimbokun
That sounds cool.

Do you have any references to research arguing this, or is it just your
personal idea?

~~~
dmoney
Just my idle speculation.

------
einarvollset
I totally agree with this, and the reason is simple: my US friends often fail
to get the dripping sarcasm of my UK ones, so I don't expect a software
package to.

From a practical perspective, it's remarkable how often you can cheat your way
out of not needing NLP. Google is in a way a prime example of that.

------
huhtenberg
Just to clarify for everyone's benefit - guy rants about Natural Language
Processing, and not Neuro Linguistic Programming, which is what NLP acronym is
commonly used for and which is incidentally also quite relevant to the field
of press release crafting.

<http://en.wikipedia.org/wiki/Neuro-linguistic_programming>

------
omarish
<http://nltk.sourceforge.net/>

it's not that hard; you just have to know when to use it.

~~~
ntoshev
NLTK uses statistical approach, which it seems he doesn't consider to be
"NLP". TF-IDF is also part of that statistical approach. He seems to be
talking about the old school rule-based natural language understanding.

~~~
nihilocrat
If rule-based NLP is what he's truly referring to, he is actually very much on
the money. Rule-based NLP gets too complicated too quickly and is useless when
analyzing a language other than the target language or analyzing a piece of
text that is poorly-formed (bad grammar/spelling) or heavily peppered with
loan words and loan constructions.

Much like how Good Old-Fashioned AI has proven to be cumbersome and non-
analogous to how biological brains actually work, rule-based NLP is flawed at
its core. Statistical NLP has proven to be much more powerful, extensible, and
tolerant to near-misses.

