
What Happened to Old School NLP? - psygnisfive
http://languagengine.co/blog/what-happened-to-old-school-nlp/
======
danieldk
_We 'll start to see the re-emergence of tools from old-school NLP, but now
augmented with the powerful statistical tools and data-oriented automation of
new-school NLP. IBM's Watson already does this to some extent._

This is not a new trend. As early as 1997, Steven Abney augmented [1]
attribute-value grammars with discriminative modelling (maximum entropy
models) in this case to form 'stochastic attribute-value grammars'. There is a
lot of work on efficiently extracting the best parse from packed forests, etc.
Most systems that rely on unification grammars (e.g. HPSG grammars) already
use stochastic models.

In the early to mid 2000s when the modelling of association strengths using
structured or unstructured text became popular, old-school parsers have been
adopting such techniques to learn selectional preferences that cannot be
learnt from the usually small hand-annotated treebanks. E.g. in languages that
normally have SVO (subject-verb-object) for main clauses but also permit OVS
order, parsers trained on small hand-annotated treebanks would often be set on
the wrong path when the direct object is fronted (analyzing the direct object
as subject). Techniques from association strength modelling were used to learn
selectional preferences such as 'bread is usually the subject of eat' from
automatically annotated text [2].

In recent years, learning word vector representations using neural networks
has become popular. Again, not surprisingly, people have been integrating
vectors as features in the disambiguation components of old-school NLP
parsers. In some cases with great success.

tl;dr, the flow of ideas and tools from new-school NLP to old-school NLP has
been going on ever since the statistical NLP revolution started.

[1]
[http://ucrel.lancs.ac.uk/acl/J/J97/J97-4005.pdf](http://ucrel.lancs.ac.uk/acl/J/J97/J97-4005.pdf)

[2]
[http://www.let.rug.nl/vannoord/papers/iwptbook.pdf](http://www.let.rug.nl/vannoord/papers/iwptbook.pdf)

~~~
SixSigma
The problem is slippery though, people always reuse words to express different
ideas :

"We can't buy any bread because we haven't got any bread."

And it's not just English. In Chinese one is taught that "ni hao ma?" is the
greeting equivalent to "hello, how are you" but try it on a Chinese person and
it amuses them. My Chinese friend at Uni says that Chinese people use "Ni chi
ma?" which is literally "have you eaten?" (although we both appreciate that is
a bit of a generalisation for 1bn people).

~~~
danieldk
> The problem is slippery though, people always reuse words to express
> different ideas : > "We can't buy any bread because we haven't got any
> bread."

As long as we are talking about syntactic parsing, this is not a problem as
long as the attachment is the same. In both cases 'bread' is the direct object
of the main verb.

Of course, there are cases where a particular word can be used _both_ as a
direct object and a subject of a particular verb. E.g.:

 _The man ate the pig._

 _The pig ate the apple._

Of course, what such systems are learning are not rules, but probability
distributions that combine information about the distributions of word orders,
association strengths between heads and dependent with a particular dependency
relation, configurations of dependent pairs, etc.

~~~
psygnisfive
I think they're saying tho that in one case "bread" means a kind of food, and
in the other, it means money. If the different uses were more directly
connected to particular words that could disambiguate, it wouldn't be too hard
-- head features typically can do this -- but the disambiguation here is far
more conceptual in original. You know that you need money to buy things, and
so you know that not having money is a good reason for not being able to buy
things. But bread-the-food is not needed to buy things, so not having it isn't
a good reason for not being able to buy things. So probably the second "bread"
is the bread-money version. This kind of disambiguation is super tricky
without broader world knowledge.

~~~
SixSigma
Ah, also I just remembered the thing I was "quoting"

Rick: Why haven't we got any bread?

Neil: Because we haven't got any bread.

~~~
beobab
The Young Ones?

~~~
SixSigma
yes

------
agentile
Reading this article, particularly the part about sentiment analysis, was
interesting to me because last year I did my thesis[1] regarding sentiment
classification using a somewhat mixed approach (albeit pretty simple) where I
factored in basic sentence structure in addition to word features to see
improvement in accuracies. I found it really neat to see various cases where
particular sentence structures like PRP RB VB DT NN would be much more likely
to show up for a positive sentiment e.g. "I highly recommend this product" vs
negative sentiment e.g. "They totally misrepresent this product"

I get the impression that while it is true the computational side of
computational linguistics has seemingly seen more attention for lucrative
reasons, but now it is seeing some success there more people trying to
incorporate more from the linguistic side, when it doesn't cause for a huge
amount of computational expense.

It doesn't seem like anything new, however, that business needs drive funding
for particular areas in academia. Sadly, more so than ever considering the
greed of the school systems (but that is another topic).

[1]
[https://digital.lib.washington.edu/researchworks/handle/1773...](https://digital.lib.washington.edu/researchworks/handle/1773/24983)

~~~
psygnisfive
Yep, I agree with what you say. Tho I would dispute calling a POS gloss much
of a structure. When I think structure, I think full syntax. Parse + formal
features. Or a full logical formula for the semantics! Now _that 's_
structure!

~~~
agentile
Fair enough, I think in my thesis I referred to it as a sentence
representation. I think it may actually serve to be a good example of some of
the compromises being done in application. Sometimes people need some of that
high level useful information, but not within a deeper, more comprehensive
format/structure for efficiency sake even if some knowledge and subsequent
accuracy is lost. My experience was that some of the tools out there in NLP
land didn't always make this an easy thing to do.

------
dsfsdfd
I think it's more likely that we will do machine learning to learn the
syntactic structure, rather than hand craft these pieces of machinery. For a
long time we have tried to create intelligent machines by designing to solve a
problem, now at last we are designing machines to solve problems and finding
success. I see no reason to imagine that going back to the old school methods,
with a layer of the new magic on top is going to be effective as we move to
the medium term - in the short term, possibly, but briefly.

~~~
psygnisfive
I think it's going to be machine learning too, just over richer structures.
And actually when you need parsing that's what you get together. But few
people need real parsing right now. Some things, like meaning, will I think
have to be more old school. There's no good way to learn meanings right now at
all.

------
JacobiX
The beginning of the article reminds me of the quote : "Every time I fire a
linguist, the performance of our speech recognition system goes up." But
nowadays statistical NLP systems regularly use syntactic and semantic
information as a features in the learning phase.

~~~
psygnisfive
It really depends on what you count as syntactic and semantic information. As
a linguist, to me syntactic information is tree structures, syntactic
categories, etc., and semantics is formulas in some (typically higher-order)
logic. But for a lot of the NLP that I see, "syntax" is pretty shallow stuff
like head words and POS tag contexts, and "semantics" is at best things like
word vectors maybe dependency trees. These are very different. But maybe we're
thinking of different things! :)

~~~
danieldk
There have been some inflated claims, e.g. people calling their part-of-speech
tagger a shallow parser or their shallow parser (e.g. chunking plus some
rules) a parser :).

But I think in general computational linguists would say that dependency trees
are definitely syntax.

~~~
psygnisfive
It depends on the kind of dependency trees. I've seen plenty of semantic-y
dependency trees and plenty of syntactic-y dependency trees. I'm not sure how
common either really are ever, but on the semantics side, it's the best you
get, usually, and isn't all that good for semantics. It's fine for syntax,
more or less.

~~~
danieldk
I think that if you say: dependency treebank, most computational linguists
will expect syntactic dependencies. I agree that the notion is blurred
sometimes, e.g. by including word senses or preferring semantic attachments in
some cases over semantic attachment. There may also be treebanks of the
semantic-y kind, but I haven't seen or used them often.

------
lazzlazzlazz
NetBase ([http://www.netbase.com](http://www.netbase.com)) uses this kind of
"old-school" NLP (with a large team of full-time linguistics PhDs) augmented
with statistical tools and increasingly sophisticated forms of automation.

The end product is more accurate and quicker to adapt than the industry is
used to.

Disclosure: I work in the engineering team at NetBase.

~~~
danieldk
Unfortunately, the website has marketing all over it. The only thing I could
find on Google scholar was about how Netbase is doing parsing of Chinese,
based on phrase chunking and then extracting dependency relations using phrase
chunks.

I wonder if this is true for all the languages, since this is usually
considered shallow parsing (and not deep parsing, a word that seems to be in
the white papers). Constructing an 'old-school' grammar for deep syntactic
analysis (I am thinking HPSG, CG, or LFG-style here) of 42 languages is a
_very_ tedious task if you want to get any decent coverage.

~~~
lazzlazzlazz
Not all languages receive the "deep" treatment. Eight, going on nine, have
full-time linguists working on developing grammars.

------
ryanmim
This is a pretty good explanation of why almost all practical applications of
NLP are now accomplished by statistics rather than fancy linguistic grammar
models you might have read about in a Chomsky book.

Old school NLP has always fascinated me though, and I'm pretty excited about
what might be possible in the future by using more than purely statistical
methods for accomplishing NLP tasks. Maybe the author could have speculated
more wildly in his prognostication ;)

~~~
sqrt17
It's important to make a distinction between (i) Chomskyan linguistics, (ii)
90s style symbolic systems, (iii) 90s/early 2000s style statistical systems
and (iv) 2010s style statistical systems.

Chomskyan linguistics assumes that statistics and related stuff is not
relevant at all, and that instead you need to find the god-given (or at least
innate) Universal Grammar and then everything will be great. 90s style
symbolic systems adopt a more realistic approach, relying on lots of
heuristics that kind of work but aim at good performance rather than
unattainable perfection; 90s style statistical models give up some of the
insights in these heuristics to construct tractable statistical models.

If you look at 2010s style statistical models, you'll notice that machine
learning has become more powerful and you can use a greater variety of
information, either using good linguistic intuitions (which help even more
with better learning algorithms, but require a certain expressivity as well as
some degree of matching between the way of constructing the features and the
classification) or unsupervised/deep-NN learning, which constructs
generalizations over features.

The main reason that you won't ever see people talking about systems with
great machine learning and great linguistic intuitions is that you normally
want to treat one of them as fixed and focus on improving the other, i.e.,
it's more a practical/cultural difference than an actual limitation.

~~~
psygnisfive
Actually this isn't true, wrt Chomsky. Chomskyan linguistics assumes
statistics is very important (and this has been noted by Chomsky himself since
at least the early 1960s). Chomsky simply argues that statistics is
insufficient on its own. And in truth, most NLPers believe this, but they
rarely admit it. Most/all NLP projects have some form of "universal grammar",
tho usually its something like a regular grammar (~ a Markov chain) or at best
a probabilistic CFG (PCFG). I suspect the reason is that, to some extent,
hierarchical structures like this seem so natural that its hard to imagine
what else you could do, so there's a tendency to co treat CFGs as not even a
grammar choice, but it is. There are other kinds of grammars (such as pregroup
grammars) which lack these notions of hierarchy but work perfectly well for
the same domains as CFGs, just in very different ways.

~~~
sqrt17
Quoting a symposium with Chomsky talking about statistical AI:
[http://languagelog.ldc.upenn.edu/myl/PinkerChomskyMIT.html](http://languagelog.ldc.upenn.edu/myl/PinkerChomskyMIT.html)

"I think there have been some successes, but a lot of failures. The successes,
to my knowledge at least, are those that integrate statistical analysis with
some universal grammar properties, some fundamental properties of language;
when they're integrated, you sometimes get results [...] On the other hand
there's a lot work which tries to do sophisticated statistical analysis, you
know bayesian and so on and so forth, without any concern for the uh actual
structure of language, as far as I'm aware that only achieves success in a
very odd sense of success. There is a notion of success which has developed in
computational cognitive science in recent years which I think is novel in the
history of science. It interprets success as approximating unanalyzed data."

The model that has become dominant in statistical AI -- positing a basic
grammar that is strongly underconstrained and eliminating spurious analyses
not through "universal grammar" (i.e. presupposed innate structures) but
through learned parameters, would be something that Chomsky has been very much
against; Simultaneously, work that models grammar with enough precision that
you could derive predictions from it (e.g. Ed Stabler's grammar
implementation) are seen as nice-to-have but not central to the undertaking of
generative grammar.

And I think Chomsky put his thumb right on the difference in goals: Chomsky
defines progress in linguistics as work that posits the right ("universal")
structures, and argues that these are cognitively real and innate, whereas
statistical AI is more interested in predicting useful things with structures
that may or may not correspond to anything that is cognitively real.

To people nowadays, the whole notion of constrained "universal" models with
few statistics versus underconstrained "statistical" models seems to be a very
minor one, since today's statistical models have a lot of structure, and
people doing generative grammar aren't totally opposed to using statistics or
optimality theory to select most-plausible structures. But, back in the day,
when the most expressive statistical models people used were HMMs [hidden
Markov model - a probabilistic regular grammar] and PCFGs [probabilistic
context-free grammars], the gap was much wider, whereas nowadays the models
are a bit more similar while the goals are still different.

------
VLM
There is a greater economic lesson that tech does not necessarily have the
drivers seat in the economy.

"but with the advent of computers, it became possible to monetize NLP and the
priority shifted to making products people would buy, rather than a system
that was scientifically correct."

The competition for a NLP computer program is not another NLP computer
program, but call centers in India, Phillipines, onshore prison labor, that
kind of "support"

------
sdoering
I really liked the article and some of the Blog-Headlines seemed to be
interesting as well. But try as I might, I was not able to find a
rss/atom/xml-Feed for plugging this ressource into my feedreader. Sadly, so I
will probably miss upcoming interesting posts.

~~~
psygnisfive
I'm working on it, have no fear! I'm hand coding the blog right now because I
don't want to use a CMS, I really don't like the options I have. I might just
hand-code an RSS feed.

~~~
zhenjl
Try a static blog generator like hugo or octopress...

~~~
psygnisfive
yeah, people have suggested a variety of things. i'll add those to my list of
static generators to investigate, thanks :)

------
dschiptsov
The subtle ideas form the original "Structure Of Magic" books about how we
construct out internal representations of reality depending of wording we use
has been replaced by industry of coaches and consultants.

The ideas, by the way, had been studied by mainstream psychology as the
framing effect and the priming effect.

In short, our minds do lexical analysis and decomposition sub-consciously, so
we could be influenced by specially crafted sentences. We also leak details of
our internal representation of some aspects of reality in the way we
unconsciously construct language sentences.

~~~
danieldk
Note: NLP here means Natural Language Processing, not Neurolinguistic
Programming.

~~~
dschiptsov
Oh, sorry. Too old school.)

