
How to Get into Natural Language Processing - craigcannon
https://blog.ycombinator.com/how-to-get-into-natural-language-processing/
======
d_burfoot
> Why is NLP Hard? ... Language is highly ambiguous - it relies on subtle cues
> and contexts to convey meaning.

This is true, but it is only part of the answer.

Another part of the answer is what I call the Long Tail of Grammar. It turns
out that if you try to write down all the rules of grammar, you will not get
40 or 60 rules, but something more like 100s or maybe even 1000s of rules.
Most of those rules are obscure, rare, archaic, or useable only in specific
contexts or with specific words. However, they are part of the language, a
native speaker will be able to use them and comprehend them without
difficulty, and an NLP system must be able to "understand" them in order to
extract the correct meaning from a sentence.

As just a minor example off the top of my head, compare the phrase "peeled
peach" with "hairy-peeled peach". The former phrase means a peach without a
peel, while the latter means a peach with a hairy peel. So a good NLP system
must not only recognize the existence of the two grammatical rules, but also
be able to disambiguate them correctly.

~~~
chch
> or useable only in specific contexts or with specific words.

A good example of this is the Winograd Schema. You might think you can figure
out a good algorithm for anaphoric resolution (i.e. If you see "Sally called
and she said hello.", who is "she"?) that just relies on the structure of a
sentence, without considering semantics.

But here's a counterexample:

"The city councilmen refused the demonstrators a permit because they feared
violence."

Who are 'they'?

"The city councilmen refused the demonstrators a permit because they advocated
violence."

Now who are 'they'?

If you're like most people, even though only the verb changed, the binding of
'they' based on the deeper semantic meaning also changed.

These sentences are called Winograd Schema[1], and there are plenty more like
it.

[1]
[https://en.wikipedia.org/wiki/Winograd_Schema_Challenge](https://en.wikipedia.org/wiki/Winograd_Schema_Challenge)

~~~
ice109
re the council people sentences: I don't understand the problem. they're ill-
defined sentences. we use heuristics to parse them but those heuristics can
fail (the council denied the demonstrators permit because they feared
violence... and the council was obliging). just teach the computer the
heuristics like we learn them.

~~~
chch
That's exactly the issue. The way we learn them is through world experience,
which is sometimes hard to figure out how to transfer into a computer.

Example:

"I dropped the egg on my glass living room table and it broke!"

"I dropped my hammer on my glass living room table and it broke!"

These are both ill-defined semantically, but if you asked most native English
speakers "what broke" for each sentence, they'd probably say "egg" for the
first and "table" for the second. It could be the other, but it would be
surprising. So, to solve just the "Dropped X on Y, Z broke" problem, we'd need
to teach the computer to understand the effect of the relative 'fragility
scores' of each object. Personally, I never sat down and memorized a chart of
these as a human. You could perhaps use machine learning to derive the data by
analyzing a large corpus of text[1], and match humans most of the time, but
then that's just one sentence type solved, out of any number of other tricky
constructions. So the long tail of semantic understanding quickly becomes a
very fun set of problems to solve, for certain definitions of fun. :)

A few more examples to consider how you would teach a computer to understand,
from a Winograd Schema corpus[2]:

John couldn't see the stage with Billy in front of him because he is so
[short/tall]. Who is so [short/tall]?

The sculpture rolled off the shelf because it wasn't [anchored/level]. What
wasn't [anchored/level]?

The older students were bullying the younger ones, so we [rescued/punished]
them. Whom did we [rescue/punish]?

I tried to paint a picture of an orchard, with lemons in the lemon trees, but
they came out looking more like [light bulbs / telephone poles]. What looked
like [light bulbs / telephone poles]?

[1] e.g.
[http://cs.rochester.edu/research/lore/](http://cs.rochester.edu/research/lore/)

[2]
[http://www.cs.nyu.edu/faculty/davise/papers/WinogradSchemas/...](http://www.cs.nyu.edu/faculty/davise/papers/WinogradSchemas/WSCollection.html)

~~~
posterboy
context is everything.

"A violent mob requested a demonstration from the councilmen. The councilmen
refused the permit, because they feared the violence."

I suspect, grammar begets normalization, with primary and secondary keys just
like in relational databases. People are just not very good at it. EG. I'd
contest the consistency of those _1000 grammar rules_. Point in case, the word
"violence" needs the definite article, because violence is an abstract concept
(which the parent missed). All the while the indefinit and definit articles
serve other purposes, eg. the quantifiers from logic (for all, there exists)
which are at odds with the naive countability of the violence.

So Language is ambiguous, NLP is done probabilistic and thus is hard with at
least exponential complexity.

Edit: What I mean is, the problem here is contraction omitting context. Of
course databases worked before relational databases, but sometimes you really
want the third normal form.

~~~
yongjik
> Point in case, the word "violence" needs the definite article, because
> violence is an abstract concept (which the parent missed).

No, searching for "inciting violence"/"fearing violence" in Google gives
thousands of hits. (Also in Google Books, if you want to claim all these
websites are wrong.)

It is perfectly OK to use "violence" without an article.

~~~
posterboy
OK, I'm not a native speaker and admittedly didn't reaffirm my claim before
posting. Anywho, the headline and newspaper speak is not exactly the propper
grammer I am talking about, right?

Wouldn't omitting the _the_ in my sentence mean, they feared violence in
general? Sure, _broad contracts_ are welcome, but then there couldn't be a
specific answer to who _they_ are. I guess that's in agreement with what you
said.

------
blcArmadillo
I think this is a good idea for a series. Although I think more detail needs
to be given on the actual path, that is after all the purpose of the series.
Most of this article seemed to be describing what NLP is and why it's hard.
This isn't bad and some attention should be given to it but people looking to
find the path into NLP will already be familiar with most of this information.
I was expecting a bit more of a syllabus type format. There was mention of
needing some college level algebra and statics, I would have liked more detail
in this area with links to more resources (classes, articles, datasets, etc).
Keep up the good work!

~~~
mswen
Agreed on more substantive detail needed. I was surprised at the lack of
mention of many of the basic techniques and domains that a person interested
in should consider learning about.

The following are all germane but not mentioned: text analysis/mining,
controlled vocabularies, indexing, taxonomies, ontology, semantic web, latent
semantic analysis, latent dirichlet allocation, corpus analysis, document
similarity analysis, tf-idf, ngrams, and skip grams just to mention a few.

In general the article is a good idea but their needs to be more of a
description of the domain landscape and then "paths" plotted through that
landscape that lead to interesting and useful competency.

~~~
vincentschen
That's a great point. I wonder if there would be a better way to introduce
meaningful, actionable topics of study to an introductory-level audience of
people who may have never heard of NLP.

------
andrewtbham
I researched deep learning for nlp for a year and compiled this list of papers
and articles about some of the most interesting topics.

[https://github.com/andrewt3000/DL4NLP/blob/master/README.md](https://github.com/andrewt3000/DL4NLP/blob/master/README.md)

~~~
p1esk
Have you built anything interesting?

~~~
andrewtbham
I did some kaggle contests using image stuff.

With regards to nlp, I have a site that is using a spider to collect headlines
for stocks and I have been working on clustering, sentiment analysis, and text
summarization. But it's I haven't completed it.

[http://www.teslanewsfeed.com/](http://www.teslanewsfeed.com/)

------
hobofan
I like the idea of the Paths series, though some of the points in this first
article read like they could be written about most "emerging technologies".
Anyway, I'm looking forward to the next one!

The two questions about the PhD's do feel a little bit misplaced for a startup
audience. Who here stops and thinks "Am I supposed to have a PhD to do that?",
when setting out to start something new? (<insert theranos reference here>)

~~~
fauigerzigerk
I think a better question would be: How much math do I need to approach NLP in
a way that enables me to be among the best?

PhD is just an academic title and as such it is neither a necessary nor a
sufficient prerequisite to approach NLP from a mathematical angle.

~~~
hobofan
> How much math do I need to approach NLP in a way that enables me to be among
> the best?

The best in what? If you mean pushing the boundaries of research, yes, then
your path there will likely involve a PhD. If you mean building the best
technology products, then being able to read, understand and implement the
biggest recent advances is enough and usually requires far less mathematical
knowledge.

(That was intentionally written generically since I think that it applies to
more than just NLP)

~~~
fauigerzigerk
_> If you mean pushing the boundaries of research, yes, then your path there
will likely involve a PhD._

Yes I mean pushing the boundaries, but I think it is important to stress that
a PhD is not a prerequesite to do that. Anyone who is talented enough can
learn the necessary math.

Getting paid for research work is a different story of course. A PhD
undoubtedly helps with that, but this is Hacker News. People might figure
something out.

~~~
vincentschen
Perhaps the distinction between "analysts" and "builders" would be better
context for discussing math background in startups than PhDs.

------
deegles
There are a ton of libraries and tools available for NLP, so I feel that side
is relatively mature.

What I want are more tools for Natural Language Generation. Can anyone
recommend some good ones? (beyond what's on Wikipedia)

~~~
will_pseudonym
I'm not sure what methods they use, but the "single sentence reply
suggestions" created by Google's Inbox are the highest quality natural
language generation that I have come across.

~~~
tyingq
I believe this is the paper that describes the approach they are using:
[http://www.kdd.org/kdd2016/papers/files/Paper_1069.pdf](http://www.kdd.org/kdd2016/papers/files/Paper_1069.pdf)

Looks less like NLG, and more like picking existing responses from a (probably
huge) corpus using ML. Hard to replicate unless you have access to the kind of
data Google has.

~~~
will_pseudonym
If you think about it, all natural language generation is simply picking from
a list of possible words to go next. They just do that at a sentence level.

~~~
tyingq
Heh. I suppose, if you have enough sentences to choose from for the various
combinations of words, tenses, conjugations, prepositions, and so forth. Can't
imagine there's many entities that would have that much data.

------
JoeDaDude
I'm a little surprised GATE [1], the General Architecture for Text Engineering
tool is not mentioned. It is incredibly flexible, open source and has a very
long track record as a research and prototyping tool.

[1] [https://gate.ac.uk/](https://gate.ac.uk/)

------
visarga
If you want to play with NLP, then just try Gensim, sklearn and Keras. If
you're serious about NLP, it's hard stuff, you need a PHD in the field.

In a way, vision is easier. Instead of discrete symbols (words) it's
continuous signal which are much easier to interpret and generate from neural
networks. By comparison, best language models are behind best image generation
models (2-3 years behind, in my estimation).

For example, there are few applications of GANs to text, and many applications
to images, GANs being the hottest thing in deep learning now. So you have to
keep in mind that NLP is by and large still not solved. There is no decent
conversational chat bot yet. We can reason over small pieces of text but that
is far from full understanding. NLP at this level is hard.

What you can easily do now is to classify text, detect sentiment, entities,
word vectors, grammatical parsing and summarization. All are low level stuff.

~~~
amelius
So what you are saying is that a computer would be able to understand sign-
language more easily than spoken language?

~~~
visarga
Computers can transcribe spoken language into text at great accuracy, but they
can't understand the meaning of text at the same level of accuracy yet.
Meaning is much harder than simple transcription. Voice recognition is to
speech like OCR to print. What we want is to speak to computers and have them
understand what we mean, like humans. Such an AI would be able to carry a
conversation, extract data from and reason over documents, or perform complex
actions based on verbal commands. They would need to have a good physical and
conceptual understanding of the world, otherwise they could not use reasoning.

------
p1esk
NLP right now looks like the computer vision 5 years ago: DL methods are
starting to work really well, so a lot of "traditional" methods to process
text might soon become obsolete.

The goal is to just feed gigabytes of raw text to a huge, complex neural
network, and hope it will extract relevant features.

~~~
demonshalo
The problem is datasets. How can you distinguish a good result from a bad
result? In some cases, depending on the user, it could be both at the same
time.

Most advancements in ML is not accomplished by some new super algorithm.
Rather, advancements are reached when new datasets are presented!

------
ktRolster
This book was really helpful for me when I was getting started with natural
language processing: [http://www.nltk.org/book/](http://www.nltk.org/book/)

It's practical, readable, and it's free.

------
danso
Love the concept of this "How to" series. Seems like it'd be a good
opportunity to spotlight the interesting HN threads on any given topic.

e.g. for NLP:

\-
[https://news.ycombinator.com/item?id=11686029](https://news.ycombinator.com/item?id=11686029)

\-
[https://news.ycombinator.com/item?id=11690212](https://news.ycombinator.com/item?id=11690212)

\-
[https://news.ycombinator.com/item?id=1839611](https://news.ycombinator.com/item?id=1839611)

------
posterboy
> Take this simple example: “I love flying planes.”

> Do I enjoy participating in the act of piloting an aircraft? Or am I
> expressing an appreciation for man-made vehicles engaged in movement through
> the air on wings

Clearly the latter, as the former begets the infinitive, "I love to fly ...".

Maybe I am wrong, going by the American usage of the gerund I clearly am, but
then "I want going flying" sounds ridiculous in any case. Maybe I am missing
the difference, so as a second language speaker, I'd love to be corrected.

~~~
posterboy
note to self: I read, it might not be the gerund in this case, but the past
participle.

------
rstuart
If you are interested in working in NLP, feel free to reach out to Kapiche.
The website, Twitter or hello at kapiche dot com are all good options.

------
_spoonman
I think "Paths" is a terrific idea. There have been times where I've wanted to
do a "first principles" look at a topic but don't want to go back through my
HN upvotes. "Paths" allows for a curated and practical advice-driven jumping
off point. Looking forward to more content. Best of luck with it!

~~~
vincentschen
So glad that you enjoyed the post! What are other topics that you've wanted to
hear about?

------
elchief
[http://web.stanford.edu/class/cs124/kwc-unix-for-
poets.pdf](http://web.stanford.edu/class/cs124/kwc-unix-for-poets.pdf) is fun
and easy. Text analysis with bash

------
beders
Fun problem: Write a parser for the English language. See it fail at tweets :)

~~~
divbit
I dont to need to write a program to do that.

------
fnl
> text summarization are examples of NLP in real-world products

Can someone point me to a satisfying demo of a professional text summarization
software?

~~~
whodunser
The autotldr bot on reddit gets a lot of praise:

[http://smmry.com](http://smmry.com) (demo here)

[https://np.reddit.com/r/autotldr/comments/31bfht/theory_auto...](https://np.reddit.com/r/autotldr/comments/31bfht/theory_autotldr_concept/)

~~~
rsivapr
Worth mentioning here, the text summarization engines are pretty much only
usable on news articles or text with clean paragraph structure (yes, non-tech
literature works too). Popular text summarization tools as they stand now,
fails on anything else.

~~~
fnl
I actually disagree with both of you. SMMRY/AutoTLDR does an acceptable job
when pasting URLs of the latest news, but its not something I would actually
want to consume. More of a showcase that summarization AI has made some huge
progress in recent years, but its still not at a point where I'd pay for it,
as a service.

------
demonshalo
How? just get started working on a fun problem. A good place to start is
keyword extraction. You don't need a PhD or expensive tools. All you need is
some free time and willingness to read some cool stuff.

Copy a few articles into text files and get working on implementing some of
these methods until you have enough of an understanding to construct your own
methods for the fun of it.

Here's some good reading material:

[https://www.facebook.com/notes/facebook-engineering/under-
th...](https://www.facebook.com/notes/facebook-engineering/under-the-hood-the-
entities-graph/10151490531588920/)

[https://www.researchgate.net/profile/Stuart_Rose/publication...](https://www.researchgate.net/profile/Stuart_Rose/publication/227988510_Automatic_Keyword_Extraction_from_Individual_Documents/links/55071c570cf27e990e04c8bb.pdf)

[http://cdn.intechopen.com/pdfs/5338.pdf](http://cdn.intechopen.com/pdfs/5338.pdf)

[https://arxiv.org/pdf/1603.03827v1.pdf](https://arxiv.org/pdf/1603.03827v1.pdf)

[https://www.quora.com/Sentiment-Analysis-What-are-the-
good-w...](https://www.quora.com/Sentiment-Analysis-What-are-the-good-ways-to-
extract-topics-keywords-from-a-text-paragraph-article)

[http://hrcak.srce.hr/file/207669](http://hrcak.srce.hr/file/207669)

[http://nlp.stanford.edu/fsnlp/promo/colloc.pdf](http://nlp.stanford.edu/fsnlp/promo/colloc.pdf)

[https://arxiv.org/ftp/cs/papers/0410/0410062.pdf](https://arxiv.org/ftp/cs/papers/0410/0410062.pdf)

[http://delivery.acm.org/10.1145/1120000/1119383/p216-hulth.p...](http://delivery.acm.org/10.1145/1120000/1119383/p216-hulth.pdf?ip=90.152.24.166&id=1119383&acc=OPEN&key=4D4702B0C3E38B35%2E4D4702B0C3E38B35%2E4D4702B0C3E38B35%2E6D218144511F3437&CFID=890263035&CFTOKEN=90663898&__acm__=1484938921_1ba7ae06773c926d1f8efae701324613)

Edit: Don't get deterred by the math formulas in these papers. They look far
more complicated than they actually are.

~~~
garysieling
Another fun thing is to paste article text into some API, like the Watson
demo, so you can see what kinds of things are possible:

[https://alchemy-language-demo.mybluemix.net/](https://alchemy-language-
demo.mybluemix.net/)

I played around with this a bit to develop
[https://www.findlectures.com](https://www.findlectures.com), so knowing what
works/doesn't work there I'm developing some NLP scripts to support my use
cases.

~~~
demonshalo
I never thought about this particular use-case. The subtitle for TED talks
should be an ocean of info for you to extract keywords from :D Pretty neat
site you got there. I will be using it. Thanks!

------
edblarney
I strongly recommend Stanford's youtube min-course by Dan Jurafsky & Chris
Manning

[https://www.youtube.com/watch?v=nfoudtpBV68](https://www.youtube.com/watch?v=nfoudtpBV68)

------
mankash666
Just noting that the article is better titled "why to get into ..."

------
earthly10x
One of the best places to start is reading this patent from Berkeley Lab/DOE
which word2vec was based on
[https://www.google.com/patents/US7987191](https://www.google.com/patents/US7987191)

~~~
AlexCoventry
What insight does the patent offer, which you won't get from reading the
word2vec papers?

