
A Guide to Natural Language Processing - ftomassetti
https://tomassetti.me/guide-natural-language-processing/
======
bhaak
> Essentially, when dealing with natural languages hacking a solution is the
> suggested way of doing things, since nobody can figure out how to do it
> properly.

That's really the TL;DR I also got from the computational linguistic courses I
attended.

There's probably the Pareto principle at works. Having no solution is worse
than having an 80% solution that works well enough when the 100% solution is
much harder to achieve (and some of the problems not even humans would be able
to solve properly).

~~~
cvs268
Couldn't agree more!

Recently I wrote a web-extension for Firefox that displays funny "Deep
thought" quotes.

I wanted to analyse the quote text and fetch relevant images to animate in the
background of the quote text. After reading several NLP tutorials, guess what
I did as a first PoC - Pick the 3 longest words in a quote text and run an
image search with those 3 words.

6 lines of plain javascript code that can be run anywhere almost instantly
[https://github.com/TheCodeArtist/deep-thought-
tabs/blob/mast...](https://github.com/TheCodeArtist/deep-thought-
tabs/blob/master/addon-src/deepThoughts.js#L87)

I get relevant images in the search results 99/100 times. The quirks of
searching often result in the image adding to the funny-ness of the "Deep
Thought" on display.

Its so effective that i ended up publishing the "Deep Thought Tabs" web-
extension with this approach itself: [https://addons.mozilla.org/en-
US/android/addon/deep-thought-...](https://addons.mozilla.org/en-
US/android/addon/deep-thought-tabs/)

Later I tried using the nlp-compromise js library to identify "topics" of
interest within a quote text - typically nouns, verbs, and adjectives.
Comparing the results with my "3-longest-words" approach, I found that the
longest words were anyways almost always the "topic" words that NLP identified
for any given quote text.

~~~
vvanders
That's pretty awesome.

Back in games we'd do all sorts of tricks in networking to make it look like
things were happening(sound effects, decals, etc) in response to local events
until we could have the server provide the definitive call on some game state.

Most players thought we had a much higher fidelity sim then we actually did.
It's a pretty common technique across a lot of games. You can get away with
quite a bit by being smart about what you "fake" and what you actually make
work end-to-end.

------
nl
Ha, there's a whole section on clones of the summarizer from Classifier4J.

I wrote that in 2003 (I think?) based on @pg's "A plan for spam" essay, and
then "invented" the summarization approach (I'm sure others had done similar,
but I thought it up myself anyway).

Turns out it was rather well tuned. The 2003 implementation, presumably
downloaded from sourceforge(!) still wins comparisons on datasets which didn't
even exist when I wrote it[1].

I much prefer the Python implementation though[2], which I hadn't seen before.

Also, Textacy on top of Spacy is awesome for any kind of text work.

[1]
[https://dl.acm.org/citation.cfm?id=2797081](https://dl.acm.org/citation.cfm?id=2797081)

[2]
[https://github.com/thavelick/summarize/blob/master/summarize...](https://github.com/thavelick/summarize/blob/master/summarize.py)

------
amelius
There are a few applications missing:

\- Answering a question by returning a search result from a large body of
texts. E.g. "How do I change the background color of a page in Javascript?"

\- Improving the readability of a text. The article only mentions
"understanding how difficult to read is a text".

\- Establishing relationships between entities in a body of text. E.g. we
could build a fact-graph from sentences like "Burning coal increases CO2", and
"CO2 increase induces global warming". Useful also in medical literature where
there are millions of pathways.

\- Answering a question, using a large body of facts. Like search, but now it
gives a precise answer.

\- Finding and correcting spelling/grammatical errors.

~~~
BjoernKW
> \- Establishing relationships between entities in a body of text. E.g. we
> could build a fact-graph from sentences like "Burning coal increases CO2",
> and "CO2 increase induces global warming". Useful also in medical literature
> where there are millions of pathways.

That's a simple example because with 'CO2' you at least have the same string
that can serve as a keyword connecting those two facts. Usually in natural
language we make frequent use of anaphora to refer to people, objects and
concepts previously mentioned in the text by name.

Anaphora resolution is one of the really hard problems not only in NLP but in
linguistics in general. The most simple anaphoric device in languages like
English is pronouns and even with those it can be quite difficult to determine
what a 'he' or 'she' refers to in context.

~~~
boxy310
>Anaphora resolution is one of the really hard problems not only in NLP but in
linguistics in general.

This was one of the most frustrating parts of studying Latin rhetoric. The
speakers would keep referring to "That thing I was talking about," and it's a
noun from a subordinate clause 2 and a half paragraphs ago.

~~~
kuschku
That’s actually very common in most languages. English is one of the few
western languages that doesn’t do this, which makes it quite complicated for
some people to write sentences in it, as in their native language such far
backreferences, and long run-on sentences may be a lot more common.

------
kinow
A lot to review, read, learn. Thanks a lot for sharing this. Any plans to
extend it or have another one including even more, like Natural Language
Generation (not limited to bots, we are using it in weather forecast), and co-
reference?

~~~
umilegenio
Thanks. Well, there are interesting things that we had to cut because they
were too advanced for an introductory article. We were thinking about making a
new article for them in a few months. And Natural Language Generation would be
another great topic to talk about.

However, if you already have experience in the topic we would be happy if you
would like to write a guest post for us.

------
fnl
I'm always astonished how little mention gensim gets, considering that it can
basically be used for all the listed tasks, including parsing, if you combine
it with your favorite deep learning library (DyNet, anyone?).

~~~
rpedela
gensim is one of the best libraries for word vectors and summarization. For
parsing and NER, Stanford CoreNLP works best in my experience.

~~~
fnl
Well, a model you fine tune to your specific corpus/domain works even (in
fact: much) better... And gensim there gives you the tools to build the best
possible embeddings.

But you do need a use case and an economic reward for the substantial increase
in cost than a pre-trained, vanilla, off-the-shelf parser (model) can give
you. Yet, if your domain is technical enough (pharma, finance, law, ... -
essentially, all but parsing news, blogs, and tweets...) it might be the only
way to get a NLP system that really works.

------
pencilcode
Regarding finding similar documents what is the state of the art nowadays,
LDA, word2vec, something else? What do you normally use?

~~~
zintinio5
Like everything else, depends on your use-case. I have personally used TF-IDF
vectors and token sets with Cosine and Jaccard distances in practice.

Some examples of use-cases: are you searching for "semantically similar", or
"near duplicate"? You can compare documents under different metrics and
different _representations_. Some representations are: LSA, PLSA, LDA, TF-IDF,
and Set representations, along with metrics such as Jaccard Distance, Cosine
Distance, Euclidean distance, etc.

Doc2vec is the Word2vec analog for documents.

------
visarga
First time I see reading time and readability score mentioned together with
NLP.

------
bane
Was hoping for some discussion about word vectors like word2vec. I keep
reading about them, but don't really understand what they're useful for.

~~~
matt4077
Let me try:

Take the famous example of [king] and [queen] being close neighbors in vector
space after generating the word vectors ("embedding"). If you then use these
vectors to represent the words in your text, a sentence about kings will also
add information about the concept of queens, and vice versa. To a far lesser
degree, such a sentence will also add to your knowledge of [ceo], and, further
down, [mechanical engineer]. But it will not change the system's knowledge of
[stereo].

~~~
bane
Thanks, yeah I get that, but I think I'm having a lack of imagination about
what to do with that in terms of how to build something useful and user
friendly out of it.

------
d23
My experience with your site on mobile:
[https://m.imgur.com/5vLrEJH](https://m.imgur.com/5vLrEJH)

Can't get it to go away, can't read the article.

------
arcanus
Is there an equivalent to MNIST for NLP? I've always wanted to play around in
this space but I don't know a good, and simple, database to start with.

~~~
jimsmart
There are a few different datasets that might be of use, depending on what
you're playing with:-

\- bAbI
[https://research.fb.com/downloads/babi/](https://research.fb.com/downloads/babi/)
and [https://github.com/facebook/bAbI-tasks](https://github.com/facebook/bAbI-
tasks)

\- SQuAD [https://rajpurkar.github.io/SQuAD-
explorer/](https://rajpurkar.github.io/SQuAD-explorer/)

\- WebQuestions [https://github.com/brmson/dataset-factoid-
webquestions](https://github.com/brmson/dataset-factoid-webquestions)

Edit: there's also a great list of datasets on the ParlAI project page
[https://github.com/facebookresearch/ParlAI](https://github.com/facebookresearch/ParlAI)

------
betageek
Your 'send me a PDF' popup has the background fade div above the form so it's
impossible to fill in the form (without opening dev tools).

~~~
umilegenio
Thanks for your comment! Now, we have fixed the issue.

~~~
paultopia
FYI, still a glitch: email form for pdf doesn't work right on mobile Safari
for me---the cursor shows up in strange places unrelated to the form fields,
have to click in random places to go from editing the name field to the email
field.

~~~
umilegenio
Thanks for your comment. We are going to look into it.

------
rpedela
Using Chrome on both a Chromebook and Galaxy S5, the right sidebar is screwed
up. On the phone, it completely blocks the content.

------
Boothroid
Quite an obnoxious website on my phone. Anyway I came here to point to GATE as
a mature FLOSS option: [https://gate.ac.uk/](https://gate.ac.uk/)

------
alexasmyths
Recommend Dan Jurafsky and Chris Manning @ Stanford online course:

[https://www.youtube.com/watch?v=nfoudtpBV68](https://www.youtube.com/watch?v=nfoudtpBV68)

