Hacker News new | comments | show | ask | jobs | submit login
A Guide to Natural Language Processing (tomassetti.me)
418 points by ftomassetti 9 months ago | hide | past | web | favorite | 52 comments

> Essentially, when dealing with natural languages hacking a solution is the suggested way of doing things, since nobody can figure out how to do it properly.

That's really the TL;DR I also got from the computational linguistic courses I attended.

There's probably the Pareto principle at works. Having no solution is worse than having an 80% solution that works well enough when the 100% solution is much harder to achieve (and some of the problems not even humans would be able to solve properly).

Couldn't agree more!

Recently I wrote a web-extension for Firefox that displays funny "Deep thought" quotes.

I wanted to analyse the quote text and fetch relevant images to animate in the background of the quote text. After reading several NLP tutorials, guess what I did as a first PoC - Pick the 3 longest words in a quote text and run an image search with those 3 words.

6 lines of plain javascript code that can be run anywhere almost instantly https://github.com/TheCodeArtist/deep-thought-tabs/blob/mast...

I get relevant images in the search results 99/100 times. The quirks of searching often result in the image adding to the funny-ness of the "Deep Thought" on display.

Its so effective that i ended up publishing the "Deep Thought Tabs" web-extension with this approach itself: https://addons.mozilla.org/en-US/android/addon/deep-thought-...

Later I tried using the nlp-compromise js library to identify "topics" of interest within a quote text - typically nouns, verbs, and adjectives. Comparing the results with my "3-longest-words" approach, I found that the longest words were anyways almost always the "topic" words that NLP identified for any given quote text.

Neat observation. The "3-longest-words" approach probably works well because grammatical words tend to be elided down to as short of an implementation as possible, while longer words tend to be more demonstrative of the actual topic at hand, rather than grammatical structure.

That's pretty awesome.

Back in games we'd do all sorts of tricks in networking to make it look like things were happening(sound effects, decals, etc) in response to local events until we could have the server provide the definitive call on some game state.

Most players thought we had a much higher fidelity sim then we actually did. It's a pretty common technique across a lot of games. You can get away with quite a bit by being smart about what you "fake" and what you actually make work end-to-end.

You could use the same argument against pretty much any discipline that's undergoing active research. Of course no-one knows (yet) how to do it properly or else there would be no research going on. Image understanding, robotics, even non-computational disciplines such as medicine... Staying with the latter, take HIV for example: no-one knows how to heal it but I'm sure a lot of people are very grateful for the 80% solutions that prolong lives today.

So, in summary, you point is not wrong. But it's no reason for bashing computational linguistics. It is common across many disciplines to use not-yet-perfect solutions as long as you don't know how to do better.

That said, I don't fully agree with the notion that "hacking a solution" is the suggested way of doing things. Computational linguistics is a pretty wild field with a lot of sub-disciplines. In a lot of those, the state of the art consists of quite sophisticated approaches that are the result of years of research. Take speech recognition, for instance. Currently, deep learning approaches take the cake, but there is also a plethora of insights that have been gained from improving the traditional methods over decades.

I think, a more nuanced point of view is called for here.

I didn't intend to bash computational linguistics. Those were some of my favorite course I wouldn't have attended more than I needed if I didn't like the topic and gotten something out of it.

It's surprising how often you can get very far with imperfect solutions. ELIZA is the classic example. A simple program with very little code could convince people that they were talking to another human or at least machine with an understanding of their feelings.

ELIZA was coded completely by humans. Of course, nowadays we have more sophisticated ways of doing that. We can throw a few topic tagged example sentences with connected replies at a computer and it will mostly reply with the right answers to similar sentences. This is only possible because computational linguistics provided the foundation for that.

Still many solution are hacky to this day but that is because computational linguistics is more concerned about interaction with imperfect humans than most of the other disciplines in computer science.

Eh, that really comes down to applied theory vs. pure theory. There's no one Grand Unifying Theory of Natural Language Processing, and not likely to be a strong candidate for a while yet. Until then, there can still be a lot of good problem-solving that can be used with either traditional NLP or with neural networks, or even a hacked-together hybrid approach, and both application and research will feed into each other to refine the processes.

Yeah, when you look at some of the SemEval contest winners or top 3, many use fairly simple methods combined into a powerful solution (except when LSTM with attention grabs the throne).

Ha, there's a whole section on clones of the summarizer from Classifier4J.

I wrote that in 2003 (I think?) based on @pg's "A plan for spam" essay, and then "invented" the summarization approach (I'm sure others had done similar, but I thought it up myself anyway).

Turns out it was rather well tuned. The 2003 implementation, presumably downloaded from sourceforge(!) still wins comparisons on datasets which didn't even exist when I wrote it[1].

I much prefer the Python implementation though[2], which I hadn't seen before.

Also, Textacy on top of Spacy is awesome for any kind of text work.

[1] https://dl.acm.org/citation.cfm?id=2797081

[2] https://github.com/thavelick/summarize/blob/master/summarize...

There are a few applications missing:

- Answering a question by returning a search result from a large body of texts. E.g. "How do I change the background color of a page in Javascript?"

- Improving the readability of a text. The article only mentions "understanding how difficult to read is a text".

- Establishing relationships between entities in a body of text. E.g. we could build a fact-graph from sentences like "Burning coal increases CO2", and "CO2 increase induces global warming". Useful also in medical literature where there are millions of pathways.

- Answering a question, using a large body of facts. Like search, but now it gives a precise answer.

- Finding and correcting spelling/grammatical errors.

> - Establishing relationships between entities in a body of text. E.g. we could build a fact-graph from sentences like "Burning coal increases CO2", and "CO2 increase induces global warming". Useful also in medical literature where there are millions of pathways.

That's a simple example because with 'CO2' you at least have the same string that can serve as a keyword connecting those two facts. Usually in natural language we make frequent use of anaphora to refer to people, objects and concepts previously mentioned in the text by name.

Anaphora resolution is one of the really hard problems not only in NLP but in linguistics in general. The most simple anaphoric device in languages like English is pronouns and even with those it can be quite difficult to determine what a 'he' or 'she' refers to in context.

>Anaphora resolution is one of the really hard problems not only in NLP but in linguistics in general.

This was one of the most frustrating parts of studying Latin rhetoric. The speakers would keep referring to "That thing I was talking about," and it's a noun from a subordinate clause 2 and a half paragraphs ago.

That’s actually very common in most languages. English is one of the few western languages that doesn’t do this, which makes it quite complicated for some people to write sentences in it, as in their native language such far backreferences, and long run-on sentences may be a lot more common.

> - Answering a question by returning a search result from a large body of texts. E.g. "How do I change the background color of a page in Javascript?" > - Answering a question, using a large body of facts. Like search, but now it gives a precise answer.

That is essentially a Natural Language Interface. There are simple ways to implement one for bots that receives simple commands[1]. The problem is that it quickly become very hard if you are trying to do something more open ended that a bot. So, there was simply no room to include it.

> - Improving the readability of a text. The article only mentions "understanding how difficult to read is a text".

The issue is that the formulas to measure the readability of a text cannot really be used to suggest improvements. That's because the user ends up focusing on improving the score instead of improving the text. To suggest improvements you need a much more sophisticate system.

> - Establishing relationships between entities in a body of text. E.g. we could build a fact-graph from sentences like "Burning coal increases CO2", and "CO2 increase induces global warming". Useful also in medical literature where there are millions of pathways.

This is one of the things that were axed, because in some sense it is simple if you just want to link together concepts without any causality, i.e. stuff that happens together. To do that you could link named entity recogniton (to find entities) and a simple way to find a relationship between words (i.e., they happen in the same phrase therefore they have related). However a more sophisticated form of the process, like the one that results in the Knowledge Graph[2] would be quite hard to do.

> - Finding and correcting spelling/grammatical errors.

That's a great idea, we will add how to detect spelling errors.

[1] https://medium.com/swlh/a-natural-language-user-interface-is...

[2] https://en.wikipedia.org/wiki/Knowledge_Graph

The fact that those things are hard is exactly why a guide on them would be valuable.

That's true up to a point. We wrote the article for programmers that had no previous knowledge, so we avoided stuff that is too hard. To such people stuff that is too advanced would look cool, but it would also be impractical to use.

However, we are thinking about creating a more advanced article on a later date.

Author profiling comes to mind as well

- Text generation and dialogue systems

A lot to review, read, learn. Thanks a lot for sharing this. Any plans to extend it or have another one including even more, like Natural Language Generation (not limited to bots, we are using it in weather forecast), and co-reference?

Thanks. Well, there are interesting things that we had to cut because they were too advanced for an introductory article. We were thinking about making a new article for them in a few months. And Natural Language Generation would be another great topic to talk about.

However, if you already have experience in the topic we would be happy if you would like to write a guest post for us.

I'm always astonished how little mention gensim gets, considering that it can basically be used for all the listed tasks, including parsing, if you combine it with your favorite deep learning library (DyNet, anyone?).

gensim is one of the best libraries for word vectors and summarization. For parsing and NER, Stanford CoreNLP works best in my experience.

Well, a model you fine tune to your specific corpus/domain works even (in fact: much) better... And gensim there gives you the tools to build the best possible embeddings.

But you do need a use case and an economic reward for the substantial increase in cost than a pre-trained, vanilla, off-the-shelf parser (model) can give you. Yet, if your domain is technical enough (pharma, finance, law, ... - essentially, all but parsing news, blogs, and tweets...) it might be the only way to get a NLP system that really works.

Regarding finding similar documents what is the state of the art nowadays, LDA, word2vec, something else? What do you normally use?

Like everything else, depends on your use-case. I have personally used TF-IDF vectors and token sets with Cosine and Jaccard distances in practice.

Some examples of use-cases: are you searching for "semantically similar", or "near duplicate"? You can compare documents under different metrics and different _representations_. Some representations are: LSA, PLSA, LDA, TF-IDF, and Set representations, along with metrics such as Jaccard Distance, Cosine Distance, Euclidean distance, etc.

Doc2vec is the Word2vec analog for documents.

Word Mover Distance on Word2Vec vectors.

There is an implementation in Textacy.

Have you heard of word mover’s distance? It works really well!

First time I see reading time and readability score mentioned together with NLP.

Was hoping for some discussion about word vectors like word2vec. I keep reading about them, but don't really understand what they're useful for.

The interesting thing about word2vec is that is an unsupervised method that build vectors to represent each word in a way that makes easy to find relationship between them.

There is a video by the creator of Gensim on word2vec and frieds: https://www.youtube.com/watch?v=wTp3P2UnTfQ

We didn't include it, simply because it relies on machine learning and we wanted to show simpler methods.

Yes, I agree that the applications for word vectors are not made as clearly as it should be. One direct application is as the first layer of a neural network [1], which could be part of either a 1-dimensional convolution or a recurrent neural network. Using pre-trained word vectors is a form of transfer learning and allows for much more predictive models with smaller amounts of training data.

[1] https://blog.keras.io/using-pre-trained-word-embeddings-in-a...

Let me try:

Take the famous example of [king] and [queen] being close neighbors in vector space after generating the word vectors ("embedding"). If you then use these vectors to represent the words in your text, a sentence about kings will also add information about the concept of queens, and vice versa. To a far lesser degree, such a sentence will also add to your knowledge of [ceo], and, further down, [mechanical engineer]. But it will not change the system's knowledge of [stereo].

Thanks, yeah I get that, but I think I'm having a lack of imagination about what to do with that in terms of how to build something useful and user friendly out of it.

Essentially they are useful for comparing the semantic similarity of pieces of text. The text could be a word, phrase, sentence, paragraph, or document. One practical use case is semantic keyword search where the vectors can be used to automatically find a keyword's synonyms. Another is recommendation engines that recommend other documents based on semantic similarity.

are you sure it allows to guess synonyms? I was under the impression that word2vec only allowed to know how similar are words, which different from synonyms. E.g. red is like blue in word2vec sens, but not a synonym.

Technically yes. It will find words which are used in similar contexts such as synonyms, antonyms, etc. However in practice, word2vec and clustering does a good job of finding synonyms [1].

1. https://www.slideshare.net/mobile/lucidworks/implementing-co...

Was very pleased to find this out when I first started studying word embeddings (the abstract principles of word2vec). Essentially it comes down to words having similar verbs and objects that come up most frequently together, so they end up being semantically close.

My experience with your site on mobile: https://m.imgur.com/5vLrEJH

Can't get it to go away, can't read the article.

Is there an equivalent to MNIST for NLP? I've always wanted to play around in this space but I don't know a good, and simple, database to start with.

There are a few different datasets that might be of use, depending on what you're playing with:-

- bAbI https://research.fb.com/downloads/babi/ and https://github.com/facebook/bAbI-tasks

- SQuAD https://rajpurkar.github.io/SQuAD-explorer/

- WebQuestions https://github.com/brmson/dataset-factoid-webquestions

Edit: there's also a great list of datasets on the ParlAI project page https://github.com/facebookresearch/ParlAI

I worked with NLP for my research, and I used to build my corpora from wikipedia documents. Here's a tool that I've built to do it: https://github.com/joaoventura/WikiCorpusExtractor

Well there's word2vec, which while it isn't quite the same (its whole point is the vector classification it already embodies), I think is actually the kind of think you were asking for.

Depends on what you want to try. NLTK has built in datasets. 20 Newsgroups is useful for trying lots of things.

Your 'send me a PDF' popup has the background fade div above the form so it's impossible to fill in the form (without opening dev tools).

Thanks for your comment! Now, we have fixed the issue.

FYI, still a glitch: email form for pdf doesn't work right on mobile Safari for me---the cursor shows up in strange places unrelated to the form fields, have to click in random places to go from editing the name field to the email field.

Thanks for your comment. We are going to look into it.

The 'send me the PDF' pop-up can not be closed on my iPhone. Had to close the page.

Hmm, worked fine for me.

Using Chrome on both a Chromebook and Galaxy S5, the right sidebar is screwed up. On the phone, it completely blocks the content.

Quite an obnoxious website on my phone. Anyway I came here to point to GATE as a mature FLOSS option: https://gate.ac.uk/

Recommend Dan Jurafsky and Chris Manning @ Stanford online course:


Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact