Hacker News new | past | comments | ask | show | jobs | submit login
How to solve most NLP problems: a step-by-step guide (insightdatascience.com)
461 points by e_ameisen on Jan 24, 2018 | hide | past | favorite | 74 comments

Word2Vec and bag-of-words/tf-idf are somewhat obsolete in 2018 for modeling. For classification tasks, fasttext (https://github.com/facebookresearch/fastText) performs better and faster.

Fasttext is also available in the popular NLP Python library gensim, with a good demo notebook: https://radimrehurek.com/gensim/models/fasttext.html

And of course, if you have a GPU, recurrent neural networks (or other deep learning architectures) are the endgame for the remaining 10% of problems (a good example is SpaCy's DL implementation: https://spacy.io/). Or use those libraries to incorporate fasttext for text encoding, which has worked well in my use cases.

Author here. One of the reasons to start with something even simpler than embeddings such as fastText/Word2Vec, is their ease of explainability. As this blog post alludes to, the majority of ML work is data work, and in order to efficiently work on your data, understanding what your model does right/wrong is critical. This is why we favor starting with approaches that might seem simple or outdated. Once you see how these approaches fail, you can better inform your choice of a complex model.

Having worked on modeling data and sharing results with customers, I agree simpler solutions can be better for ease of explaining alone. Additionally, the loss of x percent accuracy with lower-tech solutions can sometimes be worth it because they are easier to train and reason with. This is particularly the case when you are just looking for loose directional indicators and correlations.

If simple solution gets 77% accuracy, and complex solution yields 80%, I would wager that most of the time you should just stick with the simple solution.

One specific example that comes to mind is in sentiment analysis where you can achieve "sufficiently" high accuracy with well tailored Bayesian approaches. They are super fast to train and reason with. If a customer/consumer wants to know exactly why a piece of text is positive or negative, the n-gram probability matrix is extremely easy to inspect. Subsequent re-training and fixing is also much easier than re-training a large neural network, svm, etc.

Can you explain how you reduced sentences to 2 dimensional entities using PCA ?

First, FastText is very similar to (a superset of) Word2Vec. Secondly. RNNs are far from being endgame for anything. In fact, recent research has been moving away from them towards more predictable and parallelizable methods (see, e.g., Google's Transformers, aka "Attention is all you need", or FAIR's approach to translation using CNNs).

Added a note about other DL architectures.

You may be painting with too broad a brush. I've worked with a number of NLP people to try to beat our tfidf model using deep methods and various word vectors. The tfidf still performs better. I think the reasons are: many out-of-vocab words, and small amounts of labelled data.

Coming a bit late to the discussion, but out-of-vocab words and small amounts of labelled data is the exact kind of problem that word embedding vectors are meant to solve, and they do so exceedingly well. Of course, you need a good, large source of (unlabelled) data, preferably approaching the gold standard of "everything that has ever been written about your problem domain" to train the embeddings, and that takes some effort, but then the out-of-vocab words must be exceedingly rare (i.e. only if it's a genuinely new word that has never ever been mentioned before, or a very very rare term that you've discarded), and even that can be solved well with character-based embeddings.

I built a docker image exposing spacy.io (with their largest en model) as rest endpoints


I'm not an expert but I'd be surprised if you could declare any approach "endgame" at this point.

My interpretation is that they meant something like "last resort" there.

Fasttext is not that different from word2vec

Fasttext's ability to handle OOVs, however naïve, cannot be understated. I'm not sure about the literature on this, but I think generalizing based on character-gram wordvectors is probably better than assigning a dropout based common OOV vector.

Evolution for sure, heck, they even has the same author. But fasttext is the same idea as word2vec, the objective is identical.

Reading the guide for fasttext I am not sure how a non NLP researcher would use it.

"This library can also be used to train supervised text classifiers" - https://github.com/facebookresearch/fastText#text-classifica...

Looks like you can use it for gensim. https://radimrehurek.com/gensim/models/fasttext.html

You could probably throw DNA, RNA or protein sequences at it.

A lot of people do. Here are the academic articles that cite Gensim and contain the word "protein" (4 pages of them):


I think you'd be surprised to find out that for a pretty large amount of classification problems, tf idf and Bayes naive classifiers do equally as good or almost as good but with less variance than deep nets.

also afaik fastext is just a faster way to build embeddings that are mostly equivalent to word2vec

All and all, I'm not sure you're enough of an expert to be commenting on this

1) No, they don’t. Eg, on the current Toxic comment classification TF-IDF and Bayes models are about 20% worse than RNN models.

2) FastText is also a classification algorithm, a command line classifier and a set of pretrained embeddings.

3) Yes, he does. I’ve learnt from his notebooks before, and I am confident I’m qualified to judge. I don’t completely agree with his statement here, but he is well qualified to comment.

1) I'm saying, depends on the size of the datasets

one of fasttext's main features is a pre-trained word2vec implementation, you can hardly call word2vec obsolete. Bag of words has been obsolete for a long time of course.

Yeah, fasttext/spacy/gensim are some of the biggest open source NLP libraries these days. However I've found that they aren't very performant - I've done prototyping in them before but wouldn't use them for a finished product. I'm sure most big tech companies use custom implementations to do their heavy lifting

It's hard to quantify "most", but I can attest that some of the largest companies in the world do in fact use Gensim, in part because of its good performance and robust battle-tested implementations.

I'm honored that you would respond to my comment! I meant that in no way as a dig against Gensim. Really I'm referring more to the pythonic environment itself - the larger an NLP application grows within Python, the more overhead baggage it picks up (due to logic written in Python - plus Cython is far from a silver bullet). That is not to mention all the difficulties involved in parallelization within a Python environment.

For what it's worth, when I was at a large company recently, I did use Gensim too - I was originally referring more to the case where you take a prototype for an NLP application and ask "ok, how do we scale this across millions of users"

The Gensim version of FastText doesn’t support subword embeddings without additional support.

FastText itself (the command line program and the pretrained embeddings together) is a decent starting point for classification though.

The latest gensim does support fasttext .bin files and subwords (certainly using it with out of vocabulary words right now.) It doesn't support fasttext classification though.

Oh that's awesome! I've been pre-processing and generating separate embedding files.

Bag-of-words/tf-idf is still a pretty good goto for working out "significant terms" to solve "search" tasks. Its simple, old, and reasonably effective, which is what makes it good.

Fastext is faster but does not beat vdcnn and other deep architectures. It all depends on the use case really.

When you say GPU + RNNs address the remaining 10% of problems... what problems fall under that 10%?

In my case, problems where text is not the only input into the model (but the text is a highly-important feature), the text document is only a couple characters/words (or empty), and/or regressions instead of classifications. (although most of those are not specific to text encoding, but are difficult to implement and get good results without tools like Keras regardless)

I am not sure how many people have an issue with this, but it seems to me that computer science, just over the relatively short time I've been paying attention, is becoming more-and-more abstract in a lot of ways.

You can code something incredibly complex that works great without understanding any of the math underneath. Understanding the math arguably makes you a better engineer overall, but isn't required to solve many of these problems.

I think it's pretty cool, but I'm sure a lot of people have a big issue with the "just TRUST the library!" approach.

Layers are good. They allow our limited mind to deal with more and more complex problems. The issue forms when layers are inadequately described, or blackbox things they shouldn't. The worst form of this issue is when people start writing introductory documentation as if it was a marketing copy.

A fair introduction to a library would be like this: "This library lets you make X, Y and Z from A, B and C. It does so using mathematical methods 1, 2, 3 - therefore, it will be great for this-and-that type of A-B-C, but will not perform well for some-other-type." Such a description will tell you where the limits of applicability are, and what to look for if you want to understand more.

(Also, neural network libs should come with a big, bold caveat: "this is magic, only 10 people on the planet know how the whole stack works; the rest of us just perform a ritual on a cluster of GPUs and pray that thus summoned entity will do sorta ok-ish job with our problem (and if it doesn't, you're on your own)".)

I understand what you are saying and I agree with it too. What you are worried about, is what happens in the short term. But, in the long term, as the technology matures, this is what should be happening. It will become an easy to use black-box which works for almost 99.99% of the cases and gracefully handles the remaining cases. You will no longer need PhDs to use the black-box, only to develop them. For e.g., look at numerical analyses toolboxes, like automatic integrator. Even if you do not understand the underlying system of differential equations, the libraries just work or throw a nice legible error. They encapsulate a ton of ugliness inside, but that is all right as long as the solutions are predictable.

Really interesting view point. When the convenience of use improves along with the performance of the methods, then we worry less about not understanding how the methods work.

In the broad sense, isn't this an example of how we trust?

The key is building the right abstraction layers - clean separation of responsibilities into black boxes. Outside of security, no one complains about trusting the heap of abstractions that interpreted languages depend on... interpreters, compilers, operating systems, hardware... clear layers.

I don't think machine learning and NLP has reached that point. Largely because the field is based on probabilities and tunable parameters which are hard to expose without complexity, and hard to trust when they aren't exposed.

This has been the case throughout the history of computers and many other things.

We build layers on top of layers on top of layers.

Occasionally this causes problems. Often it solves them.

Surely this is a good thing, though. Computers now span every field and it's not feasible to know the math behind all types of software. It's better to go deep in a few areas and let other people take care of the rest.

Well, the 80% accuracy the author reaches isn't anywhere near what you need in production for text classification. Imagine classifying all of Twitter with that... I don't want to make this look bad - the author did a great job at this "introduction to text mining" article. But the results are a far cry form prediction ready, as I hope would be obvious: Anything but a tiny false positive rate will just produce infinite false alarms (or miss all relevant cases).

> it seems to me that computer science, just over the relatively short time I've been paying attention, is becoming more-and-more abstract in a lot of ways.

But can we still call it computer science? It seems to me more like linguistics / mathematics / statistics, with the CS only to do the low-level computations and accounting of data.

Of course, if you build a global-scale efficient search engine with it, the role of CS may become bigger.

As a computer scientist I think we benefit as a group from having NLP fit under computer science, so yes I think we should call it computer science. The overarching question is: why does it matter what we classify NLP as?

NLP is a domain specific problem. Of course it shouldn't be under computer science. Pure computer scientists are much less useful than linguists for these things. That's like arguing that building physics simulations is under the realm of computer science. You'd want your team to be mostly linguists, some which specialize in computational linguistics. A vanilla computer scientist, to be frank, is almost useless, especially at the PhD level.

I think it matters because it determines what kind of people you would hire to do the R&D.

yes , in the new world we will need shitty engineers and good engineers ... and we will not need ditch diggers as cashiers as much

but we ill make the shitty engineering jobs as easy as being a cashier.

and life goes on...

NLP is one of the most challenging areas of research, and nothing in this article will help solve even 0.009% of those challenges

Example of the wisdom herein:

> Remove words that are not relevant, such as “@” twitter mentions or urls

Wow. That's like saying a person's name isn't important. That's like saying Latin words aren't relevant or Spanish words aren't relevant for processing English. To explain why this is bad since if the author is confused(see edit), others may be as well: Twitter mentions are contextual data. They're either addressing someone by name or they're being used in a cheeky way to replace another word.

I could quite easily say "I'm going to @chicago this weekend to see @taylor_swift". Now see if this sentence makes any sense: "I'm going to this weekend to see". Nope. What you need to do is translate them, like any other word. Sure it's not an English word, but you wouldn't ignore the word "hola" just because it's not an English word. Now, if your NLP application doesn't rely on this data, sure, throw it away. But if you're looking at Twitter and throwing away any mentions, you're not really processing natural language, are you?

Sure, it's really hard to translate. Am I talking about Chicago the city or Chicago the band? Maybe Chicago the movie?

Well that's why NLP is one of the most challenging areas of research, and nothing in this article will help solve even 0.009% of those challenges.

(edit) Maybe confused isn't the right word... they may just have other priorities for their use case. However, that doesn't excuse the title.

>Now, if your NLP application doesn't rely on this data, sure, throw it away. But if you're looking at Twitter and throwing away any mentions, you're not really processing natural language, are you?

The data is natural language, and your system is processing it. NLP doesn't mean "using computation to model exactly and completely the entire human language faculty". NLP is distinct from text processing in that often text files may contain forms of unstructured information other than natural language. The data being processed here is natural language.

The approach I've seen to handling Twitter mentions is to replace all mentions with a "<userid>" token, which handles that potential loss of context, although not perfectly.

This comment is funny, and also unfortunate. Overall, the article gives a broad overview of a typical NLP pipeline, and demonstrates the concepts with a neat example. Sure it could be improved, but it seems that you interpreted the fact that it can be improved as a sign that it's almost entirely unhelpful. In what world is that mindset useful?

For the example task in the article (classifying whether a tweet is about disasters), it would be genuinely surprising if `@` mentions were meaningful. Sure, this would be something you would investigate, but the general idea of `removing words that are not relevant` as a pre-processing step is definitely not bad advice.


Author here, happy to answer questions and share our vision of it. Many problems require more complex approaches, and we definitely have Fellows tackle some of those (https://blog.insightdatascience.com/entity2vec-dad368c5b830).

That being said, when it comes to the volume of practical applications that come up for the many teams that we work with, the vast majority can be solved by the techniques outlined in the post. These techniques are simpler, but often overlooked. We believe that they should often be a starting point, and most of the time they end up being good enough for the job.

A better title for your article might be “How to solve most text-processing problems”, because you’re really not talking about NLP.

I'd agree - and also, it should be made clear that a Twitter classifier with 80% accuracy is probably a long way from something that you would apply to the actual Twitter firehose.

But the article is a very nice introduction to text mining (alas, not NLP)!

Sarcasm can be fun, but it would seem odd to leave such a thing off a list of basic things to do to clean your data.

Large amounts of text based problems that companies actually have problems I think can be solved with approaches in this article. I'd be surprised actually if many need to go all the way through to the end, which gets up to word vectors and convolutional neural networks.

My thought is exactly. While the content is not bad, the title is a clickbait. It is just how people deal with text classification.

Yes that title isn't just pedantically incorrect... its extremely disingenuous.

Where did the 90% even come from?

Every point in the list of data cleaning tips removes contextual information that may actually help models learn. I'd be careful with applying this advice blindly.

I took this article more at a level of DS online coursework as opposed to useable in industry.

Bag of words is the death of comprehensible NLP

Agree.. There's also some things that you can't do with "bag-of-words" methods, such as handling multi-words because for those, the order of words is relevant.

bag of n-grams can help here, but yes you will still run into a wall quick if you require phrase understanding or context.

That has its own set of problems, namely data sparsity and resulting in high dimensionality vectors, although like you say does give you a little bit of context. I'm not sure that bag of anything is a good solution.

The thing that jumped out at me in this article was the use of Lime to explain models - I hadn't heard of it before.


For NLP tasks, it looks like what it does is selectively delete words from the input and check the classifier output. This way it determines which words have the biggest effect on the output without needing to know anything about how your model works.

I think this might be the first blog post I've read to actually explain how to use word vectors as features---good for the author!

A question to the NLP experts out there -- is it possible to automatically detect various pre-defined attributes about a person by automatically analyzing relevant texts? For example, finding out whether a person is anti-capitalist by scanning his blog posts related to economics. I'm not even sure how to approach such a problem.

What you're describing is just a classifier, and you can certainly train something to do this. However, I'm quite skeptical it would work well.

I suppose if you were Google and had all their data, perhaps. Putting different data types together (like text, pictures, locations) adds a lot of difficulty.

Author here. If you have the means to label some data, this becomes quite an easy classification problem. Get multiple examples of posts, and labels (capitalist, libertarian, ...) for a group of users that represents your total population well, and train a classifier.

thank you!

The title os the blog is to ambitious.

This was an interesting read, but when I read sentences like "The words it picked up look much more relevant!", I'm reminded of the XKCD explanation of machine learning: https://xkcd.com/1838/

I know natural language processing predates Neuro-linguistic programming, but I still can’t see ‘NLP’ without the little hairs on the back of my neck standing up.

Bad title (this is all about text classification/mining, not NLP), but a very nice introduction at that. Maybe a tad optimistic - I'd never even consider applying a classifier with 80% accuracy to the Twitter firehose (unless extremely noisy performance were a non-issue - but it never is ... :-)).

Good intro, but the approaches used here are quite basic and outdated for 2018. Not sure this solves 90% of NLP problems.

Granted, even if the approaches are outdated, they can still be used to solve NLP problems.

Any suggestions for good resources on more advanced and current approaches?

I think

  def sanitize_characters(raw, clean):    
      for line in input_file:
          out = line
  sanitize_characters(input_file, output_file)
should be

  def sanitize_characters(raw, clean):    
      for line in raw:
          out = line
  sanitize_characters(input_file, output_file)
in your notebook: https://github.com/hundredblocks/concrete_NLP_tutorial/blob/...

Or am I mistaken?

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact