
How to solve most NLP problems: a step-by-step guide - e_ameisen
https://blog.insightdatascience.com/how-to-solve-90-of-nlp-problems-a-step-by-step-guide-fda605278e4e
======
minimaxir
Word2Vec and bag-of-words/tf-idf are somewhat obsolete in 2018 for modeling.
For classification tasks, fasttext
([https://github.com/facebookresearch/fastText](https://github.com/facebookresearch/fastText))
performs better and faster.

Fasttext is also available in the popular NLP Python library gensim, with a
good demo notebook:
[https://radimrehurek.com/gensim/models/fasttext.html](https://radimrehurek.com/gensim/models/fasttext.html)

And of course, if you have a GPU, recurrent neural networks (or other deep
learning architectures) are the endgame for the remaining 10% of problems (a
good example is SpaCy's DL implementation:
[https://spacy.io/](https://spacy.io/)). Or use those libraries to incorporate
fasttext for text encoding, which has worked well in my use cases.

~~~
e_ameisen
Author here. One of the reasons to start with something even simpler than
embeddings such as fastText/Word2Vec, is their ease of explainability. As this
blog post alludes to, the majority of ML work is data work, and in order to
efficiently work on your data, understanding what your model does right/wrong
is critical. This is why we favor starting with approaches that might seem
simple or outdated. Once you see how these approaches fail, you can better
inform your choice of a complex model.

~~~
KennyCason
Having worked on modeling data and sharing results with customers, I agree
simpler solutions can be better for ease of explaining alone. Additionally,
the loss of x percent accuracy with lower-tech solutions can sometimes be
worth it because they are easier to train and reason with. This is
particularly the case when you are just looking for loose directional
indicators and correlations.

If simple solution gets 77% accuracy, and complex solution yields 80%, I would
wager that most of the time you should just stick with the simple solution.

One specific example that comes to mind is in sentiment analysis where you can
achieve "sufficiently" high accuracy with well tailored Bayesian approaches.
They are super fast to train and reason with. If a customer/consumer wants to
know exactly why a piece of text is positive or negative, the n-gram
probability matrix is extremely easy to inspect. Subsequent re-training and
fixing is also much easier than re-training a large neural network, svm, etc.

------
odonnellryan
I am not sure how many people have an issue with this, but it seems to me that
computer science, just over the relatively short time I've been paying
attention, is becoming more-and-more abstract in a lot of ways.

You can code something incredibly complex that works great without
understanding any of the math underneath. Understanding the math arguably
makes you a better engineer overall, but isn't required to solve many of these
problems.

I think it's pretty cool, but I'm sure a lot of people have a big issue with
the "just TRUST the library!" approach.

~~~
zzz95
I understand what you are saying and I agree with it too. What you are worried
about, is what happens in the short term. But, in the long term, as the
technology matures, this is what should be happening. It will become an easy
to use black-box which works for almost 99.99% of the cases and gracefully
handles the remaining cases. You will no longer need PhDs to use the black-
box, only to develop them. For e.g., look at numerical analyses toolboxes,
like automatic integrator. Even if you do not understand the underlying system
of differential equations, the libraries just work or throw a nice legible
error. They encapsulate a ton of ugliness inside, but that is all right as
long as the solutions are predictable.

~~~
has2k1
Really interesting view point. When the convenience of use improves along with
the performance of the methods, then we worry less about not understanding how
the methods work.

In the broad sense, isn't this an example of how we trust?

------
paulsutter
NLP is one of the most challenging areas of research, and nothing in this
article will help solve even 0.009% of those challenges

Example of the wisdom herein:

> Remove words that are not relevant, such as “@” twitter mentions or urls

~~~
freehunter
Wow. That's like saying a person's name isn't important. That's like saying
Latin words aren't relevant or Spanish words aren't relevant for processing
English. To explain why this is bad since if the author is confused(see edit),
others may be as well: Twitter mentions are contextual data. They're either
addressing someone by name or they're being used in a cheeky way to replace
another word.

I could quite easily say "I'm going to @chicago this weekend to see
@taylor_swift". Now see if this sentence makes any sense: "I'm going to this
weekend to see". Nope. What you need to do is translate them, like any other
word. Sure it's not an English word, but you wouldn't ignore the word "hola"
just because it's not an English word. Now, if your NLP application doesn't
rely on this data, sure, throw it away. But if you're looking at Twitter and
throwing away any mentions, you're not really processing natural language, are
you?

Sure, it's really hard to translate. Am I talking about Chicago the city or
Chicago the band? Maybe Chicago the movie?

Well that's why NLP is one of the most challenging areas of research, and
nothing in this article will help solve even 0.009% of those challenges.

(edit) Maybe confused isn't the right word... they may just have other
priorities for their use case. However, that doesn't excuse the title.

~~~
ppod
>Now, if your NLP application doesn't rely on this data, sure, throw it away.
But if you're looking at Twitter and throwing away any mentions, you're not
really processing natural language, are you?

The data is natural language, and your system is processing it. NLP doesn't
mean "using computation to model exactly and completely the entire human
language faculty". NLP is distinct from text processing in that often text
files may contain forms of unstructured information other than natural
language. The data being processed here is natural language.

------
Rickasaurus
Bag of words is the death of comprehensible NLP

~~~
jventura
Agree.. There's also some things that you can't do with "bag-of-words"
methods, such as handling multi-words because for those, the order of words is
relevant.

~~~
samfriedman
bag of n-grams can help here, but yes you will still run into a wall quick if
you require phrase understanding or context.

~~~
aglionby
That has its own set of problems, namely data sparsity and resulting in high
dimensionality vectors, although like you say does give you a little bit of
context. I'm not sure that bag of anything is a good solution.

------
polm23
The thing that jumped out at me in this article was the use of Lime to explain
models - I hadn't heard of it before.

[https://github.com/marcotcr/lime](https://github.com/marcotcr/lime)

For NLP tasks, it looks like what it does is selectively delete words from the
input and check the classifier output. This way it determines which words have
the biggest effect on the output without needing to know anything about how
your model works.

------
paultopia
I think this might be the first blog post I've read to actually explain how to
use word vectors as features---good for the author!

------
mikevm
A question to the NLP experts out there -- is it possible to automatically
detect various pre-defined attributes about a person by automatically
analyzing relevant texts? For example, finding out whether a person is anti-
capitalist by scanning his blog posts related to economics. I'm not even sure
how to approach such a problem.

~~~
e_ameisen
Author here. If you have the means to label some data, this becomes quite an
easy classification problem. Get multiple examples of posts, and labels
(capitalist, libertarian, ...) for a group of users that represents your total
population well, and train a classifier.

~~~
mikevm
thank you!

------
master_yoda_1
The title os the blog is to ambitious.

------
CGamesPlay
This was an interesting read, but when I read sentences like "The words it
picked up look much more relevant!", I'm reminded of the XKCD explanation of
machine learning: [https://xkcd.com/1838/](https://xkcd.com/1838/)

------
hinkley
I know natural language processing predates Neuro-linguistic programming, but
I still can’t see ‘NLP’ without the little hairs on the back of my neck
standing up.

------
fnl
Bad title (this is all about text classification/mining, not NLP), but a very
nice introduction at that. Maybe a tad optimistic - I'd never even consider
applying a classifier with 80% accuracy to the Twitter firehose (unless
extremely noisy performance were a non-issue - but it never is ... :-)).

------
code4tee
Good intro, but the approaches used here are quite basic and outdated for
2018. Not sure this solves 90% of NLP problems.

~~~
minimaxir
Granted, even if the approaches are outdated, they can still be used to
_solve_ NLP problems.

------
phijFTW
I think

    
    
      def sanitize_characters(raw, clean):    
          for line in input_file:
              out = line
              output_file.write(line)
      sanitize_characters(input_file, output_file)
    

should be

    
    
      def sanitize_characters(raw, clean):    
          for line in raw:
              out = line
              clean.write(line)
      sanitize_characters(input_file, output_file)
    

in your notebook:
[https://github.com/hundredblocks/concrete_NLP_tutorial/blob/...](https://github.com/hundredblocks/concrete_NLP_tutorial/blob/master/NLP_notebook.ipynb)

Or am I mistaken?

