
Getting Started in Natural Language Processing - wordvector
https://monkeylearn.com/blog/getting-started-in-natural-language-processing-nlp/
======
caiobegotti
As a linguist and software engineer I can't imagine someone doing serious NLP
without ever having studied [concrete] syntax trees and such. It is easy to
impress people with some tokenization but it's n-grams that are really useful
in the real world, as is understanding syntax trees and all the
interconnections possible inside them so you can NLP the shit out of real
world text/speech, instead of simple examples with a tagger (and a good,
carefully crafted for a demo like Apple's, trained set of tagged data). This
is a good summary article with very good links nonetheless.

~~~
riku_iki
> as is understanding syntax trees and all the interconnections possible
> inside them so you can NLP the shit out of real world text/speech

But do modern DL approaches (e.g. SQuad, translation models) defy this
approach? They train DL models on labeled data without knowing anything about
syntax trees, and allow NN do all the magic..

~~~
yorwba
> They train DL models on labeled data without knowing anything about syntax
> trees, and allow NN do all the magic.

Sure, but that doesn't mean that knowing about syntax won't improve the
result. If you're training a translation model on a huge database of labeled
examples, it might discover syntactic relationships from scratch. But if you
don't have so much data, you're probably better off using all the auxiliary
information you can get.

~~~
riku_iki
> But that doesn't mean that knowing about syntax won't improve the result

this is likely correct, additional high quality input will likely improve
performance, but creating such input for 200 modern human languages requires
significant effort, much larger than allowing NN to solve this problem, that's
why researchers invest this effort into NN improvement, not syntax tree
creation tooling.

~~~
hcorreasuarez
All of us have to strike the right balance between stubbornly believing that
NNs will solve all problems in NLP -which, imho, underestimates what
linguistics can do for this field of study- and stubbornly saying that only
linguistic knowledge will lead to an improvement of the results -which, imho,
means underestimating the outstanding work on deep learning techniques done so
far. As you said, creating high-quality input is likely to be costly -and I'm
not talking about syntax tree creation tooling only here. But, isn't it worth
the while? I believe it is.

------
imh
I really love that this getting started guide is "do lots of studying and
practice, here are the canonical textbooks, papers, conferences, tools, and
problems" instead of "spend a few hours on this superficial toy problem." I'd
love to see more guides like this.

~~~
stared
I strongly disagree.

It's easy to list a lot of books and papers (and drown newcomers in them),
without pointing to actual step=by-step starting points. Sure, doing
superficial problems is only the first step (and it's foolish to think that it
is the last step). Yet, you can read all books in the world, but unless you
are able to prove theorems, or write code, you know less than someone who
wrote a small script to predict names.

Additionally, it's weird that they recommend NLTK (no, please not), SpaCy
(cool and very useful, but high-level), but not Gensim, PyTorch (or at least:
Keras). As a side note, PyTorch has readable implementations of classical ML
techniques, such as word2vec (vide
[https://adoni.github.io/2017/11/08/word2vec-
pytorch/](https://adoni.github.io/2017/11/08/word2vec-pytorch/)).

There are some good recommendations linked there (I really like "Speech and
Language Processing" by Dan Jurafsky and James H. Martin
[https://web.stanford.edu/~jurafsky/slp3/](https://web.stanford.edu/~jurafsky/slp3/),
and recommended myself in [http://p.migdal.pl/2017/01/06/king-man-woman-queen-
why.html](http://p.migdal.pl/2017/01/06/king-man-woman-queen-why.html)).

~~~
autokad
I finished top 25 in kaggle using NLTK and sklearn. word2vec is thrown around
like the gospel in NLP, but simple techniques usually do a lot better because:
#1 there isnt that much data in most cases and most importantly #2 the corpus
differs substantially than the one word2vec was fit on. I am really
flabbergasted by how many people start with word2vec and LSTM and come up with
really over-fit models and they never even tried the simple things.

using ngrams (1 and 2 on words, and 3-5 characters with truncated SVD) gets
you really far.

------
YeGoblynQueenne
What, no Charniak? Tut tut:

[https://mitpress.mit.edu/books/statistical-language-
learning](https://mitpress.mit.edu/books/statistical-language-learning)

------
alexott
Besides NLP course by Jurafsky, course “Introduction to Natural Language
Processing” by D. Rädev is quite good - there were some topics not covered in
Jurafsky course

