
Ask HN: Needs advice on learning NLP - navyad
I&#x27;m just starting to learn NLP through book natural language processing with python.<p>I don&#x27;t want to complete the book without knowing essential parts of the book.
It would be great if you guys can point out which one are important concepts to grasp on and thereby i can put extra effort to learn and experiment these concepts.<p>Looking for advice from folks who have learned the NLP concepts or have some kind of experience in NLP.<p>Bonus: point out sample projects to work on.
======
jventura
I would suggest to start simple and manually to get some feeling for the
problems in the field. No frameworks, no tools, just you and Python!

Do a simple experiment: get some texts, split words between spaces (e.g
line.split(" ")) and use a dict to count the frequency of the words. Sort the
words by frequency, look at them, and you will eventually reach the same
conclusion as in figure 1 of the paper by Luhn when working for IBM in 1958
([http://courses.ischool.berkeley.edu/i256/f06/papers/luhn58.p...](http://courses.ischool.berkeley.edu/i256/f06/papers/luhn58.pdf))

There are lots of corpora out there in the wild, but if you need to roll your
own from wikipedia texts you can use this tool I did:
[https://github.com/joaoventura/WikiCorpusExtractor](https://github.com/joaoventura/WikiCorpusExtractor)

From this experiment, and depending if you like statistics or not, you can
play a bit with the numbers. For instance, you can use Tf-Idf
([https://en.wikipedia.org/wiki/Tf%E2%80%93idf](https://en.wikipedia.org/wiki/Tf%E2%80%93idf))
to extract potential keywords from documents. Check the formula, it only uses
the frequency of occurrence of words in documents.

Only use tools such as Deep neural networks if you decide later that they are
essential for what you need. I did an entire PhD on this area just with Python
and playing with frequencies, no frameworks at all (an eg. of my work can be
found at
[http://www.sciencedirect.com/science/article/pii/S1877050912...](http://www.sciencedirect.com/science/article/pii/S1877050912001251)).

Good luck!

~~~
arvinsim
Thanks for posting this. Good to know that using just vanilla Python is as
viable for learning as using specialized frameworks.

~~~
syllogism
You might find these posts interesting:

[https://explosion.ai/blog/part-of-speech-pos-tagger-in-
pytho...](https://explosion.ai/blog/part-of-speech-pos-tagger-in-python)

[https://explosion.ai/blog/parsing-english-in-
python](https://explosion.ai/blog/parsing-english-in-python)

These days I would say these articles are better indications of solving NLP
problems with linear models -- tagging and parsing are less important than
they used to be. Here's how I think about doing NLP with current neural
network techniques: [https://explosion.ai/blog/deep-learning-formula-
nlp](https://explosion.ai/blog/deep-learning-formula-nlp)

------
hiddencost
NLP for what purpose?

\- Academic \-- want results? deep learning [0], data munging [1,2] \-- want
to understand "why" / context? Jurafsky and Martin [1]

\- Professional \-- the data is easy to get and clean? deep learning [0] \--
you need to do a lot of work to get the signal? [2]

\- Personal \-- [http://karpathy.github.io/2015/05/21/rnn-
effectiveness/](http://karpathy.github.io/2015/05/21/rnn-effectiveness/) \--
[http://colah.github.io/posts/2014-07-NLP-RNNs-
Representation...](http://colah.github.io/posts/2014-07-NLP-RNNs-
Representations/)

(Andrej Karpathy and Chris Olah are some of my favorite writers)

[0] [http://www.deeplearningbook.org/](http://www.deeplearningbook.org/) [1]
[https://web.stanford.edu/~jurafsky/slp3/](https://web.stanford.edu/~jurafsky/slp3/)
[2] [http://nlp.stanford.edu/IR-book/](http://nlp.stanford.edu/IR-book/)

------
deepGem
Start with Machine Learning by Andrew Ng, on Coursera Once you get a hang of
neural networks, which is chapter 4 in the course I think jump to Stanford's
CS224n. It's helpful to complete Andrew's course as well.

[http://web.stanford.edu/class/cs224n/](http://web.stanford.edu/class/cs224n/)

cs224n is not easy. Of course, you can learn NLP without deep learning, but
today it makes sense to pursue this path. During the course of CS224n you'll
get some project ideas as they discuss a ton of papers and the latest stuff.

~~~
rmchugh
I think deep learning is a pretty hefty starting point for learning NLP.
Cutting edge NLP seems to be more and more based on deep learning, but it's a
rather steep learning curve for a beginner. I would have thought starting with
the basics (like the NLTK book) was more useful. Once those concepts are
mastered, one can progress to see what deep learning brings to the field.

------
mericsson
Some good advice here: [https://blog.ycombinator.com/how-to-get-into-natural-
languag...](https://blog.ycombinator.com/how-to-get-into-natural-language-
processing/)

~~~
navyad
Didn't know of this, highly helpful, thanks.

------
haidrali
Keep reading and practice with this book
[http://www.nltk.org/book_1ed/](http://www.nltk.org/book_1ed/), when you will
complete this book you will have a good understanding of NLP. Sample product
to work on suggestion would include

Implementing a classifier, For detail of it you can look at 13 chapter of
[http://nlp.stanford.edu/IR-
book/pdf/irbookonlinereading.pdf](http://nlp.stanford.edu/IR-
book/pdf/irbookonlinereading.pdf)

Cover topics like Sentiment analysis, Document Summarisation etc

~~~
tu7001
The information retrieval book is great lecture, I'm going through this and
implement algorithms, learn a lot.

~~~
haidrali
I have implemented these two algorithms back in 2013 do check it out
[https://github.com/wonderer007/Naive-Bayes-
classifier](https://github.com/wonderer007/Naive-Bayes-classifier)

------
kyrre
no point wasting your time on nltk:

cs224d (videos, lecture notes, assignments)

a similar course: [https://github.com/oxford-cs-
deepnlp-2017/lectures](https://github.com/oxford-cs-deepnlp-2017/lectures)

good paper: [https://arxiv.org/abs/1103.0398](https://arxiv.org/abs/1103.0398)
"Natural Language Processing (almost) from Scratch"

------
gtani
I think you want to understand comp linguistics viewpoint: parsers, PoS
taggin, dependency analysis, syntax trees;

and the machine learning perspective: embeddings in, say, 100-200 dimensional
space (word2vec, glove) and topic modelling/LDA, and latent semantic analysis
from the 90's. Then you can read about inputting embedding datasets into LSTM,
GRU, content addressable memory/attention mechanisms etc that are being
furiously introduced (you can scan the ICLR submissions and
[http://aclweb.org/anthology/](http://aclweb.org/anthology/).

_____________________

The Jurafsky/Martin draft 3rd ed is a good starting point, they've got about
2/3 of chapters drafted:
[https://web.stanford.edu/~jurafsky/slp3/](https://web.stanford.edu/~jurafsky/slp3/)
as well as the Stanford, Oxford, etc courses on NLP and comp linguistics, and
Klein's
[https://people.eecs.berkeley.edu/~klein/cs288/fa14/](https://people.eecs.berkeley.edu/~klein/cs288/fa14/)
, Collins:
[http://www.cs.columbia.edu/~cs4705/](http://www.cs.columbia.edu/~cs4705/) and
other courses at MIT, CMU, UIUC etc

Also, try out the various standard benchmark datasets and tasks:
[https://arxiv.org/abs/1702.01923](https://arxiv.org/abs/1702.01923)

________________

Last time i checked, this SoA page wasn't up to date and not very well
summarized but will give you lots of project ideas:
[http://www.aclweb.org/aclwiki/index.php?title=State_of_the_a...](http://www.aclweb.org/aclwiki/index.php?title=State_of_the_art)

------
sainib
This is one of the best resources for learning NLP using Python -
[https://www.youtube.com/watch?v=FLZvOKSCkxY&list=PLQVvvaa0Qu...](https://www.youtube.com/watch?v=FLZvOKSCkxY&list=PLQVvvaa0QuDf2JswnfiGkliBInZnIC4HL)

Step by Step, one concept at a time with just a few mins of small videos.

------
sprobertson
For the deep learning angle, I'm starting a project-based tutorial series on
using neural networks (specifically RNNs) for NLP, in PyTorch:
[https://github.com/spro/practical-pytorch](https://github.com/spro/practical-
pytorch)

So far it covers using RNNs for sequence classification and generation, and
combining those for seq2seq translation. Next up is using recursive neural
networks for structured intent parsing.

PS: To anyone who has searched for NLP tutorials, what tutorial have you
wanted that you couldn't find?

------
stared
See links in here: [http://p.migdal.pl/2017/01/06/king-man-woman-queen-
why.html](http://p.migdal.pl/2017/01/06/king-man-woman-queen-why.html).
Especially:

\- Python packages: Gensim, spaCy

\- book:
[https://web.stanford.edu/~jurafsky/slp3/](https://web.stanford.edu/~jurafsky/slp3/)

------
demonshalo
I think the best way to start is tackling a specific problem. Ex. Try building
a summarizer for any given piece of text.

Start by using traditional statistical methods first in order to understand
what works and what doesn't. From there, you can go on to work on an ML
solution to the same problem in order to see the actual difference between the
two approaches in terms of comparable output.

------
navyad
Also helpful to read some research papers

[https://research.google.com/pubs/NaturalLanguageProcessing.h...](https://research.google.com/pubs/NaturalLanguageProcessing.html)

------
amirouche
What is the book you are reading?

~~~
navyad
I am following [http://www.nltk.org/book_1ed/](http://www.nltk.org/book_1ed/)

~~~
xiphias
It looks very old, try something that uses deep learning, like this:

[https://github.com/rouseguy/DeepLearningNLP_Py](https://github.com/rouseguy/DeepLearningNLP_Py)

~~~
botexpert
It's not old. Has most needed background necessary. DL NLP is not necessary
for most common tasks.

------
zump
I also need help; can someone point me to the latest results with NLP?

I want to build an AI powered note-taker.

------
jm547ster
[https://www.amazon.co.uk/Introducing-Neuro-Linguistic-
Progra...](https://www.amazon.co.uk/Introducing-Neuro-Linguistic-Programming-
Joseph-OConnor/dp/1855383446)

