Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Needs advice on learning NLP
127 points by navyad on Feb 20, 2017 | hide | past | web | favorite | 28 comments
I'm just starting to learn NLP through book natural language processing with python.

I don't want to complete the book without knowing essential parts of the book. It would be great if you guys can point out which one are important concepts to grasp on and thereby i can put extra effort to learn and experiment these concepts.

Looking for advice from folks who have learned the NLP concepts or have some kind of experience in NLP.

Bonus: point out sample projects to work on.




I would suggest to start simple and manually to get some feeling for the problems in the field. No frameworks, no tools, just you and Python!

Do a simple experiment: get some texts, split words between spaces (e.g line.split(" ")) and use a dict to count the frequency of the words. Sort the words by frequency, look at them, and you will eventually reach the same conclusion as in figure 1 of the paper by Luhn when working for IBM in 1958 (http://courses.ischool.berkeley.edu/i256/f06/papers/luhn58.p...)

There are lots of corpora out there in the wild, but if you need to roll your own from wikipedia texts you can use this tool I did: https://github.com/joaoventura/WikiCorpusExtractor

From this experiment, and depending if you like statistics or not, you can play a bit with the numbers. For instance, you can use Tf-Idf (https://en.wikipedia.org/wiki/Tf%E2%80%93idf) to extract potential keywords from documents. Check the formula, it only uses the frequency of occurrence of words in documents.

Only use tools such as Deep neural networks if you decide later that they are essential for what you need. I did an entire PhD on this area just with Python and playing with frequencies, no frameworks at all (an eg. of my work can be found at http://www.sciencedirect.com/science/article/pii/S1877050912...).

Good luck!


Thanks for posting this. Good to know that using just vanilla Python is as viable for learning as using specialized frameworks.


You might find these posts interesting:

https://explosion.ai/blog/part-of-speech-pos-tagger-in-pytho...

https://explosion.ai/blog/parsing-english-in-python

These days I would say these articles are better indications of solving NLP problems with linear models -- tagging and parsing are less important than they used to be. Here's how I think about doing NLP with current neural network techniques: https://explosion.ai/blog/deep-learning-formula-nlp


Oh, definitely you don't need much more than vanilla Python for playing around and test things. The only thing that you may need is a good tokenizer (i.e., to know where you should break the words), but the regex in [0] was good enough for what I needed. I was working with texts for three different languages (EN, DE and PT), and that is the reason I recommend statistical approaches, as they tend to be language independent.

[0] https://github.com/joaoventura/WikiCorpusExtractor/blob/mast...


thanks for input. right now i am playing with built-in corpus and text of NLTK library.


NLP for what purpose?

- Academic -- want results? deep learning [0], data munging [1,2] -- want to understand "why" / context? Jurafsky and Martin [1]

- Professional -- the data is easy to get and clean? deep learning [0] -- you need to do a lot of work to get the signal? [2]

- Personal -- http://karpathy.github.io/2015/05/21/rnn-effectiveness/ -- http://colah.github.io/posts/2014-07-NLP-RNNs-Representation...

(Andrej Karpathy and Chris Olah are some of my favorite writers)

[0] http://www.deeplearningbook.org/ [1] https://web.stanford.edu/~jurafsky/slp3/ [2] http://nlp.stanford.edu/IR-book/


Start with Machine Learning by Andrew Ng, on Coursera Once you get a hang of neural networks, which is chapter 4 in the course I think jump to Stanford's CS224n. It's helpful to complete Andrew's course as well.

http://web.stanford.edu/class/cs224n/

cs224n is not easy. Of course, you can learn NLP without deep learning, but today it makes sense to pursue this path. During the course of CS224n you'll get some project ideas as they discuss a ton of papers and the latest stuff.


I think deep learning is a pretty hefty starting point for learning NLP. Cutting edge NLP seems to be more and more based on deep learning, but it's a rather steep learning curve for a beginner. I would have thought starting with the basics (like the NLTK book) was more useful. Once those concepts are mastered, one can progress to see what deep learning brings to the field.


thanks.



Didn't know of this, highly helpful, thanks.


Keep reading and practice with this book http://www.nltk.org/book_1ed/, when you will complete this book you will have a good understanding of NLP. Sample product to work on suggestion would include

Implementing a classifier, For detail of it you can look at 13 chapter of http://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf

Cover topics like Sentiment analysis, Document Summarisation etc


The information retrieval book is great lecture, I'm going through this and implement algorithms, learn a lot.


I have implemented these two algorithms back in 2013 do check it out https://github.com/wonderer007/Naive-Bayes-classifier


no point wasting your time on nltk:

cs224d (videos, lecture notes, assignments)

a similar course: https://github.com/oxford-cs-deepnlp-2017/lectures

good paper: https://arxiv.org/abs/1103.0398 "Natural Language Processing (almost) from Scratch"


I think you want to understand comp linguistics viewpoint: parsers, PoS taggin, dependency analysis, syntax trees;

and the machine learning perspective: embeddings in, say, 100-200 dimensional space (word2vec, glove) and topic modelling/LDA, and latent semantic analysis from the 90's. Then you can read about inputting embedding datasets into LSTM, GRU, content addressable memory/attention mechanisms etc that are being furiously introduced (you can scan the ICLR submissions and http://aclweb.org/anthology/.

_____________________

The Jurafsky/Martin draft 3rd ed is a good starting point, they've got about 2/3 of chapters drafted: https://web.stanford.edu/~jurafsky/slp3/ as well as the Stanford, Oxford, etc courses on NLP and comp linguistics, and Klein's https://people.eecs.berkeley.edu/~klein/cs288/fa14/ , Collins: http://www.cs.columbia.edu/~cs4705/ and other courses at MIT, CMU, UIUC etc

Also, try out the various standard benchmark datasets and tasks: https://arxiv.org/abs/1702.01923

________________

Last time i checked, this SoA page wasn't up to date and not very well summarized but will give you lots of project ideas: http://www.aclweb.org/aclwiki/index.php?title=State_of_the_a...


This is one of the best resources for learning NLP using Python - https://www.youtube.com/watch?v=FLZvOKSCkxY&list=PLQVvvaa0Qu...

Step by Step, one concept at a time with just a few mins of small videos.


For the deep learning angle, I'm starting a project-based tutorial series on using neural networks (specifically RNNs) for NLP, in PyTorch: https://github.com/spro/practical-pytorch

So far it covers using RNNs for sequence classification and generation, and combining those for seq2seq translation. Next up is using recursive neural networks for structured intent parsing.

PS: To anyone who has searched for NLP tutorials, what tutorial have you wanted that you couldn't find?


See links in here: http://p.migdal.pl/2017/01/06/king-man-woman-queen-why.html. Especially:

- Python packages: Gensim, spaCy

- book: https://web.stanford.edu/~jurafsky/slp3/


I think the best way to start is tackling a specific problem. Ex. Try building a summarizer for any given piece of text.

Start by using traditional statistical methods first in order to understand what works and what doesn't. From there, you can go on to work on an ML solution to the same problem in order to see the actual difference between the two approaches in terms of comparable output.


Also helpful to read some research papers

https://research.google.com/pubs/NaturalLanguageProcessing.h...


What is the book you are reading?



It looks very old, try something that uses deep learning, like this:

https://github.com/rouseguy/DeepLearningNLP_Py


It's not old. Has most needed background necessary. DL NLP is not necessary for most common tasks.


thanks, i would certainly have a look.


I also need help; can someone point me to the latest results with NLP?

I want to build an AI powered note-taker.





Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: