It's a great resource, but don't expect to get started quickly.
On the flip side, the Python in the book is fairly straightforward, so developers should be able to dive right in with the code. In fact, experienced developers will probably find themselves skipping over many sections that are essentially lessons on Python.
Overall, this book is a good introduction to NLP and NLTK. That said, if you're at all familiar with NLP, you'd probably do just as well to dive right in and start working with the toolkit. NLTK is very well-documented and easy to work with. It's also amazingly complete -- tokenizers, stemmers, POS taggers, classifers, etc. etc etc. It also includes a number of algorithms that commonly fall under ML, data mining, and IR. If you work with Python and do any kind of work with unstructured text, you need to check NLTK out.
> It's also amazingly complete -- tokenizers, stemmers, POS taggers, classifers, etc. etc etc.
Don't agree. The biggest missing piece is a statistical parser which forms the basis for a lot of further linguistic analysis. It is hard to beat Stanford Parser for that. Check out https://github.com/wavii/pfp which has Python bindings.
For most of the ML stuff, you would be better off going to a specialist library like Scikits.learn directly. They are faster and implementations are more accurate. ( I found some of the implementations not quite correct in NLTK. For example, Naive Bays classifier which a lot of first time users use. The difference in results may not be much in practice but it is still incorrect.)
It is definitely a very good place to start but better alternatives exist for many of the pieces.
Conceded and agreed. This is the one major gap. But I still maintain it's a remarkably complete toolkit. Plus you get to work in Python, which is a big advantage for me.
What's wrong with the Naive Bayes classifier? Did you submit a patch?
Likewise, I totally agree with you that there are faster/more accurate/more efficient implementations of many of the tools in the NLTK. If performance is a must, then you're better of prototyping in NLTK then using a specialized library. But in terms of completeness and ease of use, NLTK is very strong.
EDIT: I'm not sure why abhaga is being downvoted. There was nothing disrespectful in his response to me. Disagreement is an important part of intelligent discussion. Upvoting to counter the downvote(s).
The problem I found is that it mixes up the binomial and the multinomial event models for the naive bayes (See http://www.cs.cmu.edu/~knigam/papers/multinomial-aaaiws98.pd... for reference). It computes the probabilities as per the binomial event model but doesn't include the probabilities of missing events. This was my understanding from reading the source code.
> Plus you get to work in Python, which is a big advantage for me.
Indeed. I so wish someone would build a dependency parser on top of pfp so that I can ditch Stanford parser. I have used https://github.com/dasmith/stanford-corenlp-python for interfacing with Stanford toolkit but it is somewhat brittle.
Me too. I'm saying the book (like almost all technical documentation) should be task-based. Let people read the chapter with their problem - don't ask them to read a whole list of solutions to work out which one is applicable. Eg:
- Title: Preparing Text for Analysis
(explain what tokenizing is and why it's necessary)
- Title: 'gauging opinion of a topic'
(explain what sentiment analysis is)
- Title 'identifying relationships'
(explain logical inference)
Task-based documentation doesn't seek to avoid terminology. It seeks to explain why those concepts exist and how they are useful (which encourages people to learn, as they become aware of the practical application of their knowledge).
My response is task based. The titles are tasks you would like to perform.
Compare with the TOC of the NLTK book:
'Accessing Text Corpora and Lexical Resources (extras)'
What is this? What does it help me do?
If it's prepare a document for analysis, then the title should be 'Preparing text for analysis'
'3. Processing Raw Text'
'Process' is a meaningless word, like 'System' or 'Data'
4. Writing Structured Programs (extras)
That sounds like a coding guidelines document. What does it have to do with language? Why do I want to do this?
7. Extracting Information from Text
8. Analyzing Sentence Structure (extras)
9. Building Feature Based Grammars
Why do I want to do that?
10. Analyzing the Meaning of Sentences (extras)
Good. Better would be 'Determining the meaning of sentences'
11. Managing Linguistic Data
'Managing' is another meaningless word. Are you going to process a managed data system now? No? Perhaps there's some practical advice regarding handling large volumes of text here. Cool, then my task, and the title, should be 'Handing large volumes of text'.
I was mostly interested in tagging and chunking and found it easy to reason about and use the relevant NLTK features with little auxiliary reading.
The only issues I found were with NLTK itself, e.g., Python version support, loading time, packages.
It is integrated into and maintained by the authors of Abiword. There is a talk on this other library at Pycon, this year: https://us.pycon.org/2012/schedule/presentation/187/
Also, the Language Instinct was written in the early 1990s, so although new editions have been released, it's a bit dated. It's a classic read for linguists and people interested in language, but I wouldn't recommend it as an introduction to NLP.
Anyway, in case anyone reading this missed it, the Stanford NLP class taught by Chris Manning and Dan Jurafsky starting next week (Jan 23rd) will allow programming assignments to be submitted using Python and NLTK, which is really good news.
So now's a good time to get familiar with the NLTK, or for a refresher for those of us already acquainted with it.
It includes sentiment analysis, stemming and lemmatization, part-of-speech tagging and chunking, phrase extraction and named entity recognition.
Seriously - text mining made fun and exploratory with open source tools.
One of the text sources I like to use is the Launchpad tickets for the Ubuntu project, since they get a good amount of traffic from international end users, a professional interest of mine.
It would be great to hear about some other interesting open data sets that people have found.
I see the book as a way to sell their API for a service for which everything (algorithm and methods) are obfuscated (server side).
I expected much insights for helping me on a project, I had less informations and more noise than what I already knew.
The entirety of NLTK is FOSS originally produced by a university. Even if NLTK had some sort of hosted version (and I was unaware that if they do, as your post seems to claim), you could always go D/L the whole source (https://github.com/nltk/nltk ) and use it as you see fit (it's even Apache Licensed).