Hacker News new | past | comments | ask | show | jobs | submit login
Natural Language Processing with Python, free online (nltk.org)
255 points by motter on Jan 17, 2012 | hide | past | web | favorite | 31 comments

I own the paper version of this - just a heads up, it's written primary for Language Scientists rather than developers. You may find yourself constantly pausing and looking up concepts and their meanings / practical uses on linguistics websites when returning to the book.

It's a great resource, but don't expect to get started quickly.

I'm not sure that any book on the NLTK, even one written specifically for developers, could avoid the concepts you're referring to. These terms and concepts are inherent to the study of natural language processing. So while I agree that there are a significant number of NLP-related terms used, I think that has less to do with the book's audience and more to do with the book's subject matter. And the use of computational linguistic terminology is not gratuitous -- it's important to learn the jargon so you can look for the right type of help when you don't know how to do something.

On the flip side, the Python in the book is fairly straightforward, so developers should be able to dive right in with the code. In fact, experienced developers will probably find themselves skipping over many sections that are essentially lessons on Python.

Overall, this book is a good introduction to NLP and NLTK. That said, if you're at all familiar with NLP, you'd probably do just as well to dive right in and start working with the toolkit. NLTK is very well-documented and easy to work with. It's also amazingly complete -- tokenizers, stemmers, POS taggers, classifers, etc. etc etc. It also includes a number of algorithms that commonly fall under ML, data mining, and IR. If you work with Python and do any kind of work with unstructured text, you need to check NLTK out.

> NLTK is very well-documented and easy to work with.


> It's also amazingly complete -- tokenizers, stemmers, POS taggers, classifers, etc. etc etc.

Don't agree. The biggest missing piece is a statistical parser which forms the basis for a lot of further linguistic analysis. It is hard to beat Stanford Parser for that. Check out https://github.com/wavii/pfp which has Python bindings.

For most of the ML stuff, you would be better off going to a specialist library like Scikits.learn directly. They are faster and implementations are more accurate. ( I found some of the implementations not quite correct in NLTK. For example, Naive Bays classifier which a lot of first time users use. The difference in results may not be much in practice but it is still incorrect.)

It is definitely a very good place to start but better alternatives exist for many of the pieces.

>Don't agree. The biggest missing piece is a statistical >parser which forms the basis for a lot of further linguistic >analysis.

Conceded and agreed. This is the one major gap. But I still maintain it's a remarkably complete toolkit. Plus you get to work in Python, which is a big advantage for me.

What's wrong with the Naive Bayes classifier? Did you submit a patch?

Likewise, I totally agree with you that there are faster/more accurate/more efficient implementations of many of the tools in the NLTK. If performance is a must, then you're better of prototyping in NLTK then using a specialized library. But in terms of completeness and ease of use, NLTK is very strong.

EDIT: I'm not sure why abhaga is being downvoted. There was nothing disrespectful in his response to me. Disagreement is an important part of intelligent discussion. Upvoting to counter the downvote(s).

> What's wrong with the Naive Bayes classifier?

The problem I found is that it mixes up the binomial and the multinomial event models for the naive bayes (See http://www.cs.cmu.edu/~knigam/papers/multinomial-aaaiws98.pd... for reference). It computes the probabilities as per the binomial event model but doesn't include the probabilities of missing events. This was my understanding from reading the source code.

> Plus you get to work in Python, which is a big advantage for me.

Indeed. I so wish someone would build a dependency parser on top of pfp so that I can ditch Stanford parser. I have used https://github.com/dasmith/stanford-corenlp-python for interfacing with Stanford toolkit but it is somewhat brittle.

No SVM support either. I could try to add it I guess; libSVM has Python bindings already.

As far as I know, NLTK has no C dependencies other than its general dependency on NumPy. I think they are keeping the toolkit in pure Python on purpose (but I may be wrong about that). That said there are SVM implementations in pure Python -- PyMVPA, for one.

> I'm not sure that any book on the NLTK could avoid the concepts you're referring to.

Me too. I'm saying the book (like almost all technical documentation) should be task-based. Let people read the chapter with their problem - don't ask them to read a whole list of solutions to work out which one is applicable. Eg:

- Title: Preparing Text for Analysis

(explain what tokenizing is and why it's necessary)

- Title: 'gauging opinion of a topic'

(explain what sentiment analysis is)

- Title 'identifying relationships'

(explain logical inference)

Task-based documentation doesn't seek to avoid terminology. It seeks to explain why those concepts exist and how they are useful (which encourages people to learn, as they become aware of the practical application of their knowledge).

Thanks for your response. Can you give me an example of how you would frame the material in your response using task-based documentation? It seems to me that the NLTK book does exactly what you're describing in most sections, but perhaps if you gave a counterexample I would better understand.

> Can you give me an example of how you would frame the material in your response using task-based documentation?

My response is task based. The titles are tasks you would like to perform.

Compare with the TOC of the NLTK book:

'Accessing Text Corpora and Lexical Resources (extras)'

What is this? What does it help me do?

If it's prepare a document for analysis, then the title should be 'Preparing text for analysis'

'3. Processing Raw Text'

'Process' is a meaningless word, like 'System' or 'Data'

4. Writing Structured Programs (extras)

That sounds like a coding guidelines document. What does it have to do with language? Why do I want to do this?

7. Extracting Information from Text


8. Analyzing Sentence Structure (extras) 9. Building Feature Based Grammars

Why do I want to do that?

10. Analyzing the Meaning of Sentences (extras)

Good. Better would be 'Determining the meaning of sentences'

11. Managing Linguistic Data

'Managing' is another meaningless word. Are you going to process a managed data system now? No? Perhaps there's some practical advice regarding handling large volumes of text here. Cool, then my task, and the title, should be 'Handing large volumes of text'.

When looking into NLTK to support a Twitter classifier, even though I had no NLP experience, the NLTK syntax/interface was sufficiently self-contained.

I was mostly interested in tagging and chunking and found it easy to reason about and use the relevant NLTK features with little auxiliary reading.

The only issues I found were with NLTK itself, e.g., Python version support, loading time, packages.

LinkGrammar: http://www.link.cs.cmu.edu/link/ is the other NLP tool kit in python, may be suited for use by those who might be intimidated by the need to write their own grammar.

It is integrated into and maintained by the authors of Abiword. There is a talk on this other library at Pycon, this year: https://us.pycon.org/2012/schedule/presentation/187/

LinkGrammar looks very interesting and useful. But I believe that its scope is somewhat smaller than NLTK. Aside from sentence parsing, NLTK also gives you a bunch of tools for document tokenization, statistical analysis, classification, and some easily accesible corpora to try it all out on. I'd almost call it a natural language prototyping environment; it's a huge time-saver to those dabbling in the field or who need to quickly experiment with different techniques.

LinkGrammar is not really a toolkit and it is not written in Python. It is written in C.

Yup. My bad. There are python bindings, though.

If you're interested in natural language processing (NLP), but don't have a linguistics background, I would suggest reading Steven Pinker's The Language Instinct. It will introduce you to the necessary terminology and concepts for NLP in an easy-to-digest way. (The NLTK book has been free online for quite some time as well.)

The Language Instinct is a great book, but unless the content of newer editions has changed significantly, it's more of an overview of linguistics in general, and language acquisition in particular. There's not much -- if any -- practical NLP. For example, looking at Amazon's statistically-improbable phrases (SIPs), I see nothing related to NLP, nor do I see any terms related to practical NLP during a quick glance of the book's index. The index includes references to a few pages on "statistics of language", but I honestly don't remember what those were about.

Also, the Language Instinct was written in the early 1990s, so although new editions have been released, it's a bit dated. It's a classic read for linguists and people interested in language, but I wouldn't recommend it as an introduction to NLP.

Oh no, I didn't intend for The Language Instinct as an introduction to NLP, but as a basic introduction to the fundamentals of linguistics. I mainly had in mind terms like morpheme, phoneme, scope, etc. A basic understanding of these concepts will make reading the NLTK book much easier, although it isn't necessary.

Ok, that makes more sense. I read your comment as being a recommendation of The Language Instinct to learn the fundamentals of NLP. But point taken, it's a good book about language.

The book and software are both a great resource. I usually "roll my own" NLP software, but I have used NLTK for small customer text mining tasks. Definitely "batteries included." I bought the book years ago, but now, the online edition may be more current.

Most of NLTK runs fine in PyPy, too, resulting in significant performance enhancements.

Probably up there amongst the most useful Python libraries IMO. Hasn't it been available for free online for a long time now though?

Anyway, in case anyone reading this missed it, the Stanford NLP class taught by Chris Manning and Dan Jurafsky starting next week (Jan 23rd) will allow programming assignments to be submitted using Python and NLTK, which is really good news.

So now's a good time to get familiar with the NLTK, or for a refresher for those of us already acquainted with it.

We're using this library in class (Foundations of Language Technology) at the TU Darmstadt. From my point of view as a student, who hasn't done much Python in the past, it's pretty easy to use and works well for learning about NLP, hiding many implementation details and letting me focus on solving fairly complex tasks on data/algorithm levels.

There is a free NLTK cloud API: http://www.mashape.com/apis/Text-Processing

It includes sentiment analysis, stemming and lemmatization, part-of-speech tagging and chunking, phrase extraction and named entity recognition.

I came across Natural Language Processing for the Working Programmer[1] recently. It's released under a creative commons license (CC-BY). It's still a work in progress, but might be interesting anyhow.

[1]: http://nlpwp.org/book/

ipython+nltk+networkx+pytables+numpy+matplotlib=full of win

Seriously - text mining made fun and exploratory with open source tools.

One of the text sources I like to use is the Launchpad tickets for the Ubuntu project, since they get a good amount of traffic from international end users, a professional interest of mine.

It would be great to hear about some other interesting open data sets that people have found.

oooooh , this has just made my day!

same here. Something I've been wanting to learn in a while being fascinated by computational linguistics.

only the REST api is open source.

I see the book as a way to sell their API for a service for which everything (algorithm and methods) are obfuscated (server side).

I expected much insights for helping me on a project, I had less informations and more noise than what I already knew.

What on earth do you mean?

The entirety of NLTK is FOSS originally produced by a university. Even if NLTK had some sort of hosted version (and I was unaware that if they do, as your post seems to claim), you could always go D/L the whole source (https://github.com/nltk/nltk ) and use it as you see fit (it's even Apache Licensed).

Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact