Natural Language Processing with Python

danso · on Sept 29, 2011

FYI, there's an offshoot page that describes a list of projects (ongoing and suggested) that can be undertaken with the natural language toolkit: http://ourproject.org/moin/projects/nltk/ProjectIdeas

gbaygon · on Sept 29, 2011

Here is a good blog about NLTK: http://streamhacker.com/

The blogger is also the author of the book "Python Text Processing with NLTK 2.0 Cookbook"

sjaakkkkk · on Sept 29, 2011

great book! Don't wanna spam, but made a project www.whatrapperareyou.com by programming along the lines of Chapter 6 on Naive Bayesian Classifiers.

Chapter 6: http://nltk.googlecode.com/svn/trunk/doc/book/ch06.html

mark_l_watson · on Oct 2, 2011

I usually use my own NLP code that I have written over 12+ years in Lisp, Java, and Ruby. That said, I have used NLTK on a few projects (some personal, some for a data mining customer) and the "everything included" (including useful data sources) aspect of NLTK is a real time saver. I recommend it, especially so if you mostly work in Python.

gilesc · on Sept 29, 2011

NLTK is great for _learning_ NLP, but Python is much too slow for scalable deep NLP (by which I mean tagging and parsing, as opposed to TF-IDF etc). Also parallelization can become a problem because of the GIL. It's a real shame they chose Python actually, because otherwise it's a superbly structured, documented, and maintained project.

Radim · on Sept 29, 2011

Hmm, I think Python was an excellent choice; what other platform would you suggest? IMO being "superbly structured, documented and maintained" is not a magical property acquired by luck, but rather connected to the platform of choice.

Btw for performance, whenever pure Python is indeed "much too slow" (profile?), there's the option of C extension modules. The NumPy or SciPy libraries are good examples: used in hardcore numerical computing aka the epitome of I-NEED-IT-TO-RUN-FAST!, but still Python.

And not to nitpick ;) but GIL only affects multi-threading; other modes of "parallelization" are reasonably straightforward and some even built-in (import multiprocessing).

cf · on Sept 29, 2011

Yes, that is how you make fast libraries in Python. But, nltk isn't written using C extension modules. All of this NLP is done in pure python. You could rewrite what needs to be fast with C extensions, but then what's the point of using nltk in the first place?

Nltk was never intended to be a way to do production-grade natural language processing. It's primary objective has been to teach users natural language processing with clear, well-commented code and documentation. If this isn't your situation, please use something else.

andreasvc · on Sept 30, 2011

What's the point? That half of your code base has already been written for you. Rewriting performance critical parts is a lot of work, and not having to rewrite a corpus reader, tree transformations, or evaluation procedure is an advantage; aside from being an excellent prototype platform. With Cython you can seamlessly combine Python code such as that from NLTK with your own optimized code. This was indeed never the intention of NLTK, but I have found the general idea of combining arbitrary Python code with optimized Cython code to work very well. The end result is a much less bloated code base in comparison to something like Java or C++.

yesimahuman · on Sept 29, 2011

OpenNLP and Stanford NLP are both Java libraries that might have higher performance characteristics.

jdunck · on Sept 29, 2011

I think you should check out PyPy. It has a JIT which significantly improves performance for many use cases.

cgravill · on Sept 29, 2011

There are some timing comparisons of using PyPy vs CPython with nltk showing improvements

http://groups.google.com/group/nltk-dev/browse_thread/thread...

RBerenguel · on Sept 29, 2011

It's an awesome book and project. I found about it in Mining the Social Web (another fantastic book)

Rotor · on Sept 29, 2011

I'm glad to see this NLP book available online for free. Some great knowledge in there.

Legend · on Sept 29, 2011

Does anyone know if there is a distributed framework to run NLTK?

NnamdiJr · on Sept 29, 2011

Good book for an intro to NLP. NLTK is a cool library but when is it gonna get Python3 compatible??