Hacker News new | past | comments | ask | show | jobs | submit login
Natural Language Processing with Python (nltk.org)
101 points by danso on Sept 29, 2011 | hide | past | favorite | 15 comments



FYI, there's an offshoot page that describes a list of projects (ongoing and suggested) that can be undertaken with the natural language toolkit: http://ourproject.org/moin/projects/nltk/ProjectIdeas


Here is a good blog about NLTK: http://streamhacker.com/

The blogger is also the author of the book "Python Text Processing with NLTK 2.0 Cookbook"


great book! Don't wanna spam, but made a project www.whatrapperareyou.com by programming along the lines of Chapter 6 on Naive Bayesian Classifiers.

Chapter 6: http://nltk.googlecode.com/svn/trunk/doc/book/ch06.html


I usually use my own NLP code that I have written over 12+ years in Lisp, Java, and Ruby. That said, I have used NLTK on a few projects (some personal, some for a data mining customer) and the "everything included" (including useful data sources) aspect of NLTK is a real time saver. I recommend it, especially so if you mostly work in Python.


NLTK is great for _learning_ NLP, but Python is much too slow for scalable deep NLP (by which I mean tagging and parsing, as opposed to TF-IDF etc). Also parallelization can become a problem because of the GIL. It's a real shame they chose Python actually, because otherwise it's a superbly structured, documented, and maintained project.


Hmm, I think Python was an excellent choice; what other platform would you suggest? IMO being "superbly structured, documented and maintained" is not a magical property acquired by luck, but rather connected to the platform of choice.

Btw for performance, whenever pure Python is indeed "much too slow" (profile?), there's the option of C extension modules. The NumPy or SciPy libraries are good examples: used in hardcore numerical computing aka the epitome of I-NEED-IT-TO-RUN-FAST!, but still Python.

And not to nitpick ;) but GIL only affects multi-threading; other modes of "parallelization" are reasonably straightforward and some even built-in (import multiprocessing).


Yes, that is how you make fast libraries in Python. But, nltk isn't written using C extension modules. All of this NLP is done in pure python. You could rewrite what needs to be fast with C extensions, but then what's the point of using nltk in the first place?

Nltk was never intended to be a way to do production-grade natural language processing. It's primary objective has been to teach users natural language processing with clear, well-commented code and documentation. If this isn't your situation, please use something else.


What's the point? That half of your code base has already been written for you. Rewriting performance critical parts is a lot of work, and not having to rewrite a corpus reader, tree transformations, or evaluation procedure is an advantage; aside from being an excellent prototype platform. With Cython you can seamlessly combine Python code such as that from NLTK with your own optimized code. This was indeed never the intention of NLTK, but I have found the general idea of combining arbitrary Python code with optimized Cython code to work very well. The end result is a much less bloated code base in comparison to something like Java or C++.


OpenNLP and Stanford NLP are both Java libraries that might have higher performance characteristics.


I think you should check out PyPy. It has a JIT which significantly improves performance for many use cases.


There are some timing comparisons of using PyPy vs CPython with nltk showing improvements

http://groups.google.com/group/nltk-dev/browse_thread/thread...


It's an awesome book and project. I found about it in Mining the Social Web (another fantastic book)


I'm glad to see this NLP book available online for free. Some great knowledge in there.


Does anyone know if there is a distributed framework to run NLTK?


Good book for an intro to NLP. NLTK is a cool library but when is it gonna get Python3 compatible??




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: