

Natural Language Processing with Python, free online - motter
http://www.nltk.org/book?=

======
nailer
I own the paper version of this - just a heads up, it's written primary for
Language Scientists rather than developers. You may find yourself constantly
pausing and looking up concepts and their meanings / practical uses on
linguistics websites when returning to the book.

It's a great resource, but don't expect to get started quickly.

~~~
jnbiche
I'm not sure that any book on the NLTK, even one written specifically for
developers, could avoid the concepts you're referring to. These terms and
concepts are inherent to the study of natural language processing. So while I
agree that there are a significant number of NLP-related terms used, I think
that has less to do with the book's audience and more to do with the book's
subject matter. And the use of computational linguistic terminology is not
gratuitous -- it's important to learn the jargon so you can look for the right
type of help when you don't know how to do something.

On the flip side, the Python in the book is fairly straightforward, so
developers should be able to dive right in with the code. In fact, experienced
developers will probably find themselves skipping over many sections that are
essentially lessons on Python.

Overall, this book is a good introduction to NLP and NLTK. That said, if
you're at all familiar with NLP, you'd probably do just as well to dive right
in and start working with the toolkit. NLTK is very well-documented and easy
to work with. It's also amazingly complete -- tokenizers, stemmers, POS
taggers, classifers, etc. etc etc. It also includes a number of algorithms
that commonly fall under ML, data mining, and IR. If you work with Python and
do any kind of work with unstructured text, you need to check NLTK out.

~~~
abhaga
> NLTK is very well-documented and easy to work with.

Agreed.

> It's also amazingly complete -- tokenizers, stemmers, POS taggers,
> classifers, etc. etc etc.

Don't agree. The biggest missing piece is a statistical parser which forms the
basis for a lot of further linguistic analysis. It is hard to beat Stanford
Parser for that. Check out <https://github.com/wavii/pfp> which has Python
bindings.

For most of the ML stuff, you would be better off going to a specialist
library like Scikits.learn directly. They are faster and implementations are
more accurate. ( I found some of the implementations not quite correct in
NLTK. For example, Naive Bays classifier which a lot of first time users use.
The difference in results may not be much in practice but it is still
incorrect.)

It is definitely a very good place to start but better alternatives exist for
many of the pieces.

~~~
jnbiche
>Don't agree. The biggest missing piece is a statistical >parser which forms
the basis for a lot of further linguistic >analysis.

Conceded and agreed. This is the one major gap. But I still maintain it's a
remarkably complete toolkit. Plus you get to work in Python, which is a big
advantage for me.

What's wrong with the Naive Bayes classifier? Did you submit a patch?

Likewise, I totally agree with you that there are faster/more accurate/more
efficient implementations of many of the tools in the NLTK. If performance is
a must, then you're better of prototyping in NLTK then using a specialized
library. But in terms of completeness and ease of use, NLTK is very strong.

EDIT: I'm not sure why abhaga is being downvoted. There was nothing
disrespectful in his response to me. Disagreement is an important part of
intelligent discussion. Upvoting to counter the downvote(s).

~~~
abhaga
> What's wrong with the Naive Bayes classifier?

The problem I found is that it mixes up the binomial and the multinomial event
models for the naive bayes (See
[http://www.cs.cmu.edu/~knigam/papers/multinomial-
aaaiws98.pd...](http://www.cs.cmu.edu/~knigam/papers/multinomial-aaaiws98.pdf)
for reference). It computes the probabilities as per the binomial event model
but doesn't include the probabilities of missing events. This was my
understanding from reading the source code.

> Plus you get to work in Python, which is a big advantage for me.

Indeed. I so wish someone would build a dependency parser on top of pfp so
that I can ditch Stanford parser. I have used
<https://github.com/dasmith/stanford-corenlp-python> for interfacing with
Stanford toolkit but it is somewhat brittle.

------
scorpion032
LinkGrammar: <http://www.link.cs.cmu.edu/link/> is the other NLP tool kit in
python, may be suited for use by those who might be intimidated by the need to
write their own grammar.

It is integrated into and maintained by the authors of Abiword. There is a
talk on this other library at Pycon, this year:
<https://us.pycon.org/2012/schedule/presentation/187/>

~~~
daviddaviddavid
LinkGrammar is not really a toolkit and it is not written in Python. It is
written in C.

~~~
scorpion032
Yup. My bad. There are python bindings, though.

------
ethanpoole
If you're interested in natural language processing (NLP), but don't have a
linguistics background, I would suggest reading Steven Pinker's The Language
Instinct. It will introduce you to the necessary terminology and concepts for
NLP in an easy-to-digest way. (The NLTK book has been free online for quite
some time as well.)

~~~
jnbiche
The Language Instinct is a great book, but unless the content of newer
editions has changed significantly, it's more of an overview of linguistics in
general, and language acquisition in particular. There's not much -- if any --
practical NLP. For example, looking at Amazon's statistically-improbable
phrases (SIPs), I see nothing related to NLP, nor do I see any terms related
to practical NLP during a quick glance of the book's index. The index includes
references to a few pages on "statistics of language", but I honestly don't
remember what those were about.

Also, the Language Instinct was written in the early 1990s, so although new
editions have been released, it's a bit dated. It's a classic read for
linguists and people interested in language, but I wouldn't recommend it as an
introduction to NLP.

~~~
ethanpoole
Oh no, I didn't intend for The Language Instinct as an introduction to NLP,
but as a basic introduction to the fundamentals of linguistics. I mainly had
in mind terms like morpheme, phoneme, scope, etc. A basic understanding of
these concepts will make reading the NLTK book much easier, although it isn't
necessary.

~~~
jnbiche
Ok, that makes more sense. I read your comment as being a recommendation of
The Language Instinct to learn the fundamentals of NLP. But point taken, it's
a good book about language.

------
mark_l_watson
The book and software are both a great resource. I usually "roll my own" NLP
software, but I have used NLTK for small customer text mining tasks.
Definitely "batteries included." I bought the book years ago, but now, the
online edition may be more current.

------
jnbiche
Most of NLTK runs fine in PyPy, too, resulting in significant performance
enhancements.

------
NnamdiJr
Probably up there amongst the most useful Python libraries IMO. Hasn't it been
available for free online for a long time now though?

Anyway, in case anyone reading this missed it, the Stanford NLP class taught
by Chris Manning and Dan Jurafsky starting next week (Jan 23rd) will allow
programming assignments to be submitted using Python and NLTK, which is really
good news.

So now's a good time to get familiar with the NLTK, or for a refresher for
those of us already acquainted with it.

------
herTTTz
Some opinions <http://news.ycombinator.com/item?id=3052540>

------
olex
We're using this library in class (Foundations of Language Technology) at the
TU Darmstadt. From my point of view as a student, who hasn't done much Python
in the past, it's pretty easy to use and works well for learning about NLP,
hiding many implementation details and letting me focus on solving fairly
complex tasks on data/algorithm levels.

------
sinzone
There is a free NLTK cloud API: <http://www.mashape.com/apis/Text-Processing>

It includes sentiment analysis, stemming and lemmatization, part-of-speech
tagging and chunking, phrase extraction and named entity recognition.

------
tikhonj
I came across _Natural Language Processing for the Working Programmer_ [1]
recently. It's released under a creative commons license (CC-BY). It's still a
work in progress, but might be interesting anyhow.

[1]: <http://nlpwp.org/book/>

------
capttwinky
ipython+nltk+networkx+pytables+numpy+matplotlib=full of win

Seriously - text mining made fun and exploratory with open source tools.

One of the text sources I like to use is the Launchpad tickets for the Ubuntu
project, since they get a good amount of traffic from international end users,
a professional interest of mine.

It would be great to hear about some other interesting open data sets that
people have found.

------
drstrangevibes
oooooh , this has just made my day!

~~~
ObnoxiousJul
only the REST api is open source.

I see the book as a way to sell their API for a service for which everything
(algorithm and methods) are obfuscated (server side).

I expected much insights for helping me on a project, I had less informations
and more noise than what I already knew.

~~~
knowtheory
What on earth do you mean?

The entirety of NLTK is FOSS originally produced by a university. Even if NLTK
had some sort of hosted version (and I was unaware that if they do, as your
post seems to claim), you could always go D/L the whole source
(<https://github.com/nltk/nltk> ) and use it as you see fit (it's even Apache
Licensed).

