Hacker News new | past | comments | ask | show | jobs | submit login

I own the paper version of this - just a heads up, it's written primary for Language Scientists rather than developers. You may find yourself constantly pausing and looking up concepts and their meanings / practical uses on linguistics websites when returning to the book.

It's a great resource, but don't expect to get started quickly.




I'm not sure that any book on the NLTK, even one written specifically for developers, could avoid the concepts you're referring to. These terms and concepts are inherent to the study of natural language processing. So while I agree that there are a significant number of NLP-related terms used, I think that has less to do with the book's audience and more to do with the book's subject matter. And the use of computational linguistic terminology is not gratuitous -- it's important to learn the jargon so you can look for the right type of help when you don't know how to do something.

On the flip side, the Python in the book is fairly straightforward, so developers should be able to dive right in with the code. In fact, experienced developers will probably find themselves skipping over many sections that are essentially lessons on Python.

Overall, this book is a good introduction to NLP and NLTK. That said, if you're at all familiar with NLP, you'd probably do just as well to dive right in and start working with the toolkit. NLTK is very well-documented and easy to work with. It's also amazingly complete -- tokenizers, stemmers, POS taggers, classifers, etc. etc etc. It also includes a number of algorithms that commonly fall under ML, data mining, and IR. If you work with Python and do any kind of work with unstructured text, you need to check NLTK out.


> NLTK is very well-documented and easy to work with.

Agreed.

> It's also amazingly complete -- tokenizers, stemmers, POS taggers, classifers, etc. etc etc.

Don't agree. The biggest missing piece is a statistical parser which forms the basis for a lot of further linguistic analysis. It is hard to beat Stanford Parser for that. Check out https://github.com/wavii/pfp which has Python bindings.

For most of the ML stuff, you would be better off going to a specialist library like Scikits.learn directly. They are faster and implementations are more accurate. ( I found some of the implementations not quite correct in NLTK. For example, Naive Bays classifier which a lot of first time users use. The difference in results may not be much in practice but it is still incorrect.)

It is definitely a very good place to start but better alternatives exist for many of the pieces.


>Don't agree. The biggest missing piece is a statistical >parser which forms the basis for a lot of further linguistic >analysis.

Conceded and agreed. This is the one major gap. But I still maintain it's a remarkably complete toolkit. Plus you get to work in Python, which is a big advantage for me.

What's wrong with the Naive Bayes classifier? Did you submit a patch?

Likewise, I totally agree with you that there are faster/more accurate/more efficient implementations of many of the tools in the NLTK. If performance is a must, then you're better of prototyping in NLTK then using a specialized library. But in terms of completeness and ease of use, NLTK is very strong.

EDIT: I'm not sure why abhaga is being downvoted. There was nothing disrespectful in his response to me. Disagreement is an important part of intelligent discussion. Upvoting to counter the downvote(s).


> What's wrong with the Naive Bayes classifier?

The problem I found is that it mixes up the binomial and the multinomial event models for the naive bayes (See http://www.cs.cmu.edu/~knigam/papers/multinomial-aaaiws98.pd... for reference). It computes the probabilities as per the binomial event model but doesn't include the probabilities of missing events. This was my understanding from reading the source code.

> Plus you get to work in Python, which is a big advantage for me.

Indeed. I so wish someone would build a dependency parser on top of pfp so that I can ditch Stanford parser. I have used https://github.com/dasmith/stanford-corenlp-python for interfacing with Stanford toolkit but it is somewhat brittle.


No SVM support either. I could try to add it I guess; libSVM has Python bindings already.


As far as I know, NLTK has no C dependencies other than its general dependency on NumPy. I think they are keeping the toolkit in pure Python on purpose (but I may be wrong about that). That said there are SVM implementations in pure Python -- PyMVPA, for one.


> I'm not sure that any book on the NLTK could avoid the concepts you're referring to.

Me too. I'm saying the book (like almost all technical documentation) should be task-based. Let people read the chapter with their problem - don't ask them to read a whole list of solutions to work out which one is applicable. Eg:

- Title: Preparing Text for Analysis

(explain what tokenizing is and why it's necessary)

- Title: 'gauging opinion of a topic'

(explain what sentiment analysis is)

- Title 'identifying relationships'

(explain logical inference)

Task-based documentation doesn't seek to avoid terminology. It seeks to explain why those concepts exist and how they are useful (which encourages people to learn, as they become aware of the practical application of their knowledge).


Thanks for your response. Can you give me an example of how you would frame the material in your response using task-based documentation? It seems to me that the NLTK book does exactly what you're describing in most sections, but perhaps if you gave a counterexample I would better understand.


> Can you give me an example of how you would frame the material in your response using task-based documentation?

My response is task based. The titles are tasks you would like to perform.

Compare with the TOC of the NLTK book:

'Accessing Text Corpora and Lexical Resources (extras)'

What is this? What does it help me do?

If it's prepare a document for analysis, then the title should be 'Preparing text for analysis'

'3. Processing Raw Text'

'Process' is a meaningless word, like 'System' or 'Data'

4. Writing Structured Programs (extras)

That sounds like a coding guidelines document. What does it have to do with language? Why do I want to do this?

7. Extracting Information from Text

Better.

8. Analyzing Sentence Structure (extras) 9. Building Feature Based Grammars

Why do I want to do that?

10. Analyzing the Meaning of Sentences (extras)

Good. Better would be 'Determining the meaning of sentences'

11. Managing Linguistic Data

'Managing' is another meaningless word. Are you going to process a managed data system now? No? Perhaps there's some practical advice regarding handling large volumes of text here. Cool, then my task, and the title, should be 'Handing large volumes of text'.


When looking into NLTK to support a Twitter classifier, even though I had no NLP experience, the NLTK syntax/interface was sufficiently self-contained.

I was mostly interested in tagging and chunking and found it easy to reason about and use the relevant NLTK features with little auxiliary reading.

The only issues I found were with NLTK itself, e.g., Python version support, loading time, packages.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: