
Learning Python while processing raw text: The NLTK book - ColinWright
http://nltk.org/book/ch03.html
======
donretag
Here is another freely available book, Text Processing in Python:
<http://gnosis.cx/TPiP/>

Plain text files and not tied to a library.

------
ColinWright
I know this is Chapter 3, and hence jumping into the middle of the book, but
lots of people here will have enough knowledge and experience to start from
here and check up on unfamiliar terms. You may find you need to backtrack a
little, but it seems to me this is a good place to start.

~~~
RRRA
That is going to be very useful to me, thanks. I'm doing 2 classes right now,
one where I have to present python and the other explain NLTK! :P

------
waterside81
For anyone whose interested in text classification (along the lines of what
Chapter 6 in this book covers), check out our service
[https://www.repustate.com/predictive-analytics-machine-
learn...](https://www.repustate.com/predictive-analytics-machine-learning/)

It's machine learning as a service: simple API calls to train, cross-validate
& classify your data. We'll also be at PyCon this week in Santa Clara so come
on by.

Currently in private beta, but we're ramping things up quickly.

~~~
bromang
Do you really see any sort of market for providing operations that can be
implemented very easily using python itself?

~~~
waterside81
The Python examples given in the book are very very rudimentary. The service
behind our API is "real" machine learning (e.g. SVMs, RBMs, deep neural
networks & the like). This is all transparent to the user - you just submit
your data via our API.

------
denzil_correa
In terms of Machine Learning for text data, Chapter 6 is highly recommended.

<http://nltk.org/book/ch06.html>

------
zissou
I've learned a lot from the NLTK library, but unfortunately NLTK is terribly
slow. Nevertheless, it is a fantastic place for beginners to start with text
processing and learn from as the documentation is superb. However eventually
one may want to start digging into the NLTK source to rewrite necessary
functions using multiprocessing if they plan to process any "big" textual
datasets.

------
danso
Processing text is a great way to learn any programming language but I would
think there's more interesting and varied practice found through web scraping,
not to mention it's a whole lot easier

~~~
ColinWright
Forgive me if I'm mistaken, but that comment feels like you've read the title
of this submission, but not actually read the chapter. This isn't just about
chopping and slicing strings, this is an entry point into a comprehensive book
about Natural Language Processing, and its associated techniques as
exemplified in Python.

~~~
danso
The chapter contains HTML processing but that's a small subset of what this
chapter covers. You don't need to learn word stems to do really interesting
things with structured HTML. Also, web scraping involves more than text
processing, but the programmatic navigation of websites, which does add some
complexity but is pretty manageable with the libraries out there.

Edit: I'm obviously not saying NLP isn't useful, just that web scraping is
more _immediately_ useful. With NLP, besides learning the concepts, you have
to find a source of raw text that's been unprocessed and yet contains
something of real world value. With web text, you just have to collect what
someone already thought was valuable to publish and find insights through the
aggregation. It seems to me that the latter scenario is easier to grasp, with
NLP being useful for going beyond what others have gathered and published.

~~~
ColinWright
Thing is, this is a book about NLP, not a book about web scraping, so while
what you say may be true (although personally I find more value learning about
NLP than WS) it seems a little misplaced.

But there is value in both, depending on your objectives. I find web scraping
trivial, and mining the text hard, hence my interest in NLP and machine
learning.

------
seanlinehan
I actually read this book a few weeks ago. I'm pretty new to Python as whole
(4 months with the language), so I picked up a few small tricks have saved me
quite a bit of time. It does not assume that you either know Python or
linguistics very well, so I was certainly pleased that I was able to have my
hand held through some of that. I recommend it!

------
nailer
This is an excellent resource which I own, but seemed to be focused on
language scientists who are unfamiliar with Python rather than developers who
need to process text - which I suspect is a large portion of the audience.

~~~
agibsonccc
What simpler use cases do you see? In my case, I'm the language guy this is
targeted at and have no clue what most web devs would want this stuff for.
Spam and text classification of some kind maybe? MAYBE certain kinds of named
entity recognition?

~~~
sdoering
Well you as a news-distributor could try to build a tagging-machine,
something, that takes texts from a sports news agency for example and enriches
it with meaningful tags/keywords, your data from your statistics-section (and
so on), to later match other, related content, or match images, or anything
like this. something, that you could transmit with the original texts, to make
life easier for your customers, with sorting and managing these texts in an
automatic fashion inside their content management systems.

[edit] This coming from a text guy, who recently started down the path of
python and is hooked ;-)

------
forgotAgain
Cover and TOC

<http://nltk.org/book/>

------
tootie
I read it and their code examples are not great. Too many abbreviated and
meaningless variable names.

------
dunham
The "pattern" library is also worth checking out:

    
    
      http://www.clips.ua.ac.be/pages/pattern

