
Natural Language Processing for the Working Programmer - ColinWright
http://nlpwp.org/book/
======
nubela
I was working with NLP, and its various toolkits (python-nltk I'm looking at
you). The thing about NLP is, there just isn't enough libraries (for humans)
to simply plug NLP into use. Even nltk, the premier python library for NLP,
seems to be an NLP core-library for building NLP solutions, rather than for
building NLP-powered apps. It also seems to extremely unpythonic.

Is there a missing link there? I don't know. So I took a few days and built an
extremely simple NER (named entity recognition) engine and made it extremely
easy for any programmer to begin using.

See [http://blog.nerily.com/howto-train-your-own-modelset-for-
you...](http://blog.nerily.com/howto-train-your-own-modelset-for-your-custom)
. We'll see how NLP becomes more easily accessible with better tools to come
over time.

~~~
bane
yeah, for the vast majority of uses, most people _really_ want to do just a
fairly small set of things fairly well.

NER comes to mind, lots and lots and lots of toolkits for building up to NER,
but very few that let submit English text and get back a list of people,
places and things _without_ having to virtually build my own NER system from
scratch anyways.

Give me NER, Entity relationships (ER) and a couple kinds of sentiment
analysis scoring (SA) (which can be jump started with decent NER) and I've
pretty much exhausted 95% of what I'd ever want to do.

I really really _really_ don't need yet another library to do sentence
tokenization, term tokenization, tf counting and stemming. If I was building a
free text indexer or bayesian filter or some such it might be useful, but I'm
probably not, there are far better solutions to those domain than I'm likely
to come up with, but there _aren't_ for NER, ER and SA.

~~~
saryant
I've used the Stanford NLP library extensively for NER. I made heavy use of it
in my senior thesis project.

It's pretty straightforward to use their library to read a document and output
an XML file containing NER data (and lots of other fun stuff).

For instance, from the sentence:

> World War II, or the Second World War (often abbreviated as WWII or WW2),
> was a global military conflict lasting from 1939 to 1945, which involved
> most of the world's nations, including all of the great powers, eventually
> forming two opposing military alliances, the Allies and Axis.

Stanford NLP NER will output the following entities:

"World War II" - MISC "Second World War" - MISC "1939 to 1945" - DATE -
NORMALIZED 1939/1945 "Axis" - MISC

You can view the output of Stanford's CoreNLP library (NER + dependency
grammar + coreference resolution + some other stuff) for the Wikipedia article
on World War II in my github repo:

[https://raw.github.com/ryantanner/thesis/master/data/ww2samp...](https://raw.github.com/ryantanner/thesis/master/data/ww2sample.txt.xml)

edit: I should add that the real fun (for me) came from combining NER with
dependency grammars and coreference resolution. It makes it very easy to turn
Stanford NLP's output into a knowledge graph combining a large number of
documents.

~~~
agibsonccc
For those who want to play around with dynamic output:
<http://nlp.stanford.edu:8080/parser/>

This is a bit more human friendly.

~~~
ninjin
Here is the whole Stanford CoreNLP suite with visualisation output:
<http://nlp.stanford.edu:8080/corenlp/> It helps me greatly when it comes to
interpreting the dependency structure.

If you need an example sentence: "Stanford University is located in
California. It is a great university."

I also know that Microsoft Research has a demo online of their NLP tools:
<http://msrsplatdemo.cloudapp.net/> (Silverlight required) I don't think you
can download the tools though, but they do offer to provide you with an API
token to call their service from their cloud.

Potential conflict of interest: I wrote parts of the CoreNLP visualiser.

 _Edit:_ Added example sentence.

------
danieldk
One of the authors here: we wrote this during the Pragmatic Programmer's
writing month in 2010 and some more in 2011. Then I got caught up writing my
PhD thesis, and now a new job (as an NLP engineer, but in Java ;)).

So, the book is basically frozen. We hope to have more time in the future to
continue the writing...

~~~
phantom-scald
Nice endeavor, but finished up as the most endeavors - unfinished. :)

That was the first book in NLP (and the only for now) that I read. I've been
interested both in NLP and Haskell. In that respect it fitted, thanks!

A few points to criticize. For the frequency list one should use multisets,
not dictionaries. There are a few multiset packages at Hackage. Suffix arrays
are badly explained. Monads - very badly. With tagging there was an impression
that it could be explained simpler.

Many things are announced but not touched. The book is not a book in fact,
it's more like an article. Perhaps reconsider it in that way? But oke,
hopefully you will find time to continue it as a book.

Perhaps meanwhile you can recommend some other book to continue reading on
NLP?

~~~
danieldk
Speech and Language Processing by Jurafsky and Martin

Foundations of Statistical Natural Language Processing by Manning and Schütze

I have to say comments like yours are not really encouraging to continue
writing ;).

~~~
phantom-scald
I'm sorry for that! All other sections were written nicely or OK, and I
appreciate for what I picked up from the book. I just wanted to point out some
places needed to be reworked in case you continue.

Myself being in industry, I know how hard, near to impossible it is to find
time for anything extra than work and family. And a decent book requires
approximately the same amount of effort as finishing PhD. Perhaps that was my
frustration coming out of the projects I had to abandon. :(

Thanks for refs!

------
danso
I'm not that familiar with Haskell and the past week's HN frontpage articles
on Monads was just confusing...but what is it about Haskell that makes it more
useful for NLP than, say, Python?

~~~
yummyfajitas
Not much. It's a more expressive and cleaner language, but on the other hand
python has NLTK + scipy community.

Scala (or Java) is another great NLP language. It's got decent libraries
(openNLP, mallet, mahout), hadoop, and Scala is almost as nice as Haskell.

~~~
KirinDave
> Not much. It's a more expressive and cleaner language, but on the other hand
> python has NLTK + scipy community.

Haskell's mechanisms for defining parsers, lexers, and other pattern match
tools is so good it probably passes over the line from "pretty" to
"objectively better".

A lot of people who need to lex and parse data and then act on it turn to
Haskell. It has some really remarkable and efficient libraries. And even for
"common" target languages it's reasonable to write extremely fast parsers.
With tuning, projects like Aeson are among some of the fastest JSON parsers
and writers out there (only a few projects exceed its speed and resource
efficiency in ANY runtime).

~~~
robrenaud
I am guessing you might be conflating parsing natural language with parsing
something that has a rigid and well defined grammar (like a programming
language). NLP is a whole different beast.

~~~
KirinDave
> NLP is a whole different beast.

The very same patterns that define "packrat-like" parsers (which share a
strong relationship to the monadic and "arrow-adic" parsers) can be extended
to define things like DFAs and semantic pattern matching. And languages with
support for rich, somewhat lazy pattern matching like Haskell and Prolog wipe
the floor with eager languages without (e.g., C), which is ideal for semantic
analysis.

While not an "authority" in the subject, I've spent a lot of time working with
some very skilled folks in the field of NLP, Linguistics. Most tools they used
(in our case licensed from X/PARC) had C underpinnings for performance, but
ultimately consumed specifications that were very much like Prolog or Haskell
in character. Talking to some of the linguists who wrote those tools suggested
that had GHC existed (or Allegro or a fast prolog been cheaper) then they
would have been much easier to write in those languages.

~~~
yummyfajitas
Do you have more info on this? I'd love to read more.

~~~
KirinDave
I'm afraid I can't say much more beyond what I have without talking out of my
rear. But you can read about X/P's XLE project here:
<http://www2.parc.com/isl/groups/nltt/xle/>

------
ceautery
I think promoting this should have waited until the section on Natural
Language Processing said more than "stub".

------
myth_drannon
Just FYI : there is a NLP class starting soon on Coursera
<https://class.coursera.org/nlangp-001>

------
superbobry
I wonder if PDF or Epub version is available? couldn't find it on the site.

~~~
Ellos
Here you go <http://nlpwp.org/nlpwp.pdf>

~~~
superbobry
Great, thank you!

------
MojoJolo
I'm always fascinated with NLP. My undergrad works around it. And currently
doing a research for my MS degree. It's about a variant of automatic
summarization, wherein I extract the most important sentences in an article.
I'll open an API for it soon. :) If you're interested, just contact me in my
email (check my profile for it).

In the meantime, here's a preview of what it can do. <http://goo.gl/Lz7Vr>

------
mark_l_watson
Not open source, but I provide a free web service endpoint for my NLP system:
<http://kbsportal.com/> Use the form demo page to use it interactively.

------
pdat
Maluuba provides an API for natural language processing:
<http://dev.maluuba.com/>

------
ExpiredLink
> _1.2. What is natural language processing?_

> _Stub_

