
Natural Language Processing for the Working Programmer - r11t
http://nlpwp.org/book/
======
danieldk
We just started out one and a half week ago, joining the Pragmatic
Programmer's writing month. We though a 'release early, release often'
approach would be best, that's why there are just a few in-progress chapters.

We will keep you posted, and thanks for the encouragement!

~~~
roel_v
You seem to know a lot about NLP and I've asked this question in various
places and never even found anyone who knew just a little, so I hope you don't
mind that I ask you a small question on whether my problem can even be solved
with NLP at all.

I'm looking for a way to extract addresses from web pages, where these
addresses are immediately recognizable as such by people but are not in a
standard format (zip codes before city or after, no zip codes at all, p/o box
instead of street name, ...). All in text format (no graphics, no OCR problem)
but inside html tags, in various forms (as row in a table, inside one or
multple <div>'s, as an <ul>, etc).

\- Is this an NLP problem? \- If so, where do I start reading/learning? Most
NLP seems to be about understanding free-flowing texts of all sorts of
subjects. I'm looking for 98% solutions in what I think is a restricted
problem space. Is this a reasonable expectation?

~~~
_corbett
this could be an NLP problem although if you can find an adequate solution
with a regular expression/context free grammar that's the easier route.

a lot of modern NLP is based on statistical methods and training data driven,
meaning having a training corpus of example addresses identified within the
context of these webpages would be the starting place if you went one of those
routes. you might start by looking up some academic papers in this area and
see if it's been done and methods published.

~~~
nervechannel
Just for information, CFGs used for processing natural language are almost
invariably statistical too these days. Because natural language is inherently
ambiguous and probabilistic.

"Fruit flies like bananas" can be grammatically parsed in [at least] two ways,
but one if a much more likely interpretation.

~~~
_corbett
yea for sure, I meant in my advice that a non-statistical solution might be
good to start out with.

------
hvs
It's an interesting paper that I intend to dig into more carefully, but I kind
of wish that a paper "for the Working Programmer" used a language like Python
rather than Haskell. I'm aware that Haskell has a very nice type system for
doing things like this -- and I'm a language nerd myself, so It's not that --
but it just seems like it would be more _practical_ in something more
"mainstream."

That said, it is interesting from what I've read so far.

~~~
1331
The title is likely a hat tip to a famous functional textbook: _ML for the
Working Programmer_.

<http://www.cl.cam.ac.uk/~lp15/MLbook/>

~~~
silentbicycle
...which is in turn a nod to _Categories for the Working Mathematician_ by
Saunders Mac Lane (as is Clocksin's _Clause and Effect: Prolog Programming for
the Working Programmer_, and probably several other books.)

------
waterside81
I've posted this link before, but these NLP posts keep popping up on HN, so
I'll keep posting.

Over at <http://www.repustate.com>, we're taking the more common functions
that NLTK performs (and the ones it should) and porting them over as web
services. NLTK is kind of buggy here & there, and it's not too great if you're
dealing with big data sets. Our API, with the obvious handicap of network
latency, is lightning fast because we ported many NLTK functions down to raw
C.

Our API is free so have at it, let me know if you want to see us add anything.

~~~
riffraff
whiled I'm sure something nice will come out of it you may wish to temporarily
disable the NER feature because, it seems to amount to "select capitalized
words" at least on the few pages I tried (wikipedia, bitcoin, nytimes).

It is blazing fast though :)

~~~
waterside81
Yeah, the NER call is not functioning as well as it can. We're aiming at
improving that. Thanks for checking things out.

------
LeBlanc
I'll have to put this on my 'to read' list, it looks really interesting. I
think natural language processing/understanding may become one of those next
'big things' like mobile and social media simply because understanding what a
user is trying to do will become very important.

If anyone is interested in playing around with a robust natural language
processing tool, I built an API for the Stanford Parser.
<http://nlp.naturalparsing.com/browserparser/parse>

------
mark_l_watson
Thanks Daniël, this is cool!

I am not a very good Haskel programmer, but I spend an occasional evening with
it, and I am interested in NLP also (have been working off and on on NLP since
the early 1980s).

From skimming through the book, it looks like a nice read and just went on my
reading list.

------
samratjp
This is pretty neat. At the risk of sounding childish, here I go -- I wish
books like these could be given life like tryruby.org where you could try out
examples and learn along the way. That would be wicked cool.

For now, OpenStudy will do the trick. I created a "StudyPad" if anyone wants
to go through this book together. [http://openstudy.com/studypads/Natural-
Language-Processing-f...](http://openstudy.com/studypads/Natural-Language-
Processing-for-The-Working-Programmer-4ce2253e59fe3a7ffe5f6778)

------
jasonjei
It's interesting to note that a lot of natural language processing is English-
centered. It's clear that English natural language processing is way ahead of
the curve, but based on the quality of Chinese results on Google Translate, I
take it Asian languages don't do so hot when it comes to natural language
processing?

~~~
syllogism
Translation is a lot harder for language pairs that are less related. Most of
the European languages are fairly close cousins, so translation between, say,
English and French isn't that hard.

That said, it's generally true that for most NLP tasks, we're doing much
better on languages similar to English.

