We will keep you posted, and thanks for the encouragement!
If I could make one request, could you remove the mouseover from the paragraph text that shows the topic heading? It is really distracting for those us who like to use our mouse pointer as a finger when reading.
I'm looking for a way to extract addresses from web pages, where these addresses are immediately recognizable as such by people but are not in a standard format (zip codes before city or after, no zip codes at all, p/o box instead of street name, ...). All in text format (no graphics, no OCR problem) but inside html tags, in various forms (as row in a table, inside one or multple <div>'s, as an <ul>, etc).
- Is this an NLP problem?
- If so, where do I start reading/learning? Most NLP seems to be about understanding free-flowing texts of all sorts of subjects. I'm looking for 98% solutions in what I think is a restricted problem space. Is this a reasonable expectation?
Since this book is for the 'working programmer' (rather than the 'working scientist'), it seems reasonable to assume that the book will provide techniques that can be used in domain-specific problems.
a lot of modern NLP is based on statistical methods and training data driven, meaning having a training corpus of example addresses identified within the context of these webpages would be the starting place if you went one of those routes. you might start by looking up some academic papers in this area and see if it's been done and methods published.
"Fruit flies like bananas" can be grammatically parsed in [at least] two ways, but one if a much more likely interpretation.
I actually remember reading your Slackware book a few years back. I've no doubt that the quality of this text will be as superb as that one's! Cheers!
That said, it is interesting from what I've read so far.
http://ocw.mit.edu/courses/electrical-engineering-and-comput... is excellent as is http://metaoptimize.com/qa/a stackoverflow style site which is active albeit specialized.
Maybe you should try to do the same with python? :)
It'd be easy enough to rewrite most of the examples in another language anyway (I'd hope), even if elegance is lost in the process...
However, "translating" Python examples into relatively idiomatic Ruby shouldn't too much of a problem if you really want to go down that path.
Really good point -- thanks for that.
Over at http://www.repustate.com, we're taking the more common functions that NLTK performs (and the ones it should) and porting them over as web services. NLTK is kind of buggy here & there, and it's not too great if you're dealing with big data sets. Our API, with the obvious handicap of network latency, is lightning fast because we ported many NLTK functions down to raw C.
Our API is free so have at it, let me know if you want to see us add anything.
It is blazing fast though :)
If anyone is interested in playing around with a robust natural language processing tool, I built an API for the Stanford Parser. http://nlp.naturalparsing.com/browserparser/parse
I am not a very good Haskel programmer, but I spend an occasional evening with it, and I am interested in NLP also (have been working off and on on NLP since the early 1980s).
From skimming through the book, it looks like a nice read and just went on my reading list.
For now, OpenStudy will do the trick. I created a "StudyPad" if anyone wants to go through this book together.
That said, it's generally true that for most NLP tasks, we're doing much better on languages similar to English.