You seem to know a lot about NLP and I've asked this question in various places ...

danieldk · on Nov 16, 2010

Without knowing all details, this is probably something you could no with a regular language (such as regular expressions).

Since this book is for the 'working programmer' (rather than the 'working scientist'), it seems reasonable to assume that the book will provide techniques that can be used in domain-specific problems.

_corbett · on Nov 16, 2010

this could be an NLP problem although if you can find an adequate solution with a regular expression/context free grammar that's the easier route.

a lot of modern NLP is based on statistical methods and training data driven, meaning having a training corpus of example addresses identified within the context of these webpages would be the starting place if you went one of those routes. you might start by looking up some academic papers in this area and see if it's been done and methods published.

nervechannel · on Nov 16, 2010

Just for information, CFGs used for processing natural language are almost invariably statistical too these days. Because natural language is inherently ambiguous and probabilistic.

"Fruit flies like bananas" can be grammatically parsed in [at least] two ways, but one if a much more likely interpretation.

_corbett · on Nov 17, 2010

yea for sure, I meant in my advice that a non-statistical solution might be good to start out with.