Hacker News new | past | comments | ask | show | jobs | submit login

You seem to know a lot about NLP and I've asked this question in various places and never even found anyone who knew just a little, so I hope you don't mind that I ask you a small question on whether my problem can even be solved with NLP at all.

I'm looking for a way to extract addresses from web pages, where these addresses are immediately recognizable as such by people but are not in a standard format (zip codes before city or after, no zip codes at all, p/o box instead of street name, ...). All in text format (no graphics, no OCR problem) but inside html tags, in various forms (as row in a table, inside one or multple <div>'s, as an <ul>, etc).

- Is this an NLP problem? - If so, where do I start reading/learning? Most NLP seems to be about understanding free-flowing texts of all sorts of subjects. I'm looking for 98% solutions in what I think is a restricted problem space. Is this a reasonable expectation?




Without knowing all details, this is probably something you could no with a regular language (such as regular expressions).

Since this book is for the 'working programmer' (rather than the 'working scientist'), it seems reasonable to assume that the book will provide techniques that can be used in domain-specific problems.


this could be an NLP problem although if you can find an adequate solution with a regular expression/context free grammar that's the easier route.

a lot of modern NLP is based on statistical methods and training data driven, meaning having a training corpus of example addresses identified within the context of these webpages would be the starting place if you went one of those routes. you might start by looking up some academic papers in this area and see if it's been done and methods published.


Just for information, CFGs used for processing natural language are almost invariably statistical too these days. Because natural language is inherently ambiguous and probabilistic.

"Fruit flies like bananas" can be grammatically parsed in [at least] two ways, but one if a much more likely interpretation.


yea for sure, I meant in my advice that a non-statistical solution might be good to start out with.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: