I wanted to do this same thing a while ago and have done a lot of research and reading in this area. Here are some search terms that will likely help you:
* automatic wrapper generation
* information extraction
* removing noisy information from Web pages
* template detection
* wrapper induction
"Wrapper" is a fancy computer-science term for "scraper."
I wrote some Python code that does this -- given X sample documents, detect the differences between them and automatically create a scraper tailored to those documents. I released the first version open source -- it's called templatemaker: http://code.google.com/p/templatemaker/ .
But that version of templatemaker is quite brittle, because it was designed to work on plain text as much as on HTML. I've since written an HTML-aware version of templatemaker that is really frikkin' awesome (if I may say!) and beats the pants off the old one. I don't know if I'm going to open-source it, as it's quite valuable to my own startup.
* automatic wrapper generation
* information extraction
* removing noisy information from Web pages
* template detection
* wrapper induction
"Wrapper" is a fancy computer-science term for "scraper."
I wrote some Python code that does this -- given X sample documents, detect the differences between them and automatically create a scraper tailored to those documents. I released the first version open source -- it's called templatemaker: http://code.google.com/p/templatemaker/ .
But that version of templatemaker is quite brittle, because it was designed to work on plain text as much as on HTML. I've since written an HTML-aware version of templatemaker that is really frikkin' awesome (if I may say!) and beats the pants off the old one. I don't know if I'm going to open-source it, as it's quite valuable to my own startup.
Hope this helps!