Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I wanted to do this same thing a while ago and have done a lot of research and reading in this area. Here are some search terms that will likely help you:

* automatic wrapper generation

* information extraction

* removing noisy information from Web pages

* template detection

* wrapper induction

"Wrapper" is a fancy computer-science term for "scraper."

I wrote some Python code that does this -- given X sample documents, detect the differences between them and automatically create a scraper tailored to those documents. I released the first version open source -- it's called templatemaker: http://code.google.com/p/templatemaker/ .

But that version of templatemaker is quite brittle, because it was designed to work on plain text as much as on HTML. I've since written an HTML-aware version of templatemaker that is really frikkin' awesome (if I may say!) and beats the pants off the old one. I don't know if I'm going to open-source it, as it's quite valuable to my own startup.

Hope this helps!



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: