

Ask HN: Whats the best set of tools do Structure crawled web pages? - lucasrp

Hello everybody,<p>I have to scrape ~1k news sources (among other types of content) on the web, and extract data like title, author, date, news body, etc.<p>Right now we use a horrible inhouse code (And Jsoup) to parse it. The problem is that we rely on regex expressions and css colectors to do it. As you can imagine, the maintanance cost is very high, because everytime some source changes their template, we have to do it again, by hand.<p>We are interested in doing the whole thing from scratch, and i would like to now which tools, or set of tools, would be better to do a more inteligent approach. i&#x27;ve had a nice experience with antlr building a date parser, for example.<p>Any suggestions?
======
palidanx
I use the Mechanize gem for rails

[http://mechanize.rubyforge.org/](http://mechanize.rubyforge.org/)

