I do all my screen scraping with PHP, curl, and some regex. Previously I used pl...

m0nty · on April 9, 2008

"Also AJAX certainly has changed the way a lot of screen scraping is done."

I'd be interested in how you tackle this one. I've always used something like Perl/Curl/wget etc for scraping, but (like you say) JavaScript messes that up. I've had moderate success using GreaseMonkey and regexps in JavaScript code, but it's a bit fragile. I'm thinking of using GreaseMonkey + jQuery, since that should allow me to select DOM elements very easily. But if you have a better way, please share :)

alex_c · on April 9, 2008

Even though it's actually a testing tool, you might have some luck with Canoo Webtest + Groovy (http://webtest.canoo.com). Webtest uses HtmlUnit which has pretty good Javascript support, and means you don't have to mess with regexps to get around the document structure, and Groovy lets you use an actual programming language rather than the awkward Ant-based syntax of Webtest. It takes some getting used to, and I haven't used it for web scraping, but it's a pretty powerful combination.

m0nty · on April 9, 2008

Thanks, I'll give it a try. I'm collaborating on a project which involves getting info from online financial markets, btw, but it's getting held up because of this scraping problem. So new ideas might help get it moving again.