

Ask HN: What web crawler should I use? - MarkMc

I need to run a web crawler to 'screen scrape' a few websites.  Ideally this crawler can:
1. Deal with forms
2. Deal with javascript
3. Deal with non-pure HTML to extract bits of data<p>A few years ago I had a bit of success with Mozilla Parser (http://mozillaparser.sourceforge.net/) but the project seems to have gone cold.<p>Can anyone recommend a web crawler they have used?<p>Thanks
======
nostrademons
What's wrong with Mechanize + BeautifulSoup? Or even urllib2 + html5lib, if
you want to stick to the standard library?

Executing JavaScript is a bit tougher...maybe you can extract it with html5lib
and then hook it up to V8 to execute, although that doesn't really account for
the intricacies of the feedback cycle between JS execution and HTML parsing.

------
madhouse
I would recommend PhantomJS: <http://www.phantomjs.org/>

It's not a crawler per-se, but you can fairly easily build one based on it. It
supports everything that modern browsers do - it's a headless WebKit, driven
from JavaScript, after all.

