Hacker News new | comments | show | ask | jobs | submit login
Scraping the web with Node.io (coderholic.com)
70 points by coderholic 2266 days ago | hide | past | web | 10 comments | favorite

Hey Ben, thanks for the write up on my framework. Firstly, the lack of documentation for more advanced scraping is something I plan on getting around to. You can incorporate proxies, scraping pages behind logins, etc. If anyone needs help in the mean time, send me a message on github.

The main thing I'd like to point out is that I built it primarily as a command line tool. "By implementing an input method there’s no way to specify a search term from the command line" - so leave/comment the input method out! The default input method is to read lines from STDIN, just like the default output method is to write to STDOUT

Try commenting out the input line on that google example and running it with a list of words in a file:

    node.io google_keywords < input.txt
Or you could feed the results in to another node.io job:

    cat input.txt | node.io google_keywords | node.io someotherjob

The article fails to mention this but there's probably more reasons why this might be a good idea besides the fact that using JS selectors on page content is a natural fit. Because everything is asynchronous, I suppose there's probably some concurrency benefits, not allowing a slow-responding server in your list to slow down the processing of the other sites you're scraping.

Node.JS seemed like a perfect fit for a few reason:

1. JS selectors make scraping _very_ easy.

2. Asynchronous is fast as it is, but the page is actually parsed as it's received - contrast this with other scraping solutions where you need to download a page and parse it once it's complete.

3. With asynchronous scraping it's trivial to handle failures, timeouts, retries, nested requests, recursing similar URLs, concurrent requests, etc. - just add one of the many options (https://github.com/chriso/node.io/wiki/API---Job-Options)

Does it cope with dynamically generated (ie by JS) pages? That'd be a big win for me..

Yes, it's something I'm experimenting with at the moment. You can select JSDOM (https://github.com/tmpvar/jsdom) - which has the ability to handle JS - as an alternative parser. Set the following two options:

    jsdom: 1
    external_resources: 1
Watch this space: https://github.com/chriso/node.io/blob/master/lib/node.io/do...

thanks for the info.

Interesting, but I think I'll stick to more mature frameworks like Scrapy for the moment.

I use htmlunit as a headless browser, I believe it's more mature than this for DOM, etc?

Is it a full blown renderer? i.e., does it fetch JS/CSS and execute them?

It's not the default mode, but you can select JSDOM as an alternative parser and interact with / scrape pages as a headless browser

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact