

Scraping the web with Node.io - coderholic
http://www.coderholic.com/scraping-the-web-with-node-io/

======
chrisohara
Hey Ben, thanks for the write up on my framework. Firstly, the lack of
documentation for more advanced scraping is something I plan on getting around
to. You can incorporate proxies, scraping pages behind logins, etc. If anyone
needs help in the mean time, send me a message on github.

The main thing I'd like to point out is that I built it primarily as a command
line tool. "By implementing an input method there’s no way to specify a search
term from the command line" - so leave/comment the input method out! The
default input method is to read lines from STDIN, just like the default output
method is to write to STDOUT

Try commenting out the input line on that google example and running it with a
list of words in a file:

    
    
        node.io google_keywords < input.txt
    

Or you could feed the results in to another node.io job:

    
    
        cat input.txt | node.io google_keywords | node.io someotherjob

------
mxavier
The article fails to mention this but there's probably more reasons why this
might be a good idea besides the fact that using JS selectors on page content
is a natural fit. Because everything is asynchronous, I suppose there's
probably some concurrency benefits, not allowing a slow-responding server in
your list to slow down the processing of the other sites you're scraping.

~~~
chrisohara
Node.JS seemed like a perfect fit for a few reason:

1\. JS selectors make scraping _very_ easy.

2\. Asynchronous is fast as it is, but the page is actually parsed as it's
received - contrast this with other scraping solutions where you need to
download a page and parse it once it's complete.

3\. With asynchronous scraping it's trivial to handle failures, timeouts,
retries, nested requests, recursing similar URLs, concurrent requests, etc. -
just add one of the many options (<https://github.com/chriso/node.io/wiki/API
---Job-Options>)

------
rasur
Does it cope with dynamically generated (ie by JS) pages? That'd be a big win
for me..

~~~
chrisohara
Yes, it's something I'm experimenting with at the moment. You can select JSDOM
(<https://github.com/tmpvar/jsdom>) - which has the ability to handle JS - as
an alternative parser. Set the following two options:

    
    
        jsdom: 1
        external_resources: 1
    

Watch this space:
[https://github.com/chriso/node.io/blob/master/lib/node.io/do...](https://github.com/chriso/node.io/blob/master/lib/node.io/dom.js#L47-66)

~~~
rasur
thanks for the info.

------
cdr
Interesting, but I think I'll stick to more mature frameworks like Scrapy for
the moment.

------
wslh
I use htmlunit as a headless browser, I believe it's more mature than this for
DOM, etc?

------
xtacy
Is it a full blown renderer? i.e., does it fetch JS/CSS and execute them?

~~~
chrisohara
It's not the default mode, but you can select JSDOM as an alternative parser
and interact with / scrape pages as a headless browser

