Hey Ben, thanks for the write up on my framework. Firstly, the lack of documentation for more advanced scraping is something I plan on getting around to. You can incorporate proxies, scraping pages behind logins, etc. If anyone needs help in the mean time, send me a message on github.
The main thing I'd like to point out is that I built it primarily as a command line tool. "By implementing an input method there’s no way to specify a search term from the command line" - so leave/comment the input method out! The default input method is to read lines from STDIN, just like the default output method is to write to STDOUT
Try commenting out the input line on that google example and running it with a list of words in a file:
node.io google_keywords < input.txt
Or you could feed the results in to another node.io job:
The article fails to mention this but there's probably more reasons why this might be a good idea besides the fact that using JS selectors on page content is a natural fit. Because everything is asynchronous, I suppose there's probably some concurrency benefits, not allowing a slow-responding server in your list to slow down the processing of the other sites you're scraping.
Yes, it's something I'm experimenting with at the moment. You can select JSDOM (https://github.com/tmpvar/jsdom) - which has the ability to handle JS - as an alternative parser. Set the following two options: