

Use Node.js to Extract Data from the Web - johnrobinsn
http://www.storminthecastle.com/2013/08/25/use-node-js-to-extract-data-from-the-web-for-fun-and-profit/

======
STRML
Don't forget streams, the more `node.js` way to parse HTML:

    
    
        var http = require('http');
        var tr = require('trumpet')();
        var request = require('request');
        request.get('http://www.echojs.com")
          .pipe(tr.createReadStream("article > span"))
          .pipe(process.stdout);
    
    
    

That's it! See [https://github.com/substack/node-
trumpet](https://github.com/substack/node-trumpet) and their tests for more.

~~~
kanzure
And then there's hyperquest because maybe you want to do more than five
simultaneous requests:

[https://github.com/substack/hyperquest](https://github.com/substack/hyperquest)

~~~
ssafejava
True - you can also disable the globalAgent or change the number of pooled
connections. Connection pooling was generally a bad idea (tm) in Node and
afaik will be removed in the near future.

------
zenocon
I've done a considerable amount of scraping; if you're poking around at nicely
designed web pages, node/cheerio will be nice, but if you need to scrape data
out of a DOM mess with quirks and iframes w/in iframes and forms buried 6
posts deep (inside iframes with quirks), I'd use PhantomJS + CasperJS. Having
a real browser sometimes makes a difference.

~~~
techaddict009
Does this help in scraping website which provide data via jquery ? I mean does
this render the javascript on page ?

~~~
klibertp
Yes. It interprets and executes JS like a real browser would. Which is nice.
For Python: [http://jeanphix.me/Ghost.py/](http://jeanphix.me/Ghost.py/)

------
nodesocket
Have you played around with node.io?
[https://github.com/chriso/node.io](https://github.com/chriso/node.io)

Encapsulates all this functionality in an easy to use interface.

~~~
httpteapot
Last commit 3 months ago. Do you know if this project is still alive?

~~~
nacs
Haven't used node.io but 3 months isn't that old.

Also, if you check the issues page for the project (
[https://github.com/chriso/node.io/issues](https://github.com/chriso/node.io/issues)
), the author seems to be responding to any open issues with the latest
comment by author being a month ago.

------
nostrademons
There're also Node.js bindings for Gumbo if folks want HTML5 compliance:

[https://github.com/karlwestin/node-gumbo-
parser](https://github.com/karlwestin/node-gumbo-parser)

It might be interesting if someone were to implement a Cheerio-like API on top
of that, as Cheerio has a nicer API but Gumbo's parser is more spec-compliant.

------
aroman
Cheerio is really really awesome. I've used it to build a considerably
sophisticated web scraping backend to wrap my school's homework website and
re-expose/augment via node/mongo/backbone/websockets.

There are definitely some bugs in cheerio if you're looking to do some really
fancy selector queries, but for the most part it's extremely performant and
pleasant to use.

If anyone is interested in seeing what a sophisticated, parallalized usage of
cheerio looks like, feel free to browse through the app I was mentioning above
-- it's open source:
[https://github.com/aroman/keeba/blob/master/jbha.coffee](https://github.com/aroman/keeba/blob/master/jbha.coffee)

------
victorhooi
Hmm, interesting.

I'm also looking at doing a web-scraping project with Node.js.

I was going to go with CasperJS
([http://casperjs.org/](http://casperjs.org/)), which seems fairly active and
is based on PhantomJS.

Their quickstart guide is actually creating a scraper:

[http://docs.casperjs.org/en/latest/quickstart.html](http://docs.casperjs.org/en/latest/quickstart.html)

However, I'm wondering how this (Cheerio) compares - anybody have any
experiences?

------
premasagar
See also [http://noodlejs.com](http://noodlejs.com) for a Node-based web
scraper that also handles JSON and other file formats.

It was initially built as a hack project to replace a core subset of YQL. (I
helped to guide an intern at my company Dharmafly, Aaron Acerboni, when he
built it).

------
dfrodriguez143
I like to use the readability API so I don't need to see the HTML of every
single site. I did an example here:
[http://danielfrg.github.io/blog/2013/08/20/relevant-
content-...](http://danielfrg.github.io/blog/2013/08/20/relevant-content-blog-
crawler/)

------
chatman
Isn't scrapy easier to use than this?

~~~
hackula1
Cheerio is really easy for anyone familiar with jQuery (most node.js devs I
would imagine).

------
mholt
This is cool... if the content is structured. (Ever tried finding addresses in
arbitrary text? Much harder: [http://smartystreets.com/products/liveaddress-
api/extract](http://smartystreets.com/products/liveaddress-api/extract))

~~~
babby
Come on, that's not really a scraping problem, it's more of a text parsing
problem coupled with an API lookup or scrape to verify the address.

Though, id probably just google for some good address regexes, match against
pages, for each address just throw them into something like
maps.google.com/?q=[address] then try to scrape whatever normally pops up for
a valid result. Also helps if you're expecting addresses to be in a certain
country.

------
greenido
Similar to what I wrote a week ago:
[http://greenido.wordpress.com/2013/08/21/yahoo-finance-
api-w...](http://greenido.wordpress.com/2013/08/21/yahoo-finance-api-with-
nodejs/) :)

------
tommoor
I run an API that could help with this type of thing where the page includes
microformats (A surprising amount) at
[http://pagemunch.com](http://pagemunch.com)

------
shospes
We also used cheerio and node.js and built an click & extract interface around
it: [http://www.site2mobile.com/](http://www.site2mobile.com/).

~~~
garyjob
Interesting, I encounter the same set of problems as well last year when
working on two side projects. Ended up building a webscraping service with a
point and click interface on top of it : [https://krake.io](https://krake.io)

------
level09
here is how I like to do it :

    
    
      from pyquery import PyQuery as pq
      doc = pq('http://google.com')
      print doc('#hplogo')

------
tectonic
Remember to use SelectorGadget
([http://selectorgadget.com](http://selectorgadget.com)) to help generate your
CSS selectors.

------
zerni
nice!

I did a webcrawler with node.js myself last year. It's only a quick try but
you can find the worker class here:
[https://gist.github.com/zerni/6337067](https://gist.github.com/zerni/6337067)

Unfortunately jsdom had a memory leak so the crawler died after a while...

~~~
cheeaun
If you want to fix the memory leak, I remember you need to do `window.close()`
after the job is done.

~~~
zerni
thanks mate!

