

Pro scraping with Node.JS - chrisohara
https://github.com/chriso/node.io

======
simonw
I was intrigued to see what CSS selector engine it was using...

<https://github.com/chriso/node.io> uses <https://github.com/harryf/node-
soupselect>

<https://github.com/harryf/node-soupselect> is a port of my
<https://github.com/simonw/soupselect> library for Python

<https://github.com/simonw/soupselect> is a port of my getElementsBySelector
function for JavaScript:
<http://simonwillison.net/2003/Mar/25/getElementsBySelector/>

I'm always surprised to see that code still being used - it's the least
complete selector library out there by a long way.

~~~
chrisohara
Hi Simon, great libs - thanks! there's been many improvements added to node-
soupselect and node.io though - the API is here:
[https://github.com/chriso/node.io/wiki/API---CSS-
Selectors-a...](https://github.com/chriso/node.io/wiki/API---CSS-Selectors-
and-Traversal-methods)

~~~
simonw
Oh nice - the .rawtext and .striptags methods are particularly useful.

------
marcusramberg
<http://mojolicio.us> is way better for this kind of stuff. Here's the
synopsis example redone using Mojo:

    
    
        $ perl -Mojo -e'g("reddit.com")->dom("a.title")->each(sub { warn shift->text })'

~~~
chrisohara
The one liner is cool, but I guarantee that node.js's non-blocking IO will
outperform perl any day of the week. Try scraping thousands of pages at once
using perl..

~~~
marcusramberg
mojolicious is using a non-blocking async runloop as well =)

~~~
harryf
The problem you'd have with anything that represents a page as some kind of
graph is you have to construct the whole tree before you can start doing
anything with it. The API largely precludes streams. Callbacks would be
possible but some of the conditional CSS selectors need a complete knowledge
of the page before they can be resolved.

So while GET-ting pages to scrape can benefit from async IO, you're
effectively "blocked" while scraping pieces out of the page itself.

------
thibaut_barrere
Really interesting, thanks! This will probably the first thing I will use for
real projects in node.js.

Does anyone knows how it compares to say Nokogiri or Hpricot, both in terms of
speed and in terms of ability to handle crappy html ?

------
chrisohara
This is in response to all the node/jsdom/jquery scraping posts that are
popular lately. JSDom is hopeless for scraping - try parsing some slightly
malformed HTML..

~~~
DTrejo
Hey Chris, I was just trying to share a few things I'd learned. I know I
haven't done as much scraping as others (like yourself and richcollins). Glad
I've helped get some discussion going :)

~~~
chrisohara
Hey David, it wasn't a stab at your blog - your post was great - anything that
builds some more interest in node.js is positive :) I just hate reading about
people having trouble with JSDom and putting it down to the node platform.
JSDom is an excellent parser, it just fails miserably when you feed it
malformed HTML, and as we know, a majority of the internet falls in to this
category. I needed a framework that could scrape anything on the web so I
built it myself

~~~
tmpvar
just to be clear, jsdom is not a parser. By default it uses node-htmlparser
which is not very lenient.

Have you tried using Aria's html5 parser? I hear it works better with
malformed markup.

