

Scraping Web Pages With jQuery, Node.js and Jsdom - liamk
http://liamkaufman.com/blog/2012/03/08/scraping-web-pages-with-jquery-nodejs-and-jsdom/

======
sophacles
Somewhat tangential, here is something I have long thought would be a very
useful project, but unfortunately haven't had the time to build:

It is a scraping/crawling tool suite. Base it on webkit, with good scriptable
plugin support (not just js, but expose the DOM to other languages too). It
would consist of a few main parts.

1) what I call the Scrape-builder. This is essentially a fancy web browser,
but, it has a rich UI that can be used to select portions of a web page, and
expose the appropriate DOM elements, and how to find those elements in the
page. By expose, I mean put into some sort of editor/ide - It could be raw
html, or some sort of description language. in the editor, the elements one
would want to scrap can then be selected, and put into some sort of object for
later processing. This can include some form of scripting to mangle the data
as needed. It can also include interactions with the javascript on the page,
recording click macros (well event firing, and such). The point of this
component is to allow content experts/non- or novice-programmers to easily
arrange for the "interesting" data to be selected for scraping.

2) The second component for the suite is a scraping engine. It uses the
description + macros + scripts from the Scrape-builder to actually pull data
from the pages, and turn them into data objects. These objects can then be put
on a queue for later processing with backend systems/code. The scraping engine
is basically a stripped down webkit without the rendering/layout/display bits
compiled in. It just builds the dom and executes the page's javascript to
ultimately scrape the bits selected. This is driven by the spidering engine.

3) The spidering engine is what determines which pages to point the scraping
engine at. It can be fed by external code, or it can be fed by scripts from
the scraping engine, a feedback mechanism (some links on a page may be part of
a single scraping, some may just be fodder for a later scraping). It can be
thought of as a work queue for the scraping engines.

The main use-cases I see for this are specialized search and aggregation
engines, which want to get at the interesting bits of sites which don't expose
a good api, or where the data may be formatted, but hard to semantically infer
without human intervention. Sure, it wouldn't be as efficient from a code
execution point of view as say, custom scraping scripts, but it would allow
for much faster response times to page changes, and allow better use of
programmer time, by taking care of a lot of boilerplate or almost-boiler plate
parts of scraping scenarios.

~~~
baudehlo
You just described Kapow. Or Selenium if you want a cheaper alternative.

~~~
sophacles
I didn't know about Kapow, thanks for the pointer! I think selenium could be
worked into this, but it is not all the way there just yet...

------
lancefisher
The problem I've had with using jsdom in scraping web pages is that it is not
very forgiving about bad HTML. There are so many pages in the wild that have
malformed HTML, and jsdom just pukes on it. I started using Apricot [1] which
uses HtmlParser [2] which has been better. I'd like to hear what others are
using to scrape bad webpages.

[1] <https://github.com/silentrob/Apricot>

[2] <https://github.com/tautologistics/node-htmlparser>

~~~
gikrauss
A couple of months ago I posted a very similar article that uses node.js,
request and jQuery to achieve the same goal. In this case, as long as jQuery
can handle the response you shouldn't have any problem with malformed HTML...

~~~
gikrauss
I forgot the link [http://blog.devartis.com/2012/01/05/scraping-websites-
having...](http://blog.devartis.com/2012/01/05/scraping-websites-having-fun-
with-node-js-and-jquery/) :S

------
lopatin
Good overview. I've been a fan of node.io for node scraping for a while. Lot's
of stuff built in and you don't lose your jQuery selectors.

~~~
liamk
Thanks! Node.io looks excellent! As a commenter on the parent article pointed
out it handles errors, is multi-threaded and allows you to use jQuery.

------
nchuhoai
<http://nokogiri.org/>

Perfectly fine use of css selectors and more

------
tcarnell
Thanks for sharing. This is a bit off-topic, but if you are interested in
Scraping Web Pages, you might find that <http://cQuery.com> is an interesting
solution which uses CSS Selectors (much like jQuery) as its mechanism to
extract content from live web pages.

------
prestonparris
I created a dumb little script using this technique that lets you read hacker
news in the terminal then opens up the story in your browser.

<https://github.com/prestonparris/node-hackernews>

~~~
liamk
It's worth noting that Hacker News seems to temporarily block IP addresses if
too many requests are made. I'm not sure if it's requests per minute, or
within an hour. But my IP was blocked 3 times when playing around with a
similar script.

------
MatthewPhillips
I'm confused, is the window object from Jsdom a live object? Can I scrape
interactive sites?

Isn't Phantomjs already perfect for scrapping, what is the advantage of this
exactly?

~~~
baudehlo
Phantomjs has a lot larger overhead (it runs a full browser). Plus it doesn't
give you access to the full Node.js ecosystem (e.g. access to databases, etc).
You can use Node's Phantomjs driver, but it spawns a child process to do the
work, and seems a little complicated in how all the interactions work.

And of course you can scrape interactive sites - interactive sites still
basically just use HTTP to request data. Just watch the network window in
Chrome's developer tools, and figure out what HTTP requests the site is making
that you are interested in. Then code them into request() calls.

~~~
MatthewPhillips
Phantom doesn't need access to node apis. Separation of concerns. Run your
phantom script to scrape and then pipe the results to your node script for
processing. I've worked on 2 scraping projects using this method and it works
great.

~~~
baudehlo
Sure, it can work great if your needs are simple. But if you're posting data
to forms which you need from a database, and you don't know what you might
need from the DB until runtime, it can get a bit more complex.

Also worth noting that jsdom is quite slow, and has a strict HTML parser. If
you want something faster that will cope with more web pages, look up cheerio.

~~~
MatthewPhillips
Remember that Phantom just runs a web browser. So writing a Phantom script is
like writing an app on top of another app. This means you can use
XMLHttpRequest or WebSockets to communicate with a back-end if that's
necessary.

~~~
baudehlo
Yes that's what we ended up doing with Kapow. Just felt... dirty :)

~~~
tszming
Kapow? You mean this company: ([http://kapowsoftware.com/solutions/content-
migration/index.p...](http://kapowsoftware.com/solutions/content-
migration/index.php))?, mind to share about your experience?

------
mistercow
I just did this recently. It works great except when the pages you're scraping
have JS errors on them.

------
danso
OK, one thing I'm confused about...what's the advantage scraping with
node/jquery over a traditional scripting language like Ruby + Nokogiri or
Mechanize?

It's true that this process won't render the page-w-ajax as your browser will,
but I've found that if you do some web inspection of the page to determine the
address and parameters for the backend scripts, then you don't even have to
pull HTML at all. You just hit up the scripts and feed them parameters (or use
Mechanize, if cookie/state-tracking is involved).

~~~
baudehlo
Partly the advantage is that you have CSS selectors to examine your document
(I don't know if Ruby/Mechanize does that - I'm just saying what is good about
node+jquery), and you have a language that all web developers know about. So
it's about minimising friction from doing front end web work to doing scraping
work. At my company this gives us a financial advantage - we can hire basic
jQuery web developers to work on our scrapers.

