
Web scraping with Node.js - dandrewsen
http://baudehlo.wordpress.com/2012/06/05/web-scraping-with-node-js/
======
chrissnell
This makes me just a bit nervous. You're scraping bank websites using a
headless WebKit browser, which is presumably vulnerable to future exploits.
You have my username and password (and probably verification questions) either
stored on or accessible from that same server. Who's to say that one of the
sites you crawl won't get compromised and used as a vector to compromise your
crawler box and--potentially--your customers' banking credentials?

------
niggler
I found casperjs (<http://casperjs.org/>) to be a pleasant framework to work
with.

~~~
mistercow
Even just PhantomJS is a dream compared to Node.js

~~~
captainobv
You're kidding, right?

I've got heaps of experience with both, and I cringe every time I have to
touch the phantomjs API.

It feels like a half-assed imitation of node's, and for the most part isn't
even internally consistent. For example, you can render to a file, or to a
Base64 string. But heavens no, you can't render to stdout -- the file type is
decided by the file name, so /dev/stdout is out of the question and the only
workaround is making a pointless symlink. That's not to mention the
showstopper bugs with it being literally impossible to exit() the process from
inside a script in certain cases.

Don't get me wrong; phantom is awesome and it's great at what it does. But
it's not "a dream compared to Node.js".

~~~
mistercow
Keep in mind that I am talking specifically about the context of web scraping.
For any other application, I would choose node in a heartbeat.

------
lopatin
This article should have mentioned node.io
(<https://github.com/chriso/node.io>) for completeness. It hasn't been updated
in a while and I'm not sure if other frameworks have popped up, but I've had a
pleasure using it for some big scraping tasks.

------
runningbread
I wonder how they get around the two level authentication problem? Even if I
give my password to the scraper an extra credential would be required. How do
you workaround that?

------
ilaksh
Isn't it almost always against the terms of service to scrape content off of
websites?

~~~
mistercow
There are plenty of legitimate uses for web scraping. For example, say you
have a client with a hundred static HTML pages that need to be converted to
put into a new CMS. You _could_ go and copy the content by hand from each
page.

Or you could write a script that scrapes the site and pulls the content out
automatically. That will probably save you time right off the bat. And if the
client realizes that they want another piece of information pulled from each
page, you just make a minor tweak to your script and rerun it.

~~~
mahmud
"Scraping" implies remote, possibly "unauthorized" access. For your example
static html files, converting them to some other format might not be scraping
as we understand it, plenty of editors and browsers have options to dump the
data as some other format.

Once you remove the requirement for remote, unauthroized access, then every
data transformation process become "scraping".

------
salmanapk
>There’s no way to download resources with phantomjs – the only thing you can
do is create a snapshot of the page as a png or pdf. That’s useful but meant
we had to resort back to request() for the PDF download.

That's not a "problem", you shouldn't be using Webkit to download files.

