Hacker News new | comments | show | ask | jobs | submit login
Web scraping with Node.js (baudehlo.wordpress.com)
34 points by dandrewsen 1755 days ago | hide | past | web | 11 comments | favorite

This makes me just a bit nervous. You're scraping bank websites using a headless WebKit browser, which is presumably vulnerable to future exploits. You have my username and password (and probably verification questions) either stored on or accessible from that same server. Who's to say that one of the sites you crawl won't get compromised and used as a vector to compromise your crawler box and--potentially--your customers' banking credentials?

I found casperjs (http://casperjs.org/) to be a pleasant framework to work with.

Even just PhantomJS is a dream compared to Node.js

You're kidding, right?

I've got heaps of experience with both, and I cringe every time I have to touch the phantomjs API.

It feels like a half-assed imitation of node's, and for the most part isn't even internally consistent. For example, you can render to a file, or to a Base64 string. But heavens no, you can't render to stdout -- the file type is decided by the file name, so /dev/stdout is out of the question and the only workaround is making a pointless symlink. That's not to mention the showstopper bugs with it being literally impossible to exit() the process from inside a script in certain cases.

Don't get me wrong; phantom is awesome and it's great at what it does. But it's not "a dream compared to Node.js".

Keep in mind that I am talking specifically about the context of web scraping. For any other application, I would choose node in a heartbeat.

This article should have mentioned node.io (https://github.com/chriso/node.io) for completeness. It hasn't been updated in a while and I'm not sure if other frameworks have popped up, but I've had a pleasure using it for some big scraping tasks.

I wonder how they get around the two level authentication problem? Even if I give my password to the scraper an extra credential would be required. How do you workaround that?

Isn't it almost always against the terms of service to scrape content off of websites?

There are plenty of legitimate uses for web scraping. For example, say you have a client with a hundred static HTML pages that need to be converted to put into a new CMS. You could go and copy the content by hand from each page.

Or you could write a script that scrapes the site and pulls the content out automatically. That will probably save you time right off the bat. And if the client realizes that they want another piece of information pulled from each page, you just make a minor tweak to your script and rerun it.

"Scraping" implies remote, possibly "unauthorized" access. For your example static html files, converting them to some other format might not be scraping as we understand it, plenty of editors and browsers have options to dump the data as some other format.

Once you remove the requirement for remote, unauthroized access, then every data transformation process become "scraping".

>There’s no way to download resources with phantomjs – the only thing you can do is create a snapshot of the page as a png or pdf. That’s useful but meant we had to resort back to request() for the PDF download.

That's not a "problem", you shouldn't be using Webkit to download files.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact