
Pjscrape: A web-scraping framework written in JS using PhantomJS and jQuery - jamesjyu
http://nrabinowitz.github.com/pjscrape/?utm_source=twitterfeed&utm_medium=twitter
======
weego
I use phantomjs + jqyery for my own scraping/testing engine so I thought I
would chuck in some extra information for those not familiar with PhantomJs.

The one thing that set phantomjs apart is that is it a full headless webkit
browser rather than just an html parsing engine which most other solutions
are. The big win with the above in mind is that you can scrape and test
comet/heavy javascript apps without having to mock the polling or
submit/responses.

I run it like a bot controlled by NodeJs with NowJs sending commands to it and
it returning the results of tests, though I believe there are plans to get
process to process communication working to make the process of controlling
and pushing data out easier.

~~~
robterrell
I, too, use a nodejs server to control multiple phantomjs processes. There's a
patch that lets your script read from stdin -- last weekend I modified it to
support my platform's preferred line ending. I also added commands for
mousemove/mousedown/mouseup; they stuff actual mouse events in the Qt event
queue, so you don't have to worry about the edge cases where javascript-faked
mouse events fail.

<https://github.com/robterrell/phantomjs>

------
bryanh
While this is awesome, anyone that needs to do about the same thing but with a
Python stack should look at pyquery as an alternative.

------
davej
Has anybody used PhantomJS with a client-side testing framework? I'd be very
interested in hearing experiences.

------
ma2rten
I am also working on a jquery/js scraping framework of my own. I think this is
the way go, because there is no library that used more to extract HTML then
jQuery. And it also enables you to scrape JS code on the page.

I used node+jsdom so far. I will have a look at phantom js.

------
camwest
How does Pjscrape handle logins, SSL, and redirects?

~~~
jqueryin
PhantomJS recently closed a pull request on some basic patching to support
SSL.

There was also a fix for self-signed or invalid certs:

<https://github.com/ariya/phantomjs/pull/40>

------
AltIvan
If it does what you guys say it does... you are full of awesome!

