
Web Scraping with Node.js and Chimera - dandrewsen
http://www.deanmao.com/2012/08/13/enter-chimera/
======
kanzure
Great project. The biggest question for me when I'm using phantomjs is why
phantomjs is trying to replicate nodejs infrastructure. For example, phantomjs
has an HTTP server feature for processing incoming requests. This doesn't make
sense to me because a browser shouldn't be a server. If you need to get
information out of the worker, you should POST it somewhere. The proclivity of
phantomjs users to prefer stdout is astounding. It's definitely the #1
question or issue that I get fielded in #phantomjs on freenode.

For example, for POSTing and reading from redis/resque I wrote this (proof of
concept, not what's in production):

<https://gist.github.com/000037f472b72d9490a6>

A few thoughts..

    
    
        > There are similar "glues" like phantomjs-node that integrate phantomjs by
        > spawning a process, and processing the stdout stream, but it is limited by 
        > what can be done via the command line of phantomjs. If you really want direct
        > api access to the browser, the best way is via direct integration.
    

This seems like a lot of overhead on top of a phantomjs (or even just a
generic webkit) worker. Substack's approach was to just put a proxy in front
of a browser that injects a <script> tag into the page to boss the browser
around:

<https://github.com/substack/schoolbus>

Supposedly the actual browser client shouldn't matter, as long as your fleet
of workers are up and running. I bet chimera's approach will end up with more
access to npm modules in the long run compared to phantomjs.

Also, the link wasn't in the article: <https://github.com/deanmao/node-
chimera>

For the python equivalent of this project, there's
<https://github.com/kanzure/pyphantomjs>

------
niggler
Did we really reach the point where demonstration code can be presented in
coffeescript without an equivalent javascript demo?

~~~
chc
People have been doing that for a surprisingly long time. I think it was
encouraged by the fact that CoffeeScript is immediately intelligible to a lot
of JavaScript coders, and authors reckon that anyone who has trouble can just
compile it and get fairly idiomatic JavaScript.

I actually wrote a bookmarklet a while back that would look for CoffeeScript
snippets on a page and translate them for people who find it troublesome, but
didn't end up doing much with it because I didn't feel like there was much
interest.

~~~
niggler
criticisms of coffeescript aside (e.g. interstitial whitespace sensitivity),
the code can't directly be applied in node (as far as i can tell, you can't
tell node to run a coffeescript file directly using the `node <script>` syntax
-- you have to use a framework or compile to js).

~~~
funkiee
How is that really different toolchain-wise from having to compile source code
in a language such as Java?

~~~
niggler
remember: java is to javascript as pain is to painting.

it's not an apt comparison. You can directly write javascript and run it in
your browser or in node (or test it out interactively in the node REPL).

I like to play with modules in node interactively (when relevant) because its
easier to see what's going on and much easier to iterate (esp. in conjunction
with the .load REPL command)

------
fruchtose
Great work! This might even have potential for browser-based testing, since
mocha-phantomjs runs from an executable; I'd prefer a code-based solution like
Chimera integrate with Mocha.

------
lancefisher
This is a great idea! phantomjs-node works okay, but it is suck a hack. A
nifty hack, but still. <https://github.com/sgentle/phantomjs-node#how-does-it-
work>

If you want to parse the DOM for the internet at large, you need a real
browser. There are simply too many sites with really bad HTML to be parsed
reliably with anything else.

~~~
catch23
If you need a great html parser, I also wrote a library for one of those:
<https://github.com/deanmao/node-hubbub>

It's merely an integration library for the netsurf browser. It's the html
parser for a real browser. I considered using the parser from other browsers
like firefox or webkit, but netsurf had the fewest external dependencies.

------
seanlinehan
This looks really great. I would love to see a bit more of an in-depth example
in a follow-up post!

Is more documentation to come?

~~~
catch23
Sure, I can make more. This post caught me by surprise. It's odd to see your
own blog post from months ago on HN.

I sorta gave up on the project seeing as nobody other than myself used it. (I
have zero watchers on github)

~~~
detst
A README is the absolute bare minimum for a GitHub project. Even a lack of
code can get some interest in an idea. Someone landing on the GitHub page will
reflexively look for the back button if they haven't seen the blog post.

Great work!

~~~
catch23
I've added a readme now. Will add more soon.

------
Trindaz
Has anyone actually gotten this working? I've tried installing on Mac OS X and
Ubuntu, both with various problems. The precompiled binaries don't work, the
qt build scripts fail, etc. etc.

------
rco8786
Is there more documentation(or source) available somewhere?

~~~
KaoruAoiShiho
It's on github, but at 6 months since last commit and no documentation...

~~~
catch23
I can make more. After about 3 months with no github watchers, I figured the
only person reading the docs would be myself.

~~~
KaoruAoiShiho
Thanks for adding docs, this should be a better alternative than phantom or
jsdom.

------
mcantelon
Similar project: <https://github.com/LearnBoost/tobi>

~~~
CoffeeDregs
Sorta. The joy of PhantomJS (and of Chimera) is that they use a real browser
to run the JS/CSS/HTML/whatever. No simulated DOM; no simulated cookies; etc.
Just a real [headless] WebKit browser with all of its quirks and tricks. You
can even take screenshots of a real, rendered webpage (which is great for
debugging).

~~~
mcantelon
>You can even take screenshots of a real, rendered webpage (which is great for
debugging).

That is pretty cool.

------
nodemaker
Testing (<http://bartaz.github.com/impress.js/#/bored>)

------
goldfeld
How does it differ from ZombieJS?

~~~
foxbarrington
Zombie doesn't use a "real" browser, but a simulated DOM Window using jsdom.

------
booz
wow this is exactly what I need, thanks!

