

Spidering the web with CasperJS - morphics
http://planzero.org/blog/2013/03/07/spidering_the_web_with_casperjs

======
kolektiv
I'm trying to work out what you'd want this for. I suppose as a quick "yes
this does actually work" but I can't think of many less efficient ways of
spidering things - whole headless browsers? Ouch.

If you needed to do this, you'd use something that already exists - if you
just wanted to do it in Node, for some particular reason, why not just use
plain requests and an HTTP parser?

I don't mean this to be overly critical, but just in case anyone's thinking
this would be a good idea - there are better ones. Just to learn from, fair
enough.

~~~
morphics
Good questions!

For this particular project I needed access to more than just the page itself
- I needed to interact with scripts and have access to the page resources on a
level which matched that of a browser. What you're seeing is only the tip of a
much larger iceberg, for which efficiency wasn't the main objective.

You're correct that you can write more efficient spiders if you don't want to
do anything fancy (indeed, I've written simple scrapers and spiders in various
languages), but this project leant itself better to a full browser
environment.

~~~
kolektiv
Ah! Thanks for clarifying, that makes significantly more sense. I hope you
didn't take my querying as too unkind!

------
ankimal
Can you provide some insights into performance? In my experience, a single
page could take between 5-10 secs. to scrape.

~~~
BaconJuice
Is this really scraping the data or just checking for links and getting
status?

