

Web crawling and downloading ebooks with phantomJS - gillyb
http://debuggerstepthrough.blogspot.co.il/2012/06/having-fun-web-crawling-with-phantomjs.html

======
inDesperateZone
What is the benefit of using phantomJS in this case? I understand that it is
very useful if content is dependant on JS running.

But that doesn't seem to be the case here. With Python I would have used a
parser like lxml or BeautifulSoup (and I'm sure there is something comparable
for JS) coupled with Requests async methods. That would probably not only end
up with shorter and more concise code, but also be a lot faster.

~~~
mistercow
The script appears to rely on jQuery (which is presumably already included in
the pages being scraped in this case). If you're already familiar with using
jQuery for DOM manipulation, then using it for scraping is incredibly easy.

One advantage is that it's not always instantly obvious if you'll need JS to
execute before you can scrape a page. If you start out with a simply html
parser and then find out that you needed the JS to run first, you're going to
have to start over. If you start out using phantomJS and then find out that
you don't need any of the original JS to run, your script still works.

~~~
rb2k_
When you say 'rely on jQuery', I think it would be more precise to say that it
relies on CSS selectors. Most libraries will actually provide you a way to
access an HTML parse tree in that way.

I guess phantomjs is a good a tool as any, but there is really no need to
evaluate Javascript for a bit of plain HTTP+HTML parsing.

~~~
mistercow
It looks to me like it's using jQuery specific functions. It could be done
with a simpler selector engine, but in this case, it looks pretty clear that
it's either jQuery or a compatible library like Zepto.

~~~
rb2k_
As far as I see the only line with selectors is:

> return [ $($('h2 a')[0]).attr('title'), $($('h2 a')[1]).attr('title') ];

which are 2 css selectors and picking an element. That is pretty much covered
by all of the available http parser libraries

~~~
mistercow
It's CSS selectors and then wrapping the DOM element again in a function to
give it an `attr` method, which is jQuery style. Other libraries may use that
syntax too, but I'm pretty sure it started with jQuery (and if not, was
certainly popularized by it).

------
smoyer
This technique is certainly useful in a variety of instance and I've done the
same thing with both HTMLUnit and JWebUnit in Java. The "great site you know
of" appears to be filled with books that are copyrighted and for-profit so I'm
not sure you'd really want to publicize what you're doing on your blog.

~~~
lince
In my country there is no problem in sharing copyrighted content as long as do
not receive economical profit. Also, I have found the example very practical
as introduction.

~~~
smoyer
"In my country there is no problem in sharing copyrighted ..."

I guess that means that each e-book only has to be sold once in your country?
The U.S. laws might be overly protective of IP, but that's an interesting
problem for publishers who in theory need to earn a profit if they're going to
continue as entities as well as for authors who need to feed their families.

This obviously wasn't a problem when the books were printed on dead trees,
because you'd only share copies that had been purchased, and if you're friend
was reading your book you no longer had access to it. Curiously, I could rent
my copy of a book to you in the U.S. without violating copyright laws.

~~~
lince
"I guess that means that each e-book only has to be sold once in your
country?"

No.

It means that you can read a book and, if you really like it, you can buy it.
It means that you can discover new authors, topics and so without a huge
investment.

This can sound demagogic: I have never had enough money to buy the books I
wanted, nor to waste it trying to discover new books and topics. But with
downloaded books I learned about tech and other fields. Eventually, I bough
more books (a lot from U.S) than if I had not discovered these topics.

Allowing private sharing (as long as there are not profit) and supporting
authors are not in direct confrontation. In my humble opinion and personal
experience, they are correlated.

~~~
smoyer
I was not passing judgement on either you or your country's laws ... And I
think the ability to try a book out before you buy it is important to the
market. I think there are a lot of us who spend time in bookstores simply for
this reason.

"Private sharing" and especially recommendations are also my favorite ways to
find worthwhile books.

~~~
lince
No problem, smoyer. I take it to first person because I thought that my
personal experience could be a interesting answer.

Recommendations are great once you both have some favorites books in common.
Because of that, I always check the Amazon's "Other people also bough"
section.

------
veverkap
You could also use the CasperJS wrapper and have the script automatically
download those files for you.

See <http://casperjs.org/api.html#casper.download>

~~~
gillyb
Wow! I just looked at the casperjs api, and it looks amazing!! Tons of great
utilities to help you work with the DOM. This could be great, since i would
like to implement the json wire protocol for phantomjs, and using casperjs
will be of great help for that task! :)

------
malandrew
If you like PhantomJS, be sure to also check out CasperJS. I use it with
jQuery, Underscore and Underscore.string.

I just wish that jQuery had support for XPath style selectors as well.
Chainable XPath would be hella sweet.

------
zdwalter
phantomJS + CasperJS make crawling easily. I build <http://sp.iderman.info> to
help scratching easier.

------
radagaisus
Phantom is awesome. I tried to use it for testing, but it's too slow (10
seconds for one test). Anyone else tried it? Any tips?

~~~
wslh
Is using Google Chrome faster?

~~~
radagaisus
What do you mean? Qunit testing? Yes. always.

------
er354yerty
Wouldn't Node be helpful here?

~~~
petsounds
<https://github.com/chriso/node.io>

looking forward to being able to distribute jobs across multiple machines

