

Using Perl to scrape the web - geoscripting
http://ssscripting.blogspot.com/2009/12/using-perl-to-scrape-web.html

======
kunley
Mechanize for Perl & Ruby is quite cool.

However, it's not able to execute JavaScript. The only lib I found which does
so with a reasonable subset of JS is HttpUnit, in Java. Though it has kind of
ugly interface IMO, I use it with a success. Doing it from a Clojure REPL
makes it quite handy tool for web scripting.

~~~
draegtun
To use Javascript then try the CPAN module WWW::Selenium
(<http://search.cpan.org/dist/Test-WWW-Selenium/>).

~~~
kunley
Yeah I considered it but Selenium uses Firefox, does it? And I needed a self-
contained script without such dependencies.

~~~
draegtun
No, Selenium works with all major browsers.

------
Freebytes
Perl would certainly be my language of choice for screen scraping. And, some
people see it almost like stealing. I know of people that look at Google
negatively for its news indexing method. I have skimmed through a book (though
I do not remember the name) that seemed to claim that Google profits only from
the work of others. (I think they added value in their collaboration of
information, though.) Nonetheless, you must be careful not to create a
backlash (or legal issue) with screen scraping. The irony is that one of the
best targets for screen scraping content for your own benefit may be Google
itself... however, it almost seems like they encourage it. (They want you to
use their APIs instead, though.)

Spidering has been around for a long time, and people act like screen scraping
is new. It is really the same that has existed for years. If you are going to
do it, though, Perl is certainly the way to go. It is fast, efficient, and
robust.

~~~
pbhjpbhj
Theft requires that the taking denies the current owner access/use to whatever
was taken. Copyright infringement appears to be what you're referring too.

Google IMO is more of a symbiont than a parasite.

~~~
Freebytes
You are correct, and I agree. It is not theft, and Google is really a huge
collection of mitochondria... helping the fledgling Internet become something
more by combining it with an intellectual powerhouse. I could not have said it
any better myself.

------
mahmud
Perl has been the language of choice for web spidering since 1999, when it
replaced REBOL for that purpose. Hint: LWP module made such a dent on the
industry, nobody was able to replace it until the last year or two when other
things started popping up.

------
martian
This line seems problematic:

    
    
      use strict;
    

Most of the web is messy. Beautiful Soup and its ilk would seem like a better
choice for parsing.

~~~
gloob
Ahem.

<http://www.perl.com/doc/manual/html/lib/strict.html>

~~~
martian
Ouch, should have RTFM. Thanks for the pointer.

