

Web Scraping with Modern Perl (Part 2 - Speed Edition) - creaktive
http://blogs.perl.org/users/stas/2013/02/web-scraping-with-modern-perl-part-2---speed-edition.html

======
wslh
Are you a hardcore web scraper? Check then <https://blog.databigbang.com> for
articles such as how to scrape sites with javascript, browserless oauth,
implementing your own rotating proxies.

Disclosure: I am the author but the site is helping and saving the time of
thousands of people with code and examples.

~~~
mguterl
Your site never seems to load for me.

~~~
creaktive
Same here :(

~~~
wslh
You mean slow? it is on AWS.

~~~
creaktive
Figured it out; the problem was the HTTPS. <http://blog.databigbang.com/> is
fine :)

~~~
wslh
Sorry, it was my fault. Too much HTTPS sites lately.

------
kimmel
Here are some of the list resources:

20 Perl libraries for fetching web content - <http://neilb.org/reviews/http-
requesters.html>

An SO wiki page for html scraping -
[http://stackoverflow.com/questions/2861/options-for-html-
scr...](http://stackoverflow.com/questions/2861/options-for-html-scraping)

~~~
creaktive
YADA, the concurrent fetcher featured in the article, has extensive benchmarks
on many Perl WWW User Agent libraries, also:
[https://metacpan.org/module/AnyEvent::Net::Curl::Queued#BENC...](https://metacpan.org/module/AnyEvent::Net::Curl::Queued#BENCHMARK)

------
devcom
Shamless on topic plug here. Some readers might be more familiar with ruby.

Multi-threaded webscraping with the tor network with ruby -

[http://devcomsystems.com.au/2013/01/multi-thread-
mechanize-u...](http://devcomsystems.com.au/2013/01/multi-thread-mechanize-
using-multiple-tor-circuits-for-web-scraping/)

------
calufa
Take a look at Tales.

Tales is a block tolerant web scraper that runs on top of aws and rackspace.
Tales is design to be easy to deploy, configure, and manage.

<https://github.com/calufa/tales-core>

~~~
hercynium
I dunno... the tales install script seems to want to take over whatever
account it's run as, going so far as to _modify ~/.ssh/config_. That _alone_
gives me pause... then it requires you have a github account?

And it needs mysql, redis and mongo??!?

Oh, and of course, I'll need an aws/rackspace account...

If I need all that for a web-scraper it better be for a _big_ project.

The CPAN modules in the linked article can all be installed and run as a non-
privileged user (via either local::lib or perlbrew, etc.). And there are no
daemons, nothing running as root, nothing listening on any ports, no
configuration or tuning to think about, and it'll work everywhere from my
macbook to my dev-server running linux or BSD or Solaris or _whatever_.

BTW, Mojolicious (<http://mojolicio.us>) is really great stuff. Outside of
having a reasonably up-to-date version of Perl (5.10.1 or higher), it's got no
external dependencies, not even other CPAN modules. It's fast, flexible, easy
to use, and easy to deploy just about anywhere. sri++

~~~
creaktive
Mojolicious is featured more extensively in the Part 1 of my article:
<http://news.ycombinator.com/item?id=5159452> ;)

