

Scraping made easy with jQuery and SelectorGadget (and Node.js!) - DTrejo
http://blog.dtrejo.com/scraping-made-easy-with-jquery-and-selectorga

======
lamby
It's neat using jQuery for this but I've found the arduous part of scraping
isn't in the actual parsing and extraction of data your target page, but
rather in the post-processing, working around incomplete data on the page,
handling errors gracefully, keeping on top of layout/URL/data changes to the
target site, not hitting your target site too often, logging into the target
site if necessary, respecting robots.txt, keeping users informed of scraping,
sane parallelisation of requests, and general problems associated with long-
running background process.

All tractable problems with standard solutions, but it's difficult to accept
the claim that the idea of using jQuery—which is still pretty neat IMO—now
makes scraping easy.

------
mmaunder

      perl -MLWP::UserAgent -e 'map { $_ =~ s/<a href="([^"]+)">([^<]+)<\/a><span class="comhead">([^<]+)<.+?<span id=[^>]*>(\d+ points)/print "$1 $2 $3 $4\n" if($i++ < 3)/ge } LWP::UserAgent->new->get("http://news.ycombinator.com/")->content;'

~~~
xtacy
It would be much difficult to write many more complex scrapers just using
regexes, which is why methods like the one posted above scale well with
complexity.

For example, if you wanted to scrape comments on HN and get a tree-like data
structure, regexes would be much more difficult to write and maintain!

~~~
mmaunder
It scales. I ran WorkZoo.com (Time Mag top 50 website of 2005 - sold it the
same year) and we scraped over 500 job boards and aggregated the jobs into a
search engine. A team of devs developed and maintained the regex for each
board and I managed them and I wrote the dev tools they used to develop the
regex for each site we scraped. It was incredibly effective and maintainable.

Incidentally, the dev tools I wrote were in javascript and they created regex
that we'd test in javascript and deploy in Perl. The regex engines are
identical which is why it worked.

Scraper abstraction is for people too lazy to learn regex. Get a good book on
regex, and learn how to use Perl's s/// regex with the 'e' modifier. It'll
change your life.

~~~
mishoo
... and later you begin to discover different, much better ways to solve
problems that you used to do with regexps, and your life will get back to
normality and happiness. ;-)

For a solid scrapper in Perl I'd use HTML::TreeBuilder / HTML::Element.
Perhaps slower than regexps, but does real parsing and understands tag-soup
HTML.

~~~
parasctr
These modules are very resource intensive and they become a bottleneck if you
are scraping at high volume. I had to stop using these modules in one of my
tools because of that reason. However, for smaller jobs they are awesome and
much easier to use and understand.

------
parasctr
I do a lot of scraping using perl. Web Scraper is an awesome tool:
[http://search.cpan.org/~miyagawa/Web-
Scraper-0.32/lib/Web/Sc...](http://search.cpan.org/~miyagawa/Web-
Scraper-0.32/lib/Web/Scraper.pm)

------
weixiyen
I also like this python scrape library: <http://arshaw.com/scrapemark/>

~~~
fmw
I use Scrapy: <http://scrapy.org/> another Python option:
<http://twill.idyll.org/> for Rails: <http://nokogiri.org/> and perl:
<http://wwwsearch.sourceforge.net/mechanize/>

------
wahnfrieden
Python's lxml can do CSS selectors. I've used lxml for scraping and find it
quite nice.

------
dazzla
I've tried PHP scraping many different ways and settled on
<http://simplehtmldom.sourceforge.net> . It uses jQuery style selectors as
well.

------
richcollins
I've been doing a _lot_ of scraping in node. I've had much better luck using
YUI + jsdom than jquery +jsdom. Many pages would fail using jquery and it also
leaked memory like crazy.

~~~
chrisohara
I suggest taking a look at node.io (<https://github.com/chriso/node.io>) - a
scraping framework written for NodeJS. It uses htmlParser rather than jsdom
and scales nicely. It also has support for handling timeouts, retries, etc.

------
JoshCole
People using clojure can get selector based scraping using Enlive instead of
JQuery [1]. It also ends up doubling as a templating library. I use it for
templating on my website and I use it to scrape Hacker News, though that
project the scraping is for isn't ready for launch [2].

1: <https://github.com/cgrand/enlive>

2: <https://github.com/jColeChanged/mysite>

~~~
DTrejo
And here's a truly awesome tutorial for enlive (swannodette is a regular here
as well):

<http://github.com/swannodette/enlive-tutorial>

