

How to Scrape Websites in Ruby on Rails using scRUBYt - dmix
http://dmix.ca/2008/09/how-to-scrape-websites-in-ruby-on-rails-using-scrubyt/

======
ejs
When I did <http://zerodaydeals.com> (not mine anymore) the biggest pain was
scraping all the sites, since there was 50+ sites... and people would
constantly change layouts, breaking it.

I tried using plugins but ended up resorting to pile of regular expressions
for each site. I wonder if this would be better as I don't think it was around
at the time.

~~~
nreece
[ _shameless plug_ ] Our startup - Feedity - (<http://feedity.com>) provides
custom RSS feeds for virtually any webpage, which helps many small-medium
online services in data integration.

------
jfarmer
So, when I was creating Adonomics I considered using scRUBYt for scraping
Facebook. Here's why I didn't go with it:

1\. It was hard to get scRUBYt to learn the "correct" rules. It tends to be
over-specific or over-broad.

2\. It was slow. Really slow. Using Ruby Mechanize was at least 2-3x faster,
and even that was pretty slow.

3\. The learner doesn't like bad HTML, but as a practical matter you have to
deal with poor markup all the time. scRUBYt makes it hard to get to the guts
of the system.

YMMV.

~~~
lethain
I wrote a similar tutorial a while back using Python and BeautifulSoup
([http://lethain.com/entry/2008/aug/10/an-introduction-to-
comp...](http://lethain.com/entry/2008/aug/10/an-introduction-to-
compassionate-screenscraping/)). BeautifulSoup doesn't learn in any sense of
the word, but it plays very nicely with malformed (even extraordinarily
malformed) html, and you can usually do things in a way that is resistant to
changes (a combination of tag and id|class is usually fairly resistant to non-
drastic changes).

------
michaelneale
I took at look at scRUBYt, looked nice, but I ended up just using hpricot -
fast, and pretty easy. I would just have one screen with the site with firebug
open, and grab the xpath expressions from it, slap them in a ruby string and
then put in place holders for the parameters.

Only minutes of work.

------
nickvn7
Nice work, script is a lot simpler than i thought.

