

Ask HN: How to aggregate product info from other websites - jreilly

Any input on the best way to aggregate product information from various websites would be much appreciated.  Most of the websites I would like to aggregate lack any APIs that I could use to track prices and things of that nature.<p>I have zero experience web scraping of any kind so any direction would be helpful.  Before I start digging I figured HNers may have some invaluable advice.
======
astrec
Python & Beautiful Soup (<http://www.crummy.com/software/BeautifulSoup/>) are
your friends here.

~~~
jreilly
Anyone know of a similar library in rails that works well?

~~~
rgrieselhuber
I'm pretty sure this (HPricot) is the one most people use in Ruby:

<http://code.whytheluckystiff.net/hpricot/>

~~~
abijlani
Here's a great hpricot tutorial

[http://www.igvita.com/2007/02/04/ruby-screen-scraper-
in-60-s...](http://www.igvita.com/2007/02/04/ruby-screen-scraper-
in-60-seconds/)

------
shabda
You might also want to 1\. Worry about the Copyright laws. 2\. Make sure you
do not hit the site so often that you show up in their logs as bandwdth hog
and are blocked.

------
DenisM
I use python to write scripts of this nature (one script so far:)).

Python has SGML SAX parser and since HTML is SGML it can be used. Better than
regexps any day.

Python's http client library also supports cookies so that you can pretend to
have a "session" with your target website.

EDIT: the libraries are urllib2, sgmllib, cookielib

~~~
olegp
Very few pages have well formed mark-up. The few large scraping projects I've
seen have started out with a mark-up based approach and then switched to
regular expressions.

What experiences has everyone else had?

~~~
DenisM
Uhm. I though any valid HTML is also valid SGML? Are you sure you're not
confusing it with XML markup?

~~~
ryanwaggoner
I think he's saying that the HTML-parsing approach only works when the HTML is
well-formed and for most sites, it isn't.

------
thwarted
See if the sites in question are part of an affiliate network, like Commission
Junction or Link Share. They often provide plain-text feeds to affiliates
through these programs, and many of their terms of service enable you to set
up these kinds of services (although some have restrictions on mixing their
data with data from their competitors). However, I've found that even this
data isn't all that great, cleanliness wise (sometimes you can't trust the
name of the product, the price, the link, or the SKU to even match the
website) and isn't updated very often (like product availability). But it's a
hell of a lot easier than writing a custom parser for each site's HTML
(although when I was working on project like this, I had to write a custom
parser for each feed in order to put them in a more consistent format).

------
petercooper
For Ruby, consider Scrubyt: <http://scrubyt.org/>

If you're wondering why, well, consider this script that "learns" how to
scrape Google results (from one supplied example of output data):

    
    
      google_data = Scrubyt::Extractor.define do
        fetch 'http://www.google.com/ncr'
        fill_textfield 'q', 'ruby'
        submit
    
        link "Ruby Programming Language" do
          url "href", :type => :attribute
        end
    
        next_page "Next", :limit => 2
      end
    
      puts google_data.to_xml
    

Reads almost like English in the scraping part!

------
aneesh
Perl's WWW::Mechanize module is a good choice for scraping & automating
website interactions.

~~~
qhoxie
Mechanize also has a ruby port since you are working with rails.

------
tocomment
What are you trying to do exactly? It depends a lot on the type of data you're
trying to gather.

~~~
jreilly
I am basically trying to track prices of certain products easily so I do not
have to worry about doing it by hand and checking myself every once in a
while.

~~~
ks
Have you considered looking at the price comparison sites? Some of them have
an API. Unless you plan to compete with them directly, you will save a lot of
time.

Example: <http://developer.yahoo.com/shopping/>

