

Ask HN: I want to scrape a web page... - bdclimber14

I'm a RoR developer and have a project where I need to scrape a webpage, specifically a University's course list page, to populate a database. I've never scraped a webpage before, are there any pointers or guides that anyone can recommend? I don't even know where to start.
======
DanielStraight
If you only need to scrape one University's one course list page to populate
one database, your best bet is probably copy-paste with some tidying up with
your text editor of choice. If you need to do more, I would look into Hpricot
or Nokogiri, both Ruby libraries for parsing HTML. Then just download the page
content, pass it to your parser of choice, and go wild.

~~~
bdclimber14
well... it's a big University... the largest in the country actually, so there
are thousands and thousands of classes. I think copying/pasting may be time
prohibitive.

Is it possible to setup a crawler that would iterate through what could be
hundreds of pages, with the same HTML structure, and download/parse the page?

~~~
janj
Yes, that is exactly what the crawler should do. I had no crawler experience a
few months ago but wanted to do the same thing. I didn't know Ruby so I want
the Java route. Now it's trivially easy, I just wrote a crawler last night to
pull a bunch of info off hundreds of pages for my next app.

First start with how you yourself would get at each page of data. If it all
starts from a head page figure out how to get to each link, follow those links
and repeat until you are at the data pages. Then point your crawler to the
head page and code in the patterns. If the pages are not all linked from a
head page but have a url pattern figure out how to pull down the info needed
to fill in the url pieces, pull that info down then visit each page filling in
the pattern with the data you now have.

I can't give any Ruby guidance but if you have general questions on web
scraping you can shoot me an email.

~~~
bdclimber14
This is awesome... time to dive into hpricot!

------
mistermann
Most every option is documented here:
[http://stackoverflow.com/questions/2861/options-for-html-
scr...](http://stackoverflow.com/questions/2861/options-for-html-scraping)

You'll likely find this tool handy once you choose your framework:
<http://www.selectorgadget.com/>

Another quick and dirty way:
[http://googlemapsmania.blogspot.com/2008/10/data-scaping-
wik...](http://googlemapsmania.blogspot.com/2008/10/data-scaping-
wikipedia.html)

------
drats
I see people are saying use Hpricot or Nokogiri and you have gone on to choose
Hpricot. Not a good choice. Hpricot was the work of the hacker _why who has
now disappeared. But even before he disappeared nokogiri overtook hpricot in
performance. He even tweeted "caller asks, “should i use hpricot or nokogiri?”
if you're NOT me: use nokogiri. and if you're me: well cut it out, stop being
me".

So please, use nokogiri it's a great library and the only thing I really miss
from ruby-land in python.

------
tocomment
Just write yourself a Ruby script that downloads a webpage. Then parse the
downloaded content with Ruby's equivalent of BeautifulSoup (a forgiving HTML
parsing library).

That's pretty much all there is to it. You'll want to manually inspect the
HTML from the page you're scraping to find patterns. For example all courses
are in <li> tags, etc.

Let me know if you need more help. But I think as a first step write a program
to download a web page.

~~~
bdclimber14
I've never written anything like this... Is there a function like CURL that
downloads the page as a string from a URL?

~~~
chrisa

      require 'net/http'
      require 'uri'
      output = Net::HTTP.get(URI.parse("http://www.google.com"))

~~~
seanmccann
curb is a lot faster than net/http

require 'curb' c = Curl::Easy.perform("<http://www.google.com>) puts
c.body_str

~~~
bdclimber14
So apparently accessing a secure URL isn't straightforward... No login is
required, its just an HTTPS but it won't go in curb.

~~~
bdclimber14
Wasn't an SSL problem but a cookie problem...

------
SIK
I recently did something similar using anemone to crawl the website, and
Hpricot to scrape each individual web page and add to the database.

Anemone is great because it can focus your crawl to only url's that match a
certain pattern, which really helps you traverse a small portion of a larger
website (like a University site). You can also do specific actions on pages
that match a certain pattern.

For scraping, anemone natively supports nokogiri, so since you're coming from
a blank slate, it might be easiest to learn nokogiri. Before discovering
anemone, I had already written what needed to be done on each page in hpricot,
so my code is a bit messy, but it's not that difficult to get anemone and
hpricot to work together.

