

Ask YC: What do you scrape? How do you scrape? - schaaf

Theory-wise, there's regular languages, context-free grammars, and combinatorial categorial grammars ( http://openccg.sf.net ). But regular + lists seems adequate for most tasks.<p>What sorts of scraping do you find yourself doing?<p>What are your biggest frustrations?<p>What's the coolest hack you've encountered while scraping?<p>My cofounder and I have been working on a domain-specific language to make scraping quick and easy, so that you can write, say, 100 different website scrapers in less time -- http://dartbanks.com/simplescrape . We'd love feedback on this approach.
======
foodawg
I do all my screen scraping with PHP, curl, and some regex. Previously I used
plain PHP.

I use it to scrape television listing data (<http://ktyp.com/rss/tv/> was my
old site, and <http://code.google.com/p/listocracy/>) and more recently to
scrape resume data from job posting websites for a (YC-rejected :P ) side
project I'm working on.

The hardest part I've encountered with scraping is odd login and form setups.
For example Monster.com uses an outside script to attempt to fool scraping. A
couple other sites use bizarre redirecting across pages. Also AJAX certainly
has changed the way a lot of screen scraping is done.

Finally, the most useful tool I've used is LiveHTTPHeaders
(<http://livehttpheaders.mozdev.org/>) which is great for following how a site
operates.

Edit: For PHP, another interesting tool for scraping is htmlSQL
(<http://www.jonasjohn.de/lab/htmlsql.htm>) which allows HTML to be searched
using SQL like syntax.

~~~
m0nty
"Also AJAX certainly has changed the way a lot of screen scraping is done."

I'd be interested in how you tackle this one. I've always used something like
Perl/Curl/wget etc for scraping, but (like you say) JavaScript messes that up.
I've had moderate success using GreaseMonkey and regexps in JavaScript code,
but it's a bit fragile. I'm thinking of using GreaseMonkey + jQuery, since
that should allow me to select DOM elements very easily. But if you have a
better way, please share :)

~~~
alex_c
Even though it's actually a testing tool, you might have some luck with Canoo
Webtest + Groovy (<http://webtest.canoo.com>). Webtest uses HtmlUnit which has
pretty good Javascript support, and means you don't have to mess with regexps
to get around the document structure, and Groovy lets you use an actual
programming language rather than the awkward Ant-based syntax of Webtest. It
takes some getting used to, and I haven't used it for web scraping, but it's a
pretty powerful combination.

~~~
m0nty
Thanks, I'll give it a try. I'm collaborating on a project which involves
getting info from online financial markets, btw, but it's getting held up
because of this scraping problem. So new ideas might help get it moving again.

------
henning
If you really want to go bonkers on scraping, there are books on this.

<http://nostarch.com/frameset.php?startat=webbots>

<http://www.oreilly.com/catalog/spiderhks/>

That probably covers the topic of scraping pretty exhaustively.

------
inovica
We are working on a lot of scraping and analysis and here are a few links that
you might be interested in if you are using Python:

<http://pyro.sourceforge.net/>

<http://pyprocessing.berlios.de/>

<http://www.sqlalchemy.org/>

<http://codespeak.net/lxml/>

<http://nltk.org/index.php/Main_Page>

The biggest hurdle is in understanding how to navigate through a complex site
- such as a forum, real estate etc. We have created a visual tool for this
however there are other methods. Look at dapper.net as this is useful.

I am wondering if there could be some collaborative effort from the minds on
this site to create something unique and groundbreaking

------
hooande
I've done my share of screen scraping, gathering all different kinds of data.
Movies, sports, finance, you name it. Here are three things I can tell you:

1\. Take the time to get very familiar with regular expressions. If you think
you know your regex pretty well, go to the docs or get a book and find three
things you don't understand and understand them fully. Then find three more.

2\. The data doesn't have to be perfect. In most cases you can clean it after
you've stored it. It's generally better to get more than you think you might
need (in terms of data or html/formatting around the data) and then go back
and clean it later

3\. Generally, my most successful data mining algorithms involve a lot of
hacks. There are very few clean formulas...usually I have to play with the
data for awhile and fix a lot of one offs and special cases and then it ends
up coming out ok

------
jraines
I use Ruby, with its nice regex support and libraries (open-uri, REXML) and
the hpricot and mechanize rubygems.

Yahoo Pipes is also fun to play with; and Firebug is the scraper's best
friend.

Right now I'm working on scraping _public_ LinkedIn data. In the past I've
done Craigslist and Twitter. I haven't done anything really hard, though --
mostly things that can be read as XML.

Here's a few cool links if you're interested in scraping with Ruby:
<http://del.icio.us/jeremyraines/scraping>

------
vikram
I'm working on something similar. Turns out scrapping a small part of the
problem. I don't use beautifulsoup. Turns out you can transform html of a page
into a list, which can easily be scrapped.

Now that I have used it to extract data out of many different types of pages.
I'm looking to turn it into a dsl. So that the code looks natural. Currently
it's just functions which search for tags in html. You can then easily filter
some or others. here is an example

(extract-all page [(and (tagp _ :a) (classp _ "jdtd4"))])

------
petercooper
The most powerful, general level scraping stuff I've come across lately has
been ScRUBYt : <http://scrubyt.org/> .. although I admit I don't have much to
do to use it often.

It lets you specify which items on an initial / prototype page you want to
scrape, and then it builds up a set of rules than then work on future similar
instances of that page. Good for scraping eBay, Google, stuff like that.

------
thorax
I use BeautifulSoup when needed for simple scraping.

My biggest frustrations, right now, are really around getting data from lots
of different websites in subtly varied forms. This is a tough problem to
automate. I certainly haven't found any tools that make it simple.

I'd be happy with a 50% correctness rate, looking for very loose patterns. I
just haven't found a tool and, while I have some ideas for how to do it, it's
a major project in itself to produce something that can do this.

For example, imagine writing a scraper that would parse out every food recipe
online. Whether it be in forums, blogs, etc, etc. That's the sort of scraping
I'm looking for and the best I'd have is putting together a neural network or
other system that I can train against human-provided data. Unfortunately
getting such a system to partition the text to just the recipe would be
difficult.

~~~
fallentimes
Getting just the recipe would be the hardest part, but it's still doable. Once
you figure out that you're currently parsing a recipe (via keywords, close
matching, whatever) you could fan out and look for common start/end tags like
<p>, <div>, etc. If you use something like Beautiful Soup you could do this
pre-parse instead of post-parse and eliminate a lot of extra stuff (no recipes
in the <head> tag, etc.)

After that it just becomes an issue of removing the cruft around the recipe. I
would start with common stuff: splitting things up by <br> or inner <p> since
if someone is gonna have something before / after their recipe (say, on a
forum) it'll be split up with blank lines somehow (well, usually). This will
be another time to use things like close matching and teaching the algorithm
what it gets right/wrong so it can weigh things as recipe/not better in the
future.

If you do all this and add more specific edge cases as time goes on, I think
you'd be able to maintain a 50% correctness rate pretty easily.

Edit: And it'd be much cheaper than a neural network ;)

------
imrobotmaker
I use curl, wget and links to retrieve data from sites and then I filter it
with old sed and grep.

I created a mashup of AIM + Flicker.

If you use AIM 6 or AIM lite send a message to MyPictureBuddy

then send a message and enjoy.

basically You type a keyword and it gots to flicker and retrieves image
information to display pictures right inside your AIM chat session.

I also have another Bot that parses HackerNews XML and then display it on the
chat session. The bot name is

HackerNewsYC

------
jharrison
I used Mechanize and Hpricot on a project recently to create a sort of poor-
man's API. My client is a performing arts organization that wanted a new
website but they already had a (dreadful) internally-hosted site for selling
tickets.

In order to keep website users from having 2 accounts I created an interface
that scrapes the sign in, sign up, lost password, change password, and couple
other screens of the internal system. So when users come to the website and
"login" they're actually logging in to the internal system and I just record
their session from the internal system so I can masquerade as them as they go
about their business.

It's not going to support 100s of connections per second but it gets the job
done for their traffic levels (36,000 views the first day of launch).

------
nreece
Our startup, Feedity - <http://feedity.com> , generates/creates RSS web feeds
from virtually any webpage, for the purpose of content tracking and mashup
data reuse.

We scrape public webpages (with an option for content owners to restrict
access), and we use the .NET Framework' in-built socket implementation
(System.Net namespace) for fetching remote content.

Our biggest frustration was to deal with invalid charset/content encoding of
the source webpages. But we resolved it using a custom module. Now everything
we parse is unicode (utf-8)!

The collest hack we've encountered while scraping is utilizing the Conditional
GET behavior using the HTTP If- Modified-Since header.

------
mosburger
Either Beautiful Soup, or Yahoo Pipes... I have a website that parses RSS
feeds, and some sites don't have feeds yet! Or if they do, they aren't usable.
So I use Pipes to scrape a page and turn it into a feed using their regexp
operator, then my site uses that feed.

------
jauco
I'm surprised nobody has mentioned dapper yet (www.dapper.net) it's a really
nice approach at turning web-sites into structured content.

------
herdrick
HtmlPrag turns any HTML into nice s-expressions. It's a Scheme library.
<http://www.neilvandyke.org/htmlprag/>

I've used it a lot - it's really great.

------
aquateen
I used Hpricot to scrape web.archive and reddit to make <http://reredd.com>.

Plan on scraping past billboard charts to let people listen to the radio back
in time.

------
dangoldin
I come from a Perl background so I've been using HTML::TreeBuilder and
XML::TreeBuilder to do my parsing. It will basically load an HTML/XML file
into it's own tree structure and give you an easy way to go through it. By
knowing how each site names their divs/classes I am able to scrape.

I took a quick glimpse at beautiful soup and it seems to be doing something
similar - someone let me know if this is correct.

~~~
mrtron
Yes. You can even regex search through the tree. Weeeeee!

BeautifulSoup is nothing unique, but it can handle malformed data that saves
you a ton of hassle.

------
friism
Scraping EU public procurement contracts from the "Tenders Europa Daily"
database (<http://ted.europa.eu/>). There's more than a million documents with
each document requiring up to two requests. Been at it for several weeks with
a multithreaded scraper and we're almost through. Using Solvent
(simile.mit.edu/solvent/) to generate xpath expressions and HtmlAgilityPack
(www.codeplex.com/htmlagilitypack) to run the xpath on the downloaded html
with regexps as the topping. They're a match made in heaven
(<http://www.itu.dk/~friism/blog/?p=40>).

The login procedure is gothic and took a lot of wiresharking to figure out.
.Net has pretty good scraping-support in the WebClient and HttpWebRequest
classes found in the System.Net namespace.

Will publish results soon... :-)

~~~
inovica
Be careful here. The content is actually copyrighted. Whilst you can scrape it
their T&Cs expressly forbid it. They sell licenses to access this information
- the license is NOT expensive and they provide direct access to all the data
in XML.

~~~
friism
[http://ted.europa.eu/Exec?DataFlow=ShowPage.dfl&Template...](http://ted.europa.eu/Exec?DataFlow=ShowPage.dfl&Template=TED/important_legal_notice)

Quote: "Reproduction is authorised provided the source is acknowledged.
However, to prevent disruptions in service to our normal users from bulk
downloads of TED data, we reserve the right to check for, and block, attempts
to download excessive quantities of documents, particularly using automated or
robot-like tools."

... they apparently chose not to exercise that right in this case, the scrape
completed last night (all 18 GB of it).

------
jdvolz
I've written a lot of this sort of program over the last 18 months. This is
something that people are in need of all the time. I would say that there
isn't yet a tool which does this to the level that customers want.

I use Mechanize, both in its Ruby and Python forms (I prefer Ruby) and plain
old regular expressions to get the information that I want. Often times I will
use a divide and conquer strategy by removing part of the web page (for
example, the <head>) and successively paring it down to what I really want.

Javascript can be a problem. What I normally do is actually read the
Javascript on the page, and then recreate that behavior in my Ruby code. Often
times this means simply setting some form values (usually hidden) and then
submitting the form.

------
bkrausz
Scraping HN for fnid's to auto-post xkcd comics :-P. It was a hack so I just
did string searches.

------
3KWA
web scrapping with beautiful soup (is this old school already?) - parsing the
Sydney Future Exchange for data back in 2003 (still running)

------
fallentimes
We're using a general-use multi-threaded crawler to get the pages and then
using Beautiful Soup and a bit of regex to parse them. Though we are scraping
multiple sites, they are all in the same "category" so to speak, so there are
a lot of generic parsing methods that are simply overridden when necessary.
PyParsing was played with for a while, but since data comes in so many
slightly varied forms I was ending up with rules that were miles and miles
long just to find a simple price or date/time on a page that would work for
the largest number of sites possible.

------
sheriff
My startup, <http://www.FuseCal.com> (previously discussed at
<http://news.ycombinator.com/item?id=146134>), scrapes calendar events out of
web pages and into your personal calendar. In the general case, we don't know
anything about the layout of the page before trying to extract the events, so
there's something of a classification problem first.

------
johnb
I'm a big fan of using Hpricot + Ruby. I'd say the sites I had been scraping
but I doubt my old client wants it to come out :|

To get the most bang for my buck (developer time wise) I would visit each site
with firebug in inspect mode, hover the data I want to extract. From there I
figure out how I would style that element, and because Hpricot supports CSS
selectors I've straight away got a method for pulling that data out of the
page.

------
mk
This sounds redundant already, but I scrape using beautiful soup. Right now
I'm scraping a lot of news sites and feeds for a project I am working on.

------
ivrokv
This can be very useful. I use pyparsing with custom python code for scraping.

------
mrtron
I always do custom stuff in beautiful soup, but this looks somewhat cool.

Maybe have it so you can edit the sample text and language and see the results
all on a web page?

------
blinks
<http://gatherer.wizards.com> with BeautifulSoup, the only parser I've found
that can deal with this @$%^! HTML.

------
andrew311
Do any of you who scrape fear retaliation from the sites you scrape? Maybe you
are violating a ToS or scraping copyrighted text, and they cut off your IP.
Thoughts?

~~~
inovica
I think you have to take into consideration the TOS, copyright and also
robots.txt. If you ignore these then its well within the site owners rights to
do something about it - blocking you or further. We always look at the
robots.txt file first and use that as our benchmark in terms of what they (the
site) wish robots/crawlers to look at

------
glasner
I have a similar DSL built in Ruby that can be run by by either Mechanize or
Watir. I highly recommend Watir if you need to scrape ajax.

------
misterbwong
I've been using C# with the HTMLAgilityPack. Probably not as fast as it could
be, but C# is what I know best.

------
bct
Yum, declarative!

I can't say I'm crazy about the syntax, but I'll give this a try when I get
home.

------
bprater
Firebug can be helpful for finding elements you want to regex on!

------
michaelneale
anyone that complains about HTML scraping is a pussy. Seriously its trivial
compared to what we had to do in the past. I like hpricot for ruby.

------
latone
Longest Common Subsequences are quite useful as well.

~~~
schaaf
Do you mean for adding a little resilience to your rigid model, or something
funkier?

------
yters
anyone have success with emacs and w3? I haven't given it a shot yet, but
seems like its interactive nature might be useful.

------
ashu
What: Banks. With: lib-www-perl.

