Hacker News new | comments | show | ask | jobs | submit login
Ask YC: What do you scrape? How do you scrape?
51 points by schaaf 3238 days ago | hide | past | web | 46 comments | favorite
Theory-wise, there's regular languages, context-free grammars, and combinatorial categorial grammars ( http://openccg.sf.net ). But regular + lists seems adequate for most tasks.

What sorts of scraping do you find yourself doing?

What are your biggest frustrations?

What's the coolest hack you've encountered while scraping?

My cofounder and I have been working on a domain-specific language to make scraping quick and easy, so that you can write, say, 100 different website scrapers in less time -- http://dartbanks.com/simplescrape . We'd love feedback on this approach.

I do all my screen scraping with PHP, curl, and some regex. Previously I used plain PHP.

I use it to scrape television listing data (http://ktyp.com/rss/tv/ was my old site, and http://code.google.com/p/listocracy/) and more recently to scrape resume data from job posting websites for a (YC-rejected :P ) side project I'm working on.

The hardest part I've encountered with scraping is odd login and form setups. For example Monster.com uses an outside script to attempt to fool scraping. A couple other sites use bizarre redirecting across pages. Also AJAX certainly has changed the way a lot of screen scraping is done.

Finally, the most useful tool I've used is LiveHTTPHeaders (http://livehttpheaders.mozdev.org/) which is great for following how a site operates.

Edit: For PHP, another interesting tool for scraping is htmlSQL (http://www.jonasjohn.de/lab/htmlsql.htm) which allows HTML to be searched using SQL like syntax.

"Also AJAX certainly has changed the way a lot of screen scraping is done."

I'd be interested in how you tackle this one. I've always used something like Perl/Curl/wget etc for scraping, but (like you say) JavaScript messes that up. I've had moderate success using GreaseMonkey and regexps in JavaScript code, but it's a bit fragile. I'm thinking of using GreaseMonkey + jQuery, since that should allow me to select DOM elements very easily. But if you have a better way, please share :)

Even though it's actually a testing tool, you might have some luck with Canoo Webtest + Groovy (http://webtest.canoo.com). Webtest uses HtmlUnit which has pretty good Javascript support, and means you don't have to mess with regexps to get around the document structure, and Groovy lets you use an actual programming language rather than the awkward Ant-based syntax of Webtest. It takes some getting used to, and I haven't used it for web scraping, but it's a pretty powerful combination.

Thanks, I'll give it a try. I'm collaborating on a project which involves getting info from online financial markets, btw, but it's getting held up because of this scraping problem. So new ideas might help get it moving again.

If you really want to go bonkers on scraping, there are books on this.



That probably covers the topic of scraping pretty exhaustively.

I've done my share of screen scraping, gathering all different kinds of data. Movies, sports, finance, you name it. Here are three things I can tell you:

1. Take the time to get very familiar with regular expressions. If you think you know your regex pretty well, go to the docs or get a book and find three things you don't understand and understand them fully. Then find three more.

2. The data doesn't have to be perfect. In most cases you can clean it after you've stored it. It's generally better to get more than you think you might need (in terms of data or html/formatting around the data) and then go back and clean it later

3. Generally, my most successful data mining algorithms involve a lot of hacks. There are very few clean formulas...usually I have to play with the data for awhile and fix a lot of one offs and special cases and then it ends up coming out ok

We are working on a lot of scraping and analysis and here are a few links that you might be interested in if you are using Python:






The biggest hurdle is in understanding how to navigate through a complex site - such as a forum, real estate etc. We have created a visual tool for this however there are other methods. Look at dapper.net as this is useful.

I am wondering if there could be some collaborative effort from the minds on this site to create something unique and groundbreaking

I use Ruby, with its nice regex support and libraries (open-uri, REXML) and the hpricot and mechanize rubygems.

Yahoo Pipes is also fun to play with; and Firebug is the scraper's best friend.

Right now I'm working on scraping public LinkedIn data. In the past I've done Craigslist and Twitter. I haven't done anything really hard, though -- mostly things that can be read as XML.

Here's a few cool links if you're interested in scraping with Ruby: http://del.icio.us/jeremyraines/scraping

I'm working on something similar. Turns out scrapping a small part of the problem. I don't use beautifulsoup. Turns out you can transform html of a page into a list, which can easily be scrapped.

Now that I have used it to extract data out of many different types of pages. I'm looking to turn it into a dsl. So that the code looks natural. Currently it's just functions which search for tags in html. You can then easily filter some or others. here is an example

(extract-all page [(and (tagp _ :a) (classp _ "jdtd4"))])

The most powerful, general level scraping stuff I've come across lately has been ScRUBYt : http://scrubyt.org/ .. although I admit I don't have much to do to use it often.

It lets you specify which items on an initial / prototype page you want to scrape, and then it builds up a set of rules than then work on future similar instances of that page. Good for scraping eBay, Google, stuff like that.

I use BeautifulSoup when needed for simple scraping.

My biggest frustrations, right now, are really around getting data from lots of different websites in subtly varied forms. This is a tough problem to automate. I certainly haven't found any tools that make it simple.

I'd be happy with a 50% correctness rate, looking for very loose patterns. I just haven't found a tool and, while I have some ideas for how to do it, it's a major project in itself to produce something that can do this.

For example, imagine writing a scraper that would parse out every food recipe online. Whether it be in forums, blogs, etc, etc. That's the sort of scraping I'm looking for and the best I'd have is putting together a neural network or other system that I can train against human-provided data. Unfortunately getting such a system to partition the text to just the recipe would be difficult.

Getting just the recipe would be the hardest part, but it's still doable. Once you figure out that you're currently parsing a recipe (via keywords, close matching, whatever) you could fan out and look for common start/end tags like <p>, <div>, etc. If you use something like Beautiful Soup you could do this pre-parse instead of post-parse and eliminate a lot of extra stuff (no recipes in the <head> tag, etc.)

After that it just becomes an issue of removing the cruft around the recipe. I would start with common stuff: splitting things up by <br> or inner <p> since if someone is gonna have something before / after their recipe (say, on a forum) it'll be split up with blank lines somehow (well, usually). This will be another time to use things like close matching and teaching the algorithm what it gets right/wrong so it can weigh things as recipe/not better in the future.

If you do all this and add more specific edge cases as time goes on, I think you'd be able to maintain a 50% correctness rate pretty easily.

Edit: And it'd be much cheaper than a neural network ;)

nod I've thought about this a fair amount, too. You can do a lot to, say, figure out which pages contain recipes, even identify the structured information like ingredient lists (they're just lists full of foodstuffs and quantities). But IME it all falls apart when you need to find a block of text - like the descriptive part of the recipe. That's rarely marked up very clearly, and tends to blend into the rest of the text. So you either miss parts of the recipe, or pick up chunks of junk from the rest of the page.

That said, it's likely do-able, as long as you don't need perfect results. There are plenty of sites around that seem to be doing things along these lines - but AFAIK none of them have open-sourced their code.

Meanwhile, I've been a coward and stuck to beautiful soup for my scraping projects. In the short term, it works out faster than trying to be too clever.

I use curl, wget and links to retrieve data from sites and then I filter it with old sed and grep.

I created a mashup of AIM + Flicker.

If you use AIM 6 or AIM lite send a message to MyPictureBuddy

then send a message and enjoy.

basically You type a keyword and it gots to flicker and retrieves image information to display pictures right inside your AIM chat session.

I also have another Bot that parses HackerNews XML and then display it on the chat session. The bot name is


I used Mechanize and Hpricot on a project recently to create a sort of poor-man's API. My client is a performing arts organization that wanted a new website but they already had a (dreadful) internally-hosted site for selling tickets.

In order to keep website users from having 2 accounts I created an interface that scrapes the sign in, sign up, lost password, change password, and couple other screens of the internal system. So when users come to the website and "login" they're actually logging in to the internal system and I just record their session from the internal system so I can masquerade as them as they go about their business.

It's not going to support 100s of connections per second but it gets the job done for their traffic levels (36,000 views the first day of launch).

Either Beautiful Soup, or Yahoo Pipes... I have a website that parses RSS feeds, and some sites don't have feeds yet! Or if they do, they aren't usable. So I use Pipes to scrape a page and turn it into a feed using their regexp operator, then my site uses that feed.

Our startup, Feedity - http://feedity.com , generates/creates RSS web feeds from virtually any webpage, for the purpose of content tracking and mashup data reuse.

We scrape public webpages (with an option for content owners to restrict access), and we use the .NET Framework' in-built socket implementation (System.Net namespace) for fetching remote content.

Our biggest frustration was to deal with invalid charset/content encoding of the source webpages. But we resolved it using a custom module. Now everything we parse is unicode (utf-8)!

The collest hack we've encountered while scraping is utilizing the Conditional GET behavior using the HTTP If- Modified-Since header.

I'm surprised nobody has mentioned dapper yet (www.dapper.net) it's a really nice approach at turning web-sites into structured content.

HtmlPrag turns any HTML into nice s-expressions. It's a Scheme library. http://www.neilvandyke.org/htmlprag/

I've used it a lot - it's really great.

I used Hpricot to scrape web.archive and reddit to make http://reredd.com.

Plan on scraping past billboard charts to let people listen to the radio back in time.

I come from a Perl background so I've been using HTML::TreeBuilder and XML::TreeBuilder to do my parsing. It will basically load an HTML/XML file into it's own tree structure and give you an easy way to go through it. By knowing how each site names their divs/classes I am able to scrape.

I took a quick glimpse at beautiful soup and it seems to be doing something similar - someone let me know if this is correct.

Yes. You can even regex search through the tree. Weeeeee!

BeautifulSoup is nothing unique, but it can handle malformed data that saves you a ton of hassle.

Scraping EU public procurement contracts from the "Tenders Europa Daily" database (http://ted.europa.eu/). There's more than a million documents with each document requiring up to two requests. Been at it for several weeks with a multithreaded scraper and we're almost through. Using Solvent (simile.mit.edu/solvent/) to generate xpath expressions and HtmlAgilityPack (www.codeplex.com/htmlagilitypack) to run the xpath on the downloaded html with regexps as the topping. They're a match made in heaven (http://www.itu.dk/~friism/blog/?p=40).

The login procedure is gothic and took a lot of wiresharking to figure out. .Net has pretty good scraping-support in the WebClient and HttpWebRequest classes found in the System.Net namespace.

Will publish results soon... :-)

Be careful here. The content is actually copyrighted. Whilst you can scrape it their T&Cs expressly forbid it. They sell licenses to access this information - the license is NOT expensive and they provide direct access to all the data in XML.


Quote: "Reproduction is authorised provided the source is acknowledged. However, to prevent disruptions in service to our normal users from bulk downloads of TED data, we reserve the right to check for, and block, attempts to download excessive quantities of documents, particularly using automated or robot-like tools."

... they apparently chose not to exercise that right in this case, the scrape completed last night (all 18 GB of it).

I've written a lot of this sort of program over the last 18 months. This is something that people are in need of all the time. I would say that there isn't yet a tool which does this to the level that customers want.

I use Mechanize, both in its Ruby and Python forms (I prefer Ruby) and plain old regular expressions to get the information that I want. Often times I will use a divide and conquer strategy by removing part of the web page (for example, the <head>) and successively paring it down to what I really want.

Javascript can be a problem. What I normally do is actually read the Javascript on the page, and then recreate that behavior in my Ruby code. Often times this means simply setting some form values (usually hidden) and then submitting the form.

Scraping HN for fnid's to auto-post xkcd comics :-P. It was a hack so I just did string searches.

web scrapping with beautiful soup (is this old school already?) - parsing the Sydney Future Exchange for data back in 2003 (still running)

We're using a general-use multi-threaded crawler to get the pages and then using Beautiful Soup and a bit of regex to parse them. Though we are scraping multiple sites, they are all in the same "category" so to speak, so there are a lot of generic parsing methods that are simply overridden when necessary. PyParsing was played with for a while, but since data comes in so many slightly varied forms I was ending up with rules that were miles and miles long just to find a simple price or date/time on a page that would work for the largest number of sites possible.

My startup, http://www.FuseCal.com (previously discussed at http://news.ycombinator.com/item?id=146134), scrapes calendar events out of web pages and into your personal calendar. In the general case, we don't know anything about the layout of the page before trying to extract the events, so there's something of a classification problem first.

I'm a big fan of using Hpricot + Ruby. I'd say the sites I had been scraping but I doubt my old client wants it to come out :|

To get the most bang for my buck (developer time wise) I would visit each site with firebug in inspect mode, hover the data I want to extract. From there I figure out how I would style that element, and because Hpricot supports CSS selectors I've straight away got a method for pulling that data out of the page.

This sounds redundant already, but I scrape using beautiful soup. Right now I'm scraping a lot of news sites and feeds for a project I am working on.

This can be very useful. I use pyparsing with custom python code for scraping.

http://gatherer.wizards.com with BeautifulSoup, the only parser I've found that can deal with this @$%^! HTML.

Do any of you who scrape fear retaliation from the sites you scrape? Maybe you are violating a ToS or scraping copyrighted text, and they cut off your IP. Thoughts?

I think you have to take into consideration the TOS, copyright and also robots.txt. If you ignore these then its well within the site owners rights to do something about it - blocking you or further. We always look at the robots.txt file first and use that as our benchmark in terms of what they (the site) wish robots/crawlers to look at

I always do custom stuff in beautiful soup, but this looks somewhat cool.

Maybe have it so you can edit the sample text and language and see the results all on a web page?

I have a similar DSL built in Ruby that can be run by by either Mechanize or Watir. I highly recommend Watir if you need to scrape ajax.

I've been using C# with the HTMLAgilityPack. Probably not as fast as it could be, but C# is what I know best.

Yum, declarative!

I can't say I'm crazy about the syntax, but I'll give this a try when I get home.

Firebug can be helpful for finding elements you want to regex on!

anyone that complains about HTML scraping is a pussy. Seriously its trivial compared to what we had to do in the past. I like hpricot for ruby.

Longest Common Subsequences are quite useful as well.

Do you mean for adding a little resilience to your rigid model, or something funkier?

anyone have success with emacs and w3? I haven't given it a shot yet, but seems like its interactive nature might be useful.

What: Banks. With: lib-www-perl.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact