Hacker News new | past | comments | ask | show | jobs | submit login
Ask PG: Polite ways to scrape HN?
71 points by siculars on Sept 23, 2010 | hide | past | favorite | 29 comments
Hi, I've been thinking about doing an HN related project and was wondering if the topic of crawling HN has been covered before. A search of "Ask PG" didn't bring up anything relevant nor did the guidelines and FAQ links. I'm also unaware of any api availability. There seems to be new HN side projects popping up every few weeks or so. How are you getting your HN data?

Obviously the basic rules apply, aka. don't hammer my server, duh. Assuming sequential article id's currently clocking in at ~1721000+ It would take a fare amount of time to slurp down HN even at 1k pages/day. Maybe look into pulling down the google cache. Eh, I dunno. Don't want to get any of my ip addrs banned or worse (hehe). Just looking for some guidance... Thanks in advance.

You can crawl so long as you respect robots.txt and don't retrieve more than a couple pages a minute. (Conservative, I know, but we're serving 800k pages a day off one server, using an app written in a slow language.)

Paul, have you considered asking the community to volunteer resources?

You've mentioned several times that the site is on one server and that performance is an issue at times. I would donate time/money to make those problems go away, and I suspect others would as well.

Proposal: Post a new thread, "somebody build me a HN server farm", and include the software and hardware prerequisites for an Arc webserver. I bet within 48 hours you would have a 3-4 servers and a load balancer at your disposal.

I think the issue is that news.arc doesn't use a distributable storage system. It is as far from "shared nothing" as you get.

Hm, interesting.

You can scale anything, though. For example, a pair of reverse proxies in front of the single "share nothing" app server could reduce the load on that one machine. I'm sure the collective brilliance of the HN readership could come up with solutions.

Simple: buy a bigger server. It's amazing how powerful one machine can be these days.

Assuming nothing has changed, it's running on a "3.0 GHz Core whatever, 12 GB RAM, 64-bit FreeBSD 7.1."

Upgraded on 4/19/09 from a 2.4 GHz Pentium 4, 4 GB RAM, 32-bit FreeBSD 5.3.

rtm's comment: http://news.ycombinator.com/item?id=516122

Yeah, well, this is a middle of the road server these days. I have a 16-core server with 64G of RAM as a dev server at work. They get even more powerful than this, too.

Not saying pg should spend any money on this, though. A redesign of news.arc will stretch the hardware a lot further.

Thank you!

Are you calling Arc Lisp slow?

Here you go: http://api.ihackernews.com/

It's an unofficial API though and it might not do everything you need.

I also made a YQL table for HackerNews a while back which should be as polite to the site as possible.


YQL implements a bunch of caching layers so it should tread as lightly as possible.

Disclaimer: I do work for Yahoo!

Just curious, especially after the deal with Microsoft: Has Yahoo made any long term commitments to support this? Would love to use this for some projects.

Yes. It's been explicitly stated on our blog and on the YQL page that it's not going away.

Specifically we've promised a minimum of 6mo notice if we ever did decide to shut it down.

is this condoned by PG?

He's aware of it because he responded to an AskHN question about it: http://news.ycombinator.com/item?id=1702399

He didn't say not to do it :)

For now, the answer seems to be yes.

I'm inclined to agree. Remember, pg doesn't provide an RSS feed for his essays.


Any words on why there is no API for HN?

It might be a philosophical thing. One APIs became popular, scraping started to be viewed as breaking the rules. I think this has discouraged many from writing search engines. Crawlers are just scrapers on a big scale.

I know I'd like to see more serious attempts at vertical search engines. I wonder if getting backlash, in the form of people complaining to hosting companies, would be the hardest part of writing a search engine for ruby blog posts that understands regular expressions. I could certainly use such a tool.

Time that could be spent working on some other, higher-marginal-utility thing, I would think.

This works great: http://api.ihackernews.com/

You can start with http://hotfile.com/dl/71319191/96579fe/ycombinator-news20080... - a mirror of old 'Y Combinator Dataset Of Posts Version 1.7' from http://news.ycombinator.com/item?id=296919 .

Also, could you please post final dataset somewhere when done?

Why not crawl the google text cache instead?

It indexes HN within a few minutes of any post.

It's a good, question and api.ihackernews.com doesn't look bad

You could monitor http://hackerne.ws/newest so you wouldn't need spider to many pages, if you only wanted the latest content.

Just curious, what kind of analysis are you thinking of doing?

That's what interested me as well. What could you possibly want to scrape HN for? I suppose it would kind of fun to see who the most active users are. Or, who has the most karma. Maybe incorporate additional game mechanics?

The only site outside of HN that uses HN data that I use regularly is ihackernews.com. All of the others I have just registered and the sites lay dormant.

Some of what you want is here: http://news.ycombinator.com/lists

If I ever get something worth looking at together I'll be sure to post here on HN. But I have some content analysis in mind.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact