
Ask PG: Polite ways to scrape HN? - siculars
Hi, I've been thinking about doing an HN related project and was wondering if the topic of crawling HN has been covered before. A search of "Ask PG" didn't bring up anything relevant nor did the guidelines and FAQ links. I'm also unaware of any api availability. There seems to be new HN side projects popping up every few weeks or so. How are you getting your HN data?<p>Obviously the basic rules apply, aka. don't hammer my server, duh. Assuming sequential article id's currently clocking in at ~1721000+ It would take a fare amount of time to slurp down HN even at 1k pages/day. Maybe look into pulling down the google cache. Eh, I dunno. Don't want to get any of my ip addrs banned or worse (hehe). Just looking for some guidance... Thanks in advance.
======
pg
You can crawl so long as you respect robots.txt and don't retrieve more than a
couple pages a minute. (Conservative, I know, but we're serving 800k pages a
day off one server, using an app written in a slow language.)

~~~
portman
Paul, have you considered asking the community to volunteer resources?

You've mentioned several times that the site is on one server and that
performance is an issue at times. I would donate time/money to make those
problems go away, and I suspect others would as well.

Proposal: Post a new thread, "somebody build me a HN server farm", and include
the software and hardware prerequisites for an Arc webserver. I bet within 48
hours you would have a 3-4 servers and a load balancer at your disposal.

~~~
jrockway
I think the issue is that news.arc doesn't use a distributable storage system.
It is as far from "shared nothing" as you get.

~~~
portman
Hm, interesting.

You can scale _anything_ , though. For example, a pair of reverse proxies in
front of the single "share nothing" app server could reduce the load on that
one machine. I'm sure the collective brilliance of the HN readership could
come up with solutions.

~~~
staunch
Simple: buy a bigger server. It's amazing how powerful one machine can be
these days.

~~~
grinich
Assuming nothing has changed, it's running on a "3.0 GHz Core whatever, 12 GB
RAM, 64-bit FreeBSD 7.1."

Upgraded on 4/19/09 from a 2.4 GHz Pentium 4, 4 GB RAM, 32-bit FreeBSD 5.3.

rtm's comment: <http://news.ycombinator.com/item?id=516122>

~~~
jrockway
Yeah, well, this is a middle of the road server these days. I have a 16-core
server with 64G of RAM as a dev server at work. They get even more powerful
than this, too.

Not saying pg should spend any money on this, though. A redesign of news.arc
will stretch the hardware a lot further.

------
bobds
Here you go: <http://api.ihackernews.com/>

It's an unofficial API though and it might not do everything you need.

~~~
dustingetz
is this condoned by PG?

~~~
sahillavingia
For now, the answer seems to be _yes_.

~~~
benatkin
I'm inclined to agree. Remember, pg doesn't provide an RSS feed for his
essays.

<http://www.paulgraham.com/rss.html>

~~~
riobard
Any words on why there is no API for HN?

~~~
benatkin
It might be a philosophical thing. One APIs became popular, scraping started
to be viewed as breaking the rules. I think this has discouraged many from
writing search engines. Crawlers are just scrapers on a big scale.

I know I'd like to see more serious attempts at vertical search engines. I
wonder if getting backlash, in the form of people complaining to hosting
companies, would be the hardest part of writing a search engine for ruby blog
posts that understands regular expressions. I could certainly use such a tool.

------
sahillavingia
This works great: <http://api.ihackernews.com/>

------
lnp
You can start with [http://hotfile.com/dl/71319191/96579fe/ycombinator-
news20080...](http://hotfile.com/dl/71319191/96579fe/ycombinator-
news20080906.tar.gz.html) \- a mirror of old 'Y Combinator Dataset Of Posts
Version 1.7' from <http://news.ycombinator.com/item?id=296919> .

Also, could you please post final dataset somewhere when done?

------
ck2
Why not crawl the google text cache instead?

It indexes HN within a few minutes of any post.

------
SteveMorin
It's a good, question and api.ihackernews.com doesn't look bad

------
deutronium
You could monitor <http://hackerne.ws/newest> so you wouldn't need spider to
many pages, if you only wanted the latest content.

------
charlief
Just curious, what kind of analysis are you thinking of doing?

~~~
jonpaul
That's what interested me as well. What could you possibly want to scrape HN
for? I suppose it would kind of fun to see who the most active users are. Or,
who has the most karma. Maybe incorporate additional game mechanics?

The only site outside of HN that uses HN data that I use regularly is
ihackernews.com. All of the others I have just registered and the sites lay
dormant.

~~~
nitrogen
Some of what you want is here: <http://news.ycombinator.com/lists>

