| ||Ask PG: Polite ways to scrape HN?|
71 points by siculars on Sept 23, 2010 | hide | past | web | favorite | 29 comments |
|Hi, I've been thinking about doing an HN related project and was wondering if the topic of crawling HN has been covered before. A search of "Ask PG" didn't bring up anything relevant nor did the guidelines and FAQ links. I'm also unaware of any api availability. There seems to be new HN side projects popping up every few weeks or so. How are you getting your HN data?|
Obviously the basic rules apply, aka. don't hammer my server, duh. Assuming sequential article id's currently clocking in at ~1721000+ It would take a fare amount of time to slurp down HN even at 1k pages/day. Maybe look into pulling down the google cache. Eh, I dunno. Don't want to get any of my ip addrs banned or worse (hehe). Just looking for some guidance... Thanks in advance.
| Apply to YC