

Ask HN: Does HN have an API and if not what's the etiquette for scraping? - robeastham

I haven't seen any mention of an API and I want my new resume app, aimed a tech types, to have some very basic HN integration. I want to offer the option for the user to supply their HN username and have their karma appear on the resume generated by my app. This along with how long they have been a HN member. I'd like it to be an up to date figure and so ideally would like to scrape it every time a request for the resume is made. Would this be acceptable or is it bad form? If it's bad form what would be a polite number of times to scrape?<p>I guess this question could open a more general discussion on the etiquette of screen scraping. So feel free to answer in general terms too.
======
hardik988
Hacker News has an unofficial API at <http://api.ihackernews.com/> , and as
far as I know, it is running with PG's permission and under certain limits.

I've worked with the API and it does a fantastic job. For what you're looking
for , I think this API call will do the job:

<http://api.ihackernews.com/profile/{userID}>

------
VBprogrammer
<http://news.ycombinator.com/robots.txt>

    
    
      User-Agent: * Disallow: /x? Disallow: /vote? Disallow: /reply? Disallow: /submitted? Disallow: /submitlink? Disallow: /threads? Crawl-delay: 30

------
coderdude
Another user posted a similar question a few days ago asking if it would be
alright if he crawled HN, and would Paul block him if he attempted to do so. I
told him: "There's enough of a load on HN as it is without every Tom, Dick,
and Harry testing out his proof of concept by crawling this site."

I think a year ago this wouldn't have been an issue if you spaced your
requests out by several seconds, but during the daytime (United States
daytime, that is) this site is hammered. Yesterday the site was brought to a
crawl (no pun intended) by just from everyone commenting on that DDG/Google
article.

If you spaced your requests out by 3 or 4 seconds and only crawled during the
US night time then the servers shouldn't be affected too much. That doesn't
account for the data transfer you use up though. Someone has to pay for that
and there are already several sites (that I can think of off the top of my
head) that crawl HN on a regular basis.

------
komlenic
For what you're looking to do, once a day is plenty often enough. (How much
does karma change in a day and does it matter if your site is a few points
behind?)

If it were me, I'd cache the results of an api call to HN with a last-
retrieved timestamp and check against that for requests > 24 hours to retrieve
again.

~~~
robeastham
Agreed. I guess it doesn't really matter if my site is a few points behind.
Timestamp strategy makes sense too. Thanks!

I like the look of <http://api.ihackernews.com/> and so will poll this rather
than scrape - thanks for the tip hardik988 and for all the other comments too.

