Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Ask HN: Hacker News rate limits and robots policy?
10 points by jkarneges on Nov 5, 2013 | hide | past | favorite | 8 comments
I'm writing a program that checks Hacker News for updates and I've discovered that it's pretty easy to get my IP address blocked, even at a modest access rate. Is there an official HN policy about robots or at least some explanation about the limits?

I notice that https://news.ycombinator.com/robots.txt declares Crawl-delay to be 30, meaning that a remote entity should not make a request more than once every 30 seconds. However, I've managed to get myself blocked even with much longer delays than this. I'd think this misinformation would cause problems with search engine crawlers, unless HN gives them special exemptions.

FWIW, my observation is that HN seems to block my IP address after it has made ~1000 requests, even if this is over a very long period of time. In my latest test, polling on a 2-minute interval got me blocked after 34 hours. Maybe others can share their experiences?



There's an official API that's much easier to work with and doesn't have aggressive precautions in place - http://hnsearch.com/api


I wouldn't necessarily say it's much easier to work with. Could not manage to pull down anything representing the front page, and from what I could tell from a lot of searching, no one else could either. So yes, you have access to the data but for 99% of uses it's either too old or not representative of what you would see in a browser.

Awesome resource for querying + running analytics, but for an HN client, not as much.


Ah yes, I saw this. According to a post on their forums, the Thrift DB runs about 15 minutes behind HN itself. Not terribly great for my particular application (that checks for updates), but no doubt useful for other things.


BTW: there is a limit of 1000 results per search.


You can mitigate that by spidering users and searching domains to build a more complete database -

http://api.thriftdb.com/api.hnsearch.com/items/_search?filte...

http://api.thriftdb.com/api.hnsearch.com/items/_search?filte...

I wrote a small NodeJS spider if anyone's interested -

https://github.com/benlowry/hnsubmitterstats


Coincidentally, I was looking into that option as well, and that seems perfect for my needs :) thanks!


From my experience the actual allowed crawl interval is somewhere between 120 and 180 seconds.


I'm actually trying a test with a 180-second interval now. We'll see if this lasts longer than 51 hours.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: