

Ask HN: Hacker News rate limits and robots policy? - jkarneges

I&#x27;m writing a program that checks Hacker News for updates and I&#x27;ve discovered that it&#x27;s pretty easy to get my IP address blocked, even at a modest access rate. Is there an official HN policy about robots or at least some explanation about the limits?<p>I notice that https:&#x2F;&#x2F;news.ycombinator.com&#x2F;robots.txt declares Crawl-delay to be 30, meaning that a remote entity should not make a request more than once every 30 seconds. However, I&#x27;ve managed to get myself blocked even with much longer delays than this. I&#x27;d think this misinformation would cause problems with search engine crawlers, unless HN gives them special exemptions.<p>FWIW, my observation is that HN seems to block my IP address after it has made ~1000 requests, even if this is over a very long period of time. In my latest test, polling on a 2-minute interval got me blocked after 34 hours. Maybe others can share their experiences?
======
benologist
There's an official API that's much easier to work with and doesn't have
aggressive precautions in place -
[http://hnsearch.com/api](http://hnsearch.com/api)

~~~
minimaxir
BTW: there is a limit of 1000 results per search.

~~~
benologist
You can mitigate that by spidering users and searching domains to build a more
complete database -

[http://api.thriftdb.com/api.hnsearch.com/items/_search?filte...](http://api.thriftdb.com/api.hnsearch.com/items/_search?filter\[fields\]\[type\]=submission&filter\[fields\]\[username\]=minimaxir)

[http://api.thriftdb.com/api.hnsearch.com/items/_search?filte...](http://api.thriftdb.com/api.hnsearch.com/items/_search?filter\[fields\]\[type\]=submission&q=arstechnica.com)

I wrote a small NodeJS spider if anyone's interested -

[https://github.com/benlowry/hnsubmitterstats](https://github.com/benlowry/hnsubmitterstats)

~~~
minimaxir
Coincidentally, I was looking into that option as well, and that seems perfect
for my needs :) thanks!

------
raquo
From my experience the actual allowed crawl interval is somewhere between 120
and 180 seconds.

~~~
jkarneges
I'm actually trying a test with a 180-second interval now. We'll see if this
lasts longer than 51 hours.

