I'm writing a program that checks Hacker News for updates and I've discovered that it's pretty easy to get my IP address blocked, even at a modest access rate. Is there an official HN policy about robots or at least some explanation about the limits?
I notice that https://news.ycombinator.com/robots.txt declares Crawl-delay to be 30, meaning that a remote entity should not make a request more than once every 30 seconds. However, I've managed to get myself blocked even with much longer delays than this. I'd think this misinformation would cause problems with search engine crawlers, unless HN gives them special exemptions.
FWIW, my observation is that HN seems to block my IP address after it has made ~1000 requests, even if this is over a very long period of time. In my latest test, polling on a 2-minute interval got me blocked after 34 hours. Maybe others can share their experiences?