Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Why am I not permitted to crawl HN?
13 points by frostnovazzz on May 20, 2013 | hide | past | favorite | 5 comments
I was trying to crawling all articles on HN to do some data analysis stuff just for fun. But it seems like my IP is banned after I've crawled a few hundreds pages(at a very low rate of a few seconds per page).

I believe it would be a good thing if HN allow people to get its data just for some fun side-projects, or even better make it publicly available. It would encourage the hacker culture to just do some fun and innovative things.

What do you guys think?




Firstly, read the robots.txt page (https://news.ycombinator.com/robots.txt), it states a crawl-delay of 30 seconds so don't crawl faster than that. Secondly, obey the robots.txt page. Don't crawl pages listed in their Disallow rules.


TIL robots.txt has a crawl-delay directive. Thank you.



I think you should understand things from the POV of YC. Letting people crawl the site at will would force PG to scale the site to a bigger size. For what? So you can get your free data?


Here's a guide on how you are able to get HN data fast without being banned: http://jcla1.com/blog/2013/05/13/crawling-hackernews/ Disclaimer: It's my own blog.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: