

Ask HN: Why am I not permitted to crawl HN? - frostnovazzz

I was trying to crawling all articles on HN to do some data analysis stuff just for fun. But it seems like my IP is banned after I've crawled a few hundreds pages(at a very low rate of a few seconds per page).<p>I believe it would be a good thing if HN allow people to get its data just for some fun side-projects, or even better make it publicly available. It would encourage the hacker culture to just do some fun and innovative things.<p>What do you guys think?
======
VierScar
Firstly, read the robots.txt page (<https://news.ycombinator.com/robots.txt>),
it states a crawl-delay of _30 seconds_ so don't crawl faster than that.
Secondly, obey the robots.txt page. Don't crawl pages listed in their Disallow
rules.

~~~
mooism2
TIL robots.txt has a crawl-delay directive. Thank you.

------
rcsorensen
<https://news.ycombinator.com/item?id=4694308>

pg says that <http://hnsearch.com> is the place to get data.

------
orangethirty
I think you should understand things from the POV of YC. Letting people crawl
the site at will would force PG to scale the site to a bigger size. For what?
So you can get your free data?

------
jcla1
Here's a guide on how you are able to get HN data fast without being banned:
<http://jcla1.com/blog/2013/05/13/crawling-hackernews/> Disclaimer: It's my
own blog.

