

Ask HN: Indexing/Crawling the web? - smallegan

Are there any hackers here that have experience indexing/crawling the web? I am wondering if anyone has any data on the cost of the bandwidth to do so? Even if it was a text only crawl it seems like this could be costly and I have a couple of ideas that would involve me playing with data extracted from a crawl. I have no experience in this area so any information would be greatly appreciated! Thanks!
======
allenp
Not sure if this helps but it could save you some time / money -
<http://www.80legs.com> you should be able to set up a sample crawl for free
to start to get a feel for the size of the dataset.

------
Travis
It sounds like you're pretty green in this area, so I'll give you a few tips
from what I've read.

* You may not be allowed to crawl everything, and your use of that data may not always be in line with sites' terms of service. Just be aware. * It is indeed an ungodly task to crawl the web. To do the whole thing, you'd probably need thousands of servers (to crawl in a reasonable amount of time). * 80legs offers you their crawling infrastructure using cloud based crawlers. Check them out.

What are you wanting to crawl / what ultimate use?

------
NonEUCitizen
make sure you respect robots.txt, including Crawl-delay:

[http://en.wikipedia.org/wiki/Robots_exclusion_standard#Crawl...](http://en.wikipedia.org/wiki/Robots_exclusion_standard#Crawl-
delay_directive)

