
Ask HN: How do I crawl responsibly? - edward_rolf
I&#x27;ve been developing a in-process search engine for some a while. Now it&#x27;s time to experiment in distributing it over many machines and also serve up a public GUI but I am wary because I have never crawled the web before.<p>To start with I&#x27;m just going to index as much data I can fit on an entry-level cloud machine and because I am very poor I shall be asking for donations to further the scope of the index.<p>Say I start with Wikipedia and The Gutenberg project and a couple of news sites. The first two will be easy, they have dumps of their data and I also don&#x27;t think Wikipedia would mind at all if I put a tiny amount of preasure on their servers for the good cause of building a free, anonymous and open web search. But what about the rest of the internet? Will they mind?<p>People crawl and scrape the web all the time for different purposes. I&#x27;m looking for some advise so that I don&#x27;t piss anyone off with my crawler. What tools&#x2F;strategies do you suggest I use?<p>Cheers!
======
Petrakis
Check the site robots.txt

[https://en.wikipedia.org/wiki/Robots_exclusion_standard](https://en.wikipedia.org/wiki/Robots_exclusion_standard)

But basically scrapping is a loophole, you are not doing nothing wrong because
everything is accessible with browser. Still web admins get upset.

For tools I use now selenium with phantomJS, but you can also use scrappy.

As long as you don´t claim alot of bandwith that other users could use should
be fine

~~~
edward_rolf
Thx!

