I mean, they state that their crawl average is 1 req/sec and can be modified if you contact them. The problem is that other people can modify the crawl rate (somebody set ours to 4/seq).
And remember, that's just the average. The distributed nature of their system means that it's ridiculously spikey. Furthermore they don't respect nofollow tags that we put in place to keep bots out of bad areas, so their crawlers got into very unoptimized areas of the site.
Overall though we turned it into a positive by ramping up our analysis tools, and creating a much stronger robots.txt. Still... I really don't like 80legs and bristle when people reference them as someone doing 'good'.
1. We identify ourselves as user-agent 008. The proper way to track a bot is through user-agent, not IP address.
2. We never say that the average rate for all sites is 1 req/sec. Many sites have higher rates. We do a best guess on what load a site can handle based on a mix of Alexa/Quantcast/Compete stats.
3. We always respond to requests from webmasters to reduce the crawl rate. Within 5 minutes of getting your message, we did this. What other bot does this? Oh yeah, none of them.
4. The nofollow tag is not a standard way to tell bots to not go to a link. That is what robots.txt is for. Nofollow was implemented for Google specifically and has caught on as a (not as common as you think) tag. Robots.txt is the right way to talk to bots.
5. You know what happens when people don't use 80legs? They figure out some way to get the information they want and soak up your bandwidth anyway. A webmaster is fooling himself if he thinks he can stop people from crawling a site. It's going to happen. At least with 80legs the webmaster has a responsible company with real people to talk to so that the crawl can be managed.
2. You should try reading your own FAQ. http://80legs.pbworks.com/FAQ - see Rate limiting section.
3/4/5. The only bot I care about is Googlebot. And they do use nofollow that way, fortunately the other major bots do as well... so things have been working pretty darn great for everybody. Every other bot I don't care for, don't need, and don't want. If someone legit wants access to our data they can use our totally open and free API that actually gives a lot better information.
In principal I think 80legs can be used for good, however, it's like gunpowder where I question the legitimacy of real-world practical use, simply because it can be abused so so easily. I could very well be totally off base here, but that's what lack of information + bad experience gets you.
edit: And shouldn't the fact that you "...usually get comments like this on webmaster forums..." be a hint that maybe there's a real issue with your service?
2. Per another commenter, I can see how the language may be confusing; we'll change it.
3. Webmasters tend to rush to grab their pitchforks instead of thinking things through. We follow all the standard, accepted rules.
4. Some people may not want to use your API because they are interested in more than just your site. Crawling is a one-stop solution to get information from a ton of sites, instead of implementing an API for a single site that is just one data point on the web.
However, something that may have a bigger effect on your business is the seemingly dismissive and patronizing attitude you have towards webmasters. At this stage of the game you need us way more than we need you. I'm sure in this exchange there's defensiveness on your end because I attacked your company, and you've probably dealt with many PO'd guys like me (patience wears thin easily).
However, that's the nature of the web. Us webmasters grab our pitchforks and bitch and moan. I imagine if a large enough contingent of webmasters doesn't like you guys it WILL start to have an effect on your overall business.
In which case, it's probably smart business to either: A) play nice, even when we're not, and/or B) Demonstrate a clear business case why it's better for us to expose our data to your crawlers.
Edit: I should note the response would have been even faster if you had emailed us from the contact page that's on the website we link to in our request header. Tweets go to us marketing/biz-dev guys. The contact form goes to everyone in the company. I'm not sure how much nicer we can be beyond doing everything we can to be reachable and identifiable.
Are you going to reimburse us the damages? Obviously not.
I want you to realize that the timeliness of your response is completely irrelevant to the matter at hand.
Proper net etiquette would be to have a reasonable limit of an ACTUAL 1 req/sec, not an average. Furthermore, allow only the webmaster of the site to increase the rate.
Also, while nofollow sculpting is not the original intention of the metatag, MANY people use it as such, so respect that directive.
Also, you can have your system be aware when a site is slow to respond and adjust crawl rates/times as necessary.
Also, user-agent protections are not enough. Many webservers have built-in automatic IP-based attack detection/prevention, but as far as I know, most don't have user-agent based systems. The new virtual land is distributed and webservers have to react to that new reality, but the onus is on you to be aware that many people are still running legacy systems.
If you guys did that from the beginning there wouldn't be an issue. And I'm sure all those complaints you see on other webmaster forums would largely go away as well.
There are many things you can/could've done to prevent this problem. There's things we could have done as well, but placing the blame on us is akin to blaming a robbed family for not locking the door. Sure, said family should have locked the door, but that in no way excuses the robber of their actions.
I know it's a lot easier to ask for forgiveness than permission, but when you're performing the equivalent of a DDoS attack and causing immediate and direct financial damages to a company... that doesn't fly.
First of all, one hit per second, or even 4 is ridiculously low and most other search engines crawl at a much higher rate.
Second, you really should learn how to use robots.txt and what nofollow is really for.
As for 80 legs, they should start caching stuff so one client crawling a page will have the effect that other clients of theirs will not be revisiting that same page for a while.
If a relatively low volume crawler is already a ddos attack in your book then I hope that you will never be subjected to a real one because it will be curtains for your site in a heartbeat.
User-agent is the accepted and standardized way to communicate with bots. Getting angry at us for following robots.txt is like getting angry at someone for showing you her ID when she goes through airport security.
From our perspective, your house put up a sign that said "Hey everyone come in 4 at a time".
80legs defaults to an average rate limit of 1 page per domain per second, but this rate limit may be increased over time for certain domains.
And then it goes on to mention people can contact you to increase the limit. This can be interpreted many ways, and the parent poster interpreted it in a different way to how you do.