Don't get me started on 80legs. They took down our servers earlier this week. We couldn't even identify them based on IP address because of their regular-computer-using distributed system. If anything, 80Legs is the ultimate DDoS machine.
I mean, they state that their crawl average is 1 req/sec and can be modified if you contact them. The problem is that other people can modify the crawl rate (somebody set ours to 4/seq).
And remember, that's just the average. The distributed nature of their system means that it's ridiculously spikey. Furthermore they don't respect nofollow tags that we put in place to keep bots out of bad areas, so their crawlers got into very unoptimized areas of the site.
Overall though we turned it into a positive by ramping up our analysis tools, and creating a much stronger robots.txt. Still... I really don't like 80legs and bristle when people reference them as someone doing 'good'.
So.. we usually get comments like this on webmaster forums, and I usually just let them go, but since HN is "my" forum, I feel like saying something.
1. We identify ourselves as user-agent 008. The proper way to track a bot is through user-agent, not IP address.
2. We never say that the average rate for all sites is 1 req/sec. Many sites have higher rates. We do a best guess on what load a site can handle based on a mix of Alexa/Quantcast/Compete stats.
3. We always respond to requests from webmasters to reduce the crawl rate. Within 5 minutes of getting your message, we did this. What other bot does this? Oh yeah, none of them.
4. The nofollow tag is not a standard way to tell bots to not go to a link. That is what robots.txt is for. Nofollow was implemented for Google specifically and has caught on as a (not as common as you think) tag. Robots.txt is the right way to talk to bots.
5. You know what happens when people don't use 80legs? They figure out some way to get the information they want and soak up your bandwidth anyway. A webmaster is fooling himself if he thinks he can stop people from crawling a site. It's going to happen. At least with 80legs the webmaster has a responsible company with real people to talk to so that the crawl can be managed.
1. This requires prior knowledge of your existence, it wasn't until we did proper analysis that we were able to identify the culprit of the "attack". This did expose a weakness in our monitoring systems, so we've obviously learned from this experience, but I can tell you that there are MANY webmasters on the web that aren't as savvy who can get their servers taken down easily by your service.
3/4/5. The only bot I care about is Googlebot. And they do use nofollow that way, fortunately the other major bots do as well... so things have been working pretty darn great for everybody. Every other bot I don't care for, don't need, and don't want. If someone legit wants access to our data they can use our totally open and free API that actually gives a lot better information.
In principal I think 80legs can be used for good, however, it's like gunpowder where I question the legitimacy of real-world practical use, simply because it can be abused so so easily. I could very well be totally off base here, but that's what lack of information + bad experience gets you.
edit: And shouldn't the fact that you "...usually get comments like this on webmaster forums..." be a hint that maybe there's a real issue with your service?
1. We identify ourselves very clearly in our request header through our user-agent, including a link to our website. It's not "savvy" to look for this. It's basic webmaster knowledge.
2. Per another commenter, I can see how the language may be confusing; we'll change it.
3. Webmasters tend to rush to grab their pitchforks instead of thinking things through. We follow all the standard, accepted rules.
4. Some people may not want to use your API because they are interested in more than just your site. Crawling is a one-stop solution to get information from a ton of sites, instead of implementing an API for a single site that is just one data point on the web.
At the end of the day users of your service won't be accessing our data, which means they'll be missing out on some data, and we'll be missing out on being included in potentially some cool applications. This spat isn't going to have a real effect on either of our businesses in the long run.
However, something that may have a bigger effect on your business is the seemingly dismissive and patronizing attitude you have towards webmasters. At this stage of the game you need us way more than we need you. I'm sure in this exchange there's defensiveness on your end because I attacked your company, and you've probably dealt with many PO'd guys like me (patience wears thin easily).
However, that's the nature of the web. Us webmasters grab our pitchforks and bitch and moan. I imagine if a large enough contingent of webmasters doesn't like you guys it WILL start to have an effect on your overall business.
In which case, it's probably smart business to either: A) play nice, even when we're not, and/or B) Demonstrate a clear business case why it's better for us to expose our data to your crawlers.
The thing is I was playing nice - responding immediately to your Twitter message, changing the rate asap. But then I come onto HN and see a complaint from you without mentioning we responded quickly. Not cool in my book. You're saying people should play nice when you're just telling one side of the story.
Edit: I should note the response would have been even faster if you had emailed us from the contact page that's on the website we link to in our request header. Tweets go to us marketing/biz-dev guys. The contact form goes to everyone in the company. I'm not sure how much nicer we can be beyond doing everything we can to be reachable and identifiable.
By the time we identified you as the problem the damage was already done. In fact, when combined with potential performance penalty that Google may apply (could be 5%, 10%, for one week, two weeks... who knows) the actual amount of damage in lost traffic/business to us could easily run into the thousands.
Are you going to reimburse us the damages? Obviously not.
I want you to realize that the timeliness of your response is completely irrelevant to the matter at hand.
Proper net etiquette would be to have a reasonable limit of an ACTUAL 1 req/sec, not an average. Furthermore, allow only the webmaster of the site to increase the rate.
Also, while nofollow sculpting is not the original intention of the metatag, MANY people use it as such, so respect that directive.
Also, you can have your system be aware when a site is slow to respond and adjust crawl rates/times as necessary.
Also, user-agent protections are not enough. Many webservers have built-in automatic IP-based attack detection/prevention, but as far as I know, most don't have user-agent based systems. The new virtual land is distributed and webservers have to react to that new reality, but the onus is on you to be aware that many people are still running legacy systems.
If you guys did that from the beginning there wouldn't be an issue. And I'm sure all those complaints you see on other webmaster forums would largely go away as well.
There are many things you can/could've done to prevent this problem. There's things we could have done as well, but placing the blame on us is akin to blaming a robbed family for not locking the door. Sure, said family should have locked the door, but that in no way excuses the robber of their actions.
I know it's a lot easier to ask for forgiveness than permission, but when you're performing the equivalent of a DDoS attack and causing immediate and direct financial damages to a company... that doesn't fly.
The reason the request rate was at 4 was that at some point your site responded to 4 req/sec without problems.
User-agent is the accepted and standardized way to communicate with bots. Getting angry at us for following robots.txt is like getting angry at someone for showing you her ID when she goes through airport security.
From our perspective, your house put up a sign that said "Hey everyone come in 4 at a time".