
MSNBOT must die - alrex021
http://blogs.perl.org/users/cpan_testers/2010/01/msnbot-must-die.html
======
gojomo
They should try to contact the crawling team directly. It's obviously a bug.

While it's satisfying to fix with a block, or vent in a blog post or tweet,
actually getting to the bottom of the issue with the MS crawl team can fix the
problem for other sites, too -- so that's the most civil and community-
spirited response.

~~~
pierrefar
MSNBot has had many issues over the years. Many times MS folks denied them all
until someone actually caused a big fuss. Once it was me causing the fuss.

When you think about it, bing/live/whatever it's called today sends very
little traffic, and the crawler to get that traffic has a long history of what
you call bugs. The cost-benefit analysis IMHO points to too much cost for
really zero benefit.

Hence why many webmasters block MSNBot.

------
ecaron
Their bots have just become incredibly greedy, while Google is becoming very
tidy. Despite a growth of content on LinkUp.com, Google has gone from
consuming 8gigs of my bandwidth in September 2009 to being on track to use
only 4gigs this month. During those same time periods, MSNBot has gone from
using 4gigs to using 15gigs this month.

Sure Bing may be on a growth spurt (<http://www.adotas.com/2010/01/bing-hits-
query-growth-spurt/>), but I don't see any correlation between that and pwning
our servers...

~~~
jacquesm
Those two are not related, you could crawl all day long and never serve up a
single query.

Also, bing being on a 'growth spurt' is something that I can't find any hard
proof for from other sources,

They're pretty steady according to:

<http://www.alexa.com/siteinfo/bing.com>

and

<http://siteanalytics.compete.com/bing.com/>

------
docgnome
It's just straight up evil to run an indexing service that ignores robots.txt
and to run more than one of these things that when they don't talk to each
other. Even if the latter was stupidity, the former is just plain evil. It's
not like they have an excuse of ignorance. This isn't some guy who just
learned to program hacking something together. These are supposed to be
"professional" programmers.

~~~
bockris
And it's nothing new. A couple of years ago I used to have a domain that I put
a blanket disallow on. Live search indexed it but neither Yahoo or Google did.

------
petercooper
It sucks for them, but I doubt any of perl.org really _needs_ to be indexed by
Bing anyway (though it's always nice to be indexed).

It's not an entirely fair or scientific comparison but my general Ruby site
had ~55k visitors from Google in the last month. 280 from Bing. I'd suspect
their market share in the Perl world is as bad or worse.

~~~
jacquesm
The problem really is that every two bit search engine crawler (twiceler,
yahoo, bing and so on) will use just as much bandwidth as google does (or
more) but will only return a fraction of the number of users of google.

Search engine vendors ought to do something like 80 legs and do something
about crawling the web as inefficient as they do today.

~~~
OmarIsmail
Don't get me started on 80legs. They took down our servers earlier this week.
We couldn't even identify them based on IP address because of their regular-
computer-using distributed system. If anything, 80Legs is the ultimate DDoS
machine.

I mean, they state that their crawl average is 1 req/sec and can be modified
if you contact them. The problem is that other people can modify the crawl
rate (somebody set ours to 4/seq).

And remember, that's just the average. The distributed nature of their system
means that it's ridiculously spikey. Furthermore they don't respect nofollow
tags that we put in place to keep bots out of bad areas, so their crawlers got
into very unoptimized areas of the site.

Overall though we turned it into a positive by ramping up our analysis tools,
and creating a much stronger robots.txt. Still... I really don't like 80legs
and bristle when people reference them as someone doing 'good'.

~~~
jdrock
So.. we usually get comments like this on webmaster forums, and I usually just
let them go, but since HN is "my" forum, I feel like saying something.

1\. We identify ourselves as user-agent 008. The proper way to track a bot is
through user-agent, not IP address.

2\. We never say that the average rate for all sites is 1 req/sec. Many sites
have higher rates. We do a best guess on what load a site can handle based on
a mix of Alexa/Quantcast/Compete stats.

3\. We always respond to requests from webmasters to reduce the crawl rate.
Within 5 minutes of getting your message, we did this. What other bot does
this? Oh yeah, _none_ of them.

4\. The nofollow tag is not a standard way to tell bots to not go to a link.
That is what robots.txt is for. Nofollow was implemented for Google
specifically and has caught on as a (not as common as you think) tag.
Robots.txt is the _right_ way to talk to bots.

5\. You know what happens when people don't use 80legs? They figure out some
way to get the information they want and soak up your bandwidth anyway. A
webmaster is fooling himself if he thinks he can stop people from crawling a
site. It's going to happen. At least with 80legs the webmaster has a
responsible company with real people to talk to so that the crawl can be
_managed_.

~~~
OmarIsmail
1\. This requires prior knowledge of your existence, it wasn't until we did
proper analysis that we were able to identify the culprit of the "attack".
This did expose a weakness in our monitoring systems, so we've obviously
learned from this experience, but I can tell you that there are MANY
webmasters on the web that aren't as savvy who can get their servers taken
down easily by your service.

2\. You should try reading your own FAQ. <http://80legs.pbworks.com/FAQ> \-
see Rate limiting section.

3/4/5. The only bot I care about is Googlebot. And they do use nofollow that
way, fortunately the other major bots do as well... so things have been
working pretty darn great for everybody. Every other bot I don't care for,
don't need, and don't want. If someone legit wants access to our data they can
use our totally open and free API that actually gives a lot better
information.

In principal I think 80legs can be used for good, however, it's like gunpowder
where I question the legitimacy of real-world practical use, simply because it
can be abused so so easily. I could very well be totally off base here, but
that's what lack of information + bad experience gets you.

edit: And shouldn't the fact that you "...usually get comments like this on
webmaster forums..." be a hint that maybe there's a real issue with your
service?

~~~
jdrock
1\. We identify ourselves very clearly in our request header through our user-
agent, including a link to our website. It's not "savvy" to look for this.
It's basic webmaster knowledge.

2\. Per another commenter, I can see how the language may be confusing; we'll
change it.

3\. Webmasters tend to rush to grab their pitchforks instead of thinking
things through. We follow all the standard, accepted rules.

4\. Some people may not want to use your API because they are interested in
more than just your site. Crawling is a one-stop solution to get information
from a ton of sites, instead of implementing an API for a single site that is
just one data point on the web.

~~~
OmarIsmail
At the end of the day users of your service won't be accessing our data, which
means they'll be missing out on some data, and we'll be missing out on being
included in potentially some cool applications. This spat isn't going to have
a real effect on either of our businesses in the long run.

However, something that may have a bigger effect on your business is the
seemingly dismissive and patronizing attitude you have towards webmasters. At
this stage of the game you need us way more than we need you. I'm sure in this
exchange there's defensiveness on your end because I attacked your company,
and you've probably dealt with many PO'd guys like me (patience wears thin
easily).

However, that's the nature of the web. Us webmasters grab our pitchforks and
bitch and moan. I imagine if a large enough contingent of webmasters doesn't
like you guys it WILL start to have an effect on your overall business.

In which case, it's probably smart business to either: A) play nice, even when
we're not, and/or B) Demonstrate a clear business case why it's better for us
to expose our data to your crawlers.

~~~
jdrock
The thing is I was playing nice - responding immediately to your Twitter
message, changing the rate asap. But then I come onto HN and see a complaint
from you without mentioning we responded quickly. Not cool in my book. You're
saying people should play nice when you're just telling one side of the story.

Edit: I should note the response would have been even faster if you had
emailed us from the contact page that's on the website we link to in our
request header. Tweets go to us marketing/biz-dev guys. The contact form goes
to everyone in the company. I'm not sure how much nicer we can be beyond doing
everything we can to be reachable and identifiable.

~~~
OmarIsmail
By the time we identified you as the problem the damage was already done. In
fact, when combined with potential performance penalty that Google may apply
(could be 5%, 10%, for one week, two weeks... who knows) the actual amount of
damage in lost traffic/business to us could easily run into the thousands.

Are you going to reimburse us the damages? Obviously not.

I want you to realize that the timeliness of your response is completely
irrelevant to the matter at hand.

Proper net etiquette would be to have a reasonable limit of an ACTUAL 1
req/sec, not an average. Furthermore, allow only the webmaster of the site to
increase the rate.

Also, while nofollow sculpting is not the original intention of the metatag,
MANY people use it as such, so respect that directive.

Also, you can have your system be aware when a site is slow to respond and
adjust crawl rates/times as necessary.

Also, user-agent protections are not enough. Many webservers have built-in
automatic IP-based attack detection/prevention, but as far as I know, most
don't have user-agent based systems. The new virtual land is distributed and
webservers have to react to that new reality, but the onus is on you to be
aware that many people are still running legacy systems.

If you guys did that from the beginning there wouldn't be an issue. And I'm
sure all those complaints you see on other webmaster forums would largely go
away as well.

There are many things you can/could've done to prevent this problem. There's
things we could have done as well, but placing the blame on us is akin to
blaming a robbed family for not locking the door. Sure, said family should
have locked the door, but that in no way excuses the robber of their actions.

I know it's a lot easier to ask for forgiveness than permission, but when
you're performing the equivalent of a DDoS attack and causing immediate and
direct financial damages to a company... that doesn't fly.

~~~
jacquesm
Sorry, but I think you are in some weird kind of 'damage control' mode here.

First of all, one hit per second, or even 4 is ridiculously low and most other
search engines crawl at a much higher rate.

Second, you really should learn how to use robots.txt and what nofollow is
really for.

As for 80 legs, they should start caching stuff so one client crawling a page
will have the effect that other clients of theirs will not be revisiting that
same page for a while.

If a relatively low volume crawler is already a ddos attack in your book then
I hope that you will never be subjected to a real one because it will be
curtains for your site in a heartbeat.

------
mrduncan
I have to wonder if there have been some recent changes which are causing all
kinds of issues. Just a few days ago, Ryan Tomayko twittered about it
completely ignoring Github's robots.txt -
<http://twitter.com/rtomayko/statuses/7685967826>

~~~
pjhyett
Our Mysql boxes were hammered thanks to this bug. Pages that weren't supposed
to be crawled because they're resource intensive were all getting hit. Very
frustrating, we ended up blocking all msnbot traffic in Nginx.

------
bshep
Why 403s instead of just denying the connection with hosts.deny?

I'm thinking more bandwidth/CPU savings than sending correct responses for
their crawler...

------
aguynamedben
Yep, MSNBOT does the same type of bursty crawling here too. Not graceful at
all.

------
akkartik
I mostly sympathize, but this is crazy:

 _"I'll now be denying access to anything with the IP matching
/^65\\.55\\.(106|107|207)/. If you discover you fall into that pattern, and
are a real person, please let me know."_

How hard is it to lookup DNS?

~~~
carbocation
I agree with your sentiment.

My first thought, however, was that is extremely costly to dynamically lookup
DNS for every page request. Even if a DNS lookup was only performed if the IP
matched that range, you'd still end up doing a DNS lookup for every MSNBot
hit. If they're calling way too many pages (which is the reason for the block
in the first place), this will thrash your server.

Now, if you followed the above but then added the IP to a static blacklist if
it resolved to an MSFT host - and a whitelist if not - then the system should
work without too much of a performance hit (except a small one upfront).

Of course, at that rate, it would probably be easier to just submit a batch
job to generate IPs in that range for blacklist and whitelist.

Anyways, your rhetorical point still stands :-)

~~~
netdog
I just block that entire Microsoft subnet at the firewall (using pf):

    
    
      microsoft = "65.52.0.0/14"
    
      block in on $ext_if proto tcp from $microsoft to any port { http, https }

