
Bing search results showing up in Google - ZeroMinx
http://jacquesmattheij.com/Bing+search+results+showing+up+in+Google
======
jedsmith
With respect to the author, the conclusion here is very flawed.

If you search for Bing in Google, you get Bing all over page 1. If you search
for Google in Bing, you get Google all over page 1. That's not the result of
Google capturing click stream data from Google Chrome and copying Bing's
results, nor is it the result of Microsoft capturing click stream data from
IE8 and copying Google's results. That's just the nature of indexing.

As for robots.txt disallowing those URLs, there is _no_ standard for
robots.txt behavior. I have observed some user agents treat it as case
insensitive, and others treat it as case sensitive.

Honestly, this isn't even in the same ballpark as the Google accusations made
earlier this week, and it smacks of just _looking_ for things to accuse Google
of in response to the "Binggate" (ugh, I typed it) drama. Can't we go back to
more productive things?

~~~
unp3rs0n
I don't understand why everyone is using the term "copying the results". I
think what Bing did was very smart, they incorporated user clickstream data.
One could accuse this method of walking a thin line morally, but I suspect
that Google's accusation wouldn't have stood any water as a lawsuit.

~~~
luigi
Because by incorporating clickstream data from Google, they're effectively
copying Google search results. Bing should blacklist Google from its
clickstream data.

~~~
unp3rs0n
Let's say tomorrow DDG is the search engine with the largest market share.
Then Bing would be getting all the clickstream data from DDG. I hope you do
realize that this "algorithm" is not Google specific. Its just a novel ranking
technique that incorporates a human user feedback loop and is a pretty well
known technique in the information retrieval field.

~~~
moultano
It would be equally unethical to be copying DDG's results in this fashion.

~~~
nostrademons
Highly ironic, though, as DDG uses Yahoo as a backend, which uses Bing, which
uses Google, which would use...DDG? I think there's a cycle in that list
somewhere...

------
Matt_Cutts
The two major issues in this article were:

\- Google can see and return links to pages without crawling them. I made a
video and a blog post about this a while ago:
<http://www.mattcutts.com/blog/robots-txt-remove-url/>

\- URL paths are case sensitive. Bing blocks /search in its robots.txt but not
/Search. That's how the /Search urls got crawled.

In a later edit, the author suggests "It would be fairly trivial for bots to
test if the server is IIS (if the server identifies itself as such of course)
or to try to retrieve Robots.txt and robots.txt, if those come up as equal
then the sever can be assumed to be case insensitive."

The issue of case sensitivity in robots.txt is a long, very nuanced topic.
Here's just one example to get you started: at least back in 2007 when we were
talking about this amongst ourselves at Google, the web server for
developer.apple.com was case-insensitive, but their robots.txt had lines like
this:

Disallow: /documentation/quicktime/

Disallow: /documentation/Quicktime/

Disallow: /documentation/QUICKTIME/

Disallow: /documentation/macosx/

Why would they do that? Apparently because Apple wanted the canonical link to
be /documentation/QuickTime . Back then, at least 21M robots.txt files on the
web had mixed-case paths. If Google started interpreting robots.txt files from
servers that claimed to be IIS differently... well, I'll leave it as an
exercise to the reader to come up with some of the unexpected bugs and
behavior that could result.

I know it's really tempting to write a headline like "Bing search results
showing up in Google," but I wish the author had done more research instead of
going for a gotcha. Any SEO worth his/her salt could have explained what was
going on here.

~~~
mc32
>"...but I wish the author had done more research instead of going for a
gotcha."

That's a bit of irony right there, isn't it?

\--wow, Matt, I guess it cuts both ways, eh?

~~~
Matt_Cutts
Anything in particular you're talking about? I've said a lot of stuff online
over the last 10 years. :)

~~~
Natsu
I don't know, but I would assume that they're one of the people who don't
think that what Bing did is a big deal. If I understand the arguments
properly, most either believe that the clickstream data belongs to the user
(who gave it away), or they believe that you haven't actually proven that
they're singling out Google in their clickstream data.

Anyhow, I've already posted in this story how anyone who wants to could go do
their own experiment (it appears that Bing's data isn't too hard to fake, that
goes double for anyone who can reverse engineer the toolbar). Similarly, I'm
not convinced that Bing doing this is going to ruin search any time soon,
mostly because I can see spammers/blackhat SEO types renting botnets to feed
Bing all the bogus clickstream data they want.

It appears to be a simple http request with some time zones, link text, and an
identifier or two that can be harvested from actual toolbars. If they don't
bother to spam them, it's because they believe that Bing is irrelevant. And
when I say "irrelevant" I mean "even less relevant than a low-traffic wiki for
a free game that's currently under a massive assault from spambots."

Which, incidentally, might be one good reason for your team to look more at
keeping wiki-spam out of your index. Some spam results from that wiki (in
spite of having rel=no-follow) were seen in Google's index and stayed there
until the admins caught on and cleaned things up.

------
floatingatoll
Dear Microsoft, case-sensitivity is important. [1]

m.bing.com/robots.txt says "/search", not "/Search". All of the crawled [2]
urls are "/Search" or "/~/search".

Also,

wap.bing.com/robots.txt explicitly "Allow:"s several search pages, which are
indexed by google.

[1] EDIT: Case insensitivity is often important. Above comment notes that some
robots are case-insensitive. I suspect Google is not, based on the results.

[2] EDIT: I said indexed, a reply corrects to crawled. Good point, thanks.

~~~
amalcon
Naturally, Bing is hosted on a Windows server, which inherits the Windows
filesystem eccentricities. Case insensitivity is among those. Because "search"
has six letters, that would mean that the robots.txt would need to have 64
entries to completely exclude this directory. That's not even including the
tilde thing or any other paths to that directory. And that's for one
directory.

Lame? Yes. Google's fault? Not in the slightest. But it brings up an
interesting question: if MS clickstream gathering included an opt-out
mechanism that happened to be impractical for Google, would that change the
ethics of any of this? Say, by having the Bing toolbar identify itself in
user-agent so that Google could block it if they wanted?

I wouldn't think that would materially change the situation. If Google really
wanted to, they could probably "block" this now by encrypting their existing
URL redirects, thus hiding the URL from the Bing toolbar entirely, at least
until the user is out of the Google system.

~~~
xilun0
What is the value of processing robots.txt in a case-sensitive way? If urls
are to have different status when considering case change, the the site
structure is just broken. Plus considering robots.txt in case-sensitive way
has already result in lots of errors, this one included, and will result in
even more in the future. Plus HTTP is not mandated to use case-sensitive URL
(though it's recommended). I can't think of any argument of why robots.txt
should be processed in a case-sensitive way (i mean for a good reason --
obviously search engine have a very "good" incentive to handle it that way:
the possibility to cheat and index more than they should, with an excuse when
they are caught), on the other hand i can think of many for case-insensitive
processing...

~~~
moultano
Webmaster's shoot themselves in the foot _a lot_. "My site isn't showing up in
google" is a frequent complaint on webmaster help forums, and typically the
problem is robots.txt or meta noindex. From that standpoint, since most sites
do want to be indexed, it makes sense to follow the standard as strictly as
possible. It should be hard to remove your site by accident, which case
insensitivity would make somewhat easier. Google states explicitly in their
robots.txt policies that it is handled in a case-sensitive way.

------
jsnell
> never mind that Bing only used its toolbar as a url discovery device

That is obviously untrue, and shows that the author does not understand the
issue even superficially. The Google experiment showed that Bing was
associating urls to search terms for no reason other than that Google had done
so. You know, like making a search for mbzrxpgjys return rim.com, a URL which
we can safely assume Bing was already quite aware of.

~~~
ajays
_the author does not understand the issue even superficially._

That, in a nutshell, is it. I don't know why we're spending so much time on
this post, as the author has no idea about how search engines work, and what
is robots.txt . If he had just looked at Bing's robots.txt and the URLs in
Google's results, he would have seen that each and every one of them passed
robots.txt . Since he site-restricted the search to ".bing.com", naturally you
_will_ get only Bing results!

------
ashleyw
Aren't /search and /Search considered two different directories when it comes
to robots.txt?

~~~
jfr
Yes. RFC 3986, sections 6.2.2.2 and 6.2.3.

~~~
nostrademons
I think you got the wrong RFC number:

<http://tools.ietf.org/rfc/rfc3938.txt>

No mention of robots.txt, nor a section 6.

~~~
jfr
Sorry, the correct number was 3986.

RFC 3986 - Uniform Resource Identifier (URI): Generic Syntax

------
haberman
Never underestimate the ability of a human being to rationalize.

If I was in the Microsoft camp, I'm sure I would also be grasping at straws to
explain why it's totally fine for Bing to use Google's search results. It's
human nature to rationalize.

The bottom line is that Bing's index contains associations that it could never
have figured out if Google hadn't figured them out first. How many there are,
we cannot know. There's no way around the fact that Bing is piggybacking on
the work of Google's search engineers.

Is it "good for the customer?" In the short term, it's good for the customer
if they can buy $1 bootlegged DVDs. In the long term, it's bad for the
consumer if the money goes to bootleggers instead of the people who are doing
the actual work.

Think I'm exaggerating the effect of just "1 out of 1000 signals?" This
argument would be extremely easy to refute. Stop using Google's results. If it
really isn't that significant, then why should it be a problem to stop using
it? Just turn it off and let everyone observe that the quality is 99.9% as
good as it used to be, and avoid any accusation of copying.

By refusing to turn it off, Microsoft makes it clear that it _is_ an important
part of their index, and that they have no qualms about having an important
part of their index ripped off wholesale from their biggest competitor. Maybe
it's a smart business move. But if that's the case, spare us the outrage about
being called "copyists."

~~~
jfager
The thing that bothers me about all this drama is that the actual offense
Google wants everyone to be so worked up about is that Bing doesn't filter
Google from its clickstream data.

Bing wrote code that works across the whole web. The whole web includes
Google. As a result, Bing gets some info from Google. But they didn't get that
info because they _copied_ Google, they got it because they didn't filter
Google out - or, said another way, because they _ignored_ Google as someone
they needed to special case for clickstream analysis.

I don't work in search, but the idea that you're supposed to special-case your
competitors when writing general-purpose tools sounds an awful lot like a
unilaterally recognized gentleman's agreement. If it's not illegal, and it
doesn't hurt end users, why shouldn't it be considered fair game?

I also think it's odd that throughout this whole thing, nobody has really
noticed that the only possible way Google could have spotted this issue is if
they're keeping very close tabs on Bing's search results. It's another
arbitrary line that Google seems to have unilaterally drawn: it's clearly fine
to monitor your competitors results closely, which presumably is going to have
an effect on your own results; it's only out of bounds when that effect is
directly measurable.

~~~
nostrademons
I thought part of the point is that whatever Bing is doing _doesn't_ work
across the whole web. They need to associate the URL with a query, and most
websites don't have queries. It's not just that Bing has recorded a click on
Miley Cyrus's webpage; it's that they've done that _and_ associated it with
the query [kecgxjpgqoe].

~~~
jfager
People have made that point, but I don't understand it. 'kecgx...' shows up as
a parameter in the url of the Google query. Tons of sites include relevant
information in parameterized urls; why is it unexpected that Bing would use
that information across the whole web? Other people have said that implies
that Bing has to have special Google url-parsing code, but that's not true at
all - query parameters in urls are standardized. You would have to have
special code to understand the specific semantics of Google's query urls, but
there's no reason to think Bing needs or wants parameter semantics, they could
easily just be interested in making probabilistic associations.

~~~
nostrademons
"You would have to have special code to understand the specific semantics of
Google's query urls"

That's why people say that it's special-cased. There's no web standard that
says the 'q' parameter means that the page is a search engine and the
parameter is the query. That was something AltaVista did a long time back
(possibly for byte-saving reasons, or possibly because they were lazy) and
Google et al copied. Many other search engines use a different system, eg. DDG
puts it in the request path, InfoSeek used qt=, Excite used search=.

~~~
jfager
Why do you think Bing cares or has to know that "q=" means a Google query
term? My point is that they don't have to have any semantic information about
parameter keys to be able to derive probabilistic associations between
parameter values and clicks.

If you consistently see pages with 'foo' as a parameter value to _any_
parameter key, and clicks on those pages consistently go to site bar, it's
completely reasonable to start associating foo with bar, regardless of what
the parameter keys are.

~~~
Natsu
> Why do you think Bing cares or has to know that "q=" means a Google query
> term?

If that's true, then they should also be associating the sites linked with all
the other weird parameter values in a search query, which would spam them to
heck. Here are all the params from a search I just did on google:

q=test&ie=utf-8&oe=utf-8&aq=t&rls={moz:distributionID}:{moz:locale}:{moz:official}&client=firefox

They'd start making a lot of strange associations between random sites and
"utf-8" if that were true, because that parameter shows up in just about every
Google search done in English.

It's also a perfectly normal thing for programmers to search for, so they'd
clutter up their index with millions of sites that had nothing whatsoever to
do with utf-8.

So to make any real use out of that, they had to understand what the
parameters in there actually mean, rather than associating ALL of them with
whatever site was next in the clickstream.

Though I grant you, that does not disprove the alternate hypothesis that they
were dumb and polluted their index with loads of irrelevant crap.

And I admit that I found weak support for that hypothesis by trying to see if
utf-8 was linked to rim.com (one of the tests sites, if memory serves):

[http://www.bing.com/search?q=utf-8+rim&go=&form=QBRE](http://www.bing.com/search?q=utf-8+rim&go=&form=QBRE)

Those results appear to be crap, though I'm not sure that any sensible results
exist that they _could_ return.

~~~
jfager
_If that's true, then they should also be associating the sites linked with
all the other weird parameter values in a search query_

If it's a parameter value people will ever actually use Bing to search for
('utf-8'), there's probably plenty of other signal to help them figure out
which results to return. If it's not ('kecgxjpgqoe'), we already know they
sometimes return crap, thanks to Google's little experiment.

 _They'd start making a lot of strange associations between random sites and
"utf-8" if that were true, because that parameter shows up in just about every
Google search done in English._

If it shows up as a parameter for every search, why do you think Bing's
algorithm would decide it was a good indicator for a particular url?
P(foo.com|utf-8) wouldn't be any different from P(bar.com|utf-8), making
'utf-8' basically worthless as a discriminator. I'm pretty sure the folks at
Bing understand the concept of conditional probability.

 _It's also a perfectly normal thing for programmers to search for, so they'd
clutter up their index with millions of sites that had nothing whatsoever to
do with utf-8_

I see no justification for the idea that url parameters are somehow a more
difficult challenge in this regard than the mass of crap that is content on
the web, which we know they crawl and index at a massive scale.

~~~
Natsu
It would only show up as a parameter "relevant" to whatever sites were next in
the clickstream. Remember, not every page transition is Google -> other site.

They'd also be gobbling up tons of random things from forums and whatnot
(which should appear on long tail searches, if we knew where to look), most of
which spam the heck out of you with random parameters, forum names, and
whatnot.

> I see no justification for the idea that url parameters are somehow a more
> difficult challenge in this regard than the mass of crap that is content on
> the web, which we know they crawl and index at a massive scale.

Which is why I see no reason to assume that they don't understand or attempt
to understand the actual meaning of parameters passed to one of the biggest
sites on the internet.

~~~
jfager
You said 'utf-8' shows up as a parameter in almost every English Google
search, and suggested that this would cause weird associations between 'utf-8'
and random pages.

I pointed out that there's no reason to expect this to be true, because Bing
engineers are likely smart enough to realize that if the probability of
clicking on to foo.com is not significantly different than the probability of
clicking on to bar.com given the presence of the 'utf-8' parameter, then
'utf-8' is a pretty poor discriminator between foo.com and bar.com, and
probably shouldn't be used to determine search results.

It doesn't matter that not every page transition is Google -> another site.
You still wouldn't need to special-case Google to determine an association
with a parameter is useful or useless - the same code could build that model
for any site with params.

 _They'd also be gobbling up tons of random things from forums and whatnot
(which should appear on long tail searches, if we knew where to look), most of
which spam the heck out of you with random parameters, forum names, and
whatnot._

How do you know that they don't? Google pointed out some longtail results that
look bad, and you yourself pointed some out in a previous comment.

 _Which is why I see no reason to assume that they don't understand or attempt
to understand the actual meaning of parameters passed to one of the biggest
sites on the internet._

You're misunderstanding what I'm saying. I don't assume that they don't; I'm
just saying there's no evidence that they do, and that assertions of
wrongdoing based on the belief that they do are just irresponsible
speculation. I have no knowledge of what Bing actually does, but neither do
the vast majority of the people on the internet who are talking about this,
many of whom assume the worst based on a mistaken notion of what's technically
necessary to see the results that Google demonstrated.

~~~
Natsu
> the presence of the 'utf-8' parameter, then 'utf-8' is a pretty poor
> discriminator between foo.com and bar.com, and probably shouldn't be used to
> determine search results.

And yet, it will link random text to websites even if they appear only in
Google's URLs. I realize you're talking about discrimination (as in, "what's
the better result for utf-8?"), but if the code is generic, it ought to be
generic in this respect as well. After all, it linked up random nonsense to
random sites given nothing more than Google's say-so, even though there's
plenty of information out there about, say, rim.com that would tend to
indicate that nobody except Google thinks that random text is relevant to an
otherwise well-known site.

> How do you know that they don't? Google pointed out some longtail results
> that look bad, and you yourself pointed some out in a previous comment.

Indeed, I do not know. I know that it would be dumb to link those things to
random sites, but you are correct that I do not know if they're doing things
that dumb.

> I don't assume that they don't; I'm just saying there's no evidence that
> they do, and that assertions of wrongdoing based on the belief that they do
> are just irresponsible speculation.

Well, for one, I'm not really asserting "wrongdoing" here. That is, I don't
particularly think that it's wrong of them to do things this way. My interest
is mainly technical, so I'm more interested in figuring out exactly what
they're doing rather than blaming them for it. As such, I'm going for the most
likely explanations I can find, rather than worrying about whether it's been
proven to such an extent that they can be blamed for it (as I'm not really
going to blame them anyhow).

You may have seen where I pointed out that I don't think it will "ruin search"
in the end because they should expect a crapflood from spammers now that it's
clear that they use clickstream data to rank sites. After all, there's a large
spam attack right now on a tiny wiki for a game I play. I have to think Bing
is more of a target than that. I can't prove that, true, I'm just playing the
odds here.

------
mukyu
We need to get over this partisan "gotcha journalism". No one really benefits
from everyone making low content blog posts with any random accusations that
make their side 'right' (which just happens to be ad hominem anyways).

------
moultano
This looks like microsoft is assuming robots.txt is case insensitive?

~~~
ajays
It is. Only domain names are case-insensitive.

This document explains how Google handles robots.txt :
[http://code.google.com/web/controlcrawlindex/docs/robots_txt...](http://code.google.com/web/controlcrawlindex/docs/robots_txt.html)

~~~
Natsu
Are we reading the same file? Under where it describes matching paths (so
replace /fish and /Fish with /search and /Search if you like):

==Example path matches==

[path] /fish

Matches: /fish /fish.html /fish/salmon.html /fishheads /fishheads/yummy.html
/fish.php?id=anything

Does not match: /Fish.asp /catfish /?id=fish

Comments: Note the case-sensitive matching.

------
illdave
If you look at <http://wap.bing.com/robots.txt>, the URLs that Google is
returning are actually all set to 'allow', not disallow.

It also looks like m.bing.com/robots.txt blocks /search while their actual
URLs are /Search - I guess Googlebot treats robots.txt as case-sensitive.

------
aristidb
robots.txt applies to the source of the links, not the target.

So if, say, <http://www.paulgraham.com/> links to <http://m.bing.com/search>,
then <http://m.bing.com/robots.txt> does not apply to that.

EDIT: If you think this is wrong, please explain it instead of just downvoting
me, because I think it is pretty unfair that I lose karma for explaining my
interpretation.

~~~
benologist
Why would someone with no authority over your site linking to your site
override your robots.txt?

Edit: I didn't downvote you, but I think you're wrong because it makes no
sense - if I link to your site that shouldn't give search engines a free pass
to ignore your wishes and do whatever they want with your content.

~~~
aristidb
I understand a robots.txt "Disallow: /foo" to mean that it must not crawl that
page, i.e. look at the links _inside_ that page.

~~~
benologist
I've always interpreted it to mean they're explicitly not allowed to touch it
- no exceptions (unless you actually specified exceptions for them which you
can see on the 3rd last example at <http://www.robotstxt.org/robotstxt.html>).

~~~
moultano
One counter intuitive thing. They are allowed to link to it in search results
(using links that point to it to rank it for insance) but can't use the
content of the page.

------
dminor
Google has likely indexed links to Bing found on _other pages_ , rather than
on Bing itself. That doesn't mean it followed the links (and it wouldn't, if
excluded by robots.txt).

------
Herring
> _never mind that Bing only used its toolbar as a url discovery device, not
> to 'copy search results'_

Yeah, they just happened to discover high quality urls on google. What are the
chances?

~~~
ajays
URL discovery is one thing; ranking that discovered URL at number 1 without
any other signals is another. I don't think anyone cares that much about how
Bing does URL discovery (unless, of course, the URL is supposed to be private
and exchanged via email). Given that they have a new URL, what made them rank
it #1?

------
pmb
robots.txt disallows (or did until recently) only "/search". The results shown
have "/Search" in the url. Bing screwed up.

------
mwg66
Bit different.

------
yaix
Shouldn't the headline be "Bing explicitly allowing some results pages to
showing up in other search engines".

The wap.bing.com/robots.txt blocks all /search/ and then explicitly allows a
few. What ever the reason is for that.

Very weak article, IMHO.

------
maeon3
Microsoft is way out of line. Google figures out what content is good by
crawling every page and doing the leg work, and Bing copies Google data and
displays what google displays.

Google proved it with the bing sting. there is absolutly NO reason why bing
should have linked to those documents, other than that they copied off of
Google's exam paper.

When students do this, it is called plagiarizing. The smoke getting thrown by
MS is just to distract and divert while they scramble to hide what they did.

~~~
benologist
They proved that Microsoft uses clickstream data to rank websites, and in 7%
of manufactured cases that's _all_ the data they have?

I think this is exactly as petty and silly as last week's news, and now _they_
get to spend a week explaining how this occurred and that they _do_ obey
robots.txt.

~~~
moultano
Everyone else in this thread has already explained that microsoft is not
properly using robots.txt by having case insensitive urls, hence why these
urls were indexed.

~~~
benologist
The Bing context might suck for you guys, but this is your problem - case
insensitivity is _every_ Windows server, not just the Bing website.

Why would anyone hosted on Windows have to specify every possible spelling
variation to keep search engines out of a folder or file?

Here's another example:

<http://www.ifma.org/robots.txt>

These guys are disallowing /pv/

[http://www.google.com/search?sourceid=chrome&ie=UTF-8...](http://www.google.com/search?sourceid=chrome&ie=UTF-8&q=site:ifma.org/pv/)

You guys are indexing /PV/

~~~
andrewcooke
Maybe I'm misunderstanding your argument, but I think you're confusing the
Windows file system with the URLs that a web service provides.

~~~
benologist
The problem exists when that web service is _on_ Windows - ASP.NET, Cold
Fusion, static html sites, probably a negligible percent of PHP sites etc -

/pv is /PV is /pV is /Pv

------
shareme
robot.txt excludes /search not /Search..big difference as 99.99% of return
results are ../Search*

MS mistake on robot.txt file not Google's

------
jamesjyu
I agree with the author here. I think that Google will come away from this
looking combative and childish.

