Hacker News new | past | comments | ask | show | jobs | submit login
Bing search results showing up in Google (jacquesmattheij.com)
156 points by ZeroMinx on Feb 5, 2011 | hide | past | favorite | 122 comments



With respect to the author, the conclusion here is very flawed.

If you search for Bing in Google, you get Bing all over page 1. If you search for Google in Bing, you get Google all over page 1. That's not the result of Google capturing click stream data from Google Chrome and copying Bing's results, nor is it the result of Microsoft capturing click stream data from IE8 and copying Google's results. That's just the nature of indexing.

As for robots.txt disallowing those URLs, there is no standard for robots.txt behavior. I have observed some user agents treat it as case insensitive, and others treat it as case sensitive.

Honestly, this isn't even in the same ballpark as the Google accusations made earlier this week, and it smacks of just looking for things to accuse Google of in response to the "Binggate" (ugh, I typed it) drama. Can't we go back to more productive things?


Except for the schema and host parts (which are not part of robots.txt anyway) URLs are case sensitive (ref: RFC 3986, sections 6.2.2.2 and 6.2.3).

The problem here is that Microsoft servers respond to /search, /Search and /SeaRCh without distinction. They are all distinct URLs. If it was the intended behavior (stupid, but understandable, coming from Microsoft), then robots.txt should contain all variants in capitalization for each path. A better solution would be to force a 301 redirect to a canonical path, and have this path in robots.txt. Google would work as expected.

The original article is totally bogus. I can't imagine how it has over 90 votes.


I don't understand why everyone is using the term "copying the results". I think what Bing did was very smart, they incorporated user clickstream data. One could accuse this method of walking a thin line morally, but I suspect that Google's accusation wouldn't have stood any water as a lawsuit.


Because by incorporating clickstream data from Google, they're effectively copying Google search results. Bing should blacklist Google from its clickstream data.


Let's say tomorrow DDG is the search engine with the largest market share. Then Bing would be getting all the clickstream data from DDG. I hope you do realize that this "algorithm" is not Google specific. Its just a novel ranking technique that incorporates a human user feedback loop and is a pretty well known technique in the information retrieval field.


It would be equally unethical to be copying DDG's results in this fashion.


Highly ironic, though, as DDG uses Yahoo as a backend, which uses Bing, which uses Google, which would use...DDG? I think there's a cycle in that list somewhere...


Equally unethical sure (x=y), but differing opinions on whether that (x) is ethical or unethical. Personally, I see nothing wrong with it.


Not necessarily. It is possible that Google incorporates clickstream data too.

The problem that Google's little experiment highlighted was: given the utter lack of any other signal, Bing uses the fact that that URL was ranked #1 by Google's search engine and clicked on by a user.

Having said that: had I been doing this experiment at Google, I would have also added the following variations:

- for some search terms, rank the honeypot URL #1 but don't click on it

- for some search terms, rank the honeypot URL #1 on some _other_ search engine's list and click on it. How can they do that? There are search engines out there which use Google in the backend.

Experiment #1 would have shown more blatant copying. Experiment #2 would have shown whether it's just Google, or any other search engine.


Google has said they do not use clickstream data for ranking from Google toolbar.

They did have variations of their tests. Cutts mentioned this during the bigthink panel. Sometimes they went to Bing first or not, sometimes they clicked on the links and sometimes not, and various other things.


Google has said they do not use clickstream data for ranking from Google toolbar.

Please provide a reference. I've been looking for this statement and haven't found it. When Googlers are directly asked, they pointedly don't answer or say they don't know.

Amit Singhal's statement was carefully worded to be ambiguous on this matter, and Google has apparently confirmed that page-load-time data (at the very least) from the Toolbar does affect rankings.

Such use by search is definitely allowed by Google's written privacy policy. The confirm dialog a user passes when installing the Toolbar refers to that privacy policy.

It'd be very easy for an official Google spokesperson to say clearly that Toolbar data doesn't drive search rankings, if that were true. That they haven't strongly suggests it is used.

Search expert Danny Sullivan made the same observation in his 'Bing: Why Google’s Wrong In Its Accusations' article:

As For The Google Toolbar

Meanwhile, I’m on my third day of waiting to hear back from Google about just what exactly it does with its own toolbar. Now that the company has fired off accusations against Bing about data collection, Google loses the right to stay as tight-lipped as it has been in the past about how the toolbar may be used in search results.

Google’s initial denial that it has never used toolbar data “to put any results on Google’s results pages” immediately took a blow given that site speed measurements done by the toolbar DO play a role in this. So what else might the toolbar do?

http://searchengineland.com/bing-why-googles-wrong-in-its-ac...


You are being disingenuous. That you cannot find them explicitly denying something does not therefore make it true nor provide any evidence that it is true.

“Absolutely not. The PageRank feature sends back URLs, but we’ve never used those URLs or data to put any results on Google’s results page. We do not do that, and we will not do that,” said Singhal.

http://www.toprankblog.com/2006/04/matt-cutts-on-toolbar-dat...

In this one, Matt Cutts all but explicitly says that they do not use it.


"Put any results" is vague, perhaps intentionally finessed, language – as I (and the Sullivan quote) already highlighted in the grandparent comment. It could mean, especially in the context of the Bing allegations, that Google Toolbar data never adds a new URL to the index or a result-set, but is still used for relative ranking of already-known URLs.

In the link you provide, Matt Cutts says: "I’m not going to say definitively that Google doesn’t/won’t use toolbar data (or other signals) in ranking." And: "I’m not going to say whether Google uses a particular signal in our ranking." Cutts simply says Toolbar data could be problematic because it could be gamed. Well, links can too – didn't stop Google from building its empire on that impure signal. This seems to me more of the same finesse that creates the impression of a denial without a denial.

Further, it's clear from previous Google statements that page-load-timing from the Toolbar is used to affect rankings. That alone invalidates the 'strong' interpretation of Amit Singhal's statement. So Singhal means something other than "Toolbar data never effects search rankings". What does he mean? Just requoting that vague statement doesn't clear anything up.

And since everything Google does with this data is a closely-guarded secret, how can we be sure of anything, short of awaiting (and then trusting) definitive Google statements? And I can't yet find any clear statement about Toolbar data usage – even though lots of people seem to think they have seen them. (I think general warm feelings towards Google are creating this mistaken impression.)

I don't expect a clear statement; I believe Toolbar clicktrails are a big part of Google' secret sauce. But it means Google could be using equivalent techniques to Bing's, and simply have better filters against the blatant dominance of a single website or only 20 clickers on any result sets.


I do not think it would matter if Sergey and Page personally delivered a stone slab engraved with a statement that Google does not use their toolbar's clicks. As you say, you would have to rely on trusting their statements. You seem to require extraordinary evidence that they do not and have no evidence to back up that they actually do use it, and yet cling to belief that they do.

Matt spent 5 paragraphs on the subject, and yet you have laser focus on one statement that is not even contradicted by other evidence and try to derive whatever you want from it.

Here are some other comments from the panel at bigthink, but I'm sure you will find enough wiggle in them to assert that Google faked the moon landing.

Matt Cutts: "I'm not sure that users realize...when they search on Google and click... those results appear--those clicks appear to be encrypted and sent to Microsoft which then appear to be used in Google's [sic] rankings?"

[stuff about EULAs]

Harry Shum: "Everyone does this Matt you know.."

Matt: "Google... I want to categorically deny that Google does this."

Harry: "That Google does what?"

Matt: "We don't use clicks on Bing's users in Google's rankings."

http://bigthink.com/series/62#!selected_item=4845 around 25:00


It's a subtle point, but the whole point of this issue is that Bing actually does copy more than just "clickstream data". Think about what "clickstream" data is - user clicks a link, the link and the text of the link get sent to Microsoft. But in Google's text the search term was not in the link text, nor was it in the target page the user went to. So then how does the search term end up in Bing's index? They have to get it from somewhere - where? The logical conclusion is they get it from the URL of the page the click is performed on, or from the referrer header when the link is clicked (essentially these are the same thing). But how do you do that generically? The search URL contains the query terms in a format that is unique to Google. The only way Bing can be seeing it is if they deliberately are parsing out the Google search terms by specifically targeting how Google encodes them.

So rather than just generically "incorporating clickstream data" this is actually using a special procedure to extract search terms from clicks that are determined to be google searches and then putting those search terms into the bing index.


What Bing did is very smart! It actually was, but not crediting Google there just makes it look like a cheap shot.

You won't quote someone without citing their name now would you?

Sadly you are right, they won't be able to push a lawsuit. Not enough grounds for it, however Bing should acknowledge what they did and are doing crediting Google.

If Google hadn't caught them, we would all be thinking Bing did it on its own, which really isn't the case.


Taking liberty to concoct a scenario. If Walmart asked shoppers to take a photograph of the product layout on display at their favorite shop (which say happens to be Target because its the most popular in town) and used that to make small modifications to its own layout, would you say Walmart needs to credit Target, or that its copying Target? This is an arrangement between Walmart and its shoppers and there is nothing Target can do about it other than making a brouhaha. I don't see why things have to be different in the digital world. We all know how user interfaces historically have been blatantly ripped off.


Actually, Bing should credit the users providing all that clickstream data and all the websites they are using. It just happens that Google is one of many... what's your point?


It's possible the author may have intended that: .1 since search results are only accessed through a search action, and not links in the wild .2 since even if links in the wild are followed the google bot should have respected the disallow /search rule (which may have no standard, but google usually respect the format used in that robots.txt, and results are in both cases as of this writing) 3. therefore, they got to know such pages by using clickstream data

But I'm just guessing.


Hmm, I'm not sure you RTFA. He didn't search for the term "bing", he searched for the term "site:.bing.com/search". So it IS kinda a big deal that they're disrespecting the robots.txt and listing those pages anyway.

I have also seen Google listing one of my domains for which I specifically disallowed all spiders (Bing doesn't show those domains FYI). My feeble attempt at separating my personal and professional personas was defeated by frigging Google's blatant disregard for Internet etiquette.

Since I'm trying to be anon here, I won't be able to list the search term. Sorry.


There's a bit of a weird myth with robots.txt and the idea that it prevents pages from showing up in search engine indexes. Robots.txt means that the search engine cannot crawl the page - it can still include it in the search results if it sees enough sites linking to it. It can take a guess at what the page title can be, but there's usually no descriptive snippet because it's unable to see what's on the page.

If you don't want the page to be included in the index at all, you can use the meta noindex tag, by putting this in the head:

<meta name="robots" content="noindex" />

Pro tip: you need to also unblock that page from robots.txt - if Google isn't allowed to crawl the page, it can't see the meta noindex tag, which means it would stay indexed.


Thanks. I think this may be the reason.

IMHO, Bing is being more reasonable here.

So, I have to let Google see any page which I want to prevent it from showing it to others. If I want Google to show NONE of my pages to others, I should show them ALL my pages. Conveniently, there is no wildcarded noindex, is there? Nonsensical, but since Google has more power here I'll probably have to bend to their whim.


They are not disrespecting robots.txt: http://news.ycombinator.com/item?id=2183519

> I have also seen Google listing one of my domains for which I specifically disallowed all spiders [...] Since I'm trying to be anon here, I won't be able to list the search term. Sorry.

Unsubstantiated accusations. If you truly think Google is disregarding robots.txt but you don't want to divulge the original site, you should set up an experiment as Google has. Otherwise we have no way of determining if your accusations are true, and therefore they should not be taken seriously.


Fine. I reason I stated my point was so that others could chime in if they've had a similar experience.

Let me ask you and others this: Is the following robots.txt supposed to exclude all pages from my domain from showing up in Google results? Am I missing something? According to http://www.robotstxt.org/robotstxt.html I think I'm doing the right thing. Same file is returned for www.<domain>.com/robots.txt and <domain>.com/robots.txt. Google lists <domain>.com/<subdir> in results. I don't think it should be.

    User-agent: *
    Disallow: /


> Let me ask you and others this: Is the following robots.txt supposed to exclude all pages from my domain from showing up in Google results?

I believe that robots.txt is a way to prevent your site from being crawled by a robot, but it is not a blacklist against your site appearing in Google search results if it finds a link to your page on a site that does allow robots.

Check out this page: http://www.google.com/support/webmasters/bin/answer.py?hl=en...

Specifically check out the section "I want to completely remove a page from search results." It appears that if you use the "noindex" meta tag, you can prevent the site from showing up in search results even if other pages link to it. The noindex meta tag is documented here: http://www.google.com/support/webmasters/bin/answer.py?answe...


So it appears the only way to not appear is to let them crawl to see your NOINDEX tag.

And while it's clear NOINDEX prevents a page from appearing in results, it's not clear that it excludes the page contents from analysis by any of Google's algorithms, once collected. (Is it still used to train the spell-checker, for example?)


Great. This link helps. Google's site basically says that in addition to specifically excluding their crawler via robots.txt, I also have MANUALLY submit a request to them. As I said in a reply below, this is nonsensical and Bing is being more reasonable, but I'll grit my teeth and do it since Google has more power here.

It definitely seems like Google is exploiting a loophole in spirit of the definition of robots.txt. Robots.txt is an ancient standard, and I don't think it was anticipated at that time that search engines would gain enough confidence about pages' relevance to list them even if they had not indexed/crawled them.


As I recall, the spirit of robots.txt was not about appropriateness of search results so much as "this URL space can generate an unbounded graph, please don't DoS my server by trying to exhaustively traverse it."


No. It prevents pages from being crawled, but references to uncrawled pages can still be shown. See http://www.mattcutts.com/blog/robots-txt-remove-url/ for how it works.


> So it IS kinda a big deal that they're disrespecting the robots.txt and listing those pages anyway.

Did you RMFC? The robots.txt doesn't match what is indexed. As has been exhaustively pointed out in this thread by more than a few commenters, "Search" != "search". On that very search results page, we see #1 which is:

> OLAC search - Bing

OLAC is an unrelated site, and at one point it apparently linked to "www.bing.com/search" with the anchor text "OLAC search - Bing", which Google faithfully indexed but absolutely did not fetch, as you notice there is no description. What else would "OLAC search" mean? This gives us insight into how Google indexes and that they actually respect robots.txt case-sensitively.

The second URL, m.bing.com/Search/Results.aspx, was indexed and fetched because "Search" != "search".


> "Search" != "search"

This is highly ridiculous. HTTP URLs are not mandated to be case-sensitive (though it's recommended), and clearly lots of sites use them in a case-insensitive way. Robots should either consider robots.txt in a case-insensitive way (even if I'm conscious that lot of them, including major ones, currently don't do that, which is precisely what I consider to be a problem, and this is supported by what happened here -- where google is risking to appear as a fool). The following article has perfectly good arguments in favor of case-insensitivity or even more clever handling: http://www.slicksurface.com/blog/2007-04/be-careful-robotstx...

Also; failing to respect conditions of use of a service because an automatic process is not safe enough is not a completely exonerating excuse for the operator of such process...


An HTTP client must always assume URL paths and query strings are case-sensitive. You can't rely on the bing.com web servers always returning the same resource for http://www.bing.com/search?q=test and http://www.bing.com/Search?q=test just because the default filesystem for their OS is case-insensitive today. Actually it's unlikely the main entry point for their search engine is a file named "search" in some top-level directory. The only way for an author to tell you many URLs are equivalent is to use "301 Moved Permanently" responses to the canonical one from all the others, and Bing doesn't do that.


The two major issues in this article were:

- Google can see and return links to pages without crawling them. I made a video and a blog post about this a while ago: http://www.mattcutts.com/blog/robots-txt-remove-url/

- URL paths are case sensitive. Bing blocks /search in its robots.txt but not /Search. That's how the /Search urls got crawled.

In a later edit, the author suggests "It would be fairly trivial for bots to test if the server is IIS (if the server identifies itself as such of course) or to try to retrieve Robots.txt and robots.txt, if those come up as equal then the sever can be assumed to be case insensitive."

The issue of case sensitivity in robots.txt is a long, very nuanced topic. Here's just one example to get you started: at least back in 2007 when we were talking about this amongst ourselves at Google, the web server for developer.apple.com was case-insensitive, but their robots.txt had lines like this:

Disallow: /documentation/quicktime/

Disallow: /documentation/Quicktime/

Disallow: /documentation/QUICKTIME/

Disallow: /documentation/macosx/

Why would they do that? Apparently because Apple wanted the canonical link to be /documentation/QuickTime . Back then, at least 21M robots.txt files on the web had mixed-case paths. If Google started interpreting robots.txt files from servers that claimed to be IIS differently... well, I'll leave it as an exercise to the reader to come up with some of the unexpected bugs and behavior that could result.

I know it's really tempting to write a headline like "Bing search results showing up in Google," but I wish the author had done more research instead of going for a gotcha. Any SEO worth his/her salt could have explained what was going on here.


>"...but I wish the author had done more research instead of going for a gotcha."

That's a bit of irony right there, isn't it?

--wow, Matt, I guess it cuts both ways, eh?


Anything in particular you're talking about? I've said a lot of stuff online over the last 10 years. :)


I don't know, but I would assume that they're one of the people who don't think that what Bing did is a big deal. If I understand the arguments properly, most either believe that the clickstream data belongs to the user (who gave it away), or they believe that you haven't actually proven that they're singling out Google in their clickstream data.

Anyhow, I've already posted in this story how anyone who wants to could go do their own experiment (it appears that Bing's data isn't too hard to fake, that goes double for anyone who can reverse engineer the toolbar). Similarly, I'm not convinced that Bing doing this is going to ruin search any time soon, mostly because I can see spammers/blackhat SEO types renting botnets to feed Bing all the bogus clickstream data they want.

It appears to be a simple http request with some time zones, link text, and an identifier or two that can be harvested from actual toolbars. If they don't bother to spam them, it's because they believe that Bing is irrelevant. And when I say "irrelevant" I mean "even less relevant than a low-traffic wiki for a free game that's currently under a massive assault from spambots."

Which, incidentally, might be one good reason for your team to look more at keeping wiki-spam out of your index. Some spam results from that wiki (in spite of having rel=no-follow) were seen in Google's index and stayed there until the admins caught on and cleaned things up.


It's no secret that a fair number of commenters throughout the controversy have sided with Google's or Bing's side.

Some people have thought that the initial story came to a premature conclusion (where, incidentally, Bing cried foul). Now we have Jacques, it appears, arriving at premature conclusion as well and we have Google crying foul. To me, that's the essence of irony.

I'm sure both Google Search and Bing have your reasons for the way the dirty laundry has been brought out in public.


Dear Microsoft, case-sensitivity is important. [1]

m.bing.com/robots.txt says "/search", not "/Search". All of the crawled [2] urls are "/Search" or "/~/search".

Also,

wap.bing.com/robots.txt explicitly "Allow:"s several search pages, which are indexed by google.

[1] EDIT: Case insensitivity is often important. Above comment notes that some robots are case-insensitive. I suspect Google is not, based on the results.

[2] EDIT: I said indexed, a reply corrects to crawled. Good point, thanks.


Small correction: All crawled urls are "/Search" or "/~/search". Note that www.bing.com/search/ is indexed (link was found on another page), but not crawled (result has no snippet).


Naturally, Bing is hosted on a Windows server, which inherits the Windows filesystem eccentricities. Case insensitivity is among those. Because "search" has six letters, that would mean that the robots.txt would need to have 64 entries to completely exclude this directory. That's not even including the tilde thing or any other paths to that directory. And that's for one directory.

Lame? Yes. Google's fault? Not in the slightest. But it brings up an interesting question: if MS clickstream gathering included an opt-out mechanism that happened to be impractical for Google, would that change the ethics of any of this? Say, by having the Bing toolbar identify itself in user-agent so that Google could block it if they wanted?

I wouldn't think that would materially change the situation. If Google really wanted to, they could probably "block" this now by encrypting their existing URL redirects, thus hiding the URL from the Bing toolbar entirely, at least until the user is out of the Google system.


What is the value of processing robots.txt in a case-sensitive way? If urls are to have different status when considering case change, the the site structure is just broken. Plus considering robots.txt in case-sensitive way has already result in lots of errors, this one included, and will result in even more in the future. Plus HTTP is not mandated to use case-sensitive URL (though it's recommended). I can't think of any argument of why robots.txt should be processed in a case-sensitive way (i mean for a good reason -- obviously search engine have a very "good" incentive to handle it that way: the possibility to cheat and index more than they should, with an excuse when they are caught), on the other hand i can think of many for case-insensitive processing...


Webmaster's shoot themselves in the foot a lot. "My site isn't showing up in google" is a frequent complaint on webmaster help forums, and typically the problem is robots.txt or meta noindex. From that standpoint, since most sites do want to be indexed, it makes sense to follow the standard as strictly as possible. It should be hard to remove your site by accident, which case insensitivity would make somewhat easier. Google states explicitly in their robots.txt policies that it is handled in a case-sensitive way.


> never mind that Bing only used its toolbar as a url discovery device

That is obviously untrue, and shows that the author does not understand the issue even superficially. The Google experiment showed that Bing was associating urls to search terms for no reason other than that Google had done so. You know, like making a search for mbzrxpgjys return rim.com, a URL which we can safely assume Bing was already quite aware of.


the author does not understand the issue even superficially.

That, in a nutshell, is it. I don't know why we're spending so much time on this post, as the author has no idea about how search engines work, and what is robots.txt . If he had just looked at Bing's robots.txt and the URLs in Google's results, he would have seen that each and every one of them passed robots.txt . Since he site-restricted the search to ".bing.com", naturally you _will_ get only Bing results!


Aren't /search and /Search considered two different directories when it comes to robots.txt?


Yes. RFC 3986, sections 6.2.2.2 and 6.2.3.


I think you got the wrong RFC number:

http://tools.ietf.org/rfc/rfc3938.txt

No mention of robots.txt, nor a section 6.


Sorry, the correct number was 3986.

RFC 3986 - Uniform Resource Identifier (URI): Generic Syntax


Never underestimate the ability of a human being to rationalize.

If I was in the Microsoft camp, I'm sure I would also be grasping at straws to explain why it's totally fine for Bing to use Google's search results. It's human nature to rationalize.

The bottom line is that Bing's index contains associations that it could never have figured out if Google hadn't figured them out first. How many there are, we cannot know. There's no way around the fact that Bing is piggybacking on the work of Google's search engineers.

Is it "good for the customer?" In the short term, it's good for the customer if they can buy $1 bootlegged DVDs. In the long term, it's bad for the consumer if the money goes to bootleggers instead of the people who are doing the actual work.

Think I'm exaggerating the effect of just "1 out of 1000 signals?" This argument would be extremely easy to refute. Stop using Google's results. If it really isn't that significant, then why should it be a problem to stop using it? Just turn it off and let everyone observe that the quality is 99.9% as good as it used to be, and avoid any accusation of copying.

By refusing to turn it off, Microsoft makes it clear that it is an important part of their index, and that they have no qualms about having an important part of their index ripped off wholesale from their biggest competitor. Maybe it's a smart business move. But if that's the case, spare us the outrage about being called "copyists."


The thing that bothers me about all this drama is that the actual offense Google wants everyone to be so worked up about is that Bing doesn't filter Google from its clickstream data.

Bing wrote code that works across the whole web. The whole web includes Google. As a result, Bing gets some info from Google. But they didn't get that info because they copied Google, they got it because they didn't filter Google out - or, said another way, because they ignored Google as someone they needed to special case for clickstream analysis.

I don't work in search, but the idea that you're supposed to special-case your competitors when writing general-purpose tools sounds an awful lot like a unilaterally recognized gentleman's agreement. If it's not illegal, and it doesn't hurt end users, why shouldn't it be considered fair game?

I also think it's odd that throughout this whole thing, nobody has really noticed that the only possible way Google could have spotted this issue is if they're keeping very close tabs on Bing's search results. It's another arbitrary line that Google seems to have unilaterally drawn: it's clearly fine to monitor your competitors results closely, which presumably is going to have an effect on your own results; it's only out of bounds when that effect is directly measurable.


I thought part of the point is that whatever Bing is doing doesn't work across the whole web. They need to associate the URL with a query, and most websites don't have queries. It's not just that Bing has recorded a click on Miley Cyrus's webpage; it's that they've done that and associated it with the query [kecgxjpgqoe].


People have made that point, but I don't understand it. 'kecgx...' shows up as a parameter in the url of the Google query. Tons of sites include relevant information in parameterized urls; why is it unexpected that Bing would use that information across the whole web? Other people have said that implies that Bing has to have special Google url-parsing code, but that's not true at all - query parameters in urls are standardized. You would have to have special code to understand the specific semantics of Google's query urls, but there's no reason to think Bing needs or wants parameter semantics, they could easily just be interested in making probabilistic associations.


"You would have to have special code to understand the specific semantics of Google's query urls"

That's why people say that it's special-cased. There's no web standard that says the 'q' parameter means that the page is a search engine and the parameter is the query. That was something AltaVista did a long time back (possibly for byte-saving reasons, or possibly because they were lazy) and Google et al copied. Many other search engines use a different system, eg. DDG puts it in the request path, InfoSeek used qt=, Excite used search=.


Why do you think Bing cares or has to know that "q=" means a Google query term? My point is that they don't have to have any semantic information about parameter keys to be able to derive probabilistic associations between parameter values and clicks.

If you consistently see pages with 'foo' as a parameter value to any parameter key, and clicks on those pages consistently go to site bar, it's completely reasonable to start associating foo with bar, regardless of what the parameter keys are.


> Why do you think Bing cares or has to know that "q=" means a Google query term?

If that's true, then they should also be associating the sites linked with all the other weird parameter values in a search query, which would spam them to heck. Here are all the params from a search I just did on google:

q=test&ie=utf-8&oe=utf-8&aq=t&rls={moz:distributionID}:{moz:locale}:{moz:official}&client=firefox

They'd start making a lot of strange associations between random sites and "utf-8" if that were true, because that parameter shows up in just about every Google search done in English.

It's also a perfectly normal thing for programmers to search for, so they'd clutter up their index with millions of sites that had nothing whatsoever to do with utf-8.

So to make any real use out of that, they had to understand what the parameters in there actually mean, rather than associating ALL of them with whatever site was next in the clickstream.

Though I grant you, that does not disprove the alternate hypothesis that they were dumb and polluted their index with loads of irrelevant crap.

And I admit that I found weak support for that hypothesis by trying to see if utf-8 was linked to rim.com (one of the tests sites, if memory serves):

http://www.bing.com/search?q=utf-8+rim&go=&form=QBRE

Those results appear to be crap, though I'm not sure that any sensible results exist that they could return.


If that's true, then they should also be associating the sites linked with all the other weird parameter values in a search query

If it's a parameter value people will ever actually use Bing to search for ('utf-8'), there's probably plenty of other signal to help them figure out which results to return. If it's not ('kecgxjpgqoe'), we already know they sometimes return crap, thanks to Google's little experiment.

They'd start making a lot of strange associations between random sites and "utf-8" if that were true, because that parameter shows up in just about every Google search done in English.

If it shows up as a parameter for every search, why do you think Bing's algorithm would decide it was a good indicator for a particular url? P(foo.com|utf-8) wouldn't be any different from P(bar.com|utf-8), making 'utf-8' basically worthless as a discriminator. I'm pretty sure the folks at Bing understand the concept of conditional probability.

It's also a perfectly normal thing for programmers to search for, so they'd clutter up their index with millions of sites that had nothing whatsoever to do with utf-8

I see no justification for the idea that url parameters are somehow a more difficult challenge in this regard than the mass of crap that is content on the web, which we know they crawl and index at a massive scale.


It would only show up as a parameter "relevant" to whatever sites were next in the clickstream. Remember, not every page transition is Google -> other site.

They'd also be gobbling up tons of random things from forums and whatnot (which should appear on long tail searches, if we knew where to look), most of which spam the heck out of you with random parameters, forum names, and whatnot.

> I see no justification for the idea that url parameters are somehow a more difficult challenge in this regard than the mass of crap that is content on the web, which we know they crawl and index at a massive scale.

Which is why I see no reason to assume that they don't understand or attempt to understand the actual meaning of parameters passed to one of the biggest sites on the internet.


You said 'utf-8' shows up as a parameter in almost every English Google search, and suggested that this would cause weird associations between 'utf-8' and random pages.

I pointed out that there's no reason to expect this to be true, because Bing engineers are likely smart enough to realize that if the probability of clicking on to foo.com is not significantly different than the probability of clicking on to bar.com given the presence of the 'utf-8' parameter, then 'utf-8' is a pretty poor discriminator between foo.com and bar.com, and probably shouldn't be used to determine search results.

It doesn't matter that not every page transition is Google -> another site. You still wouldn't need to special-case Google to determine an association with a parameter is useful or useless - the same code could build that model for any site with params.

They'd also be gobbling up tons of random things from forums and whatnot (which should appear on long tail searches, if we knew where to look), most of which spam the heck out of you with random parameters, forum names, and whatnot.

How do you know that they don't? Google pointed out some longtail results that look bad, and you yourself pointed some out in a previous comment.

Which is why I see no reason to assume that they don't understand or attempt to understand the actual meaning of parameters passed to one of the biggest sites on the internet.

You're misunderstanding what I'm saying. I don't assume that they don't; I'm just saying there's no evidence that they do, and that assertions of wrongdoing based on the belief that they do are just irresponsible speculation. I have no knowledge of what Bing actually does, but neither do the vast majority of the people on the internet who are talking about this, many of whom assume the worst based on a mistaken notion of what's technically necessary to see the results that Google demonstrated.


> the presence of the 'utf-8' parameter, then 'utf-8' is a pretty poor discriminator between foo.com and bar.com, and probably shouldn't be used to determine search results.

And yet, it will link random text to websites even if they appear only in Google's URLs. I realize you're talking about discrimination (as in, "what's the better result for utf-8?"), but if the code is generic, it ought to be generic in this respect as well. After all, it linked up random nonsense to random sites given nothing more than Google's say-so, even though there's plenty of information out there about, say, rim.com that would tend to indicate that nobody except Google thinks that random text is relevant to an otherwise well-known site.

> How do you know that they don't? Google pointed out some longtail results that look bad, and you yourself pointed some out in a previous comment.

Indeed, I do not know. I know that it would be dumb to link those things to random sites, but you are correct that I do not know if they're doing things that dumb.

> I don't assume that they don't; I'm just saying there's no evidence that they do, and that assertions of wrongdoing based on the belief that they do are just irresponsible speculation.

Well, for one, I'm not really asserting "wrongdoing" here. That is, I don't particularly think that it's wrong of them to do things this way. My interest is mainly technical, so I'm more interested in figuring out exactly what they're doing rather than blaming them for it. As such, I'm going for the most likely explanations I can find, rather than worrying about whether it's been proven to such an extent that they can be blamed for it (as I'm not really going to blame them anyhow).

You may have seen where I pointed out that I don't think it will "ruin search" in the end because they should expect a crapflood from spammers now that it's clear that they use clickstream data to rank sites. After all, there's a large spam attack right now on a tiny wiki for a game I play. I have to think Bing is more of a target than that. I can't prove that, true, I'm just playing the odds here.


By that standard, Google should also quit trying to integrate invite-your-fb-friends feature, right? Cuz it's essentially facebook who has figured out how to get massive user signups with their real data and it's google trying to leech.

Instead of not trying to integrate fb, Google is accusing fb of not opening up the data. So when it's convenient, you want data opened up. When it's not convenient, you scream "copycat!".


Google was freely sharing contact data with Facebook, so there's no hypocrisy here. Contact data is not the same as a search engine.


Freely sharing the data in the sphere where you're trying to catch up, while zealously guarding the data in the sphere where you've got a massive lead, is the alleged hypocrisy.

Hard-won bits is hard-won bits, at a suitably abstract level of analysis.


Contact data is not the same as a search engine.

Elaborate? Bing didn't steal Google's source code. They intercepted some linkstream data with the user's permission.


A search engine performs the task of narrowing 1 trillion URLs into the 10 that are most likely to be what you are looking for. Associating a query with those 10 results takes teams of software engineers and datacenters full of computers. A user could not reasonably do this on their own.

Contact data is a list of the people you know. I had that list in my brain before I ever visited Facebook. If I am on different social networking sites, that list is mostly the same, because it is information about me that I already knew. I didn't come to Facebook saying "hey Facebook, find for me the 10 people in the world who would be the best possible friends for me." Instead, I come to Facebook and start telling them the names of my friends.

But forget all that for a second. Imagine that contact data and search engines really were the same thing. Google does not access Facebook's social graph, because Facebook has not allowed them to. For Google's behavior to be like Microsoft's Google would have to work around that by using the "clickstream" of Chrome users who are on Facebook. Who knows, maybe Microsoft is doing this already, since they seem to think that "clickstream" is a valid way to mine the internal databases of other companies.


A search engine performs the task of narrowing 1 trillion URLs into the 10 that are most likely to be what you are looking for. Associating a query with those 10 results takes teams of software engineers and datacenters full of computers. A user could not reasonably do this on their own.

Quite irrelevant to the discussion here. A user could suggest specific pages be indexed by Bing, no? If so, why can't a user let Bing automatically listen in on the pages they are visiting and index them if Bing wishes to? That's what Bing did.

Contact data is a list of the people you know. I had that list in my brain before I ever visited Facebook. If I am on different social networking sites, that list is mostly the same, because it is information about me that I already knew.

Again, pretty irrelevant. If we are talking about data ownership, both your contacts data and your clickstream data belong to you. You can do what you want with it. Some users choose to share it with Bing. That's their prerogative. I don't think even Google disagrees with that.

They just don't see the hypocrisy between this and their stance on Facebook.

Google does not access Facebook's social graph, because Facebook has not allowed them to.

Correct, and yet, Google's stance in the Facebook episode has been that Facebook should open up that data because it does not belong them, it belongs to the users. To be consistent with their stance on Bing, Google should never have created a big fuss about Facebook not opening up the data.

Similarly, clickstream data does not belong to Google. It belongs to the users.


Even if absolutely everything you say is true, even if "clickstream data belongs to the users" and the users can "share" it with Microsoft, it's still the case that Microsoft is riding on the backs of Google's engineers.

And as I said in my original message, maybe that's even a smart business move. But let's call a spade a spade. Microsoft is copying Google search results.

Is that really what you want to be sticking up for?


Microsoft is copying Google search results.

Clickstream data != copying search results.


This is the lie that you keep telling yourselves. Rationalize it however you want, but show a normal person the screenshots on this page and ask them if there is copying going on: http://www.mattcutts.com/blog/google-bing/

Maybe you call it "clickstream," maybe you think this has something to do with the user (even though the users had no idea they were aiding in this), this web of rationalizations cannot get around this fact:

Bing puts things in its search index that came directly from Google's algorithm.


...that came directly from Google's algorithm

That is a lie.

1. User searches for specific string 2. Google pulls up the results 3. User clicks on the result

Bing makes a corelation between the user's search string and the URL they end up on and takes a note of it. There is very little algorithm involved here because there is only one listing for that string. The algorithm is primarily used to rank pages. In order for Google to convince me that Bing is copying them, they would need to show a consistent before and after of Bing copying the search results and its order. Emphasis on resultS--in plural.

Neither are present in the Google honeypot. There is only one search result returned; and there is no question of the order of search results cuz of that. Two items central to the algorithm are entirely missing.

Going back to what Microsoft does copy:

1. it copies the user's input--imo this belongs to the user

2. it does not copy the order of search results nor scrape the returned search results returned by google

3. it copies the url the user ends up on, which is effectively a user's browsing history that the user has opted in to share with Bing.


"The bottom line is that Bing's index contains associations that it could never have figured out if Google hadn't figured them out first."

Actually, it is the users who are figuring out what is relevant not Google since a) Bing are collecting data on the links they click regardless of where Google ranks it (could be on the 20th page of Google search results for all Bing care) and b) Bing are collecting this data from all sites, there has been no evidence that Google search results are treated any differently to any other site.


> Actually, it is the users who are figuring out what is relevant not Google

When you do a Google search, Google chooses ten URLs from over 1 trillion that are in its index. Creating this list takes teams of full-time engineers and data-centers full of servers that crawl the internet constantly.

The user spends, on average, less than ten seconds choosing 1 out of those 10 URLs. In a small minority of cases, they'll click to further pages, in which case maybe they were choosing 1 out of 100.

So Google narrowed the search down from 1 trillion URLs to 10, and the user narrowed it down from 10 URLs to 1. And yet you think it's the user who did the hard work?


Individually, it negligible. Cumulatively, the time spent by the many thousands (millions?) of users who are clicking links they deem relevant for queries all over the web (Bing are collecting data from all sites) is huge...

My point is that the assertion that Google are doing all the work in creating the data used by Bing is clearly false.


> My point is that the assertion that Google are doing all the work in creating the data used by Bing is clearly false.

Clearly it's not 100%, but what percentage is it?

The user ends up visiting 1 out 1T possible URLs.

Google narrowed it down from 1T to 10 URLs (11 orders of magnitude).

The user narrowed it down from 10 to 1 URLs (1 order of magnitude).

Even if you measure this logarithmically, Google did 91% of the work, and the user did 9%. And the Microsoft people are arguing that it's totally justified because of that 9%.

Or you could measure it linearly, and argue that Google did 99.999999999% of the work because that's the percentage of the results they eliminated.

Either way, this is an extremely weak justification for taking a result of which 91% came from your direct competitor (when measured in the most charitable way).


those millions of users are all searching on platforms where 99%+ of the work was already done by the site's own engineers. Cmon, this isn't rocket science.


I can not agree more with your second paragraph.

Yes, Bing toolbar just rely on people clicking links. But without Google doing all the hardwork here, users will never know about such link, that's why they went to Google. And there will not be any of such link to feed to Bing's system.


You could also argue that the user is doing the hardwork by going to Google (or any other site, since this is not specific to Google) and deciding which links are relevant by clicking on them. The user has opted share this work they have done with Microsoft (by installing the toolbar and agreeing to the terms) to help them get better results on Bing.


I think you're trying too hard to rationalize Google's claims.

With regard to the 1-in-1000 signals, clearly they have that many signals because not all 1000 signals are strong for every query. In Google's case, they "Googlebombed" Bing, but instead of using the "anchor" signal (which most people use to alter Google's results), they chose the "click stream" signal. Google specifically chose scenarios where the other 999 signals weren't being used.

If I had a toolbar installed on millions of machines where people opted-in to send me their click activity, I could start a search engine that used only click stream data to rank results. You'd be seeing the ranking effected by all sorts of sites including Facebook, Google, Amazon, Wikipedia, etc... As it should, the ranking would be effected by the click activity that users perform on the most popular websites. Effectively I'd have crowd sourced my ranking. If Google is returning a result for a particular person's name, but Facebook's click stream activity is returning a fan page for that person's name which is clicked on more frequently than the Google result, then the Facebook result would show up in my search engine for that person's name.

All Google did here was choose a scenario where every other signal wasn't being used by Bing, effectively turning Bing into the click-stream search engine described above. Furthermore, Google chose a scenario where even within this single signal, they were the only input to it. All Google did was googlebomb Bing. Any well trafficked website could do the same.

It's ridiculous the lengths people will go to while trying to discredit Microsoft. Bing has built a generalized system that partially learns rankings through behaviors observed on high-reputation websites. That's it.

And no they shouldn't remove Google from the signal because that'd imply they've done something wrong.


>As it should, the ranking would be effected by the click activity that users perform on the most popular websites. Effectively I'd have crowd sourced my ranking.

Not exactly. Those sites work hard to present relevant links to the crowd. If the sites have competent engineers, your record of clicks cannot add much value -- if just because you only have a subset of all the clicks on that site. So you see google says they're seeing increasing overlap with bing over time.

Also see http://news.ycombinator.com/item?id=2184025


We need to get over this partisan "gotcha journalism". No one really benefits from everyone making low content blog posts with any random accusations that make their side 'right' (which just happens to be ad hominem anyways).


This looks like microsoft is assuming robots.txt is case insensitive?


It is. Only domain names are case-insensitive.

This document explains how Google handles robots.txt : http://code.google.com/web/controlcrawlindex/docs/robots_txt...


Are we reading the same file? Under where it describes matching paths (so replace /fish and /Fish with /search and /Search if you like):

==Example path matches==

[path] /fish

Matches: /fish /fish.html /fish/salmon.html /fishheads /fishheads/yummy.html /fish.php?id=anything

Does not match: /Fish.asp /catfish /?id=fish

Comments: Note the case-sensitive matching.


If you look at http://wap.bing.com/robots.txt, the URLs that Google is returning are actually all set to 'allow', not disallow.

It also looks like m.bing.com/robots.txt blocks /search while their actual URLs are /Search - I guess Googlebot treats robots.txt as case-sensitive.


robots.txt applies to the source of the links, not the target.

So if, say, http://www.paulgraham.com/ links to http://m.bing.com/search, then http://m.bing.com/robots.txt does not apply to that.

EDIT: If you think this is wrong, please explain it instead of just downvoting me, because I think it is pretty unfair that I lose karma for explaining my interpretation.


Why would someone with no authority over your site linking to your site override your robots.txt?

Edit: I didn't downvote you, but I think you're wrong because it makes no sense - if I link to your site that shouldn't give search engines a free pass to ignore your wishes and do whatever they want with your content.


I understand a robots.txt "Disallow: /foo" to mean that it must not crawl that page, i.e. look at the links _inside_ that page.


I've always interpreted it to mean they're explicitly not allowed to touch it - no exceptions (unless you actually specified exceptions for them which you can see on the 3rd last example at http://www.robotstxt.org/robotstxt.html).


One counter intuitive thing. They are allowed to link to it in search results (using links that point to it to rank it for insance) but can't use the content of the page.


Interesting, and I've always interpreted it the other way. Looks like you are right. But the language on www.robotstxt.org is pretty vague, too. (What does it mean to "visit" an URL?)

The "RFC" on http://www.robotstxt.org/norobots-rfc.txt (Section 3.2.2) states:

"These lines indicate whether accessing a URL that matches the corresponding path is allowed or disallowed. Note that these instructions apply to any HTTP method on a URL."

And I think both interpretations are in theory (but probably not in practice) valid, at least if a search engine is willing to add a site to its index without accessing it (which is unlikely, but not impossible).


I think you're confusing robots.txt with the nofollow attribute value: http://en.wikipedia.org/wiki/Nofollow


yup. Anchor text is fair game. Keywords taken from anchor texts are usually a good indication on what the linked page is about.

A page can have a high page rank without ever being crawled - given that other high ranked pages have linked to it with good keywords in the anchor text. Such pages can often be identified by the lack of snippet texts.

Edit: I may have misunderstood you, you can only use the data found in links, not follow them.


If a page is not crawlable, but listed in the search results based on anchor text (or the URL), then it will not have any snippets. You can show snippets only if you crawl the page itself.


It just means a crawler cannot retrieve the page; in other words, a crawler cannot GET a URL for whom there's an exclusion pattern in robots.txt .


That's not true.


Google has likely indexed links to Bing found on other pages, rather than on Bing itself. That doesn't mean it followed the links (and it wouldn't, if excluded by robots.txt).


>never mind that Bing only used its toolbar as a url discovery device, not to 'copy search results'

Yeah, they just happened to discover high quality urls on google. What are the chances?


URL discovery is one thing; ranking that discovered URL at number 1 without any other signals is another. I don't think anyone cares that much about how Bing does URL discovery (unless, of course, the URL is supposed to be private and exchanged via email). Given that they have a new URL, what made them rank it #1?


robots.txt disallows (or did until recently) only "/search". The results shown have "/Search" in the url. Bing screwed up.


Bit different.


Shouldn't the headline be "Bing explicitly allowing some results pages to showing up in other search engines".

The wap.bing.com/robots.txt blocks all /search/ and then explicitly allows a few. What ever the reason is for that.

Very weak article, IMHO.


Microsoft is way out of line. Google figures out what content is good by crawling every page and doing the leg work, and Bing copies Google data and displays what google displays.

Google proved it with the bing sting. there is absolutly NO reason why bing should have linked to those documents, other than that they copied off of Google's exam paper.

When students do this, it is called plagiarizing. The smoke getting thrown by MS is just to distract and divert while they scramble to hide what they did.


They proved that Microsoft uses clickstream data to rank websites, and in 7% of manufactured cases that's all the data they have?

I think this is exactly as petty and silly as last week's news, and now they get to spend a week explaining how this occurred and that they do obey robots.txt.


Everyone else in this thread has already explained that microsoft is not properly using robots.txt by having case insensitive urls, hence why these urls were indexed.


"Be strict in what you emit, liberal in what you accept."

Bing failed the first part, Google the second. Both should fix that.


The Bing context might suck for you guys, but this is your problem - case insensitivity is every Windows server, not just the Bing website.

Why would anyone hosted on Windows have to specify every possible spelling variation to keep search engines out of a folder or file?

Here's another example:

http://www.ifma.org/robots.txt

These guys are disallowing /pv/

http://www.google.com/search?sourceid=chrome&ie=UTF-8...

You guys are indexing /PV/


Maybe I'm misunderstanding your argument, but I think you're confusing the Windows file system with the URLs that a web service provides.


The problem exists when that web service is on Windows - ASP.NET, Cold Fusion, static html sites, probably a negligible percent of PHP sites etc -

/pv is /PV is /pV is /Pv


Except that this time it isn't Microsoft themselves crying foul.


> there is absolutly NO reason why bing should have linked to those documents

There is absolutely a reason. A user queried for a string, then followed a link. Biasing Bing's search results towards the followed links is a signal that improves their search.

> When students do this, it is called plagiarizing. The smoke getting thrown by MS is just to distract and divert while they scramble to hide what they did.

When I took exams, I wasn't allowed to consult with fellow students, read the internet, or open a textbook. My life as a developer would certainly be a lot different if I wasn't able to do any of those things now because it was "cheating" back when I was in school. It's an awful analogy and, as you've regurgitated it here, seems to be an effective smokescreen on Google's end.


There is absolutely a reason. A user queried for a string, then followed a link. Biasing Bing's search results towards the followed links is a signal that improves their search.

Yes, but the actual effect is that they've just copied Google's results, rather than extracted valuable insight from their own users. And I don't see how Bing could have scraped the search terms and results from a Google session unless they are using code specifically tailored for Google's site. Given that, the clickstream excuse rings awfully hollow.

Plagiarism is going to be a judgement call in a developing field like search. Google's judgement is that Bing's scraping of their site, in this case, is fairly underhanded. I would agree.


Yes, nothing exists on the internet if it's not on Google. When people click on things in Reddit, if Google hasn't yet indexed it, then it doesn't exist.


Google proved it with the bing sting

They didn't prove anything. Every experiment needs a control. Where's Google's?


"Control" what? Where's the control for "gravity pulled the apple down" ?

It's called an existential proof. If you want to prove that a black swan exists, all you need is 1 example of a black swan.


It's called an existential proof. If you want to prove that a black swan exists, all you need is 1 example of a black swan.

Strictly speaking, yes, non-controlled experiments exist. However without a control you cannot eliminate alternate explanations. Since we know that an alternate explanation exists here, the experiment doesn't show anything.

In the case of the black swan there's no credible explanation for that thing you see being anything but a black swan. The control is essentially Occam's razor.


Google's hypothesis was "Does Bing use Google data in its rankings?" That hypothesis was proven (because there was no other way to get those crazy links except for clickstream data showing users going from a google search for those nonsense terms to those sites).

If you want to explore a very different hypothesis, namely "Does Bing single out Google in its weightings of clickstream data?" then I suggest you go here:

http://projectgus.com/files/googlebing/seaport-trace.txt

That's a packet capture of some clickstream data. That should be more than enough to forge as much data as you like. You can then make up a nonsense search term, like "doesb1ngtrustgmorethananyoneelse" that should get zero results, then forge clickstreams going from a google.com search to "yes.com" as well as an equal number of searches for that term going from some other search engine to "no.com" and you can explore the weights as much as you want.

That said, I'd say that the fact that they weight Google highly enough that they'd take their word for it that a clearly irrelevant term should be mapped to some site is strong evidence, I think.

Mind you, I don't think it's "wrong" exactly for Bing to do this. I'm not worried about it destroying search, either. The spammers/SEO types will make it useless soon enough.


Good points. I'm the one who took the packet capture you've linked, and I'd be really interested to see what happens if someone runs the test you describe.

That said, I'd say that the fact that they weight Google highly enough that they'd take their word for it that a clearly irrelevant term should be mapped to some site is strong evidence, I think.

The evidence was that this occurred in 9% of the cases they tested. For terms that don't exist anywhere else on the net (ie the only place they could have been found anywhere in Bing's dataset was in the google query URL.)

Bing didn't weight Google ahead of a single other source of information, so I don't think this really comprises evidence of high weighting at all. To make that claim, someone needs to run a controlled test like the one you describe here.

spammers/SEO types will make it useless soon enough

Yeah, one thing that surprised me greatly is how easily it looks like this clickstream data could be injected. Although I don't really know what half the fields do, so there could be something clever going on in there.


> Bing didn't weight Google ahead of a single other source of information, so I don't think this really comprises evidence of high weighting at all. To make that claim, someone needs to run a controlled test like the one you describe here.

Strictly speaking, I think you're right, but it was interesting that it would accept mappings that didn't exist elsewhere, rather than requiring the same mapping from at least one other source. But you're right that a more complete test is warranted. And Bing may have already adjusted things to prevent this, so we may never know.

> Although I don't really know what half the fields do, so there could be something clever going on in there.

True, but the mysterious parts I see appear to be constants (I may have overlooked something, though, so feel free to point out any mysterious dynamic bits I'm ignoring). Those should be out-and-out harvestable from actual Bing toolbars. With constants, I strongly suspect that we only need to harvest valid data, rather than figure out how to generate our own.

Worse for Bing, my experience in removing viruses says that "people who install lots of toolbars" and "people who get viruses/malware/joined to a botnet" are sets that overlap so a significant degree, so it might become hard to separate actual user clickstream data from forged clickstream data when they're both coming from the same, infected, computer.

Finally, I'm sorry I forgot to give you credit. If it's any consolation, I think I did give you credit last time I posted your link. Thank you for doing a better technical analysis than I've seen from anyone else on this story.


Finally, I'm sorry I forgot to give you credit.

Ah, that's no problem. If you'd copied it to your domain and linked it there I might have minded, but if you're posting a link to my site I consider that plenty of credit.


Google's hypothesis was "Does Bing use Google data in its rankings?"

I see. I would contend that that hypothesis is misleading at best, in bad faith at worst.


I think the problem wasn't so much the hypothesis, but the spin on the conclusions.


robot.txt excludes /search not /Search..big difference as 99.99% of return results are ../Search*

MS mistake on robot.txt file not Google's


I agree with the author here. I think that Google will come away from this looking combative and childish.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: