Hacker News new | past | comments | ask | show | jobs | submit login

The article suggests that the Bing toolbar monitors what its user click and uses that information to improve Bing search results. Is that what you have conclusively proved?

I'm interested in another experiment. If you set up a honeypot, search for the term, but never click on the link, does the honeypot start showing up in Bing? The article doesn't say whether you tried this. Did you try it? Are Bing scraping your results from the page or only tracking their users clicks?

Anyone can test that Microsoft's software sends the clicks back to Microsoft, although I believe Microsoft sends the data back by SSL, so it's harder to verify even that than you'd expect.

Google's search results are blocked in robots.txt, so I don't believe Bing has been able to crawl our search results directly. All the evidence points to users' clicks on Google, which are then sent to Microsoft.

Microsoft has (so far) declined to admit whether our allegation is true. Getting them to talk about exactly what they do and what software they use or don't use would be the easiest way. I'd like them to confirm or deny, which is why I wanted to go to this search panel later today and ask them.

> so I don't believe Bing has been able to crawl our search results directly

Isn't compliance with robots.txt more of a voluntary thing?

I'm not accusing MS of ignoring it when convenient, but if you/we/someone is accusing them of acting unethically wrt search results in the first place, telling the crawler to ignore robots.txt wouldn't be that far away, would it? (And likewise faking the user-agent, etc.)

For better or for worse, UA identification, robots.txt compliance - all those things are voluntary. I'm not suggesting they shouldn't be, but it certainly makes a difference in terms of whether something's possible or not. (And, if you ask me, places an even higher obligation on the actors to behave ethically, lest trust completely evaporates and the whole thing goes to hell in a handbasket).

I am not a lawyer, but as I understand it there is some precedent in the US of intentionally ignoring robots.txt being unauthorized computer access, exposing you to all the liability that entails (possibly criminal).

I'd like to see an actual case reference for this. I've never heard of ignoring robots.txt resulting in any kind of legal action.

It would take a pretty big leap to go from robots.txt is advisory to ignoring it constitutes a criminal action.

Internet Archive was sued unsuccessfully. As I understand it a lawsuit is still in process against Google on the topic. So I guess the precedent is weaker than I thought, but still: tread carefully.

Matt, don't Google Toolbar and the Chrome Browser similarly send information to Google for use in improving their services?

If you read the article and other comments here it's been made perfectly clear that the Google toolbar and Chrome browser are not sending similar data back to Google.

Ah, at least the google toolbar does. If you enable PageRank on the Google Toolbar it sends back all the urls you visit just like the bing toolbar.

From the toolbar privacy policy: "Toolbar's enhanced features, such as PageRank and Sidewiki, operate by sending Google the addresses and other information about sites at the time you visit them."

Google has managed to demonstrate one way MS appears to be using the data. What does google do with their trove of data? That's a lot of data to collect and not do anything with.

If they want to make it perfectly clear they should add into their privacy policies and EULAs.

Yes absolutely. I don't think anyone in this thread or in the article denied that the Google Toolbar sends data to Google. And you are absolutely right that Google's use of the data collected should be clearly stated in a privacy policy and EULA. It might be, I haven't read them.

But the article clearly covers the available public statements on this issue and patio11 dug up a post from Matt Cutts in his comment below that directly addresses this: http://www.mattcutts.com/blog/toolbar-indexing-debunk-post/.

I did not say "similar data" because "similar" is a bit too slippery a word in a technical context. There's too much plausible deniablity. What I am asking is if Google's tools send data back to Googleplex to be mined for the sake of search engine improvements.

Then what use is the word "similarly" in your comment? Similarly send? As in via HTTP requests? I think that's either obvious or irrelevant or both.

Again, if you actually read the article, you will come across the section titled "What About The Google Toolbar & Chrome?" I encourage you to read it.

[edit] Also, see this comment and patio11's subcomment further down the page, both of which were written an hour before yours: http://news.ycombinator.com/item?id=2165469#score_2165578.

Quote from the article: "In fact, Google stressed that the only information that flows back at all from Chrome is what people are searching for from within the browser, if they are using Google as their search engine."

I'm pretty positive that's not true. If you run Fiddler when browsing with Chrome you will see constant hits to toolbarqueries.clients.google.com whether you're using Google or not. I could be browsing some MS site and toolbarqueries.clients.google.com gets hit. Chromium doesn't do this.

Edit: You can uncheck everything under privacy and it will still send those requests.

Edit2: What it sends back looks something like this:

<?xml version="1.0" encoding="UTF-8"?><autofillquery clientversion="6.1.1715.1442/en (GGLL)"><form signature="8551191143090325242"><field signature="620769395"/><field signature="2995202485"/><field signature="2175865763"/><field signature="904516291"/><field signature="2953051246"/><field signature="2649047790"/><field signature="2308153337"/><field signature="1003471793"/><field signature="3255484099"/><field signature="1305698505"/><field signature="3676143819"/><field signature="1275502930"/></form></autofillquery>

Looks like auto-fill data, but this happens when I click around a site, NOT when searching Google or typing something in the address bar. For some sites (interestingly, not all) it sends 3 requests for each page load.

That's troubling. I'd be very interested in seeing a response from Google about this. Are you aware of any? Also, can you use Fiddler to inspect the content of the requests? I'm not familiar with the tool.

I see this too, if I have autofill enabled, and at least one autofill address entry.

I would guess that Chrome is sending a hash of the <form> (perhaps URL + method?), plus a hash of each of the <input> tags, and Google returns some sort of information about what kind of form it is?

If so, it would mean it's pretty easy for Google to determine which sites you're on from the pattern of hashes sent for each site. e.g. I see this data sent in the clear for pretty much every page on https://www.facebook.com/

Is this malicious site detection by any chance, or does that use a different mechanism?

>I believe Microsoft sends the data back by SSL, so it's harder to verify even that than you'd expect.

Please. Adding my own SSL cert to my own laptop is not harder than I'd expect. Certainly not harder than many other things you did in setting up this experiment.

are you claiming that google never scrapes bing search results pages? or any other search result pages?

poacher69, we crawl the public web. Anyone that blocks us out with robots.txt, we won't crawl. If you check bing.com/robots.txt, it has "Disallow: /search" . So no, we won't crawl Bing's search results pages. If anything, users tend to complain when search results from Lycos or wherever show up in Google.

http://www.bing.com/robots.txt User-agent: * Disallow: /search

Funny thing: http://www.google.com/search?q=site%3Abing.com%2Fsearch%2F

I was gonna call out Matt for crawling bing's search results but I'm guessing Microsoft hasn't realized they return results from the /Search/ folder. ;)

Once again Microsoft is bitten by expecting case insensitivity.

matt, how does google do competitive relevance evaluations without scraping Bing?

From my experience, Googlebot doesn't crawl pages that are blocked in robots.txt files. Check out Bing's robots.txt: http://bing.com/robots.txt - notice how /search is disallowed. That typically means that Googlebot isn't able to access that page. The same for the other search engines, it's more down to if they specify (through robots.txt) that Googlebot isn't allowed to crawl those results.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact