Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Nate Silver: How Much Does Bing Borrow From Google? (nytimes.com)
91 points by moultano on Feb 2, 2011 | hide | past | favorite | 49 comments


I read this article because it was written by Nate Silver so I expected something substantial. There is not.

With much writing, he essentially says the following:

1. Microsoft's defense contains no information because they don't say how they weight the data learned from Google searches.

2. It's hard to say how Microsoft is actually using the data learned from Google searches.

For the crowd here this is essentially common-sense.

(edit: in point two I changed 'impossible' to 'hard')


This is classic Nate Silver. He will happily write a 1000 word essay explaining why, in great detail, with very high certainty, and despite the fact that pundits across the world are flapping their jaws about it, there is simply nothing to say about a topic.


That's the sort of pundit we need more of. Pretty much everything I've read on this story, from Google, from Bing, from various commenters on HN, has been pretty nonsensical. I'll take 1000 words of "hold your horses" over 100 words of nonsense any day of the week.

Though I admit I may only read the first couple hundred and skim about halfway through before closing the tab.


Where are you getting point 2 from?

In many cases where they have no other data, they clearly weight Google's single data point high enough to create a whole SERP around it. The claim that 'this is just one input among many' implies that some kind of cross-referencing or corroboration takes place. In this experiment that obviously didn't happen.


> they clearly weight Google's single data point high enough to create a whole SERP around it

This really doesn't tell us anything about how highly it is weighted relative to other data though, since obviously there will be no other data to weigh in on a nonsense search term.


Google managed to inject "7 to 9" of their honeypots into Bing. In 91+/100 cases apparently there was enough to outweigh the Google results on gibberish terms despite the efforts of 20 Google engineers between December 17 and December 31.


What were the terms of the other 91 cases? You make it sound like the Google engineers were slaving away for two weeks. My impression was that they ran the experiment during that time.


Google doesn't say how long they were working on it or anything at all about the methods they used. Indeed, they don't even seem to be certain how many honeypots they injected.

However hand they issued twenty engineers laptops which is more consistent with an ongoing effort than a single afternoon of Bing and beer - particularly when you consider that if there was a single method applied and it was known beforehand, one engineer could have automated the whole exercise.

The inability to accurately measure honeypot injections is a bit odd. As pure speculation, it may be that they were not sure if the methods they used to get Bing to show honeypots number 8 and 9 were legitimate.


While we are all speculating on certain bits of what happened in this she said/he said, I would presume in many/most of those 91 cases what happened wasn't that other inputs outweighed Google's but rather no user running the Bing toolbar (and who had it set to the phone home defaults) took the action which would create the Google-related input to begin with.


My point was that the algorithmic conclusion 'oh well, let's just print it anyway, since it's from Google' is pretty weighty on its own.


While we have no idea why this result is good, Google seems to think it's fine. If it's good enough for google, it's good enough for us.


"Since it's from Google" is an unproven assertion. They may do this for other data sources as well in similar circumstances.


I didn't assert that they don't. I asserted that they only had one data source in that case.

Whether they throw up new SERPs in reaction to data from other single sources might be interesting. But if true, it's also not very complimentary.


From the article's conclusion: "How much value Microsoft’s engineers are ultimately adding is hard to say. Both they and Google are extremely circumspect about revealing any detail about their algorithms."

It's common sense that they are using data from Google. That has pretty much been proven conclusively. Nate Silver telling us this again is not interesting. The possibility that he might try was.


So you paraphrase 'hard' as 'impossible'.

I actually think Silver makes a useful contribution, mainly because of his background and gift for illustrating data perception problems. If you're looking for hard facts and (more) smoking guns then you would naturally be disappointed.


Of course search engines don't like to reveal algorithms and weighting details, it would be a gold mine for rival and the SEO riggers. Why should Bing release such critical information to Google for 8/100 search queries?


The key question seems to be, if n is the information content in the page (ie number of bits needed to distinguish from a random search engine results page), what fraction of n would not be there without Google's ranking algorithm.

In the case of the honeypot pages, clearly the number is 100%.

In the cases of other results pages, we merely have Google's claim that the similarity has been rising over time, and that they attribute it to info extracted from Google.

Using information that people provide about where they click (hopefully with some kind of informed consent), in order to improve your algorithm, seems reasonable enough. If a large part of the info ended up coming from Google's algorithm, it starts to seems sketchy, and it's legitimate for Google to whinge.

Not that whinging ever stopped any other company, particularly Microsoft, from trying to take advantage of other companies' IP.


A common counter-argument seems to be 'But this effect only happens when there are no other sources of relevance data'.

Well...

  'over the next few months we noticed that URLs from 
   Google search results would later appear in Bing with 
   increasing frequency for all kinds of queries: 
   *popular queries*, rare or unusual queries and 
   misspelled queries.'
      -- http://goo.gl/Bi0JH [googleblog]
Now it's possible that Bing really did have no data for these 'popular queries' but I don't think that's an argument anyone would like to make. The alternative interpretation is that top-ranked pages in Google get added prominently to Bing after some interaction involving the toolbar, even when there are lots of other possible matches.

You can still say that this would represent just a single data-point among many but unless you think Google is lying, you can't say that it only happens with rare searches.

BTW: I'm not arguing that there is direct evidence that the mentioned effect on popular queries is related to the toolbar. But the effect is there.


The problem is that Google is doing the wrong test to prove their point.

The should search for Justin Bieber and add in the Bieber search results some really horrible link.

MS should have a lot of relevant info for Justin Bieber, so the Google data should be rated lowly. If this non-sensical result makes the first page then you can begin to assume that their weighing it heavily. But when your query is "erftqnvpwedf" -- that just means that the only relevant info is coming from the toolbar. Bing would show that result even if the toolbar only accounted for 1/(2^50) of the total relevance.

I suspect Google did this test and has nothing to report.


The test you suggest doesn't address the actual problem that Google is complaining about - long-tail, rare searches, especially ones that contain misspellings.

This is the very thing that Google is saying that Bing is stealing with their clickstream data, and it's also the one that you're agreeing would occur - "that the only relevant info is coming from the toolbar" for highly unlikely queries.

So it seems to me, looking at your argument, that you actually agree with Google's point.


My comment is in response to a comment about "popular queries". And really to the meta issue about how much it is weighed in general.

Google is trying to argue a huge PR point by saying, "Bing copies" with the ramification being that when you search on Bing you're really searching on Google. If Google said this, "For extremely rare searches where Bing has few, if any, good signals, clickthroughs from Google will be weighed in such a way that these results may make the first page".

I'd buy that 100%. The Bing team may even buy it. Google seems to want to start with that thesis, but then try to shove the whole camel in too.


They're talking about their top-ranking result being used. Are you suggesting that Google place garbage as the top-ranking Bieber result for visitors who use the toolbar?


No. Just for visitors that are coming from a specific IP address (their engineers). The only people that would see the bad results are their engineers. But the URL and data that the toolbar gets looks just like any other one.

This is pretty straightforward and I'd be surprised if Google didn't have the infrastructure in pface to do this today (like literally today).


But then their bad results would be outweighed by the hundreds or thousands of other Bing toolbar users doing Bieber searches getting legitimate results fed back to Bing.


I address this later (although probably after you started writing this comment). :-)


OK, I understand what you mean now. :)


And to be clear (if anyone from Google is reading) you don't just want to do 'Justin Bieber' since I suspect a lot of people with the toolbar actually do that search.

Also do things like 'radix-2 fft', something where there is a lot of additional signals (I suspect), yet something that other toolbar members probably aren't searching for. So the toolbar data MS gets is strongly skewed to your results.


But think about that claim: a result that shows up in Google's results for a popular query also shows up in Bing's results for a popular query.

Maybe because it's a good result? Possibly Bing arrived at it the same way Google did? Or has Google been slipping paper streets into their queries for longer than they're telling us?

In short: how did they know that it was "URLs from Google search results" that they were seeing in the Bing results?


"Even search results that we would consider mistakes of our algorithms started showing up on Bing."

If it was a popular query, and both google and bing made the same "mistake", doesn't that seem a tad suspicious to you?


There are multiple mechanisms by which Google's rankings will diffuse out to other services.

For example, lots of SEO and normal research studies the top Google results, or keywords revealed from referrer info and other logs, and then mimics those results elsewhere on the web, including outlinking to top results found via Google.

So anything prominent in Google's results will be echoed elsewhere in crawlable form, and probably associated with the same terms by which it was found at Google. Then those once-unique results will 'later appear' in other resources.

You can't be the dominating giant of the industry – the "single point labeled G connected to 10 billion destination pages" (as Blekko's Skrenta calls Google) – and not have this leakage happen. It goes with the territory.


Rankings analysis always starts with keywords of interest, even though Google can be used to identify keywords by relation. Referrer logs seeding outside sources with gibberish keywords that literally nobody is interested in? That's a stretch.


I'm not saying this is specifically how the gibberish Google probes landed in Bing; I'm saying this is a way that, even for popular queries, Google's ranking judgements and unique results are going to slowly diffuse out over time, and be echoed elsewhere, even without URL-trail mining (which shortcuts/supercharges the process).


Fair enough. I agree that naive indexers will make lots of crude and meaningless associations. I don't think Bing falls into that category though.


A better analogy using Italian restaurants than his would be something like this. You are standing on the street, watching people go into businesses. You observe a man and woman walk up, and the woman says "which of these restaurants should we dine in?". The man pulls out a well-known restaurant guide, and clearly looks up both. Then they walk into Mario Batali's restaurant.

Later, someone comes up to you and asks which is the best italian restaurant on that street. You recall that the couple with the restaurant guide picked Batali's, and so you point them to it.

Google would say you are ripping off the restaurant guide. Microsoft would say you are just using the observed behavior of the first couple to guide you.


And you don't bother to consider the other 999 aspects that are screaming at you to look up and see that the restaurant doesn't even exist.


Disclosure: you are also a restaurant guide.


Fuller disclosure: you are in direct competition with the first restaurant guide and have released public statements explaining why you are better than the first.


Here's the problem I have with Google's claims about Bing: Google suggests that Bing is cheating but hasn't demonstrated that Bing's behavior somehow deviates from what is optimal in a machine-learning sense. Thus, the real question is whether Bing should cripple its machine-learning algorithms if they infer (correctly) that Google-suggested results are likely to be relevant.

That is, if any learning system is observing click-stream behavior from users and mining it for relevance evidence, I'd expect it to ultimately home in on the true weight of each piece of evidence in that click-stream data. Since Google's contributions to that data are likely to be highly relevant, any good machine learning system is going to learn that they are relevant and start recommending them.

In effect, then, what Google is arguing is that if Bing's machine-learning algorithms are correctly inferring that results that happen to come from Google are highly relevant, Bing should blind itself to that knowledge.

I'm not sure that's good for anyone except Google.

(For a completely uninformed guess about why Google might be interested in raising copycat claims about Bing, I go out on a limb here: http://blog.moertel.com/articles/2011/02/02/the-google-micro...)


But that means that you have not created an actual search program. You are just using googles data.

Or to put it another way: If google did not exist do you still have a (good) search program? If the answer is no, then there is no reason for them to exist.


> But that means that you have not created an actual search program. You are just using googles data.

No, if your learning system is working properly you should be using Google's data only to the extent it is legitimately observable and more relevant than anything else you're feeding your system. And, for lots of searches, Google's data leave much room for other sources to be more relevant. In the limiting case, when you're feeding your system everything that Google is feeding its, you should almost never return Google-derived knowledge because your system should almost always be able to come up with greater relevance from knowledge derived from primary sources.

Your final question is on the right track, but I'd suggest a small tweak: If Google didn't exist, would the system still offer highly relevant results and, if Google did exist, would the system be able to learn from Google-supplied knowledge, to the extent allowed by law and terms of service and so forth, to offer results at least as relevant?


Well, from what I read elsewhere, Microsoft is saying that Google admitted to using the Bing toolbar during their testing, and that plays into the Bing results as it associates soon-after visited pages with the queries they used... actually kind of a smooth way to harness user interaction to power search relevance, if you ask me.


This is like trying to find the answer to an unsolvably hard problem on a test whose answer should be "None of the above". Then, peek at your neighbor (who has a reputation for scoring well), and on realization that his answer is different, change yours to match his. Plain lack of confidence in your ability.


a test is a controlled experiment. it's nothing like product development. analogies between the two make no sense.


I still don't understand how this is not cheating.

Microsoft has lifted Google's SERPs verbatim for a particular search ... what else is there to discuss?


The problem is that Google's "experiment" didn't include a control group. Is Bing copying Google's results, or just measuring click data?

The trivial addition to their experiment would be to add those nonsense words to a few Wikipedia pages, and click the links with the Bing toolbar installed, and see if those pages show up in the Bing results.


Measuring click data that gets turned into a correlation between the query string and the result on Google's SERP is copying Google's result, because that's the upshot of it - that is its effect, in these rare long-tail searches.


If the toolbar is simply sending "they clicked link X from a page with text 'A B C'" and they are using that in their search engine to mean "If someone searches for A, B, or C, they might be looking for X" then that is, in fact, NOT copying Google's results, it is a general measurement of how users react to ALL web pages.


You have very little actual knowledge of what their experiments entailed.


What are you talking about? They wrote an entire blog post about what their experiments entailed.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: