
Nate Silver: How Much Does Bing Borrow From Google? - moultano
http://fivethirtyeight.blogs.nytimes.com/2011/02/02/how-much-does-bing-borrow-from-google/
======
JoelSutherland
I read this article because it was written by Nate Silver so I expected
something substantial. There is not.

With much writing, he essentially says the following:

1\. Microsoft's defense contains no information because they don't say how
they weight the data learned from Google searches.

2\. It's hard to say how Microsoft is actually using the data learned from
Google searches.

For the crowd here this is essentially common-sense.

(edit: in point two I changed 'impossible' to 'hard')

~~~
noibl
Where are you getting point 2 from?

In many cases where they have no other data, they clearly weight Google's
single data point high enough to create a whole SERP around it. The claim that
'this is just one input among many' implies that some kind of cross-
referencing or corroboration takes place. In this experiment that obviously
didn't happen.

~~~
jhamburger
> they clearly weight Google's single data point high enough to create a whole
> SERP around it

This really doesn't tell us anything about how highly it is weighted relative
to other data though, since obviously there will be no other data to weigh in
on a nonsense search term.

~~~
brudgers
Google managed to inject "7 to 9" of their honeypots into Bing. In 91+/100
cases apparently there was enough to outweigh the Google results _on gibberish
terms_ despite the efforts of 20 Google engineers between December 17 and
December 31.

~~~
stumm
What were the terms of the other 91 cases? You make it sound like the Google
engineers were slaving away for two weeks. My impression was that they ran the
experiment during that time.

~~~
brudgers
Google doesn't say how long they were working on it or anything at all about
the methods they used. Indeed, they don't even seem to be certain how many
honeypots they injected.

However hand they issued twenty engineers laptops which is more consistent
with an ongoing effort than a single afternoon of Bing and beer - particularly
when you consider that if there was a single method applied and it was known
beforehand, one engineer could have automated the whole exercise.

The inability to accurately measure honeypot injections is a bit odd. As pure
speculation, it may be that they were not sure if the methods they used to get
Bing to show honeypots number 8 and 9 were legitimate.

------
noibl
A common counter-argument seems to be 'But this effect only happens when there
are no other sources of relevance data'.

Well...

    
    
      'over the next few months we noticed that URLs from 
       Google search results would later appear in Bing with 
       increasing frequency for all kinds of queries: 
       *popular queries*, rare or unusual queries and 
       misspelled queries.'
          -- http://goo.gl/Bi0JH [googleblog]
    

Now it's possible that Bing really did have no data for these 'popular
queries' but I don't think that's an argument anyone would like to make. The
alternative interpretation is that top-ranked pages in Google get added
prominently to Bing after some interaction involving the toolbar, even when
there are lots of other possible matches.

You can still say that this would represent just a single data-point among
many but unless you think Google is lying, you can't say that it only happens
with rare searches.

BTW: I'm not arguing that there is direct evidence that the mentioned effect
on popular queries is related to the toolbar. But the effect is there.

~~~
kenjackson
The problem is that Google is doing the wrong test to prove their point.

The should search for Justin Bieber and add in the Bieber search results some
really horrible link.

MS should have a lot of relevant info for Justin Bieber, so the Google data
should be rated lowly. If this non-sensical result makes the first page then
you can begin to assume that their weighing it heavily. But when your query is
"erftqnvpwedf" -- that just means that the only relevant info is coming from
the toolbar. Bing would show that result even if the toolbar only accounted
for 1/(2^50) of the total relevance.

I suspect Google did this test and has nothing to report.

~~~
barrkel
The test you suggest doesn't address the actual problem that Google is
complaining about - long-tail, rare searches, especially ones that contain
misspellings.

This is the very thing that Google is saying that Bing is stealing with their
clickstream data, and it's also the one that you're agreeing would occur -
"that the only relevant info is coming from the toolbar" for highly unlikely
queries.

So it seems to me, looking at your argument, that you actually _agree_ with
Google's point.

~~~
kenjackson
My comment is in response to a comment about "popular queries". And really to
the meta issue about how much it is weighed in general.

Google is trying to argue a huge PR point by saying, "Bing copies" with the
ramification being that when you search on Bing you're really searching on
Google. If Google said this, "For extremely rare searches where Bing has few,
if any, good signals, clickthroughs from Google will be weighed in such a way
that these results may make the first page".

I'd buy that 100%. The Bing team may even buy it. Google seems to want to
start with that thesis, but then try to shove the whole camel in too.

------
tzs
A better analogy using Italian restaurants than his would be something like
this. You are standing on the street, watching people go into businesses. You
observe a man and woman walk up, and the woman says "which of these
restaurants should we dine in?". The man pulls out a well-known restaurant
guide, and clearly looks up both. Then they walk into Mario Batali's
restaurant.

Later, someone comes up to you and asks which is the best italian restaurant
on that street. You recall that the couple with the restaurant guide picked
Batali's, and so you point them to it.

Google would say you are ripping off the restaurant guide. Microsoft would say
you are just using the observed behavior of the first couple to guide you.

~~~
noibl
Disclosure: you are also a restaurant guide.

~~~
burgerbrain
Fuller disclosure: you are in direct competition with the first restaurant
guide and have released public statements explaining why you are better than
the first.

------
tmoertel
Here's the problem I have with Google's claims about Bing: Google suggests
that Bing is cheating but hasn't demonstrated that Bing's behavior somehow
deviates from what is optimal in a machine-learning sense. Thus, the real
question is whether Bing should cripple its machine-learning algorithms if
they infer (correctly) that Google-suggested results are likely to be
relevant.

That is, if any learning system is observing click-stream behavior from users
and mining it for relevance evidence, I'd expect it to ultimately home in on
the true weight of each piece of evidence in that click-stream data. Since
Google's contributions to that data are likely to be highly relevant, any good
machine learning system is going to _learn_ that they are relevant and start
recommending them.

In effect, then, what Google is arguing is that if Bing's machine-learning
algorithms are correctly inferring that results that happen to come from
Google are highly relevant, Bing should blind itself to that knowledge.

I'm not sure that's good for anyone except Google.

(For a completely uninformed guess about why Google might be interested in
raising copycat claims about Bing, I go out on a limb here:
[http://blog.moertel.com/articles/2011/02/02/the-google-
micro...](http://blog.moertel.com/articles/2011/02/02/the-google-microsoft-
squabble-over-bing-results-some-completely-uninformed-speculation))

~~~
ars
But that means that you have not created an actual search program. You are
just using googles data.

Or to put it another way: If google did not exist do you still have a (good)
search program? If the answer is no, then there is no reason for them to
exist.

~~~
tmoertel
> But that means that you have not created an actual search program. You are
> just using googles data.

No, if your learning system is working properly you should be using Google's
data only to the extent it is legitimately observable and more relevant than
anything else you're feeding your system. And, for lots of searches, Google's
data leave much room for other sources to be more relevant. In the limiting
case, when you're feeding your system everything that Google is feeding its,
you should almost never return Google-derived knowledge because your system
should almost always be able to come up with greater relevance from knowledge
derived from primary sources.

Your final question is on the right track, but I'd suggest a small tweak: If
Google didn't exist, would the system still offer highly relevant results and,
if Google did exist, would the system be able to learn from Google-supplied
knowledge, to the extent allowed by law and terms of service and so forth, to
offer results at least as relevant?

------
xenocom
Well, from what I read elsewhere, Microsoft is saying that Google admitted to
using the Bing toolbar during their testing, and that plays into the Bing
results as it associates soon-after visited pages with the queries they
used... actually kind of a smooth way to harness user interaction to power
search relevance, if you ask me.

------
msravi
This is like trying to find the answer to an unsolvably hard problem on a test
whose answer should be "None of the above". Then, peek at your neighbor (who
has a reputation for scoring well), and on realization that his answer is
different, change yours to match his. Plain lack of confidence in your
ability.

~~~
contextfree
a test is a controlled experiment. it's nothing like product development.
analogies between the two make no sense.

------
trustfundbaby
I still don't understand how this is not cheating.

Microsoft has lifted Google's SERPs _verbatim_ for a particular search ...
what else is there to discuss?

------
erikpukinskis
The problem is that Google's "experiment" didn't include a control group. Is
Bing copying Google's results, or just measuring click data?

The trivial addition to their experiment would be to add those nonsense words
to a few Wikipedia pages, and click the links with the Bing toolbar installed,
and see if those pages show up in the Bing results.

~~~
barrkel
Measuring click data that gets turned into a correlation between the query
string and the result on Google's SERP _is_ copying Google's result, because
that's the upshot of it - that is its effect, in these rare long-tail
searches.

~~~
erikpukinskis
If the toolbar is simply sending "they clicked link X from a page with text 'A
B C'" and they are using that in their search engine to mean "If someone
searches for A, B, or C, they might be looking for X" then that is, in fact,
NOT copying Google's results, it is a general measurement of how users react
to ALL web pages.

