
What on earth are Google thinking? (re: Binggate) - user24
http://www.puremango.co.uk/2011/02/what-on-earth-are-google-doing/
======
haberman
> At first I thought there could be three possible explanations to Google’s
> handling of this situation:

I think he forgot option 5: whether Bing intentionally copied Google's search
results or not, if Bing continues this practice then their index will contain
a _de facto_ copy of Google's index for tricky queries. Bing will inherit
Google's spelling correction, it's long tail, its top results, and any other
enhancements Google develops to return more relevant results.

Google invests so much into these things; if Bing absorbs these improvements
without lifting a finger, Google loses its ability to stay ahead through
technical merit and innovation.

I think Amit is telling the truth: "And to those who have asked what we want
out of all this, the answer is simple: we'd like for this practice to stop."
[http://googleblog.blogspot.com/2011/02/microsofts-bing-
uses-...](http://googleblog.blogspot.com/2011/02/microsofts-bing-uses-google-
search.html)

~~~
ecopoesis
It's nice that Google would like Microsoft to stop copying them. Everyone else
who generates unique content (news site articles, TripAdvisor reviews) would
like Google to stop copying their content into Google News and Google Places.

Apparently Google only cares about sites copying other sites content when
their it's their own stuff getting copied.

~~~
joe_the_user
Any site that wants Google to stop copying their content to Google news can
ask Google news to stop doing so. It's very simple.

~~~
pyre
The real issue is that they don't want their content in Google News, but they
also don't want to kill the golden goose that is directing traffic to their
site. I.e. Google News is generating traffic for them, but they don't want the
content to be on Google News either. They want to have their cake and eat it
too.

edit: 'they' here refers to the old media newspapers, more specifically to Mr.
Murdoch.

------
moultano
If it were just click data, how would they get the terms?

They're either parsing the query out of the url, or violating robots.txt to
fetch the result page, almost certainly the former. This seems like a pretty
clear indication that they've special-cased clicks from google. It's
theoretically plausible that they are treating all query parameters the same
for all sites, but very unlikely given how much noise that would introduce
into their results. Even so, they would have to know that most clicks with
meaningful query parameters come from Google. This isn't something that's
going to happen by accident.

~~~
user24
> parsing the query out of the url

Or parsing any words found in the URL. Not unreasonable to believe.

~~~
earl
Actually, unless I'm mistaken, parsing the URL may well yield good search
terms, but the query string is much less likely. Look at eg HN -- you'd parse
out id = some long number. I may be wrong, but I don't recall many sites that
would have good information for a search engine in that string.

What seems far more likely is bing is using click tracking on G's results.
This was explicitly _not_ denied by their VP on the search engine panel today.
If not many sites except search engines have useful keywords in the query
string, that pretty much validates Google's complaint.

In fact, if you go to Google's blogpost [1], the bing toolbar specifically
calls out monitoring "the searches you do, the websites you visit, [...]" [1].
And the MS guy doesn't deny using clicks on G's search results [2]. In fact,
the pretty much just says they copy G on long tail searches.

[1] [http://searchengineland.com/google-bing-is-cheating-
copying-...](http://searchengineland.com/google-bing-is-cheating-copying-our-
search-results-62914)

[2]
[http://www.bing.com/community/site_blogs/b/search/archive/20...](http://www.bing.com/community/site_blogs/b/search/archive/2011/02/01/thoughts-
on-search-quality.aspx)

~~~
user24
> I don't recall many sites that would have good information for a search
> engine in that string.

Well, Google is one such site - as in, if you have bing toolbar installed and
are on google.com/search?q=keyword and click on example.com, then Bing can
easily extract "google com search q keyword" and associate it with example.com
- without anything explicitly or intentionally relating to Google in their
code.

They may also be looking at referrer info.

------
neutronicus
Google's outrage seems silly to me. They made a fortune harvesting _other
people's_ judgments on relevant webpages. That's what PageRank is.

So Microsoft takes their search results - it's all in the game! And Google
wrote the rules!

~~~
luigi
But this is a matter of core functionality. Bing should not be using
clickstream data from Google searches as a signal. The core offering of Google
search is getting used by its competitor, without any citation. Yes, it's a
side effect of the Bing bar's clickstream tracking, but it's still an
observable, quantifiable effect.

It’s like if Developer A takes Developer B's code and passes it off as his own
work without any citation. Does it matter if it was malicious or incidental?
Not really, it's still wrong. And if Developer A gets caught doing it, he
should own up to his mistake and fix it.

~~~
neutronicus
The way I see it, it's like if B builds entire applications by pasting
snippets from Stack Overflow without citation, and then pitches a holy fit
upon realizing that one of his/her amalgamations of Stack Overflow snippets
has in turn been cut and pasted.

The boundary just seems arbitrary to me, which makes the whole thing seem
hypocritical.

------
contextfree
OK, out of curiosity I actually tried installing the Bing bar (on IE9 beta,
which, btw, makes you manually activate add-ins after they're installed. The
installer for the addin itself is pretty upfront about it sending your click
data and stuff - it's right there on the one and only options page, next to
one of three checkboxes - though the box is checked by default which I think
is dubious).

I haven't been able to influence the Bing search results (no surprise there,
since I've only spent a few minutes on it and not weeks like the Google folks)
but one thing I did find _very_ interesting was that if you search for
something on another site, the Bing bar actually lights up and populates its
own search field with your query so that with another click you can search for
it on Bing.

As far as I can tell, the bar doesn't seem to be using any heuristic to tell
what is a search query but just has a list of sites/URL patterns it knows
about. Besides Google and Bing itself, these include Wikipedia, Yahoo,
Ask.com, Amazon, Facebook, eBay, YouTube, MSDN and IMDB, but not Twitter or
DuckDuckGo. This doesn't prove anything about how it's feeding search results
of course but it does at least suggest that the Bing bar is very interested
specifically in search queries, though not only on Google.

------
zacharypinter
I think the key to all of this lies in the code for the bing toolbar (and the
code that parses its data).

If I search twitter for "binggate" then click on a link, the referrer would
be:

<http://search.twitter.com/search?q=Binggate>

It wouldn't be hard to write a generic parser to detect URLs that look like
search queries, and it'd be a novel way to gain a lot of information from
private indexes (stack overflow, twitter, lucene/solr setups, reddit, etc).

If this is what Bing is doing, then kudos to them for clever thinking, no foul
play.

However, if the toolbar has any logic to specifically parse Google's URL
syntax, or if they're filtering and correlating google.com URLs against their
own algorithms, then it's copying and foul play.

The surprising thing here is how much publicity Google's giving the issue
knowing that Microsoft could shoot them down in no time with a few code
snippets (if there's no foul play going on).

~~~
drivebyacct2
> It wouldn't be hard to write a generic parser to detect URLs that look like
> search queries

vs

>However, if the toolbar has any logic to specifically parse Google's URL
syntax

So they're either generically copying from ANY search engine, or they're
specifically copying from Google. Either way, the perception and actual
outcome is the same. I'm not sure either is inherently good or bad, but I'm
surprised that there is such a delineation in peoples' minds.

~~~
zacharypinter
Interesting.

Take a site-specific search index like the one that reddit uses. Such an index
can prioritize by votes, comments, users, fine-tuned spam filters, and so on.
A search engine specifically optimized for reddit has a lot more information
available when returning links than a generic web crawler ever could.

If Bing came up with a generic way to leverage site-specific search engines
and help drive traffic to those sites, is that still perceived with negative
connotations?

------
aaronsw
This piece is so bogus. If, as he suggests, Google's query data is
accidentally getting caught up in some larger Bing project, then why doesn't
Bing just say that in their numerous posts and tweets about the scandal?
Instead, they've just thrown mud at Google.

------
ig1
I think the biggest question here is, if a site robots.txt file prohibits bots
from gathering data, should that also prohibit bots that piggyback on real
user sessions ?

~~~
pbhjpbhj
>if a site robots.txt file prohibits bots from gathering data,

Pages in a robots.txt file still get put into Google's index they just don't
get parsed. Surprised me to find that out; <http://www.seomoz.org/learn-
seo/robotstxt> (see "Why Meta Robots is better than robots.txt").

------
javanix
Can we please, please, PLEASE not call this Binggate?

Not everything needs a pithy one-word term of endearment.

~~~
paulrademacher
Bingapocalypse?

Oh wait, that's still one word.

------
jdp23
Very good post. I'd like to suggest another explanation: they really ''do''
see what Microsoft is doing as cheating, and expect others to share their
outrage. When you're "in the bubble", talking only with people who share your
perspective, it's easy to believe everybody things they way you do. And
Google's known for being smarter than everybody else when it comes to search,
so at some level people there probably believe that the only way anybody could
get results as good as they are is by cheating.

~~~
uxnjmad
A bubble like Microsoft Research for instance?

The comments on this subject seem very polarised and it is interesting to look
at the background of the commenters.

~~~
jdp23
Yes, although in some ways MSR is less in the bubble than the rest of the
company -- people go to conferences and are aware of what's happening
elsewhere. Microsoft's bubble applies just as much here as Google's: I'm sure
a lot of folks there can't understand why Google or anybody else would think
there's anything wrong with what they did here. When bubbles collide.

My charter back in 2006-7 was "game-changing strategies", which meant getting
people to think outside the bubble. So we did a lot of work analyzing
Google's, Yahoo's, and Microsoft's corporate culture. One of the things that's
core to Google's identity is preferring algorithms to anything that has to do
with people, and that's very much on display with Binggate.

------
pak
My first thought was that it was click data, and not outright copying. If a
bunch of Google employees with Bing toolbar start clicking on links to some
made-up term, that should spike the data enough to change results within a few
weeks. How can Google positively rule that out without internal knowledge of
how Bing works?

~~~
user24
With a control. Launch a fake search engine with similar spiked results but
with no traceable connection to Google. If the results only show up on Bing
from google-clicks and not from clicks on the control, then it's a good
indicator that Bing have Google-specific code. Then run the test a few more
times with greater than 100 queries (which in SE land is a miniscule test
set).

From what they've said, it seems like they only tested against fake clicks on
google.com. That tells us Bing are using click data but nothing more. This is
a pretty simple debugging technique, which is why I'm shocked if Google didn't
think to do that. I really wish I wasn't the one saying this, wish I didn't
have these doubts. But I can come to no other conclusion than the ones I've
outlined in the post.

~~~
jsnell
If you can't come up with any other conclusion, I'm not sure you're really
sorry about having those doubts. Here's a few alternatives off the top of my
head:

0\. There might or might not have been a control group, but its results didn't
matter since:

0a) the whole purpose of revealing this was to make sure that blackhat SEOs
could start abusing the system, and MS would thus be forced to stop doing it.

0b) the whole purpose of revealing this was to shame MS into stopping the
practice, and for that purpose it didn't matter whether the system was
specific or generic.

0c) describing the full experimental setup and the gazillion things that were
tested would just have distracted from the core story

1\. There was a control group and it suggested that the mechanism wasn't
generic, it just wasn't mentioned because:

1a) it was held back as a gotcha in case Microsoft started lying about what
the system actually did.

1b) positive evidence from the control group couldn't actually prove anything,
it'd be suggestive at best.

2\. There was no need for an explicit control group since:

2a) they actually observed the network traffic of the toolbar, and it only
sent the relevant information for Google and not other sites.

2b) they disassembled the toolbar and found out it had Google-specific code
related to this.

3\. There was a control group and it suggested that the mechanism was generic,
but:

3a) they thought that the the mechanism being generic didn't matter, and what
MS were doing was still equally dodgy.

3b) they thought that MS would not be keen on trying out a "oh no, it's not
just Google whose algorithms we're leeching off when spying on users, it's
every other site too" PR strategy.

I have no idea of what actually happened. In all likelihood it was something
not listed here, since these were just random ideas. But at least I think many
of them are way more plausible than the silly "maverick super-senior engineers
botch the job, leak a flawed story, PR coverup follows" theory.

------
b0b0b0b
I like to imagine that the bing algorithms are so smart, that it realizes the
preciousness of the signal in clicktrack-logged visits to websites with a
referrer "<http://www.google.com/search?q=%s.> This single feature is given so
much weight that bing unintentionally gives the appearance of wholesale google
duplication.

------
robobenjie
The article was interesting, but using Google and Bing as plurals really
disorients me. I had to stop and mentally substitute every time he said
something like "Google are thinking".

~~~
mikeklaas
That's standard practice in the UK when referring to companies.

~~~
user24
thanks, I was beginning to wonder if I was 'right' or not. Didn't know the
standard was singular in US.

------
natmaster
Finally some sense in all this.

------
kj12345
Really interesting comment at Reddit about what Google might be thinking:

[http://www.reddit.com/r/programming/comments/fd3g9/google_bi...](http://www.reddit.com/r/programming/comments/fd3g9/google_bing_is_cheating_copying_our_search_results/c1f1bvx)

Apparently by creating fake results they've published something "creative" and
thus potentially able to be copywrited. Just speculation but interesting.

~~~
msbarnett
The Reddit commentator appears to be operating on the idea that trap streets
and the like are copyrightable under US law, which is a fairly common urban
legend.

In reality adding a fake "fact" to a collection of real facts does not allow
you to sue someone if they copy the fake fact along with the real facts; from
the decision in _Nester's Map & Guide Corp. v. Hagstrom Map Co._, treating
"'false' facts interspersed among actual facts and represented as actual facts
as fiction would mean that no one could ever reproduce or copy actual facts
without risk of reproducing a false fact and thereby violating a copyright ...
If such were the law, information could never be reproduced or widely
disseminated"

~~~
pbhjpbhj
In Europe there are specific database IP rights.

Re the decision you cite it seems to be spurious. The notion of whether the
data is factual or not is irrelevant. Was the presentation copied?

>would mean that no one could ever reproduce or copy actual facts without risk
of reproducing a false fact

Unless they actual did some work and checked the facts rather than making a
slavish, infringing, copy of someone else's work.

>If such were the law, information could never be reproduced or widely
disseminated"

For limited terms of never (not limited enough mind you but nonetheless
limited).

This is interesting though - if you can copy facts with impunity then can't I
copy the "fact" of the score for Katy Perry's Firework for example?

------
melling
9/10 desktops run Windows and IE still has almost 60% market share, despite
being much worse than Firefox or Chrome (yes IE9 is a huge improvement).
Microsoft could do a lot of damage to Google by leveraging their desktop
monopoly. I think Google has around 70% search market share, but they're
easily replaceable. By the time the courts sort it out, the damage will be
done.

------
mukyu
Why does he think that Google were not aware of the clickstream data when they
set up (several) experiments where they specifically install what they thought
were sources for the clicks and then purposefully clicked?

------
pluc
It's okay for MS to parse user input through it's toolbar, fine. However, it's
not okay for Microsoft (or anyone) to use that toolbar to figure out what
response any given server sends in reply to any kind of requests - unless it's
part of a documented feature. You're only seeing the part where Microsoft is
grabbing user input and not the part where Microsoft is grabbing Google data.

------
InclinedPlane
I think Google is right to be upset here. Google's biggest asset is their
search results and Microsoft's response seems tepid at best. Effectively MS is
claiming that they don't scrape google search results, instead they merely
constructed an automated device which essentially does exactly that. If Bing
doesn't see what's wrong with that then they are not terribly smart.

------
user24
I've written a followup post addressing some of the common reactions to my
first: [http://www.puremango.co.uk/2011/02/what-are-google-
thinking-...](http://www.puremango.co.uk/2011/02/what-are-google-thinking-
part-2/)

On HN: <http://news.ycombinator.com/item?id=2169690>

------
gfodor
I have a feeling there is more to this story than meets the eye. Either Google
did more tests then they mention on the blog and proved irrefutably that Bing
is literally scraping their results, or there's some underlying political
stuff going on that we're not privy to.

~~~
ars
Never mind about proving it - google isn't even _saying_ it.

It's not scraping, it's click tracking.

------
storborg
Can't someone just analyze the Bing toolbar binary and figure out what's
actually going on here?

~~~
_flag
The Bing toolbar only sends information back to Microsoft, it doesn't decide
what to do with it. For this kind of a task it would be simpler to use a
packet sniffer anyway.

~~~
storborg
Thanks for the clarification. Does the toolbar send a notification to
Microsoft on _every_ click? Or only clicks on Google SERPs?

------
jyanez
I'm in Venezuela and it just didn't happened, I've parse HTML results from
Google a couple of times and Google discriminates results from Address IP,
Browser and Languague. So far, I think it's pretty wrong to say something like
this without being 100% sure.

------
izendejas
Whether it's wrong or not, it does make you wonder where Google's priorities
are. How much time (cumulatively) did they spend on this and could they have
spent this time improving their algorithms instead? Maybe it took a few hours,
but still.

------
muyyatin
Is the next push in SEO to pay users to navigate from search results or other
sites to your site?

------
known
<http://www.google.com/xml> is wrong

------
xaei
it's tangentially interesting that the number 2 result on bing for 'google' is
a washington post semi-hit-piece regarding spam. google's results for 'bing'
contain marginally less transparent attacks .

------
s_jambo
I'm surprised google didn't just use this to poison the results.

------
jyanez
I just tried what the blog said and it's simply not happening.

------
trezor
I honestly don't get the controversy of this thing.

Let's say Microsoft via its bing-toolbar is checking what pages people are
visiting, what search term they are using on various sites (including google)
and what they deem are the most useful results of these queries, and
incorporating this into their search engine. Is this really so bad?

As it stands, Google is collecting this data about you almost _everywhere_ on
the internet, even if you use google or not. Besides straight Google search,
think websites which uses Google's CDN for stuff like jQuery, Google
analytics, etc etc. Google is everywhere and they are collecting data to
incorporate in their search and ad-platform from everyone. End of story.

Heck, with google instant and their new JS-based search-gui you can't even get
the referer information on your own website to see what search terms lead your
users to your site. You now _have_ to use Google analytics to get that
information, and in getting that information you are helping Google even
further in tracking everything everyone is doing on the internet. WTH?

Relatively speaking, is really Microsoft the bad guy here?

