
Data is at the heart of search, but who has access to it? - dpw
http://andreasgal.com/2015/03/30/data-is-at-the-heart-of-search-but-who-has-access-to-it/
======
ChuckMcM
Sigh, this is incorrect.

edit: incorrect is perhaps too strong, it is incomplete.

While it is true that click tracking can be used as a relevance signal, the
people who were _really_ pissed off when the data stream got dumped were
advertisers who wanted to buy AdWords. That was a very simple system, pay
someone for clickstream data, extract trending queries, front those with
AdWord buys to get your page on the top of Google's results, and profit.

Having built a search engine and run it for 5 years, we got to see what people
felt was relevant and what wasn't in a very loose way with click stream data.
Basically you have a query and 10 blue links you can split the results in
quartiles and figure out if the thing they clicked on was top half, bottom
half, top quarter/second quarter etc. And do A/B testing to see how that
played out. But what we found was that the best indication of what a page was
about, was the text that linked to it. If you have an in-link to a page which
was "<href='page'>great radio site"[1] then "great radio site" would be a
query that should return that page which might be titled something like "bob's
electromagnetic spectrum imaginarium" or something equally unlikely to come up
in a query string.

So the bottom line is that there are lots of ways to try to determine
relevance, click stream data is a part of that but by no means the biggest
factor.

[1] neutered html for obvious reasons.

~~~
Animats
The value of looking at queries is that it allows learning what questions
users ask. The front end of the search process is to infer from the query what
the user really should be given. That's a machine learning problem. The head
of Google search remarked recently that "as the search engine gets smarter,
the queries get dumber".

This is reflected in Google's search results. A Google query which can
possibly be interpreted as related to a popular culture item usually will be.
Google has become more aggressive about this over the years. Their "Did you
mean" result tag once offered an alternative for a second search. Now, they
return results for the more popular interpretation first.

The back side of search, page quality and ranking, is weaker than many think.
Links are less useful than they used to be. Most links to business sites are
now from "social" sites or forums, which are easily spammed. Using social
signals was a disaster back in 2012, when, for a few months, Google went all-
in on social signals. Google tried to recognize sites that "look like spam",
but everybody knows that now and spam sites look better than ever. (The same
thing happened with spam emails a decade ago.) Google doesn't recognize
provenance, so they can be fooled by scraper sites. Google doesn't recognize
the business behind the web page, so they can be fooled by marginal
businesses. There are even SEO companies using machine learning to reverse
engineer Google's algorithms, to find out how far they can go with keyword
stuffing before a penalty kicks in.

Google does far more manual adjustment than they did two years ago. There's an
army of people doing manual ranking, and a smaller unit handing appeals from
manual penalties. There was a time when Google boasted they did no manual
adjustments to ranking. The automation is starting to fail.

~~~
sanxiyn
1noon (Korean web search startup) tried to recognize provenance and was
somewhat successful. But that wasn't enough to win in the market. Naver
acquired 1noon.

------
jfuhrman
>In Germany, for example, where Google has over 95% market share, competing
search engines don’t have access to adequate past search data to deliver
search results that are as relevant as Google’s. And, because their search
results aren’t as relevant as Google’s, it’s difficult for them to attract new
users. You could call it a vicious circle.

This is interesting because of the browser choice enforced by the EU on
Windows. IE whose default is Bing lost share to other browsers like Chrome,
Firefox and Opera which all had Google as the default. So an attempt to fix
the browser market totally distorted the Web Search market. I wonder why MS
didn't request to the EU that the alternate browsers in the browser choice
screen had to have Bing as the default search.

I wonder if the EU will mandate that search relevancy data must be shared by
Google with rival search engines like DDG just like they mandated that SMB
shares and Office formats must be documented by MS and released to developers.

~~~
dheera
Ethics and morality aside, I'm curious what allows the EU to "enforce" laws on
a US company. Let's say Google and Microsoft don't register entities in the
EU. Can they do anything?

Can Microsoft and other US-based technology companies _theoretically_ just
keep doing their own thing, tell the EU government "to hell with it, we're
abiding by US laws, you have a choice to stop importing Windows and invent
your own OS if you don't like us"?

~~~
M2Ys4U
The EU is the world's largest economy, do you really want to shut your
_entire_ business out of that market?

What then happens if a competitor is established to take your former position
in the European market - chances are they're not just going to stay in the EU.
They're going to eat your lunch elsewhere too.

~~~
dheera
How would the EU shut you out? Would the EU actually dare to begin censorship?

~~~
xxxyy
No, of course not by censorship. Through: fines for monopolist practices
(happened to Microsoft), general smear campaign (happens to Amazon in Germany
over working conditions), poking with a stick (the "right to be forgotten"),
or just plain old taxes (the new "internet tax" is a current topic in the EU).
There is always a way if you are determined enough. Politics.

------
solve
Other than the index data, there's something even bigger.

Google's biggest PR success is convincing everyone that the quality of web
rankings depends almost purely on algorithms. It does not. What allows Google
to hold their monopoly is the $100s of millions (or more) they continuously
pay to amass more manually created training data:

[http://www.theregister.co.uk/2012/11/27/google_raters_manual](http://www.theregister.co.uk/2012/11/27/google_raters_manual)

[http://www.forbes.com/sites/timworstall/2012/11/27/is-
google...](http://www.forbes.com/sites/timworstall/2012/11/27/is-googles-
algorithm-really-just-1500-homeworkers/)

A new search engine could appear today with algorithms 10x better than Google,
but without access to this scale of training data, their rankings wouldn't
even be close to Google's quality.

Google maintains their position by paying cash for this monopoly on training
data made by tens of thousands of $9/hour workers, not through superior
algorithms!

------
bobajeff
I think a problem that is happening here is that there is no competition in
search just like there is no competition in social networks and operating
systems. Not like there are for things like automobiles, electronics and
clothing.

Computers introduce a means to lock people in that don't exist in other
markets. In software products there are often ecosystems that tie directly in
to the product/service which are not required to be shared with competitors
unlike with road systems for cars.

Regulators ought to look into ways to enforce measures that require the
companies to completely open their ecosystem to competitors. Or look into ways
to standardize these ecosystems and require every service/application/website
comply with them (similar to how media companies are forced to include closed
captioning).

~~~
pain
"Jobs did great harm to the world with his iThings: computers designed to be
jails for their users. His genius was to find the way to make these jails
desirable so that millions would clamor to be locked up." —Richard Stallman

~~~
ntakasaki
What is more open, a Chromebook or a Windows laptop or a Macbook?

I would think a Windows laptop or a Macbook because the users and developers
can install or develop any application, yet we have everyone singing the
praises of heavily DRM'ed and locked up Chromebooks and iPads. Sometimes I
feel it's more about Microsoft hate than about a free computing environment.
At least RMS is consistent and is less prone to company fanboyism than the
tech crowd.

~~~
castratikron
Chromebooks are locked up now? I thought you could install your own Linux on
them. Don't some even use coreboot?

~~~
ntakasaki
You can install Linux on Windows PCs without even needing to developer unlock,
doesn't that mean they're as open as Chroembooks. Not to mention things like
if the battery goes completely dead on some Chromebooks, Linux is completely
wiped along with the data and replaced by ChromeOS. Also have to press Ctrl-D
past a vscary warning on every single boot on some Chromebooks or flash a new
bios.

Can Mozilla make a Firefox for ChromeOS? How many Chromebooks that are being
dumped in the education space are having Linux installed on them? Google has
root on ChromeOS and the user doesn't. The whole purpose of them is to force
the user into uploading all their data into Google's cloud. That's why even a
$1400 machine has a paltry 64GB of storage but comes free with a few years of
1TB space on Google Drive.

------
sanxiyn
In South Korea, Google's market share is below 5%, and Naver gets more than
80% of search queries. I think this is the reason why Google's search results
for Korean contents are not as good as contents in other languages.

------
jjoe
So the whole push for SSL/https from Google has been opportunistic rather than
good practice. I mean why would a search engine go as far as to make SSL a
ranking signal?

~~~
dheera
Sites that use or at least offer SSL probably also tend to be higher-quality
sites. The combination of verified identity and payment means that it's a
natural filter for people who are at least semi-serious about their project.

------
ocdtrekkie
It makes you wonder how many changes were made for "privacy" and how many
changes were made for "protecting our business".

~~~
stevenbedrick
Is it necessarily an "either/or" situation here? This seems to me like an
example of a "both/and".

~~~
ocdtrekkie
That's fair. I just wonder which half was the selling point that made the
change happen.

~~~
geoelectric
It's honestly hard to say. Privacy is a selling point, especially nowadays.

My guess is that the proposal probably included the cliche "win/win
situation," had already been sitting in someone's back pocket, and the raising
of it was either sparked by some privacy-related news story -or- a market
event of some kind. At the end of the day, it doesn't really matter.

I think there are a handful of techs that lend themselves to natural monopoly
--basically anything where the expense of building sufficient infrastructure
for a minimally-competitive product requires previous success in the market.
This is true whether the infrastructure is copper lines or a body of previous
searches.

That means you're either one of the first ones there with low cost of entry
and building on your own successes (Google); or you're shifting to the market
from success in an unrelated area (Bing); or you're essentially locked out
unless you can somehow acquire access to that infrastructure.

My guess is search will eventually turn into an antitrust-regulated industry.
Really depends on whether up and comers like DuckDuckGo can really stay
relevant based on ideology and whether old players like Yahoo can really re-
enter the market successfully.

But the most likely scenario really appears to be a duopoly between Google and
Bing at best, and more likely simply a monopoly for Google.

The analogous solution to telecom would be forced access to search queries for
alternative providers (a la CLEC/ILEC) but privacy concerns will make the
situation interesting to say the least.

Possible it may eventually turn out that mainstream search engines simply have
no specific privacy protection, at least for aggregate data. Since that's in
both the corporations' (market leaders aside) and government's best interest,
seems plausible. That'd be a lot of power behind it.

------
pcl
Interesting. I wonder to what extent this reasoning was behind executive
support of the Chrome project, and whether it was a factor from the onset or
something that Google stumbled upon after developing a browser.

~~~
sanxiyn
I am 100% sure this is the reason Chrome was funded. (I don't doubt Chrome
developers' goal was to develop the best web browser in the world, but
business case for doing so is different matter.)

~~~
rockdoe
Chrome was also an insurance policy. You can't buy away Google being the
default search engine in Chrome. Imagine pre-Chrome browser marketshares and
imagine the impact the Firefox-Yahoo deal would have had.

------
ntakasaki
>In 2011, Google famously accused Microsoft’s Bing search engine of doing
exactly that: logging Google search traffic in Microsoft’s own Internet
Explorer browser in order to improve the quality of Bing results.

MS didn't do that from IE, they did for users who installed the Bing bar, a
huge difference.

------
Metapilot
I think the author's perspective is skewed in order to stay in line with the
title. Here's an example of why I say that:

The author states that "For some 90% of searches, a modern search engine
analyzes and learns from past queries, rather than searching the Web itself,
to deliver the most relevant results." This may be true in some types of
searches but overall, I think the statement is misleading.

Rather, it's better to think of it like this: One important part of the
algorithmic process involves constantly crawling the web and updating the
index with new information. (Important / frequently-updated web sites may get
crawled all day every day, while ones that are less important may get crawled
only weekly or monthly). Meanwhile, another part of the algorithmic process
constantly analyzes new info discovered in the crawl and combines it with, as
the author-mentioned, click-through data learned from past queries.

The answers to many queries don't change, while the answers to many other
queries deserve freshness. For example, I'm quite certain Einstein's date of
birth hasn't changed in quite a while, but his theory of relativity is in
constant discussion and there is always new information and new queries
pertaining to it. As a result, there is not much need for a search engine to
go digging for the latest info on an "einstein's birthday" query, but it's to
everyone's advantage that Google is able to identify which pages on the web
deserve priority crawling and that Google has retrieved and incorporated the
fresh info those pages contain into its index when it comes to a topical type
of query like "diffraction of light with quantum physics".

In the end, the results to every query depend on info gathered from the web
and user data helps refine the results. Info that is more static can be
prioritized with more input from click-through data, while new information
found on the web must rely more on Google's artificial intelligence to push it
up in front of searchers.

Another reason that that "90%" statement sticks out to me is that there is a
fairly often-used factoid tossed around industry experts that between "6% to
20% of queries that get asked every day have never been asked before." Google
can't rely heavily on past query data for all of these type of searches.

~~~
solve
You're vastly underestimating the uniqueness of search queries these days.
Various sources within Google have said that 25% to 50% of queries entered
into Google have never been seen before at all.

------
wmf
So does Mozilla's contract with Yahoo allow Mozilla to track query data and
maybe feed it to underdog search engines like DDG or Blekko (oops)?

~~~
minthd
AFAIK ,the deal with yahoo was about putting yahoo search in the front. If it
was about tracking Google search data - mozilla should have at least let
people known, especially with their claim at protecting privacy. And if they
lie ,they risk a very strong response, especially from developers they depend
on.

Also ,if such changes we're to be made, there's a decent likelihood that
someone would have noticed that data leakage and told us about it.

So since mozilla is a pretty decent company, we should currently give them the
benefit of the doubt.

~~~
rockdoe
I don't see any reason to doubt anything or for that matter give anyone "the
benefit".

Firefox is still open source, unlike IE, Safari and Chrome, so just look.

~~~
minthd
Yes you're right. They probably couldn't do those games even if they wanted.

------
ekr
So that's why Google created the Chrome browser.

~~~
ntakasaki
Not just created, but bundled and installed with default by Java and Flash
updates some of which also install the Google toolbar into IE. Many folks that
I had converted to Firefox from IE back in the day use Chrome now and have no
idea how it ended up on their computer. This explains the steady rise of
Chrome, not the few percentage of tech geeks that installed it by choice.

------
minthd
So, since Google tracks the full browsing experience of chrome users, and
hence gets more relevant data than for other browsers users, it has the
theoretical ability to offer better search results to chrome users.

Has anybody noticed this happening ?

~~~
asuffield
(Tedious disclaimer: my opinion, not my employers. Not representing anybody
else. I work at Google, not on chrome)

Google does not "track the full browsing experience of chrome users". Please
read the privacy policy which is very clear on this subject:
[https://www.google.com/chrome/browser/privacy/](https://www.google.com/chrome/browser/privacy/)

I particularly draw your attention to this paragraph: "If you use Chrome to
access other Google services, such as using the search engine on the Google
homepage or checking Gmail, the fact that you are using Chrome does not cause
Google to receive any special or additional personally identifying information
about you."

~~~
minthd
Maybe i'm reading this wrong, but this sounds like Google gets your browsing
history:

"If you sign in to Chrome browser, Chrome OS or an Android device that
includes Chrome as a preinstalled application with your Google Account, this
will enable the synchronization feature. Google will store certain
information, such as HISTORY, bookmarked URLs as well as an image and a sample
of text from the bookmarked page, passwords and other settings, on Google's
servers "

And this isn't that far from full browsing behavior.And that's from a few
minutes reading this page - we don't know if they track deeper details - like
how long the page was open.

Also - Google doesn't have to collect this data. The claimed purpose of this
is that you could share history on multiple devices. But this can also be
achieved by sending encrypted history to Google and decrypting the history on
each device you use(i think browser extensions with similar functions
implement this in that way). So it's clear the purpose here is collecting
data.

~~~
asuffield
Notice that this is a feature you have to turn on (try it!). Obviously in
order to perform cross-device synchronisation, it's necessary to send this
information.

I'm not free to discuss the details of how these systems work, but consider
this: if both the statements "Google will store this information" and "the
fact that you are using Chrome does not cause Google to receive any special or
additional personally identifying information about you" are hard
requirements, how would you implement this feature?

~~~
minthd
Of course "the fact that you are using Chrome does not cause Google to receive
any special or additional personally identifying information about you." could
be true.

But Agreeing to sign to the history sync feature is something different , not
covered by "using Google Chrome".So now Google is free to use your history.

And let's be realistic here. Most people don't think about the implications of
login into Google(even if it says sync of bookmarks , history etc) and
probably don't read the instructions. Many even don't understand what it
means. And realistically most people see a Google login box which they filled
a million times, and fill it once more, as a sort of a pavlovian response.

------
tokai
Training data is nice, but I think its important not to underestimate capacity
for crawling. IMO one of Googles strengths is that they crawl large quantities
of new content. Smaller operations like DDG can't crawl at that scale. If I
want discussion new bugs, search the articles at my favorite newspage (where
the inhouse search is unusable), or just want the newest blogpost on some
subject - Google is hard to beat.

------
PaulHoule
At this point Google is not winning because it's search results are good (have
you used Google recently?), it is winning because it makes almost 10x as much
revenue as other search engines do per view -- at that rate any other search
engine is running a charity.

~~~
minthd
It's really pretty weird. Google certainly has the capabilities to offer a
great search experience, but it's very incosistent.

For example after learning i like the results of a certain journals ,their
personalization engine offered me those in releated searches. and usually i
chose content from them.

But somehow, after some time, Google's personalization engine forgot that i
like them ,and stopped offering me content from them, so i'm back into
drowning in shitty results. Why ? no idea why.

------
countrybama24
Seems like there is a business opportunity to build a plugin of sorts that
allows users to opt in and share their search data with competing platforms.
I'd be interested in donating my data to help a rival engine compete with
Google.

------
thallukrish
Only when user can own his data which means Apps are just logics and user can
allow access to whomever whatever selectively we can suddenly find more
genuine things reaching the user be it commerce or content.

------
thrownaway2424
It is unsettling to read this kind of chip-on-my-shoulder opinion piece full
of innuendo under the Firefox logo and the Mozilla name but on the author's
personal domain.

------
Semiapies
TL;DR - Yahoo! still exists and resents Google. But not for being better in
their niche, no. Just for delivering a better _service_ , which is not at all
the same thing. Somehow.

------
asuffield
(Tedious disclaimer: my opinion, not my employers. Not representing anybody
else. I work at Google, not on search quality)

This article makes a number of bold claims about the contents of data and code
which its author hasn't seen, and is written by a company that is receiving a
large amount of money from Yahoo. I would encourage people not to forget these
details.

~~~
rockdoe
So just point out where it's wrong, instead of making a fairly disingenuous
appeal to non-authority or however you want to call it?

~~~
asuffield
I don't speak for the company, but I don't think we're going to respond to an
attack piece by the Mozilla CTO by disclosing how our search algorithm works.
;)

In any event I'm not a person who can decide to release that information. All
I can do here is to ask people to think about what evidence has been offered
and the motives behind this article.

