
Microsoft Finds Cancer Clues in Search Queries - hvo
http://www.nytimes.com/2016/06/08/technology/online-searches-can-identify-cancer-victims-study-finds.html?hp&action=click&pgtype=Homepage&clickSource=story-heading&module=second-column-region&region=top-news&WT.nav=top-news&_r=0
======
vmarsy
For those curious about what kind of queries the researchers were interested
in : " _it typically produces a series of subtle symptoms, like itchy skin,
weight loss, light-colored stools, patterns of back pain and a slight
yellowing of the eyes and skin that often don’t prompt a patient to seek
medical attention._ " [1]

The article name is: J. Paparrizos, R.W. White, E. Horvitz. Screening for
Pancreatic Adenocarcinoma using Signals from Web Search Logs: Feasibility
Study and Results, Journal of Oncology Practice, June 2016.[2]

[1][https://blogs.microsoft.com/next/2016/06/07/how-web-
search-d...](https://blogs.microsoft.com/next/2016/06/07/how-web-search-data-
might-help-diagnose-serious-illness-earlier/#sm.000120cjl17p8d0oqj91dpiyyzt7o)

[2]
[http://jop.ascopubs.org/content/early/2016/06/02/JOP.2015.01...](http://jop.ascopubs.org/content/early/2016/06/02/JOP.2015.010504)

~~~
jbandela1
First, this is a very scary first step in the assault on privacy. It is an
opening for people to argue that a person who fits a certain search profile
should be de-anonymized. After all, wouldn't you want to know if you had
cancer? In addition, as it is not about law enforcement but public health,
there are a lot fewer limitations on what information the government can
access.

Second, after seeing the type of queries, I do not think that this is all that
helpful. If a person has unexplained weight loss or yellow skin or eyes, they
should always go see their doctor right away. My guess is that most of the
specificity of this study comes from those two terms (weight loss combined
with yellow skin). Just getting out that message will do a lot more to save
people's lives than violating their privacy in this manner.

~~~
mooneater
> they should always go see their doctor

GPs often do not catch rare conditions or determine the underlying cause of
milder symptoms.

~~~
davidw
I had a collapsed lung and went away from my GP with an inhaler for asthma
(I've never had asthma). Luckily I went in after a few days and got x-rays and
then got treated immediately, and I'm fine (I hope:-) 20 years later. Still
though, it reduced my trust in doctors dramatically.

~~~
jacquesm
You're a cyclist, so you probably have very good lungs. I was in a similar
situation (saxophone player, avid cyclist) so too ended up with a misdiagnosed
collapsed lung. If you're not showing the regular symptoms (such as very low
oxygen levels in your blood) then they are bound to assume something else is
the cause. Sharp pain in your chest is a good reason to suspect a collapsed
lung or a partial detachment of the lung from the chest wall if you're
otherwise in very good condition and should be taken very seriously.

------
andy_ppp
How long before Google and Microsoft put up an automated warning:

 _Your recent search queries suggest you may have cancer, please seek advice
from your doctor._

Or some time in the future...

 _The pattern of your searches seems to indicate you are a terrorist, we aren
't telling you and we have called the thought police._

~~~
robmcm
What if they put this warning up to their (actual) customers, like the US
insurance companies that could hike premiums or ditch their customers before
they realized they had costly conditions.

~~~
beefield
I would be quite likely willing to pay Google for their services more than
they are making out of me currently, if:

1.They agree legally bindingly not to give any of my data to anyone.

2\. They do not show me a single ad anywhere.

Obviously, that would mess the current business model of everyone counting on
adwords revenue, so I am not holding my breath here.

~~~
robmcm
They would quickly have a potential revenue ceiling that would likely decline
over time.

At the moment it's the opposite.

~~~
xviia
Google is trying this with their YoutTube Red, which removes all advertising
from YouTube.

I believe it is actually kore profitable than advertising. With ads, an
average person is worth between $0.01 and $1 a month (depending on what type
of ads), much less than the $9 for YouTube Red.

------
biot

      "The data used by the researchers was anonymized, meaning
       it did not carry identifying markers like a user name,
       so the individuals conducting the searches could not be
       contacted."
    

As when AOL or Yahoo released their anonymized data set, it is often easy to
take someone's search history and work backwards to find out who they are. How
can they ensure that personally identifiable information has been scrubbed
100% from all queries? Maybe a user searched a courier tracking number, and
that info can now be looked up on the courier's site and tracked back to their
home or office address. Each additional piece of info gets you one step closer
to identifying who they are.

Yet one more reason to use DuckDuckGo for your general search needs.

~~~
robmcm
Not sure why you are getting down voted?

Predicting users health data based on simple searches is terrifying, I'm not
sure why anyone would be happy with Google or Microsoft having this
information, especially when their customers could use this against you.

~~~
andylei
google and microsoft already has this information

------
seizethecheese
Methods: We identified searchers in logs of online search activity who issued
special queries that are suggestive of a recent diagnosis of pancreatic
adenocarcinoma. We then went back many months before these landmark queries
were made, to examine patterns of symptoms, which were expressed as searches
about concerning symptoms. We built statistical classifiers that predicted the
future appearance of the landmark queries based on patterns of signals seen in
search logs.

Results: We found that signals about patterns of queries in search logs can
predict the future appearance of queries that are highly suggestive of a
diagnosis of pancreatic adenocarcinoma. We showed specifically that we can
identify 5% to 15% of cases, while preserving extremely low false-positive
rates (0.00001 to 0.0001).

~~~
SapphireSun
It would be really interesting if after a series of queries, the search engine
displayed a little text block that said something like:

"We don't often do this, but did you make the following searches regarding the
health of yourself or a loved one? ... SEARCHES FOLLOW ...

Studies show that a large proportion of people making these searches for
medical purposes should talk to a doctor about these symptoms. Here's a number
to call if you do not have a personal physician: (555) 555-5555"

------
junto
I was just reading this on mobile, and I got redirected to another doesn't
site that told me my device had "problems"!

The NYT can't even keep their site clean from virus malware laden advertisers.
Ridiculous.

[https://www.dropbox.com/s/9bv5ju0f73bdmcr/Screenshot_2016-06...](https://www.dropbox.com/s/9bv5ju0f73bdmcr/Screenshot_2016-06-08-09-24-51.png?dl=0)

[https://www.dropbox.com/s/ba1g2u2cuox95zz/Screenshot_2016-06...](https://www.dropbox.com/s/ba1g2u2cuox95zz/Screenshot_2016-06-08-09-24-40.png?dl=0)

~~~
yolesaber
The NYT on desktop has generally good ad experience. The mobile website is
awful awful awful with ads for some reason. Use their app instead if you can

------
avbor
Target actually did something similar, though their intent was to eventually
find better ads for families who were expecting. In their case, fully
deanonymizing and being straight forward - straight out advertising baby
products - turned out to be a nightmare, and they eventually turned to more
subtle advertising by inserting baby products into the weeklies.

Could we see a case where, when someone searches for one thing, instead of
seeing results that pertain to that immediate query we see results that match
common future searches?

[http://www.nytimes.com/2012/02/19/magazine/shopping-
habits.h...](http://www.nytimes.com/2012/02/19/magazine/shopping-habits.html)

------
huuu
Isn't the real story here that Bing is keeping track of user's search queries
for months?

Google Flu is working different. They try to predict a flu epidemic by
counting related search queries.

But Microsoft is predicting the health of a single person based on his search
history.

Edit: Thinking about it: ofcourse Google, Facebook and others could do the
same because they also gather user data.

~~~
Artemis2
Everything you do is logged by all these sites, and tied to your identity
there, no matter what they say. They have so much storage space to spare that
it's more compelling to save the data for potentially analyzing it later than
to throw it away.

A few years back, Google decided to recalculate YouTube view and subscriber
counts to counter bot usage. They store so much information about every
request that they have been able to detect views made by bots in the past,
from patterns in this data.

~~~
kuschku
The thing is, this is illegal in the EU.

If you collect data for one purpose, you can't use it for any other purpose,
unless you explicitly and in easily readable language told the user about it
before.

You cant retroactively get permission to use data for other purposes either.

And currently medical or research purposes are not listed in Microsoft's or
Google's ToS.

------
jlg23
I'm not sure that this is a good way to demonstrate data mining skills. The
survival rate for pancreatic cancer is abyssal:

> While five-year survival rates for pancreatic cancer are extremely low,
> early detection of the disease can prolong life in a very small percentage
> of cases. The study suggests that early screening can increase the five-year
> survival rate of pancreatic patients to 5 to 7 percent, from just 3 percent.

WP claims 20%[1] though a glance at the referenced source suggests that the WP
summary is bogus.

So the only ones who benefit from this data mining would be health insurances
who could get rid of people who'll incur very high treatment cost with low
expectancy of success.

[1]
[https://en.wikipedia.org/wiki/Pancreatic_cancer](https://en.wikipedia.org/wiki/Pancreatic_cancer)

~~~
Confusion
You mean abysmal.

~~~
jlg23
Indeed, I do. Thank you.

------
kalleboo
If people are searching for medical symptoms on a search engine, aren't they
already ending up at WebMD or whatever and finding possible diagnoses?

This would have been a lot more interesting if the keywords were a lot more
subtle - like a change in behavior marked by a sudden craving for salty foods
or whatever.

~~~
figgis
This is searches over a period of time before any diagnosis was actually made.
Not "light stool, eyes slightly yellowed, cancer, " etc.

------
naveen99
Cancer can arise anywhere. Symptoms are dependent on where it comes from. I
have see too many doctors succumb to cancer, with little warning.

If mri's were faster, I think whole body mri's would be a decent screening
tool. Problem is 1. They are expensive. 2. They take a long time to do and are
uncomfortably loud 3. generate heat in the body.

There are also tumor markers for many cancers.

Screening guidelines unfortunately have to be doable to populations (lowest
common denominator). More informed people with resources can do better if they
take initiative (with some trade offs of time, risk).

In general, our bodies can use a lot of tuning. The more you look, the more
you find. Some tuning has trade offs.

If you want to be proactive, you also have to ask your doctor for trials of
particular tests or treatments. Doctors are conservative, and the first thing
they will want to try is wait and see. That leaves you with may be 100
experiments you can do on yourself in a lifetime. We need to be able to do
hundreds / week, to get significant progress towards making our bodies have
99.99999% up time.

The future is Star Trek type doctor, but a personal one for everyone. The
major hurdles are economic and regulatory. Some physical.

------
panic
To translate the false-positive rate into concrete numbers:

According to the American Cancer Society
([http://www.cancer.org/cancer/pancreaticcancer/detailedguide/...](http://www.cancer.org/cancer/pancreaticcancer/detailedguide/pancreatic-
cancer-key-statistics)), about 53,070 people will be diagnosed with pancreatic
cancer this year. The abstract says this method detects 5% to 15% of cases:
that's about 2,700 to 8,000 correct detections. Assuming there are 100 million
people using Bing
([https://www.quantcast.com/bing.com](https://www.quantcast.com/bing.com)),
between 1,000 and 10,000 cases will be wrongly detected (0.00001 to 0.0001
false positive rate).

~~~
tempestn
It wasn't entirely clear from the article, but I assumed that the false
positive rate referred to the ratio of people with matching search queries,
not of all Bing users. In that case the absolute number of false positives
would be much lower.

Also, the detection of 5% to 15% of cases would seem to me to refer to only
Bing users; I doubt they're claiming to be able to detect 5-15% of all cases
of pancreatic cancer.

Would've been nice if these things were actually spelled out in the article.

~~~
panic
_It wasn 't entirely clear from the article, but I assumed that the false
positive rate referred to the ratio of people with matching search queries,
not of all Bing users. In that case the absolute number of false positives
would be much lower._

If that's true, how did they have enough real positives to measure such a low
false positive rate?

 _Also, the detection of 5% to 15% of cases would seem to me to refer to only
Bing users; I doubt they 're claiming to be able to detect 5-15% of all cases
of pancreatic cancer._

Yeah, I'm being dumb, there's no way that percentage is out of all cases!

------
malemi
The final confidence of the "diagnosis" is about 50% (you can use the Bayes
formula to get that, see here
[http://www.visualab.org/index.php/cyberchondria-microsoft-
ba...](http://www.visualab.org/index.php/cyberchondria-microsoft-bayes-ny-
times-pancreas-cancer)). Yes, 50% is better than nothing, but the NY Times
article does exactly what serious newspapers should _not_ do –let people think
that "the Internet" is a great place to diagnose yourself. It is not.

------
mooneater
Insurance companies would just love this data...

------
fideloper
No mention of what type of queries they believe after associated :/

------
jerryhuang100
_> " We showed specifically that we can identify 5% to 15% of cases, while
preserving extremely low false-positive rates (0.00001 to 0.0001)."_

Back several years ago Google Flu Trend also claimed to have 97% accuracy
compared to CDC data. But later on it just found to be way off to the real
data. Did the author compare their study to the Google Trend.

Also it's not clear how they achieve the conclusion of low FP. Did they
randomize their sample pool and run their predictability model several round?

------
damianknz
Can't one already run ad campaigns based on a users search history? What if a
cancer foundation or similar organisation ran a targeted ad campaign?

------
toomanythings2
I was going to make a joke about Microsoft violating privacy as they scan your
search queries and are able to ID you but I see HN posters, here, have already
taken up that torch.

------
uptownfunk
Any links to the non-paywalled technical paper? I'm curious as to the learning
models they built to run the actual predictions.

------
rathish_g
Wait ... they used Bing for such a long time and lived to tell the tale? :)

