

The Parable of Google Flu: Traps in Big Data Analysis [pdf] - tokenadult
http://gking.harvard.edu/files/gking/files/0314policyforumff.pdf

======
scottfr
One issue is that the number of historical data points used in the model is
really very very small. The author's point out that the model initially was
fit to 1,152 CDC data points.

However, these 1,152 data points are highly temporally correlated (the value
at time _t_ is a strong predictor of the value at time _t_ +1). As such, they
aren't "worth" very much when building a model. Therefore, the effectively
independent number of points (what you generally need to build a good model)
is in reality much smaller.

Looking at the graphs, it appears that for each flu season there is a pretty
regular spike of flu incidences. This spike could probably be summarized
effectively by four parameters:

\- Mean

\- Standard Deviation

\- Skewness

\- Kurtosis

Looking at it this way, the number of "independent" data points of a flu
season is not 365 (one for every day the year), 52 (one for every week), or
even 12 (one for every month). Instead it is only 4.

Thus the total amount of data they had to build their model was 4*(# years).
When you have such small data sets, you can't expect to obtain good results.

TL;DR. Time series are very hard to use in predictive models. It is often
impossible to generate good predictive models based off them.

~~~
Fomite
The most vexing thing about the research I did on influenza was that my yearly
estimates meant despite having a staggering amount of data, my functional N
for many experiments was ~ 35, and that was because I reached _way_ into the
past.

------
blauwbilgorgel
_Mean absolute error (MAE) during the out-of-sample period is 0.486 for GFT,
0.311 for lagged CDC, and 0.232 for combined GFT and CDC._

A combination of GFT search behavior and actual CDC numbers produced the best
model for the authors of this paper. The (media) spin is "GFT as a stand-alone
flu tracker fails".

Perhaps the article would be of a different tone if the researchers had full
access to Google's data. I don't think the mentions of irony, failure,
misleading and hubris are deserved nor offer a fair portrayal.

------
j2kun
> When 80-90% of people visiting the doctor for “flu” don’t really have it,
> you can hardly expect their internet searches to be a reliable source of
> information.

I just attended a talk that graphed the rate of physicians' diagnoses versus
the truth, and the conclusion was that about half of physicians drastically
overestimate, too (and it's far more than double, closer to five times more!).

So when 80-90% of people visiting the doctor for flu don't really have it and
those doctors often make type II errors, you can hardly expect their diagnoses
to be a reliable source of information.

~~~
nraynaud
flu is like the communist, everybody is afraid of it, everybody believe they
have seen one, but upon serious examination we can't find one.

I guess the flu epidemic is just a conspiration created to distract us from
the fact that nobody walked on the moon or something.

On a more serious note, what's exactly the deal with misdiagnosis? If the real
thing is another virus, it could also be a virus that we should treat in a
similar manner? Or the doctors are loosing precious time often with a wrong
first diagnosis in a significant number of cases?

~~~
Fomite
Diagnosing viral illness is a pain. There are a _huge_ number of winter
circulating viruses with roughly the same pattern as the flu, and diagnostic
tests for influenza are often very specific but not very sensitive. Meaning if
they come up positive, you have it, but if they come up negative... _shrug_

Beyond that, no, it's not likely to be a virus treated in a similar manner. If
by manner you mean "drugs". Oseltamivir (aka Tamiflu) only works against
influenza, so if you have RSV or some other virus, it won't do you much good.
The other way to handle viral diseases is to let them run their course and
manage the complications, at which point a strictly accurate diagnosis isn't
particularly necessary.

TL;DR: It's very hard, fairly expensive, and the payoff per-patient isn't
strong.

------
nowarninglabel
The actual study:
[http://www.sciencemag.org/content/343/6176/1203](http://www.sciencemag.org/content/343/6176/1203)

Gist of it is, Google Trends overestimated flu cases by about double for 100
out of 108 weeks, according to the paper.

Edit: There is actually an interesting podcast available on it without a
paywall:
[http://podcasts.aaas.org/science_podcast/SciencePodcast_1403...](http://podcasts.aaas.org/science_podcast/SciencePodcast_140314.mp3)

~~~
pyre
> Google Trends overestimated flu cases by about double for 100 out of 108
> weeks

Well, there's a simple solution to that... Just halve all of the results from
the current algorithm!

------
simonlebo
This article argues that the system overestimated the cases (which it did) and
therefore did not work. To reproduce the results of traditional systems is not
the purpose of systems like GFT. If you look at the graphs you can see that
GFT provides an indication of an increase in flu cases before the traditional
systems. This is where it has value; it provides an early indication of an
emerging problem.

~~~
Fomite
To reproduce the results of traditional systems is _exactly_ the purpose of
Google Flu trends. This is evident when you go back and read the original
paper.

The question has always been "Can we match CDC estimates faster and with
different information?"

The emerging answer is: "No".

------
nraynaud
I would not stone them, they tried, it doesn't work well, they can fix it or
scrap it. At least they tried to use their technology for something positive
this time. It's one thousand times better to fail when trying to predict the
flu epidemics than successfully using technology to bomb weddings.

~~~
rm999
>It's one thousand times better to fail when trying to predict the flu
epidemics than successfully using technology to bomb weddings.

That's a pointless and irrelevant comparison.

There's a cost to false positives, e.g. creating more vaccinations than
necessary; fortunately I don't think the CDC or any other government
organizations made any policy decisions based on Google Flu. As a person who
works with a lot of data, I think applications like Google Flu have a lot of
potential to do a lot of good. But it sounds like Google wasn't properly
rigorous in their experiments if Google Flu is such a failure.

------
mdisraeli
This is interesting and a much needed discussion, however it too completely
ignores a key difference:

CDC tracks hospitalisation based on lab reports, GFT tracks search terms

The assumption, which admittedly Google is making, is that these two are
correlated by the same amount each year. However, in periods of economic
downturn without free health care, people are going to search more and seek
actual medial assistance less. Given the global economy has been shot for the
last six years....

~~~
ChuckMcM

       > CDC tracks hospitalisation based on lab reports, GFT tracks search terms
    

This is the key difference, and after 5 years it is clear that this is not a
valid way to predict the flu, or even monitor it. I think it was a great
experiment, and thought it was pretty creative when announced, and now it
should either be retired or replaced with a new hypothesis. Because as the
Mythbusters would say, this myth is busted.

~~~
mdisraeli
I disagree, but it depends what you mean by "the flu". If you mean
hospitalisations, then yes, you're entirely right. If you mean lost
productivity and population health outside of hospital, however, perhaps we
need a better means to track this other than indirectly via either current
stat.

~~~
Fomite
The CDC also uses a outpatient physician network to report cases of influenza-
like illness in the community.

~~~
mdisraeli
I didn't know that, thank you! I went digging to try and find exactly what
figure Google were using, and exactly how CDC came up with it, but I couldn't
manage to pin them down...

------
colin_mccabe
This seems like a case of circular reasoning. "Google's results are not as
good as the CDC's because Google's results are different than the CDC's." But
why do we believe that the CDC estimates are better?

I have never gone to the doctor when I had a flu. If it had lasted more than
one or two days, or if I had a temperature higher than 102, then I would have
gone. But as it was, nobody ever knew that I had the flu except myself and my
family.

Even if I had wanted to go to the doctor for every time I got a flu or a cold,
it takes at least a day or two to schedule a doctor's appointment. By the time
the appointment had rolled around I would probably have cancelled, since the
flu doesn't usually last very long.

This being the US, there are also a lot of uninsured people who can't go to
the doctor even if they wanted to. For those people, Google is the only
option. The CDC can collect as much data as it wants from the laboratories,
but you can't collect what isn't there.

When you add in the uninsured people and the people who get better before
visiting the doctor, it's not at all surprising that the lab estimates are 2x
lower than reality. In fact, I would kind of expect them to be even lower than
that.

~~~
scottfr
The Google model is designed to predict the CDC levels, not predict actual flu
prevalence (which is unknown).

Error in the Google model is by definition the difference from the CDC levels.

~~~
Fomite
This. A failure to match the CDC's confirmed cases _is_ a failure of the
Google model, because that's what it's trying to match.

It also fails fairly spectacularly on a local level when compared to health
department data from some top flight state agencies, like NY.

~~~
colin_mccabe
The CDC's model "fails pretty spectacularly" when it comes to capturing any of
the times I've had the flu. I have had the flu several times and the CDC has
never been aware of it. As I explained earlier, I never told my doctor that I
had the flu or went to a lab. My local health department was never aware of
these cases either. It might, however, have appeared on Google flu trends,
since since I might have searched for what the impact of a certain level of
fever was. It's been a few years so I don't remember if I did or not. And of
course, Google has never published which search keywords they use.

There are three datasets here: the CDC's, Google's, and reality. Some folks
here seem unable to differentiate between the CDC data and reality. But in
fact, I really did have the flu, even though CDC didn't think I did. Reality
is what matters.

Maybe there is reason to believe that CDC's data is closer to reality than
Google's. If so, let's hear it.

~~~
Fomite
The key is, like all population studies, the CDC doesn't have to know about
_you_. It has to know about some people like you, and honestly, if you were
well enough to not need to go to the doctor, it's also not really the
influenza public health authorities are worried about.

The CDC is unapologetic about their data being estimates - but they're very
solid estimates, and match more intensive but smaller scale studies pretty
well. But we don't need to capture every single flu case - no surveillance
system will ever do that, nor need to.

~~~
colin_mccabe
I just feel like either I am missing something, or there is a lack of rigor
here. Clearly, sampling introduces errors and under-reporting introduces
errors. What I am looking for is a case (based on data, not hand-waving) that
Google's errors are worse than the CDC's, and so far I'm just not seeing it.
Maybe I am missing something which is obvious to people in the field.

Edit: I see how the CDC's numbers could be a lower bound, but not upper.

------
ffk
A few thoughts...

Is it possible the CDC estimates are off by half? Two main points come to
mind:

* How many Type II errors do doctors commit?

* How many people with the flu do not visit a doctor? People with good health coverage are probably more likely to visit the doctor preemptively when they are feeling unwell, potentially skewing the results.

Also, while google may be off in the exact measurement when compared to the
CDC, it looks like the shape of the graph correlates with the CDC data.
Overall, this looks approach promising.

~~~
shakethemonkey
Google isn't trying to predict actual flu cases. They're trying to predict the
CDC estimates.

------
gwern
Ironically, the original paper (fulltext) fell short of the front page in both
submissions:
[https://news.ycombinator.com/item?id=7405286](https://news.ycombinator.com/item?id=7405286)
and
[https://news.ycombinator.com/item?id=7422496](https://news.ycombinator.com/item?id=7422496)

/newest can be so random.

