

The Big Data Backlash - martingoodson
https://plus.google.com/104306562254349871206/posts/BAaHRiEroj5

======
pyduan
I feel Dr. Goodson is fighting a straw man here. Nowhere does the original
article make the statement that Google used an "unbelievably complex model
that no one could ever understand", or that we had _no_ understanding of why
GFT worked (merely that the assumptions were not made explicit).

What it _did_ say though was that "Big Data practitioners" (whatever that
means) too often tend to make the mistake of ignoring threats to external
validity [1] because dealing with big datasets gives researchers a false sense
of confidence in the assumption that "N = all". This is a very valid point and
IMHO the most important takeaway of the original article, but it is not
addressed here.

On a side note, I feel the pothole example (where using a mobile app to
crowdsource pothole detection in Boston unknowingly led to a bias towards
younger and more affluent areas) would have been more relevant towards
discussing that thesis than focusing on the details of GFT.

What Tim Harford (the author of said article) denounces is the recent trend
that sees people claiming Big Data is "the end of theory" [2], where
traditional theories and hypothesis testing are being replaced by theory-free
procedures such as validating metrics against a hold-out set. The problem is
that while previously the relative scarcity of data lead social science
researchers to carefully consider their data sources, the abundance of
passively generated data we have now tends to cause them to forget to do their
due diligence when assessing the threats to the external validity of their
findings. Normally such concerns naturally arise when you're deciding on your
data collection methodology or your experiment design, but Big Data analyses
do tend to be different in that they focus much more on finding latent
relationships in existing raw data. This also increases the risk to
unknowingly fall prey to the multiple comparisons problem [3], which is the
other important point Harford touches on but isn't really addressed in this
post.

In that respect, yes, Big Data analyses can be more prone to this kind of
problem. Even if you dismiss it as a case of bad scientists and not bad
science, it remains that the general population is much more prone to overly
trust findings that came out of such analyses, and articles such as Harford's
are important to remind both practitioners and laypeople to be careful.

[1]
[http://en.wikipedia.org/wiki/External_validity](http://en.wikipedia.org/wiki/External_validity)

[2] [http://www.theguardian.com/news/datablog/2012/mar/09/big-
dat...](http://www.theguardian.com/news/datablog/2012/mar/09/big-data-theory)

[3]
[http://en.wikipedia.org/wiki/Multiple_comparisons_problem](http://en.wikipedia.org/wiki/Multiple_comparisons_problem)

~~~
martingoodson
_Nowhere does the original article make the statement […] that we had no
understanding of why GFT worked (merely that the assumptions were not made
explicit)._

It’s very difficult to read the folllowing line from the FT article and reach
that conclusion:

“The problem was that Google did not know – could not begin to know – what
linked the search terms with the spread of flu.”

But the authors of the science paper that Tim Harford refers to did ‘begin to
know’ how Google Flu Trends worked [1]. That’s how they developed several
reasonable suggestions for what caused the over-prediction problem. In
particular, the suggestion that changes in the google search algorithm caused
a bias in the flu trends results, could easily be tested [2]. Perhaps we would
use a large dataset to do that. And that’s ok.

The suggestion that statisticians suddenly forget all of their training when
_n_ reaches a certain threshold is a misrepresentation of the facts. There are
bad analyses based on large data sets just as there are bad analyses based on
small data sets. We have tools to deal with large datasets and multiple
comparisons [3]. We don't need to throw our hands in the air and panic.

[1]
[http://www.sciencemag.org/content/343/6176/1203](http://www.sciencemag.org/content/343/6176/1203)

[2] For instance we could check that to see if the over-predictions of flu
cases started on the same day as the search algorithm change.

[3] To pick a random example:
[http://en.wikipedia.org/wiki/False_discovery_rate](http://en.wikipedia.org/wiki/False_discovery_rate)

~~~
pyduan
Thanks for weighing in, but you are completely misrepresenting my argument by
reducing "Big Data" to "data that reaches a certain n" and responding to
claims I did not make. As you are probably aware, doing these kind of analyses
require more than pure statistics; they also require solid understanding of
good experiment design, and this is precisely what I was arguing is at higher
risk of breaking down in these types of analyses.

To clarify my previous post, what I was referring to (and what I believe is
what is commonly referred to) when I was talking about Big Data is a specific
albeit vaguely defined trend of analysis that tends to focus on:

a) mining data out of large, _unstructured_ existing datasets

b) leveraging data that has been passively generated, ie. are byproducts of
normal activity and not a conscious experiment design decision

c) maximizing predictive power as opposed to validating a theory

And yes, there are many ways to do bad analyses on small data sets, but that
is not the point I (or, I believe, Harford) was making. The point is these
types of analyses, because of their nature, tend to require additional care
regarding external validity, because:

a) many make the mistake to think that big n means you don't need to worry
about sampling biases, ie. what Harford was referring to as the "N = all"
fallacy and what I believe was his main point; you'll agree it tends to be
more common in "big n" analyses

b) since data collection is not an issue, the real challenge is in data
_cleaning_ , which requires special care because you need to think about
potential biases in the way the data was generated (a process you had no
control about); this can be trickier than it sounds when exploring large
datasets that were passively generated, because _every_ feature is potentially
subject to these biases and some are less than obvious. The Boston case was a
good example, but now consider that in many real-world datasets almost all
features are subject to similar considerations (and may all be subject to
different biases)

c) the focus on predictive power when using theory-free metrics leads to a
risk of overfitting when the possible sources of heterogeneity are not
understood (ie. the assumptions are not made explicit)

d) since they've been optimized for predictive power, they give a false sense
of security ("it worked on the validation set!"); this is compounded by the
fact these models will often work for a while (as is the case in GFT) before
breaking down [1]

e) since there is a stronger focus on exploratory analysis, addressing the
multiple comparisons problems is not as trivial as you make it sound; the
issue is not the statistical tools we have at our disposal [2] but making all
your assumptions explicit, which is trickier in the exploratory phase (because
by doing this initial phase, you are already implicitly dismissing or
selecting relationships to study)

f) since these analyses tend to be very application-oriented and to function
at a large scale, mistakes have the potential to be much more destructive (for
example, false positives in the Target example). This is compounded by the
fact that due to the technical challenges in handling complicated data, and
because applications are often found in tech companies, many people who do
these analyses come from a computer science background and are not necessarily
well trained in statistics or econometrics

Again, none of these are insurmountable; no one is actually dismissing Big
Data analyses as a whole, but they present some unique opportunities for
screwing up.

[1] Incidentally, this is precisely why I said earlier that discussing the
details of GFT seemed only tangential to the point: yes, the Google
researchers were well aware of the limitations of the method, and so is
Harford ("Google Flu Trends will bounce back, recalibrated with fresh data –
and rightly so"). The relevant point is _not_ whether the Google researchers
were right, but the false sense of certainty it instills for the consumers of
the research, something I also made explicit at the end of my last post.

[2] Although some make a pretty good case that it is, but this is beyond the
scope of this comment: [http://www.nature.com/news/scientific-method-
statistical-err...](http://www.nature.com/news/scientific-method-statistical-
errors-1.14700)

Edit: I forgot to address the first part of your reply. While of course the
researchers knew that people looked for these terms because they are concerned
by their health, what is missing is why these _specific_ queries are
important: you may well find that some of these queries are more related to
general concern, while some are specifically about treatment options, and
others about vaccination. These may not evolve at the same time and in the
same direction, and while they may have been indistinguishable in the past,
it's entirely possible the first type of queries will be disproportionately
affected by changes in the Google algorithm vs. others, or that some of the
assumptions are only valid for one type of query and not the others. In
economics (which is Harford's background), this is often considered
insufficient when deciding whether to add a variable to a model.

~~~
mturmon
Thanks for this respectful, detailed, and analytical contribution.

I agree that "Big Data" is a tendency that is worth talking about as if it is
a new thing. There are edge cases that reside on the border between
conventional moderate-n statistical analysis, and large-n "vacuum up lots of
data and try to extract correlative information" approaches. Examples like the
pothole-location collection and GFT are emblematic of something new, that's
well beyond this fuzzy boundary.

So it's not surprising that there are new issues. Some of the fixes may be
old-fashioned, but some may not.

We should also face the fact that the people doing this work often don't have
any formal statistical training, so it's on the community to highlight the
important pitfalls.

------
onion2k
" _I suggest that ‘Big Data’ analyses are no more prone to this kind of
problem than any other kind of analysis._ "

The notion that 'big data' is just as susceptible to bad statistical analysis
if you ignore dynamics in the incoming data is entirely true. In that sense
'big data' is the same as every other kind of analysis. But there is another
notable difference between 'big data' analysis and any other kind: the
marketing. 'Big data' solutions are often sold as something you can just plug
in to your infrastructure, set up some data feeds, and out pops an insightful
trend analysis telling you things about your customers that you could never
have understood with Excel alone. That is what's wrong with 'big data'.

Used properly, by an intelligent statistician ("data scientist") who knows
their field well, big data tools are very useful for doing the things that
statisticians do only faster. _But that is all._ Big data tools don't
magically do clever statistical analysis on their own, and they are not a
replacement for statisticians, despite what some people seem to think.

~~~
walshemj
It just makes practible analysis that before would have been to costly before.
I suspect most of the backlash is from MBA students who could make a spread
sheet draw purty pictures but crash and burn when required to do any real
work.

By that argument Lyons tea shops should have never have built LEO - after all
they had a manual system that could track the cost of a bun down to fractions
of a farthing (1/4 of a penny)

------
crayola
"I suggest that ‘Big Data’ analyses are no more prone to this kind of problem
than any other kind of analysis."

To an extent, large data volumes make it more difficult for the statistician
to be as nimble. Trying different algorithms, different specifications,
different ways to approach the data is part of the statistical workflow; not
everything can be easily parallelized and run on a Hadoop cluster.

There are insights a statistician can quickly obtain (few hours) from a
carefully selected random sample of a few million observations, in memory, in
a single R or Python process. The same analysis for the complete, multi-
terabyte data would be rather more painful or costly to obtain.

Of course data scientists such as Martin Goodson know that (though their
bosses do not always) and are used to doing exploratory analysis or
prototyping on sample that fit in RAM.

~~~
einhverfr
It's not just volume but variety too. Most big data solutions are intended to
handle large varieties of data as well as large volumes.

Once you get there, all bets are off.....

------
intslack
The "Big Data Backlash" isn't anything new: Nassim Taleb's conversation about
it in his last book, in which he presents a mathematical proof that noise-to-
signal increases exponentially with data, is scathing.

"Big data" means anyone can find fake statistical relationships, since the
spurious rises to the surface.

[http://www.wired.com/2013/02/big-data-means-big-errors-
peopl...](http://www.wired.com/2013/02/big-data-means-big-errors-people/)

------
tormeh
But all big data algorithms are not simple.

