
The Easiest Data Analysis Mistake to Make - lejohnq
http://blog.statwing.com/easiest-data-analysis-mistake-to-make/
======
lpage
This is a very simple and very pathological example that's easily ferreted out
with a few more summary statistics (median, min, max) but it's a good
illustration of the blind application of statistics. Short of visualization,
non-parametric statistics really help with such things. Correlation is a
fragile, linear measure, and things that are obviously correlated by
inspection can easily appear mathematically uncorrelated -- points on a unit
circle, for example. Likewise, the mean of any skewed distribution tells you
very little, but that's the statistic that's always cited. Quantiles, medians,
and non-parametric measures of correlation such as rank correlation are simple
and often overlooked. They do a good job screening for pathological data sets
like Anscombe's quartet and real world ones.

It's also worth mentioning "dumbbell" data sets. Two clusters of data, each of
which have a independent, meaningful correlation in them, can easily leverage
a linear regression into a meaningless line passing through the two clusters.
That's a pretty common issue with high dimensional data (obviously you can see
it in a 2D scatter plot), and it's not easily caught short of looking at
regression diagnostic statistics.

~~~
couchand
I believe the "dumbbell" effect you are referring to is Simpson's paradox:
[http://en.wikipedia.org/wiki/Simpson's_paradox](http://en.wikipedia.org/wiki/Simpson's_paradox)

As you point out _obviously you can see it in a 2D scatter plot_ , but you
have to select the correct two variables.

------
brian_peiris
I think, typically, if you've gone to the trouble of calculating variance and
correlation, you would have also calculated the median and mode of these
datasets. The differences would have been obvious with those basic analyses.

~~~
velis_vel
How do you define the mode of a data set where all the values are different?

------
carlosgg
[http://en.wikipedia.org/wiki/Anscombe's_quartet](http://en.wikipedia.org/wiki/Anscombe's_quartet)

------
mrcactu5
Statwing looks great!

Right now I do my data analysis in numpy, but this looks good for my Excel-
based colleagues.

What library is doing the statistics?

~~~
lejohnq
Thanks, really appreciate it! We use numpy, scipy, and pandas.

EDIT: clarified libraries we use.

------
timruffles
So a concept you find in a beginner stats textbook is now 'news'? Definition
of blogspam surely...

~~~
wiresurfer
I would agree with you. Anyways I hope people were not making such errors and
if we were, then at least someone would benefit from reading this!

------
davidmanescu
My favourite part of this (aside from the message) is that it links back to
this exact discussion page.

------
ejain
An equally common mistake is to visualize without analyzing :-)

------
medagan
who doesn't look at the data range, min, max, mode???

~~~
sesqu
Range and mode won't save you from the quartet, and they're a red herring in
any case (you'd need to add statistics until you go blue in the face).

