
Datasaurus: Never trust summary statistics; visualize your data (2016) - tosh
http://www.thefunctionalart.com/2016/08/download-datasaurus-never-trust-summary.html
======
roland35
The datasaurus is an interesting concept, but I think the biggest issue with
most of our society is basic mathmatical fluency. This idea of "visualizing
your data" or "sanity checking" can be expanded to a lot of situations outside
of statistics.

In software engineering and electrical engineering this comes up all the time.
Does it make sense for this log file to be 5 GB? Is it really 10 amps going
into this shunt resistor? Being able to "sanity check" things mentally is an
important skill.

------
supernova87a
The dinosaur drawing is a little gimmicky or "so what do you do with this?".
Far better and more applicable to real life is the lesson in Anscombe's
quartet:
[https://en.wikipedia.org/wiki/Anscombe%27s_quartet](https://en.wikipedia.org/wiki/Anscombe%27s_quartet)

It's all the more important to have a trained skepticism of stats when every
day you hear things like, "we just had the greatest quarter of economic growth
in history!"

~~~
1wheel
There's a dinosaur version of Anscome's Quartet:

[https://www.autodeskresearch.com/publications/samestats](https://www.autodeskresearch.com/publications/samestats)

~~~
supernova87a
oh, very nice!

------
nabla9
Eyeballing 2d plot is important statistical tool.

If it looks like a blob, correlation coefficient is usually meaningless. You
can instantly see if the relationship is linear, quadratic, piecewise linear
or quadratic, mixture, clustered or even dinosaur.

~~~
jdonaldson
Except scatter plots do not handle occlusion. If you have a lot of data to
plot, the blob shape might be outliers.

~~~
Petefine
This might be a problem, but if you're careful to make use of transparency,
axis scaling and random jitter (for integer or categorical values), the
occlusion issue can be overcome.

~~~
jdonaldson
* in some cases

------
hammock
"Never trust summary statistics" is a misnomer. Data is lost in the creation
of summary statistics, that's basically a tautology. His example could look
like a fuzzy line, or a dino, or a cross, or anything else. These facts don't
refute the usefulness of summary statistics, however. At issue here is the
competency and good faith of the ANALYST.

If the summary statistics are chosen and presented in a way that is useful, in
context, and not misleading, then there is nothing wrong with omitting the
full data visualization. If some of these pieces are missing though, then
yeah, check the data.

------
justin_oaks
Whenever I see single number statistics thrown around in news articles, press
releases, ads, etc., I'm generally pretty skeptical. I want to see the
distribution. Is it Gaussian? Bimodal? Flat? Apparently random?

The shape of the data tells you a lot that isn't captured in a single number
like a mean or median.

Also, what's the sample size? What was the methodology used to acquire the
data?

Lots of statistics fall apart when you look at how the data was collected.

------
wodenokoto
Matt Parker (famous from his YouTube channel stand-up maths and frequent guest
on numberphile) did a video on it and visited the inventor

[https://youtu.be/iwzzv1biHv8](https://youtu.be/iwzzv1biHv8)

