
Same Stats, Different Graphs: Datasets with Varied Appearance and Identical Stats - bryceroney
https://www.autodeskresearch.com/publications/samestats
======
michaelmior
I don't think the title does a great job of identifying how cool this is.
Having the animations especially is great!

~~~
jmatejka
Thanks! I'm the author of the paper, glad you like it, and especially glad
that you think the animations are cool :-)

~~~
toth
Congrats on the paper, very cool work. I bet the "Datasaurus dozen" is going
to become the go to example for the dangers of relying on summary statistics,
it is much more memorable than Anscombe's quartet.

~~~
jmatejka
Thanks! And here's hoping :-)

------
wyldfire
> Boxplots are commonly used to show the distribution of a dataset, and are
> better than simply showing the mean or median value. However, here we can
> see as the distribution of points changes, the box-plot remains the same.

Violin plots [1][2] are a great spin on boxplots that help show the
distribution.

[1]
[https://en.wikipedia.org/wiki/Violin_plot](https://en.wikipedia.org/wiki/Violin_plot)

[2]
[http://seaborn.pydata.org/generated/seaborn.violinplot.html](http://seaborn.pydata.org/generated/seaborn.violinplot.html)

~~~
SamBam
But a violin plot still wouldn't distinguish between many of the plots in this
set. All of the plots that don't have multiple vertical stacks -- so, all the
horizontal lines, diagonal lines, circles, and scatter -- will have the same
or almost the same violin plot.

You could rotate the violin plot to show how the density changes on the other
axis, for those graphs where that would be useful, but that requires looking
at the data visually and making decisions, which is the whole point of the
article.

~~~
wyldfire
Agreed. But my response was specifically in response to the "More Examples"
section that shows three examples for which the boxplots are ambiguous but
violin plots would not be.

In general, scatter plots can be very valuable but difficult to compare among
trials. Violin plots help mitigate that, though you're correct that it comes
at a cost.

------
pmoriarty
I'm wondering how the general public can be educated about this when so many
people are unaware even of the difference between the median and the mean.

In the mass media especially, the mean is often bandied about as the only
statistic, and treated as if it was definitive.

~~~
rodionos
I think the public is largely aware of the difference since housing prices are
often reported using the median. Intuitively, everyone knows that median
housing price is the price of a 'regular' (most frequently sold) house in a
given area.

~~~
xmj
The most regularly encountered house price at sale would be the mode.

Median is the value at which 50% of houses trade higher as well as lower.

------
eggie5
You see this a lot in the context of building a regression and often times the
assumptions are violated:

* Linear relationship between predictor and response and no multicollinearity * No auto-correlation (statistical independence of the errors) * Homoscedasticity (constant variance) of the errors * Normality of the residual(error) distribution.

As the paper suggests, plotting the data visually will help you avoid these
assumptions, but also just making sure you don't violate the assumptions w/
statistical tests would work too. For example, uou can look at your residuals
(loss) as an indicator of good fit. If your residuals do not follow a normal
distribution, this is typically a warning sign that your R2 score is dubious.

There are a few statistical tests for Residual Normality, particularly, the
Jaque-Bara test is common and available in scipy.

So, I would argue, you don't even need to visualize the data. I describe this
more here: [http://www.eggie5.com/104-linear-regression-
assumptions](http://www.eggie5.com/104-linear-regression-assumptions)

~~~
christopheraden
Sure, using hypothesis tests could pick out some of the structured examples in
the Datasaurus, but in practice, things are often more subtle. Goodness of Fit
tests to check for normality, in particular, are a little bit thorny, lacking
power in small sample sizes, and rejecting normality for slight departures in
higher sample sizes. My experience has been with assumption checking that by
the time a hypothesis test has sufficient evidence to reject an assumption,
you'd usually be able to see it visually.

Until you get into high dimensions, it probably doesn't hurt too much to
visualize the data. Additionally, it can be helpful to understand what signal
has been left in the residuals (ex: you fit a linear model, but failed to
include a quadratic term), which is something hypothesis tests aren't as good
at telling you.

~~~
eggie5
Yes, "lacking power in small samples sizes"

------
murkle
You can see the statistics for each data set here (in GeoGebra)
[https://www.geogebra.org/m/K8HCdE6X](https://www.geogebra.org/m/K8HCdE6X)

------
tjwei
Awesome project, I really like it, so this is my attempt to achieve similar
result with different approach.

I feel it might be an useful example for illustrating the intuition of ZCA and
Wasserstein metric.

[http://nbviewer.jupyter.org/github/tjwei/Animation-with-
Iden...](http://nbviewer.jupyter.org/github/tjwei/Animation-with-Identical-
Statistics/blob/master/Animation%20with%20Identical%20Statistics.ipynb)

------
crb002
Good use case for n-dimensional probability density functions. Query a region
by a hypercube.

------
squeakynick
Awesome article. I just read with my morning coffee, and I have a big grin.

~~~
jmatejka
Thanks! Glad you liked it :-)

