Hacker News new | past | comments | ask | show | jobs | submit login
Same Stats, Different Graphs: Datasets with Varied Appearance and Identical Stats (autodeskresearch.com)
126 points by bryceroney on May 2, 2017 | hide | past | favorite | 23 comments

I don't think the title does a great job of identifying how cool this is. Having the animations especially is great!

Thanks! I'm the author of the paper, glad you like it, and especially glad that you think the animations are cool :-)

Congrats on the paper, very cool work. I bet the "Datasaurus dozen" is going to become the go to example for the dangers of relying on summary statistics, it is much more memorable than Anscombe's quartet.

Thanks! And here's hoping :-)

Yes, absolutely well done. The animations are the clearest and best example I've ever seen of how these stats can be screwed with in technically accurate but misleading ways.

Thank you very much.

It was hard to fit a title into 80 characters on my first HN post :)

I agree, very cool.

> Boxplots are commonly used to show the distribution of a dataset, and are better than simply showing the mean or median value. However, here we can see as the distribution of points changes, the box-plot remains the same.

Violin plots [1][2] are a great spin on boxplots that help show the distribution.

[1] https://en.wikipedia.org/wiki/Violin_plot

[2] http://seaborn.pydata.org/generated/seaborn.violinplot.html

But a violin plot still wouldn't distinguish between many of the plots in this set. All of the plots that don't have multiple vertical stacks -- so, all the horizontal lines, diagonal lines, circles, and scatter -- will have the same or almost the same violin plot.

You could rotate the violin plot to show how the density changes on the other axis, for those graphs where that would be useful, but that requires looking at the data visually and making decisions, which is the whole point of the article.

Agreed. But my response was specifically in response to the "More Examples" section that shows three examples for which the boxplots are ambiguous but violin plots would not be.

In general, scatter plots can be very valuable but difficult to compare among trials. Violin plots help mitigate that, though you're correct that it comes at a cost.

I'm wondering how the general public can be educated about this when so many people are unaware even of the difference between the median and the mean.

In the mass media especially, the mean is often bandied about as the only statistic, and treated as if it was definitive.

I think the public is largely aware of the difference since housing prices are often reported using the median. Intuitively, everyone knows that median housing price is the price of a 'regular' (most frequently sold) house in a given area.

The most regularly encountered house price at sale would be the mode.

Median is the value at which 50% of houses trade higher as well as lower.

I'm not at all sure that's true. That is, I agree that many people probably understand that mean and median are both things, but suspect that a minority of them understand how and why they are different, or when and why one makes more sense than the other.

You see this a lot in the context of building a regression and often times the assumptions are violated:

* Linear relationship between predictor and response and no multicollinearity * No auto-correlation (statistical independence of the errors) * Homoscedasticity (constant variance) of the errors * Normality of the residual(error) distribution.

As the paper suggests, plotting the data visually will help you avoid these assumptions, but also just making sure you don't violate the assumptions w/ statistical tests would work too. For example, uou can look at your residuals (loss) as an indicator of good fit. If your residuals do not follow a normal distribution, this is typically a warning sign that your R2 score is dubious.

There are a few statistical tests for Residual Normality, particularly, the Jaque-Bara test is common and available in scipy.

So, I would argue, you don't even need to visualize the data. I describe this more here: http://www.eggie5.com/104-linear-regression-assumptions

Sure, using hypothesis tests could pick out some of the structured examples in the Datasaurus, but in practice, things are often more subtle. Goodness of Fit tests to check for normality, in particular, are a little bit thorny, lacking power in small sample sizes, and rejecting normality for slight departures in higher sample sizes. My experience has been with assumption checking that by the time a hypothesis test has sufficient evidence to reject an assumption, you'd usually be able to see it visually.

Until you get into high dimensions, it probably doesn't hurt too much to visualize the data. Additionally, it can be helpful to understand what signal has been left in the residuals (ex: you fit a linear model, but failed to include a quadratic term), which is something hypothesis tests aren't as good at telling you.

Yes, "lacking power in small samples sizes"

You can see the statistics for each data set here (in GeoGebra) https://www.geogebra.org/m/K8HCdE6X

Awesome project, I really like it, so this is my attempt to achieve similar result with different approach.

I feel it might be an useful example for illustrating the intuition of ZCA and Wasserstein metric.


Good use case for n-dimensional probability density functions. Query a region by a hypercube.

Awesome article. I just read with my morning coffee, and I have a big grin.

Thanks! Glad you liked it :-)

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact