
Unlearning descriptive statistics - stdbrouw
http://debrouwere.org/2017/02/01/unlearning-descriptive-statistics/
======
andy_wrote
For readers who are OK with some math, I recommend John Myles White's eye-
opening post about means, medians, and modes:
[http://www.johnmyleswhite.com/notebook/2013/03/22/modes-
medi...](http://www.johnmyleswhite.com/notebook/2013/03/22/modes-medians-and-
means-an-unifying-perspective/) He describes these summary descriptive stats
in terms of what penalty function they minimize: mean minimizes L2, median
minimizes L1, mode minimizes L0.

A single-number statistic is _going_ to leave things out, so if you must boil
things down into one number, or even a few numbers, you're going to lose
something that you had in the raw data. This is why I find claims along the
lines of "statistics don't tell the whole story" a little bemusing - of course
they don't, the very definition of a statistic is a summarization of data that
is easier to work with. The question is what data is kept or lost, or more
generally what importance we place on different aspects of the raw data such
that it's reflected in our descriptive statistics.

The lessons for non-technical people who want to communicate with descriptive
statistics are to recognize that summarization is inherent in the nature of
any descriptive statistic, that they are thereby opinionated in some way in
terms of what they've preserved and what they've left out, and to recognize
whether those opinions are appropriate for your purpose.

~~~
johnmyleswhite
Glad you enjoyed that post so much. It really is a shame that we do such a bad
job of teaching students about the inherent subjectivity of descriptive
statistics and let students leave their courses with dangerous ideas about the
existence of a Holy Grail statistic that will solve all of their problems.

~~~
nkkar
Your followup post ([http://www.johnmyleswhite.com/notebook/2013/03/22/using-
norm...](http://www.johnmyleswhite.com/notebook/2013/03/22/using-norms-to-
understand-linear-regression/)) is excellent. Thank you!

~~~
johnmyleswhite
Thanks! I really should have finished and written the post about the SVD as
well. One of these days...

~~~
agandy
I'd love to read your post on SVDs once it's written

------
somestag
I think the message of the article is great: move beyond the "standard"
descriptions and pay more attention to what you're trying to show and who your
audience is.

That said, it's a slight pet peeve of mine when people recommend the median
over the mean to describe center. The median, on its own, does not describe
what is "typical" any more than the mean does; it just has a small advantage
in that it will always map a real observation, so for discrete data you don't
end up with things like "1.9 legs." (That said, a mean of 1.9 legs actually
seems much more informative to me than a median of 2 legs, so even in that
case I prefer the mean.) It's easy to envision many situations where the
median's representation is wildly inaccurate, just as you can imagine ways in
which the mean can be misleading.

The median is also insensitive to skew, which is often cited as a good thing,
but really that's something you should determine on a case-by-case basis. In
many (most?) applications, there's no tangible benefit in having a measure of
center that ignores skew. The median's insensitivity also creates strange
situations where subgroups in the population end up unrepresented (e.g. if the
poorest 20% become even poorer because of changes to the tax code, the median
doesn't budge). In general, the median conveys remarkably little information.
It's great at showing the center, but it gives zero indication of anything
else. The mean certainly has its issues, but it's far from inferior. Use the
median when the situation calls for it, certainly, but realize that it's just
as limited as any other measure, if not more so.

~~~
srean
> when people recommend the median over the mean to describe center. The
> median, on its own, does not describe what is "typical" any more than the
> mean does; it just has a small advantage

I think this swings the other extreme in selling the 'median' short. As long
as we agree that it is only strictly meaningful to talk about the 'center' for
symmetric distributions, median does a fine job.

In fact in many realistic scenarios a far better job than the mean. The main
trouble is the normal or the Gaussian distribution is no where close to being
as ubiquitous as it seems, neither is CLT as universal as it is made out to
be. Gauss sort of got away with it, Gauss did not discover the distribution
nor the associated CLT.

Many real data of day to day consequence have heavy tails, and mean is a
pathetic measure of 'central tendency' for these. Mean is particularly
sensitive to outliers. Median does _significantly_ better than the mean in
this non-academic situation. Although one could do better than median for
symmetric heavy tailed data (for example trimmed means), but its mean that I
find guilty of entirely disproportionate fame. If its needed to exaggerate
median a bit to get people to grow up beyond the pervasive Normal / Gaussian
fetish, I am all for it.

No single number is going to characterize what is 'typical'. One really needs
the CDF here, and yes avoid estimating densities as much as possible.

BTW mapping to a real observation is not true, you only get a 50% chance of
that.

~~~
somestag
The median is worse than the mean in a skewed distribution if you want to take
the skew into account. In fact one of the strengths of the mean is that it is
sensitive to changes in the entire sample, rather than only part of it. To
repeat my previous example, if the lowest 20% of household incomes drop
because of changes to the tax code, that's something I'd generally want
reflected in my "household income" statistic. We can't get too hung up on the
semantics of _central tendency_ here--there are many good ways to measure
center, and it's clear that all of these statistics actually measure very
different things even though they're all grouped into the same category of
statistic.

When people say the median is "better" than the mean, they often can't explain
exactly why (other than to cite skew, which I've addressed). It's really just
a gut feeling based on how they felt when they first learned that the median
was "more realistic" for some specific set of data that was skewed. They never
thought through exactly why or when you'd want to disregard the skew in the
first place. I would be insane to argue that the median is _never_ the right
statistic to use; my point is merely that it shouldn't be given preference
over the mean, and that people need to think about what they're actually
trying to show.

The mean is not only useful for the normal distribution, and since the article
specifically avoids jumping into inference territory talking about specific
distributions _at all_ is kind of putting the cart before the horse. The
"disproportionate fame" of the mean is inherited from its extreme importance
throughout all of statistical inference. It's true that if you ignore
inference, the mean isn't _particularly_ important as a descriptive statistic,
but my argument is that it's certainly no worse than the median.

~~~
Houshalter
>one of the strengths of the mean is that it is sensitive to changes in the
entire sample, rather than only part of it.

Yes but this is exactly what is bad with it. At least for the majority of
practical applications that I can think of. For instance, if you want to
measure the "average" income of the population. If there are a handful of
uber-rich people, they will really raise the mean. Even if the vast majority
of people make much less than the mean. The median will do a much better job
of telling you what a "typical" person makes.

I really believe that when most people read or talk about "average", they
mentally interpret it as the median. And that the median is generally less
misleading.

Sure, there are applications where sensitivity to changes at the tails is
important. But even then, mean is misleading and difficult to interpret. In
your earlier example, if you raised the income of the bottom 10% by 100%, then
just use that as a statistic. "The poorest 10% of people now make twice as
much money" is _much_ more informative and impressive than "the average income
increased by 0.1%."

------
minimaxir
It's worth noting that most statistical packages include the ability to
generate a five-number summary ([https://en.m.wikipedia.org/wiki/Five-
number_summary](https://en.m.wikipedia.org/wiki/Five-number_summary)), which
generates many of the described skew-independent descriptive statistics
automatically. (R, for example, will generate a five-number summary for each
feature of a data frame with a simple summary(df) )

The TL;DR is to just plot everything if possible to visualize skew explicitly
(or tricks like the Quartet), and there are a number of tools which can do
that without much LOC overhead. (e.g ggplot2)

~~~
jfaucett
you can also use the aptly named fivenum() func in R :)

~~~
apathy
summary(df) will do that for every column of a dataframe with less code...
which is why few people use fivenum()

The majority of defaults in R exist because statisticians do these things _all
the time_ and don't like typing any more than programmers. Time spent typing
could be spent thinking about what assumptions are being violated and whether
the results offer any useful insight.

------
pyromine
I don't feel particularly convinced by this article, I feel like it misses the
bullet by going beyond a reasonable standard. The article is phrased in such a
manner that it acts as if typical descriptive statistics are not useful, but
the more important point is more reasonable and understand when to use what
statistics.

In fact, many of his examples of what to use are just other descriptive
statistics, median is no less a descriptive statistic than the mean is.

I think the problem is not so much that descriptive statistics are bad, so
much as they are not particularly useful when they lack context.
Unfortunately, statistics is a much more complex field than your average
product designer doing an A/B test thinks. The majority of statistics and
models the average person utilizes has assumptions and qualifications that
they don't fully understand, and frankly that's not a slight to them but
rather just the additional depth of the field.

Just as an example, ask the average person to derive the degrees of freedom
for their t-test without just utilizing the formula they were taught.

As in most fields this is an issue of nuance, not that beginner techniques are
bad, but beginner techniques alone are not nuanced enough to capture useful or
rigorous insights.

~~~
ykler
"In fact, many of his examples of what to use are just other descriptive
statistics, median is no less a descriptive statistic than the mean is." I
think the author would fully agree with this. He is saying we should use more
easily understandable descriptive statistics.

"I think the problem is not so much that descriptive statistics are bad, so
much as they are not particularly useful when they lack context.
Unfortunately, statistics is a much more complex field than your average
product designer doing an A/B test thinks." Whatever you say publicly is
likely to be stripped of its context and to be received by an audience that
does not understand the complexities of your field. The answer to this is to
be change what you say, not to change the world, which verges on impossible.

------
OliverJones
This is terrific.

In the field of web ops, I've had great success tracking the 95th percentile
of request time rather than the mean, median, mode, or any other descriptive
metric.

The systems I worked with were, like many systems, ordinarily very efficient.
That meant the mean and median metrics washed away the occasional troublesome
request and hid it from the metrics dashboard.

But knowing the 95th percentile was out of bounds allowed my team to
investigate and try to remediate the trouble spots.

This was especially useful in troubleshooting fax delivery (yeah, I know, I
know: it was healthcare stuff; fax is considered secure and email is not)
through unbelievably flakey private branch exchanges (big city hospitals)

~~~
telotortium
Gil Tene has an entire talk about this point called "How NOT to measure
latency"
([https://www.youtube.com/watch?v=lJ8ydIuPFeU](https://www.youtube.com/watch?v=lJ8ydIuPFeU)).
In many circumstances, a web service (either server or client) has large
numbers of dependencies, so a user only has to hit 95th-percentile latency for
one dependency to have their overall latency significantly hurt. In other
words, users may encounter the "rare" bad case on a majority of requests.
Thus, even higher percentiles should be tightly controlled in order to truly
keep a web service's overall latency under control.

------
em500
To answer the author's postscript: "why did nobody tell me this?"

Because you didn't pay attention at school? Means, medians, modes and
percentiles were standard fare in my high school, and again in freshman
university courses. And I didn't go to particularly expensive schools or elite
universities. Then again, I also taught intro stat courses to BA students for
a few years, and it was easy to tell that the vast majority of them cared very
little, since understanding their uses wasn't exam material.

~~~
rankam
I think the author is trying to point out that mean, median, and mode are
things that people think they understand because they know how to calculate
them - any one of them is trivial for the majority of the high school educated
population (14-18 years old). However, just because you can calculate
something doesn't mean you know how to properly apply it.

When you taught your intro stats class, did you try to give the students
guidance (and convey understanding) as to when and why they should apply one
of the statistics? If so, good on you, but from my, obviously anecdotal
experience, that has not been the case.

------
jordigh
I would like to advertise again my favourite measurement of skewness, the
medcouple:

[https://en.wikipedia.org/wiki/Medcouple](https://en.wikipedia.org/wiki/Medcouple)

It's robust (25% breakdown point) and is kind of the "optimal" measurement of
skewness, as it's computed as the median of all possible interquartile range-
like statistics.

I hope someone implements the faster algorithm for Python's statsmodels.

------
FabHK
To be fair the article should not be called "Unlearning descriptive
statistics", but "Learning just a bit more (and often better) descriptive
statistics", but granted, it's not that snappy.

Certainly better than "I took a statistics course, and you won't believe what
happened next!!!1!!"

~~~
sevensor
Agreed, but I've got to say from experience that quantile statistics like the
author proposes are often _way_ more informative. The only downside is that
they can be considerably more expensive to compute for bigger data sets,
whereas mean and standard deviation are cheap and can be updated without
taking another pass over the data.

------
kriro
On a somewhat related note, I think it's interesting that there is data on how
people perceive information but that isn't really taught in descriptive
statistics (at least it wasn't at my university). I feel like data
representation and charts/graphs should be more of a focus if the title of the
class is descriptive statistics. I hadn't even heard about "How to Lie with
Statistics" or "The Visual Display of Quantitative Information" before
randomly finding them mentioned somewhere. I simply learned about some
standard graphs without much explanation on why you use them or how they
impact the understanding of the audience.

------
clircle
For anyone who is interesting in correlation measures, have a look at distance
correlation:

[https://en.wikipedia.org/wiki/Distance_correlation](https://en.wikipedia.org/wiki/Distance_correlation)
[http://projecteuclid.org/euclid.aoas/1267453933](http://projecteuclid.org/euclid.aoas/1267453933)

It's implemented in the R package "energy", and provides a more comprehensive
measure of variable dependence than Spearman's rank correlation. Distance
correlation is able to detect any kind of dependence, not just monotonic
relationships.

------
yomritoyj
I think that the article is unfair to statisticians in saying that they ignore
descriptive statistics. The problem is that one cannot choose systematically
between different descriptive statistics without clearly specifying our
objectives and what we believe about the data-generating process. Otherwise on
what grounds can you say that the median is 'better' than the mean or the
other way round? But once you specify these things you are doing statistical
estimation, which is a major branch of mathematical statistics.

------
jfaucett
This is a great read. I've been thinking alot about these basic concepts
myself lately, especially the idea of central tendency.

Essentially, we can make up any method we want to summarize data and give us a
single value that represents the central location of the data i.e. mean vs.
least absolute distance vs. distance squared, etc.

I haven't thought much about the difference between "typical case" or
"expected value" so that's a very useful distinction to be made, especially
when deciding which method you want to pick.

~~~
FabHK
One good way to think about it is to ask what norm ("distance") between all
the datapoints and your "central tendency" statistic you want to minimize, as
the article alluded to.

For L2 (squared distance), you get the mean: For fixed x_i,

sum of i=1..N of (x_i - M)^2

is smallest for M = mean.

For L1 (absolute distance), you get the median.

For L0 ("identical or not"), you get the mode.

You can come up with other and alternative concepts, but it's kind of neat
that the 3 most common descriptions of central tendency pop out of this one
unified approach.

------
wnoise
Variance and standard deviation gets a lot of hate here, that I don't think it
deserves. It really does have nice properties, generally nicer than the MAD.
Squaring feels to me much less of a hack than the absolute value. The problem
that most people don't quite understand it is certainly true, but that's as
much a problem with the people as the method.

------
heinrichhartman
I wrote an article for the ACM Queue (also published in CACM) about this very
issue called "Statistics for Engineers"

[http://queue.acm.org/detail.cfm?id=2903468](http://queue.acm.org/detail.cfm?id=2903468)

~~~
bigger_cheese
I took a "Statistics for Engineers" course while at university. It was mostly
focused around ANOVA and hypothesis testing. By the end of the course I was
sick of hearing the phrase "Null hypothesis". I wish the course had introduced
some more practical applications instead it was very high level and
theoretical.

Working as an engineer now (Materials Engineering) here is a short list from
the top of my head of what I feel are useful statistics applications that were
not covered in uni.

\- Alternative probability distributions (Rosin-Ramler) \- Harmonic and
Geometric means \- Multivariate regression \- Statistical Sampling techniques
(how to obtain a representative sample and avoid sampling bias) \- Statistical
process control \- Time Series analysis

~~~
heinrichhartman
Thanks for your list.

In the IT Operations / Monitoring domain time series models play a large role.
At the same time the level of sophistication is rather low. You definetly have
to start with the basics when teaching such a course.

------
shmageggy
> _But why do you want a number at all?_

Because lots of data analysis questions hinge upon the association between two
data sets, and it's nice (crucial) to be able to quantify this value.
Especially because

> _While statisticians are generally quite good at estimating a correlation
> from a picture and vice versa, most people are not._

The author says

> _Still not happy and absolutely want a number? You would do well to shun
> correlations even so._

OK, so what else do you suggest?

~~~
apathy
1) it is pretty amazing that normit transformations (map the quantiles of a
non-normal distribution onto a Gaussian and use that) don't seem to be on this
guy's radar. We use distributions with linearly additive and affine invariant
properties (normal plus normal is normal, bernoulli plus bernoulli is bitwise
bernoulli) because we find linear algebra very useful. Nonparametric tests and
procedures erode your power; normit transformations usually increase it. I did
part of my dissertation on this; it's partly to do with asymptotics, but also
partly due to the robustness of Gaussian error assumptions thanks to the CLT.

I realized recently that a lot of the trouble people have with training neural
networks stems from their lack of training in basic model evaluation. If you
stack a bunch of shitty penalized regressions (which is what linear/logistic +
relU hinge loss represents) you now have one gigantic shitty regression which
is harder to debug. If your early steps are thrown out of whack by outliers,
your later steps will be too. Dropout is an attempt to remedy this, but you
tend to lose power when you shrink your dataset or model, so (per usual) there
really is no such thing as a free lunch. But most of the tradeoffs make more
sense when you are able to evaluate each layer as a predictor/filter. Scaling
this up to deep models is hard, therefore debugging deep models is hard. Not
exactly a big leap.

There is a reason people say "an expert is a master of the fundamentals".
Building a castle on a swamp gives poor results. If you can't design an
experiment to test your model and its assumptions, your model will suck. This
is not rocket surgery. A GPU allows you to make more mistakes, faster, if
that's what you want. If you have the fundamentals nailed down, and enough
data to avoid overfitting, then nonlinear approaches can be incredibly
powerful.

Most of the time a simple logistic regression will offer 80-90% of the power
of a DNN, a kernel regression will offer 80-90% of the power of a CNN, and an
HMM or Kalman filter will offer 80-90% of the power of an RNN. It's when you
need that 10-20% "extra" to compete, and have the data to do it, that deeper
or trickier architectures help.

If you can transform a bunch of correlated data so that it is 1) decorrelated
and 2) close enough to Gaussian for government work, you suddenly get a
tremendous amount of power from linear algebra and differential geometry "for
free". This is one reason why Bayesian and graphical hierarchical mixed models
work well -- you borrow information when you don't have enough to feed the
model, and if you have some domain expertise, this allows you to keep the
model from making stupid or impossible predictions.

Anyways. I have had fun lately playing with various deep, recurrent, and
adversarial architectures. I don't mean to imply they aren't tremendously
powerful in the right hands. But so is a Hole Hawg. Don't use a Hole Hawg when
a paper punch is all you really need.

2) What (good) statisticians excel at is catching faulty assumptions. (I'll
leave it to the reader to decide whether this data-scientist-for-hire has done
a good job of that in his piece) So we plot our data, marginally or via
projections, _all the damned time_. If you don't, sooner or later it will bite
you in the ass, and then you too can join the ranks of the always-plotting.
However, choosing which margins or conditional distributions to plot in a
high-dimensional or sparse dataset is important to avoid wasting a lot of
time. So whether via transformation or penalization (e.g. graphical lasso) or
both, we usually try to prune things down and home on on "the good stuff".
Prioritizing what to do first is most easily done if you have a number and can
rank the putative significance by that number. Use Spearman, use energy
statistics (distance correlation), use marginal score tests -- IDGAF, just use
these as guidelines and plot the damned data.

Corollary: if someone shows you fancy plots and never simple ones containing
clouds of individual data points, they're probably full of shit. Boxplots
should be beeswarms, loess plots should have scatterplots (smoothed or
otherwise) behind them. And for god's sake plot your residuals, either
implicitly or explicitly.

3) see above. The author is good at fussing, and brings up some classical
points. But they're not really his points. Median and MAD are more robust to
outliers than mean and standard deviation, but that makes them less sensitive,
too. Check your assumptions, plot everything, use the numbers as advisory
quantities rather than final results.

~~~
stdbrouw
> it is pretty amazing that normit transformations (map the quantiles of a
> non-normal distribution onto a Gaussian and use that) don't seem to be on
> this guy's radar

I don't mean to offend, but this is the PhD ur-response, "you didn't mention
my pet theory!" :-)

You've given me some interesting stuff to chew on but I very specifically
wanted to write about descriptive statistics as a way to _describe_ data, not
as a way to summarize it for computers so it can be used in inference. Mapping
non-normal distributions onto a Gaussian ain't gonna cut it for that purpose,
and to the extent that I care about robustness in this context it's not
robustness of inference but whether a descriptive continues to provide a
reasonable description of the data for human consumption in the face of
outliers etc.

~~~
apathy
re: normit/inverse transformation: Pet theory? This isn't a theory, it's a
simple inverse transformation. It's used all the time. George Box showed how
to do a 2D version of it in the 1950s. Normit may be a pet name for it,
though. It's just a play on words (probit, logit, expit... normit).

As far as describing data, what's wrong with median + IQR for marginal
distributions, or some flavor of energy statistic for joint distributions? You
will always need to trade off robustness for sensitivity and bias for
variance. That's simply a mathematical feature of the universe. There are
plenty of ways to take advantage of this to highlight outliers, for example,
which often gets you thinking in terms of "hey, this really looks more like a
mixture of two completely different distributions" and seeing if that
intuition holds up.

The whole point of describing data with summary statistics is that _if the
assumptions are met_ this decouples them from the underlying data. If you use
the median as an estimator for the center of your distribution and the MAD as
an estimator of its scale, you may choose dimensions along which it's not very
good at partitioning your observations. If you want a resistant way to
describe the expected center of your data, the median and MAD are very useful.
Sometimes it's even more useful to plot everything and point out "our results
fall into K obvious clusters" each of which will have their own center &
spread.

What I'm saying is that there's no silver bullet. Most of the time we take
descriptions of the data, see if there are interesting inferences to be drawn,
lather, rinse, repeat. "Get a lot more samples" often thrown into the mix. Are
there strong clusters in the data? (Usually a plot will show this, whether via
projection or in the raw data) Are there continuums that are interesting in
relationship to things we care about? (Usually we'll turn around and model
their relationship to said thing-we-care-about, conditioned upon a bunch of
other items... multivariate regression, which if you're doing it right, will
get you plotting the residuals, themselves descriptive of the model fit)

You simply can't do responsible statistical inference without exploring your
data to see what's going on. In order to explore complex datasets, there are
plenty of techniques, and most all of them demand tradeoffs (see MAD vs. SD or
other metrics of "interestingness" for clustering). A number of descriptive
statistics ("extremality" for example) rely upon limit behavior of specific
distributions and are case-by-case.

I don't think you'll find many silver bullets for either descriptive or
inferential statistics. You have to choose your tradeoffs based on what you
want to accomplish.

------
krick
Can somebody recommend a good book (or any other resource) on statistics? I
mean kind of stuff this post talks about: descriptives, tests, all the basic
stuff. Despite being somewhat familiar with that, I sometimes feel I really
lack the solid understanding of the subject and am longing for something
explanatory, with real life examples and exercises.

~~~
stdbrouw
Think Stats ([http://greenteapress.com/wp/think-
stats-2e/](http://greenteapress.com/wp/think-stats-2e/)) and the follow-up
Think Bayes are awesome if you're a coder.
[https://bblais.github.io/statistical-inference-for-
everyone-...](https://bblais.github.io/statistical-inference-for-everyone-
sie.html) is a complete rethink of how basic statistics should be taught and
is excellent as well.

~~~
krick
Thanks, looks awesome.

------
wodenokoto
Great article overall, but I wish it went a little deeper in explaining how we
should interpret the statistics it recommend.

It basically says "don't use the normal numbers, use these instead, because
they are closer to what you would expect them to mean"

However I don't find all of these statistics super obvious.

------
pizza
Surely there has to be an analysis somewhere that someone has done which links
Kolmogorov complexity and one-statistic-describes-all approaches of
interpreting numbers?

~~~
stdbrouw
Something like what you describe is more or less how statisticians do in fact
reason about statistics:
[https://en.wikipedia.org/wiki/Sufficient_statistic#Minimal_s...](https://en.wikipedia.org/wiki/Sufficient_statistic#Minimal_sufficiency),
though only for i.i.d. data from a parametric distribution.

------
TorKlingberg
I think "skew" is on of those things that are taught not because it is useful,
but because it is easy to teach, easy to memorize and easy to test for.

------
TheGorramBatman
Who the hell would consider Tukey a fringe statistician?

~~~
stdbrouw
Lauded, sure, but just about all of his books are out of print and his work is
rarely a part of the curriculum.

------
OscarCunningham
What's the "ordinal center"?

~~~
stdbrouw
If you've got a bunch of ordered categories, like "bad", "okay" and "good",
the category in the middle. (So if humans can have 0, 1 or 2 legs, then 1
would be at the center.)

------
eanzenberg
This is actually a great post. I think it's a wonderful idea to move away from
worshipping summary statistics (ave, median) and start studying the underlying
distributions.

