
Beware the Mean - sjwhitworth
https://stephen.sh/posts/beware-the-mean
======
kqr
Other, related points:

\- With heavy tails, the sample mean (i.e. the number you can see) is very
likely to underestimate the population mean.

\- With heavy enough tails, higher moments like variance (and therefore
standard deviation) do not exist at all -- they're infinite.

\- Critically: With heavy tails, the central limit theorem breaks down. Sums
of heavy-tailed samples converge to a normal distribution so slowly it might
not realistically ever happen with your finite data. Any computation you do
that explicitly or implicitly relies on the CLT will give you junk results!

~~~
wodenokoto
Can you elaborate on the part about sample vs population mean?

The way I see it; in these scenarios you aren’t looking at the sample mean.
There is no reason to sample your customer base to get an estimation of your
average revenue. You can calculate from the entire population.

~~~
paulsutter
Your current customer base is just a sample of your total market, for example.
Next year’s (larger) customer base will be slightly closer to the whole set
(so different) but still just a sample.

~~~
wodenokoto
So the point is that the current customer base is a skewed sample of the
theoretical customer population.

Therefore we shouldn’t look at the current customer mean profit to predict
what would happen to profits if we doubled the customer base.

~~~
kqr
Exactly. These summaries are often used for prediction, which means historic
data is used as a sample of the same distribution as future data.

Even when comparing two different historical data sets you have to be careful:
if you're doing anything that resembles hypothesis testing (i.e. trying to
figure out if something you changed made a difference) you're not _really_
comparing two historical data sets -- you're trying to compare the underlying
distributions from which the historical data sets were drawn, but hoping that
the historical data are representative samples from those.

------
ImaCake
If the author is seeing this thread; I couldn't find an RSS feed for your
site. I don't know if they are difficult to setup, but if it's very little
effort, I would appreciate seeing what you post next :)

As for the waryness about the mean. A lot of people much further behind than
thinking of different distributions. Even something you assume is normal
distributed needs a mean _and_ a variance! As for visualising, histograms are
incredibly underrated tools. You can infer a lot of information by just
looking at a distribution.

~~~
sjwhitworth
Sorry, I do need to set this up. It's just a bunch of Markdown at the moment.
Thanks for reading!

------
EliRivers
The mean is misleading. The median is misleading. The mode is misleading. Any
reduction of a range of data to a single representative datum is misleading.

However, the fight back against providing something a bit more meaningful than
a single value can sometimes be quite strong.

I try hard to provide software estimates as probability distributions, but
when someone sees a line with a probability peak somewhere around two days
(could be really simple), and then a wide hump somewhere around two weeks (if
it's not simple, it will mean a significant rewrite), with a very low line
between them and then a long, long tail off to several months, it is not well-
received.

I can see their point; they're trying to plan things, and the whole system is
set up to work with single numbers. If everyone provided probability graphs
for their estimates, and we had a tool that could then combine them and
deliver the net probability graph of the combined pieces, I expect they'd be a
lot more amenable.

~~~
Izkata
> The mean is misleading. The median is misleading. The mode is misleading.
> Any reduction of a range of data to a single representative datum is
> misleading.

For anyone who hasn't seen it before, Anscombe's Quartet is a nice visual of
this (and actually goes a bit further, showing reduction in general can be
misleading, not just to a single point).

[https://en.wikipedia.org/wiki/Anscombe%27s_quartet](https://en.wikipedia.org/wiki/Anscombe%27s_quartet)

~~~
petercooper
I was just coming here to post that. A very neat (and surprising) thing! :)

------
teodorlu
Nassim Taleb greatly expands on this point in _Antifragile_. For a freely
available, techical argument, check out _Doing Statistics Under Fat Tails_
[1].

[1]:
[https://www.fooledbyrandomness.com/FatTails.html](https://www.fooledbyrandomness.com/FatTails.html)

------
theophrastus
This is a worthy posting, particularly as so much becomes iterative statistics
in "A.I." clothing. The two old (slightly hackneyed) counter-examples which
are popular in lectures about measures of the _central tendency_ are:

\- One is trying to get a sense of the common sort of income in a room and
then Bill Gates wanders in. Suddenly the average income becomes an amount
which _no one_ experiences.

\- What is the average number of testicles in the human population? That
computed central tendency is quite rare.

~~~
kqr
The second one doesn't seem that bad to me. The number can still be used to
answer common questions like, "Assuming it takes 10 seconds to tickle a
testicle, how long do I have to tickle testicles if I want to tickle every
testicle in my apartment complex."

Sure, you have to know what the mean means, but it's still a useful number.
Your first example is more indicative of the problem, IMO.

------
mlyle
I would quibble some here. When we look at revenue, I agree: ignore the mean.
If there's a whole bunch of people not paying you anything, that's OK... Look
at the 50th and 90th percentile.

But _profit_ , and similarly _costs_? Your mean customer better be profitable,
or you won't be. How much the people on the left of the graph _cost_ you is
_important_.

Part of this is definitional, too. Do you include that far left part of the
graph where people are not really paying you as a "customer"?

~~~
andreilys
A “mean” profitable customer would be misleading. You could see that 99% of
your customers lose money for you, while 1% earn all the profit (aka fat tail)

The point is you should be careful about reducing huge swaths of data to one
datum. It often hides the more interesting insights.

~~~
mlyle
> A “mean” profitable customer would be misleading.

In no place did I suggest "mean profitable customer" as a metric. I sort of
said the opposite.

At the same time, in a traditional industry we wouldn't consider people with
an evaluation license or sample or who came in to wander and see our wares a
"customer". They're a _lead_ and have a cost associated with them.

> The point is you should be careful about reducing huge swaths of data to one
> datum. It often hides the more interesting insights.

Sure. At the same time, if you dance with 30 numbers looking for insight,
pretty soon you're practicing qualitative, wishy-washy innumeracy. The
discussion is about KPIs, which are all about "reducing huge swaths of data to
one datum" but also absolutely essential to run a real business day to day.

Monitor more than one, and keep your eyes open for where they go wrong, and be
ready to change what you do.

------
PaulHoule
The mean is not so bad for many purposes because it is an expectation value.

If you add up your revenue, subtract your expenses, and divide by the number
of customers that gives you a real profit number. (Condition how you define
revenue & expenses) If that number is negative or positive it is meaningful.

The median on the other hand has a different set of problems. If you are
running a game like Fate Grand Order you'd better cultivate the guy who spends
$70k because he has to "catch them all". The median player probably pays
little or nothing, but the guy who sells ero comics at Comiket complains about
what it costs to get (say) Saber Bride, but it is worth more to him than it is
to the medium.

Mean and median are terrible numbers to use for latency; what drives you nuts
with your computer being unresponsive is not the median latency, but the 99%
latency.

~~~
kqr
> The mean is not so bad for many purposes because it is an expectation value.

Implicit assumptions:

1\. The sample mean is an accurate estimator of the expectation.

2\. The expectation is a useful number.

Both of these are false surprisingly often; an example is one you're
mentioning: latencies.

------
pototo666
I came across too many people who value mean soooooo much in the analysis.
Well, some of them made mistake and the project died. Hypothesis: heavy
reliance on mean increases the probability of failure in internet industry.
This reminds of PG's essay _mean people fail_ :
[http://www.paulgraham.com/mean.html](http://www.paulgraham.com/mean.html)

Pun intended :)

------
fmajid
The Iranian civilization can draw continuity to Susa, circa 3000BC, further
than China. The Mesopotamian and Indian civilizations are older still but
broke continuity.

~~~
RubenvanE
I think you meant to reply to this post
[https://news.ycombinator.com/item?id=22166846](https://news.ycombinator.com/item?id=22166846).
:)

