
Statistical Mistakes and How to Avoid Them - ingve
http://www.cs.cornell.edu/~asampson/blog/statsmistakes.html
======
lisper
This is the insight that made statistics "click" for me many years ago: a
statistical test answers one central question: what are the odds that the
results you observed could have arisen by chance? If those odds are low, then
you are justified in concluding that the results probably did not arise by
chance, and so there must be some other explanation (usually, but not always,
the causal hypothesis you are advancing).

Once consequence of this is that it is _crucial_ that you advance your
hypothesis _before_ you collect (or at least look at) the data because the
odds of something arising by chance change depending on whether you predict or
postdict the results. Also, the more data you have, the more likely you are to
find something in there that looks like a signal but is in fact just a
coincidence. Many a day-trading fortune has been lost to this one mistake.

~~~
stdbrouw
> a statistical test answers one central question: what are the odds that the
> results you observed could have arisen by chance

Well, no, that'd be very interesting but unfortunately what a statistical test
really says is the probability of the results you observed (or more extreme)
_given_ chance. P(data|model) and not P(model|data).

~~~
lisper
It's actually P(data|null-hypothesis).

~~~
qwrusz
How about you guys are both right, sort of.

 _There are both Bayesian and Frequentist approaches in statistics!_

They represent very different methods to statistics, but they are also quite
similar. My apologies, I couldn't find one link that gave a good description
of Bayesian vs. Frequentist. Here are a couple links to get started:

[https://xkcd.com/1132/](https://xkcd.com/1132/)

[http://jakevdp.github.io/blog/2014/03/11/frequentism-and-
bay...](http://jakevdp.github.io/blog/2014/03/11/frequentism-and-bayesianism-
a-practical-intro/)

If anyone comes across a link that describes Bayesian and Frequentist clearly,
please share it if you don't mind.

~~~
lisper
That cartoon is actually a pretty good illustration of why frequentists are
wrong.

~~~
qwrusz
Do you mean _probably_ wrong?

Either way I disagree. Both approaches have pros and cons depending on the
type of analysis.

At the very least, just the presence of competing approaches in the field has
pushed statisticians to have more rigour and do way more double checking than
they might have, out of fear the other side actively looking to poke holes.
It's easy to lie with statistics and the only people who can call
statisticians out on their BS is other statisticians...Need more of this.

~~~
lisper
> Do you mean probably wrong?

Nope.

> Both approaches have pros and cons

What are the pros of the frequentist approach?

~~~
throw_away_777
The main advantage of the frequentist approach is that you can do the
calculations much more easily. Bayesian statistics is great, but often the
calculations are much more difficult because of your distribution of priors.
You can make up simplified priors to ease the calculations, but then you run
into some of the same problems as frequentist statistics.

Here is a simple example: lets say you flip a coin 10 times and get 8 heads,
what is the probability that the coin is not fair? In frequentist statistics
you only have to calculate the likelihood of the results for a null
hypothesis, and then you use a p-value. While this approach is flawed, at
least you can quickly do the calculation and get an approximate answer. In
Bayesian statistics you have to specify the prior distribution, and calculate
the likelihood of your results under every possible hypothesis. Correctly
specifying this prior distribution and calculating the results is quite
challenging - especially if you want to use a realistic prior (not just
uniform). This is a pretty simple example, you can imagine how much more
challenging this becomes in real-world problems. On the other hand, it is true
that the frequentist approach doesn't really answer the question asked, so it
is misleading (especially if you choose a p-value that isn't specific to the
problem). If you choose p-values based on prior knowledge, than the
differences between frequentist and bayesian are less extreme.

~~~
lisper
A man was walking down a city street when he saw another man wandering around
a lamp post looking at the sidewalk. "What are you doing?" the first man
asked. "Looking for my keys," said the second man. "Oh, did you lose them
around here?" asked the first man. "No," the second man replied, "but the
light is better here."

~~~
throw_away_777
Often times getting close to an answer that is approximately correct is better
than trying to find the perfect solution. Most real world problems, especially
in analyzing data, don't have perfect answers. For example approximations are
made all the time in physics, because without these approximations the
calculations can't be done. Knowing when to make approximations and what
approximations to make is an essential skill for analyzing data.

------
skybrian
I'm not a statistician but even so I think this article makes assumptions that
may not hold up for computer science. The first thing to do is plot your data.
If it doesn't look like a bell curve, it's unlikely that common statistical
calculations (which assume something close to gaussian) apply here.

If you're doing benchmarking, another common model is a peak at a minimum
value (when everything goes right) and a long tail, due to various events like
cache misses that always slow things down, but don't happen in every test run.

On a system with multiple programs running (a typical desktop), taking the
mean is meaningless - this just adds noise due to activity unrelated to your
program. You'd be better off taking the minimum, which with enough test runs
should capture all the events that happen every time and none of the events
that don't.

The median or 95% percentile might also be useful if you're investigating
events that don't happen every time. But if you want to know about cold start
performance (for example), maybe the best thing to do would be to flush your
caches before every test run, so the events you're interested in are events
that happen every time.

~~~
srean
> If it doesn't look like a bell curve, it's unlikely that common statistical
> calculations (which assume something close to gaussian) apply here.

The key word in there is _common_. There is an entire industry of statistical
techniques that do not require Gaussian assumption or for that matter any
parametric assumption.

I strongly feel it is time to retire the Gaussian distribution from the space
it occupies. Discovering and studying Gaussian distribution and the bog
standard central limit theorem should be considered one of mankind's crowning
achievements. They deserve to be put on a pedestal to appreciate their
elegance, but when rubber meets the road one has to open ones mind to look
beyond. Appearance of the Gaussian distribution is rarely as _normal_ as many
expect/claim it to be (I blame the stats education machinery for this), nor
was it invented by Gauss. In fact Gauss used it as a post-hoc justification
for backing the least-squares method. His original motivation for least
squares was simplicity and convenience, not the normal distribution or CLT or
for that matter the Gauss-Markov theorem.

~~~
thanatropism
The Gaussian distribution is central in continuous-time models for different
reasons:

[https://almostsure.wordpress.com/2010/04/13/levys-
characteri...](https://almostsure.wordpress.com/2010/04/13/levys-
characterization-of-brownian-motion/)

Basically any reasonable* stochastic continuous process is driven by a
brownian motion. Also: discontinuous processes are more or less* the sum of a
brownian motion and a poisson-type process.

[https://en.wikipedia.org/wiki/L%C3%A9vy_process#L.C3.A9vy.E2...](https://en.wikipedia.org/wiki/L%C3%A9vy_process#L.C3.A9vy.E2.80.93Khintchine_representation)

(* Much details about filtrations, Banach spaces yadda yadda omitted)

~~~
srean
Yes indeed, but the Brownian motion story is weaker than the CLT story. Lot
more conditions required, you have to look at it the right scale, in the right
way ... then stochastic processes look very much like a Brownian motion. Well,
technically the bog standard CLT is a special case of this, hence has a
simpler story.

------
fela
"it’s telling you that there’s at most an alpha chance that the difference
arose from random chance. In 95 out of 100 parallel universes, your paper
found a difference that actually exists. I’d take that bet."

This is wrong. It’s telling you that there’s at most an alpha chance that a
difference like that (or more) would have arisen from random chance _if the
quantities are actually equal_. And _if the quantities are equal_ 95 out of
100 parallel universes would not be able to reject the null hypothesis.

Is he saying that he would take the xkcd bet[0] on the frequentist side?

[0] [https://xkcd.com/1132/](https://xkcd.com/1132/)

------
frozenport
The t-test assumes a normal distribution which, is rarely true, especially
when the number of runs is under 100. A better test is the Mann-Whitney U test
which is applicable for a wider category of distributions.

~~~
platz
Central limit theorem means there are lots of cases where normal distributions
are directly applicable.

~~~
ekianjo
The central limit theorem applies to independent variables only. If you are
not sure your variables are independent you cannot rely on that assumption.

~~~
ronald_raygun
That's not 100% true. There are lots of different theorems that are "central
limit theorems", and that work across different cases.

You can have CLT's with non-iid variables (either the aren't identically
distributed or aren't independent). The math just becomes _much_ harder, and
you have to assume specific dependence structures.

For example
[https://en.wikipedia.org/wiki/Martingale_central_limit_theor...](https://en.wikipedia.org/wiki/Martingale_central_limit_theorem)

~~~
ekianjo
Thanks for the reference - can you provide an example where one would use the
Martingale CLT, if you are aware of any ?

~~~
srean
Look for any estimation or inference problems in the context of stochastic
processes. For a simpler example you can take a look at sequential test of
hypothesis. It is quite ubiquitous, but not always called out by its name.

------
amelius
I don't like how the article tries to push statistics on the reader. If a CS
paper compares a pair of averages, then that gives certain information. If
statistics can add to that, and make the results a little more precise, then
that is nice. But by no means is it absolutely necessary. And statistics will
not give a conclusive result either.

I think that authors should use statistics when they see fit, and when it does
not distract too much from the original subject of the paper.

~~~
samps
Needless to say, I disagree. It can be straight-up misleading to report means
without including a more nuanced view of the distribution. You don't need to
use a bunch of fancy statistics, but you do need to consider whether your
results could have arisen by random chance. That's not a distraction; it's
accurately reporting what you found.

Here's one frightening example of spurious performance results in CS:
[https://www.cis.upenn.edu/~cis501/papers/producing-wrong-
dat...](https://www.cis.upenn.edu/~cis501/papers/producing-wrong-data.pdf)

~~~
amelius
> It can be straight-up misleading

It is only misleading if the reader doesn't understand statistics. There is,
imho, nothing wrong with putting all your focus on the subject matter, and
skipping the statistics while being frank about it.

Also, if you need statistics to show that your method is better than other
methods, then perhaps your method is not really that much better.

~~~
throw_away_777
If your analysis involves data, you can't skip the statistics. If there is no
analysis of data than go ahead and skip the stats all you want. You need to
try and determine the uncertainty in your results, both systematic and
statistical.

------
glangdale
The idea that you should 'plot the error bars' ahead of, well, _looking at the
data_ seems a bit premature. As many other comments have stated, looking at
the data first is critical.

It drives me up the wall: we have 1200dpi printers, retina displays, and so
on, and yet somehow people feel the need to collapse everything they've done
to these giant finger-painting quality bar charts. Statistical tests are well
and good, but I'm amazed at the extent to which smart people will happily plug
data which they have never actually _seen_ into statistical metrics. So a mean
might be derived from 9 reasonable results and a howlingly off factor-of-2
outlier, and you can dutifully plug this series into a bunch of standard tests
and speak confidently about p-values.

------
ekianjo
That's a good article but pretty short. There would be a lot more ground to
cover.

------
wodenokoto
Is there a good resource to learn the underpinnings of P values and T-tests?

I feel like everybody says these are important, show a formula and then
arguments ensue about what p=0.95 means, and nobody seems to know this.

~~~
imh
I think any intro stats book should do the trick. As far as I know, the
material in a first stats course is pretty homogeneous. I'm not a
biostatistician, but I happen to like this book [0] for introductory stuff.
Amazon says you can get it used for $26.

[0] [https://www.amazon.com/Principles-Biostatistics-CD-ROM-
Marce...](https://www.amazon.com/Principles-Biostatistics-CD-ROM-Marcello-
Pagano/dp/0534229026)

~~~
wodenokoto
I took intro to stats at a business school and switched to computational
linguistics and honestly, I have only been met with the "It is something that
you do" in regards to P-values.

Ill try and look at an introductory book again and see if it satisfies my
curiosity.

~~~
imh
The general motivation for a p value is that you can model what your data
should look like under the assumption that your model is correct, but you
can't really say what your data should look like under the assumption that
your model in incorrect. There are just too many ways that it could be
incorrect.

As a concrete example, I might ask you for the distribution of the mean of N
samples given that they come from the standard normal distribution (mean zero,
variance 1). That's easy. The sample mean, which is itself a random variable,
also is normally distributed with a mean of zero and a variance of 1/N. On the
other hand, if I ask you about the mean, but the only info you have is that
your data _isn 't_ from a standard normal, then it could be anything! There's
no objective way to say how the sample mean is distributed, given that one
crappy piece of info.

The most basic thing you can do then, is to assume that your model is true and
see if your data is plausible. If I have a hypothesis that I'm flipping a fair
coin and I get all heads on 10 flips, I'm going to start doubting my
hypothesis. The probability of all heads or all tails with a fair coin is only
1/512=0.002. P values formalize that notion. We call the hypothesis we can
model our "null hypothesis", and see if we get data that makes sense with it.
If your observations are some of the most unlikely ones according to your null
model, let's start doubting the model. That's it.

The benefit and trouble are both that we dodged the entire question of what an
alternative to our model could be, and how the data looks under those
alternatives. Ignoring that incredibly important question can give rise to a
weird way of thinking, and opens the door to some conceptually mind bending
mistakes, but it all comes from a simple interpretation of a p value. How
unlikely is your data given your null hypothesis (given the model you're
trying to test). Formally, this tends to be "what is the probability that some
statistic is this unlikely or worse."

Does that make any sense?

------
petters
One common mistake is to take the average of run times (for identical runs). I
think taking the minimum should be better under reasonable assumptions.

Edit: I now see that this was mentioned elsewhere here. Good!

