
Paradoxes of probability and other statistical strangeness - seycombi
https://theconversation.com/paradoxes-of-probability-and-other-statistical-strangeness-74440
======
naftaliharris
A less popular but perhaps more influential phenomenon is Stein's Paradox [1].
Here's a provocative example often given to illustrate it: Say you have a
baseball player, soccer player, and football player, and you wish to estimate
the true mean number of home runs, goals, and touchdowns each scores per year.
If you have their last ten seasons worth of data for each, then the obvious
thing to do, for each player, is to estimate the true yearly mean score for
each player by their average yearly score from the last ten years. (E.g., the
baseball player hits an average of 20 home runs each year, so let's estimate
their true mean yearly home runs by 20). Stein's Paradox says that you can
actually do a lot better than this.

Even more crazy, the James-Stein Estimator which does this actually uses data
about the football player and soccer player to make predictions about the
baseball player, (and vice-versa). This is deeply unintuitive to most people
since the players aren't related to each other at all. The phenomenon only
holds with at least three players; it doesn't work for two.

(More generally, Stein's Paradox is the fact that if you have p >= 3
independent Gaussians with a known variance, you can do better in estimating
their p-dimensional mean than just using their sample means).

I've spent a bunch of time trying to understand why this actually works [2];
to be honest I still don't deeply understand. But nonetheless the consensus is
that the same shrinkage phenomenon is what causes improved performance for a
variety of high-dimensional estimators, (lasso or ridge regression, e.g.),
making the paradox very very influential.

[1]
[https://en.wikipedia.org/wiki/James%E2%80%93Stein_estimator](https://en.wikipedia.org/wiki/James%E2%80%93Stein_estimator)
[2]
[https://www.naftaliharris.com/blog/steinviz/](https://www.naftaliharris.com/blog/steinviz/)

~~~
ralfd
I don't understand. If the average of the last 10 seasons is 20 home runs,
what would be a better predicted value? You are a bit short in explaining
here?

Your site, and the Wiki link, is very math formular heavy. Is there an
explanation for someone who forgot all his statistic courses and greek letter
thingys?

~~~
stinkytaco
This is _maybe_ a better explanation:

[https://jmanton.wordpress.com/2010/06/05/comments-on-
james-s...](https://jmanton.wordpress.com/2010/06/05/comments-on-james-stein-
estimation-theory/)

It's still math heavy, but there is some explanation. It's hard to explain
without the math, since the math if fairly integral to it, that's why it's
such an amazing discovery. My understanding is that it's saying that the
variables are independent, but the measurement is not. So in the case of the
athletes, it's not that home runs predicts touchdowns or goals, but that by
using a Stein Estimation we would get a more accurate measure of all three in
aggregate. The example used in the article is less interesting, but probably
better for understanding:

For example, if i=1,...3 represents the financial cost of claims a multi-
national insurance company will incur in the next year in three different
countries, the company may be less concerned with estimating the values of the
individual means accurately and more concerned with getting an accurate
overall estimate.

------
pmoriarty
My favorite probability paradox has always been the Monty Hall problem[1]:

 _Suppose you 're on a game show, and you're given the choice of three doors:_

 _Behind one door is a car; behind the others, goats._

 _You pick a door, say No. 1, and the host, who knows what 's behind the
doors, opens another door, say No. 3, which has a goat._

 _He then says to you, "Do you want to pick door No. 2?"_

 _Is it to your advantage to switch your choice?_

[1] -
[https://en.wikipedia.org/wiki/Monty_Hall_problem](https://en.wikipedia.org/wiki/Monty_Hall_problem)

~~~
casta
If you see this problem as a special case of another problem the answer
becomes pretty intuitive.

Suppose there are N (let's pick 1000) doors: Behind one door is a car; behind
the others N-1 (999), goats. You pick a door, say No. 1, and the host, who
knows what's behind the doors, opens N-2 (998) doors, excluding No. 1, which
have have goats.

Now there are only two doors closed, the one you picked No. 1, and one that
was part of N-1 (999) doors. He then says to you, "Do you want to pick door
No. 1 or the other one?" Is it to your advantage to switch your choice?

~~~
p1esk
It did not get any more intuitive for me.

~~~
yellowstuff
Let me try. Say you pick the prize door initially. Obviously, if you switch
you will now have a non-prize door.

Say you pick one of the two non-prize doors initially. The host now opens the
other non-prize door, leaving only the prize door. If you switch you get the
prize door.

So switching always changes the prize door to a non-prize door and a non-prize
door to the prize door. You start on the prize door 1/3 of the time and a non-
prize door 2/3, so you should always switch.

~~~
p1esk
Actually, this makes sense! Thanks.

~~~
deong
In teaching a bunch of discrete math classes over the years, I've very rarely
found someone for whom the "imagine it's 1000 doors" version helps at all. The
only way that gives any insight is if you've already internalized the correct
answer. It makes the numbers more extreme; it does nothing to make them make
intuitive sense.

The version that helped you is the version that helps nearly everyone in my
experience.

~~~
diogofranco
I think the 1000 doors version does help, if you clearly explain the point
that "now, if you switch, the only way you lose is if you had guessed the
right door out of the 1000!"

------
danbruc
No need to look at fancy paradoxes, just think about the following.

 _What does it mean that tossing a fair coin has a 50 % probability of showing
heads?_

If you think you know the answer, you are probably wrong.

EDIT: Instead of just voting this down, try to give an answer. If you think it
is easy, you have not thought about it careful enough.

~~~
yellowstuff
The frequentist answer is that you could get the proportion of tosses that
come up heads arbitrarily close to 50% in a large enough sample.

The Bayesian answer is that given all the evidence available to me I would be
willing to bet $1 to win $1 if the next toss comes up heads. (Since I work in
finance I'll add that this assumes I'm risk neutral, so losing $1 is exactly
as bad for me as winning $1 is good.)

Except for the risk-neutrality detail this is all Probability 101, right? Or
are you thinking of something else?

~~~
danbruc
That the relative frequency converges to 50 % is obviously not true. You would
have to be exceptionally lucky, but you could get heads, heads, tails repeated
for ever and therefore the relative frequency of heads would fluctuate
increasingly tiny amounts around 66.(6) %. This of course has probability
zero, but it is not impossible. And there are of course many other sequences
of outcomes for which the relative frequency does not converge to 50 %. So at
best you could say that the relative frequency converges to 50 % with high
probability, but now you have a circular definition because you make use of
probabilities while defining probabilities.

The Bayesian view is problematic for several reasons. Why do I need someone
with believes about the coin, we are talking about intrinsic properties of
tossing a coin after all. And if that is not enough, we also throw some
betting in. Tossing a coin does certainly not depends on the invention of
money and gambling, at least ignoring that coins are usually money. Last but
not least you have to explain where your believe about a 50 % probability for
heads comes from and how it is justified. I could certainly believe that the
probability for heads is 25 %, that would not be a good believe.

~~~
hellogoodbyeeee
>That the relative frequency converges to 50 % is obviously not true. You
would have to be exceptionally lucky, but you could get heads, heads, tails
repeated for ever

This would violate the law of large numbers. You may end up with the sequence
HHT 10million times, but the chances that you continue to get that sequence
for another 10million times is all but zero, and then gets even smaller as you
add another 10million trials. As the number of trials approaches infinity, you
will arrive at 50%.

~~~
danbruc
No, the law of large numbers does not assert that all sequences of outcomes
converge, only that this happens almost surely. Or look at it the other way
round, what mechanism would prevent heads, heads, tails repeated forever? I
can certainly get heads, heads, tails on the first three tosses. After that I
start over, three more tosses all independent of what just happened, again a
12.5 % chance for heads, heads, tails. Why could this not continue forever?

~~~
hellogoodbyeeee
I'm pretty sure your confusion about probability stems from you not
understanding the mathematical concept of a "limit". Your sequence HHT has a
12.5% chance of happening, but so does the sequence TTH and as the number of
trials approaches infinity, you will get similar counts of the two because
they have equivalent probabilities.

~~~
danbruc
No, I will get similar counts with high probability but not surely. There are
sequences of probability zero that do not converge. Think about it this way.
Every coin toss in a sequence, finite or infinite, on its own can surly turn
out to be heads, can't it? And all tosses are independent, aren't they? So why
can't all tosses turn out heads? Unless you have a convincing argument why
some tosses have to yield tails eventually, you have to deal with the fact
that not all sequences of tosses converge to the expected probability.

~~~
jgehrcke
> Every coin toss in a sequence, finite or infinite, on its own can surly turn
> out to be heads, can't it?

I would agree if the sentence contained just the word "finite". The "or
infinite" is where you are thinking too intuitively, and not mathematically
correct anymore. The difference between "finite" and "infinite" is precisely
the solution to this paradox in your mind.

I hope to be able to point out that it is easy to correct for that with just a
bit of structured (but maybe non-intuitive) thinking.

One of the simplest and (I would argue most complete) _definitions_ of "a
probability of 0.5 for both test outcomes A and B" is that, given an
_infinitely_ large sample size, half of the samples show result A, while the
other half shows result B. Think about this for a second, and I recommend to
also use this opportunity to appreciate again that half of infinity is still
infinity.

This relates the mathematical concept of _infinity_ to the _definition_ of
_probability_.

With this definition, it probably feels like I so far just reworded your
question. That may not be satisfying. So, I would like to encourage you to
imagine that you have superpowers and can actually perform an _infinite_
number of tests. You do that on a sunny day and observe that _all_ test
outcomes were the same: A. You call it a day and you can conclude (using the
mathematical definition from above) in your diary of days-with-superpowers:
"Today I have empirically determined that the test shows outcome A with a
probability of 1". You might smile and add "Peter said that outcome A has a
probability of 0.5, but I have proven him wrong".

In other words: if you do an infinite number of tests, the normalized
distribution of test results precisely _is_ the probability distribution of
test results.

I think we have learned by now that the concepts of infinity and probability
are deeply related and can, by definition, be used to explain each other. That
might still not be satisfying. So, I would like to focus on the "finite" case
for a bit.

Imagine you don't have super powers anymore, but you're pretty resilient and
motivated and you want to do the experiment to (in)validate Peter's claim:
"The probability for both, outcome A and B, is 0.5 each!".

After 1.333.337 tests you have seen 1.333.337 test results showing A. You're
tired from all the testing and you complain (correctly!): "it is now really
pretty unlikely that Peter is right! I am pretty damn sure that he is wrong!
How long do I still need to do this to be _absolutely sure_?" \-- and then a
voice from the darknet reminds you: "for being absolutely sure, that is, for
finding an answer that is correct with a probability of 1, you need to have
super powers and make an infinite number of tests -- sorry dude, you can't do
that, ever, because it's inconvenient, takes infinitely long and such -- so I
need to disappoint you, you will never be sure, but maybe just enjoy your life
as much as you can".

 _Infinity_ usually does not allow for actual intuitive thinking. But there
are a few really simple mathematical rules around infinity and convergence
that make it actually pretty simple and again intuitive to deal with the
concept.

~~~
danbruc
I appreciate your attempt but you did not convince me the slightest bit. Let's
take the 1,333,337 coin tosses all heads. This result has no bearing on the
probability of the coin at all. It may make you strongly doubt that the coin
is indeed a fair coin but - and that is the point I am trying to get at -
there is nothing that prevents a fair coin from coming up heads 1,333,337
times in a row. Whatever your experiment shows, it could always be a
statistical fluke.

And the infinite case does not changes much, at least not in a way obvious to
me. Back with those super powers I toss the coin infinitely often and get
heads 50 % of the time. That was fun, let me do that again tomorrow. Well,
again heads 50 % of the time. This is the way it goes for a long time but then
something strange happens, one day all tosses come up heads. The very next day
everything is back to normal. What is now the probability of heads, we got two
different answers for your way of defining he probability? And all it took was
an extreme statistical outlier on single day.

~~~
yorwba
> What is now the probability of heads, we got two different answers for your
> way of defining he probability? And all it took was an extreme statistical
> outlier on single day.

The point of probability is that performing an experiment an infinite number
of times guarantees that every outcome happens with a proportion that exactly
equals its probability (for a formalization of what that even means, look at
measure theory). If you get different proportions on different days, you have
different probabilities. That means, you weren't performing the same
experiment.

------
Houshalter
By far the most unintuitive paradox for me personally is the one presented
here:
[https://youtu.be/go3xtDdsNQM?t=3m27s](https://youtu.be/go3xtDdsNQM?t=3m27s)

"Mr. Jones has 2 children. What is the probability he has a girl if he has a
boy born on Tuesday?" Somehow knowing the day of the week the boy was born
changes the result. It's completely bizarre.

~~~
prolways
Is there a more rigorous explanation of why they count the probability space
how they do? Watching that video I feel like the ordering of the kids and the
striking of one of the "b2b2" entries seems wrong to me. If we care which kid
was first... which doesn't seem to matter... then the first b2b2 and the
second b2b2 seem like they're different and shouldn't "cancel."

Then again... it took me multiple explanations to understand the Monty Hall
problem.

~~~
kevinwang
Yeah, it definitely seems wrong to me. B2B2 has twice the probability of any
other event listed in that table.

edit: I guess it's more nuanced than that. The explanation and interpretation
sections on this blog post [1] and on wikipedia talk about the controversy.

[1]: [https://jakubmarian.com/the-day-of-the-week-boy-or-girl-
para...](https://jakubmarian.com/the-day-of-the-week-boy-or-girl-paradox-
explained/)

[2]:
[https://en.wikipedia.org/wiki/Boy_or_Girl_paradox](https://en.wikipedia.org/wiki/Boy_or_Girl_paradox)

~~~
Houshalter
Even if that's the case it still doesn't make sense. That changes the
probability to 50% girl instead of 52%. But it should be 2/3rds.

~~~
ralfd
This B2/B2 elimination also immediately jumped out to me as being wrong.

But a Youtube comment wrote: > "But the thing is, I ran a few tests through a
big randomized sample set, and... he's right. It's super weir … the second
boy-girl problem had ~51.9% change of containing a girl. Keep in mind this was
about 100 million randomized samples too.﻿"

I can't wrap my head around it.

~~~
chillee
So I think intuitively, something that might make sense to you is that the
reason that BB is more common with the restriction that a boy must be on
Tuesday is that having 2 boys increases the probability that you will have a
boy that was born on Tuesday.

Here's another way to think about it. It's twice as likely to have one girl
and one boy (either BG or GB) compared to having 2 boys. Thus, the version of
the paradox where you're given that the parent has a boy results in a 2/3
chance of the other child being a girl.

However, the more unlikely it is that any given boy satisfies the condition
(in this case the condition would be being born on a Tuesday), the more likely
the BB case becomes compared to BG or GB.

More concretely, if only 1/n of the boys satisfies some condition, you would
be left with only 1/n of BG or GB, while you would be left with 2/n - 1/(n^2)
of BB. In this case, let the population be all parents with at least one boy
(this consists 1/3 of BB, 1/3 of BG, and 1/3 of GB). Letting n = 7, (BG union
GB) represents (1/7) _(2 /3)=2/21 of the population, while BB represents (2/7
- 1/49)_(1/3) = 13/147 of the population.

Now, (2/21)/(2/21+13/147) = 14/27, our desired result.

I hope that's a more intuitive way of thinking about it. The important part is
realizing the differences between knowing that the parent has a boy vs knowing
that the parent has a boy born on Tuesday.

~~~
Houshalter
>having 2 boys increases the probability that you will have a boy that was
born on Tuesday.

I guess. But the fact that it happened to be Tuesday isn't really important.
The person could just as easily have had a kid on Wednesday. And all the logic
would be the same. And the kid has to be born on some day of the week. How
does finding out what day it was give us any additional information about the
other child? It's completely independent!

~~~
chillee
So that's what confused me to begin with as well.

I think the main thing here depends on your interpretation of how the parent
is chosen.

Let's say the question is the same, but we relax the requirement that the boy
is born on Tuesday. Do you think the probability that the other child is a
girl is 2/3 or 1/2?

------
haddr
I think there is a whole class of statistical "strangeness" with using p
values for hypothesis testing. For instance, p = 0.05 means that we have ~30%
chance that our hypothesis is a false positive [1], which is far from what
intuition tells us.

[1] [http://www.nature.com/news/scientific-method-statistical-
err...](http://www.nature.com/news/scientific-method-statistical-
errors-1.14700)

~~~
TekMol
"our hypothesis is a false positive"

I wonder what you mean by this. In what sense can a hypothesis be a false
positive?

~~~
haddr
I mean false positive that null hypothesis is rejected.

------
z3t4
If you throw a six sided dice two times, there's 1/6*1/6=~3% chance to hit six
both times. But if you throw one six, there's now ~17% change to hit six again
...

------
prmph
Here is another seemingly basic question that leads down the rabbit hole: What
does it mean to say two things are the same?

------
mrcactu5
my favorite data science paradox is the "curse of dimensionality"

[https://en.wikipedia.org/wiki/Curse_of_dimensionality](https://en.wikipedia.org/wiki/Curse_of_dimensionality)

