
Estimating the chances of something that hasn’t happened yet - devy
https://www.johndcook.com/blog/2010/03/30/statistical-rule-of-three/
======
cperciva
This reminds me of a different "rule of 3": If you want to compare two things
(e.g., "is my new code faster than my old code"), a very simple approach is to
measure each three times. If the all three measurements of X are smaller than
all three measurements of Y, you have X < Y with 95% confidence.

This works because the probability of the ordering XXXYYY happening by random
chance is 1/(6 choose 3) = 1/20 = 5%. It's quite a weak approach -- you can
get more sensitivity if you know something about the measurements (e.g., that
errors are normally distributed) -- but for a quick-and-dirty verification of
"this should be a big win" I find that it's very convenient.

~~~
ced
Man, I wish I understood frequentist statistics to know if your reasoning
makes sense. Bayesianly, if O = "the XXXYYY ordering", and F = "algo B is
faster than algo" A, then

    
    
        P(F|O) = P(O|F) * P(F) / P(O)
    

then... what? There isn't even a clear P(O|F) likelihood without making
assumptions about the process' noise. If the measurement is very noisy
compared to the gain, then XXXYYY is just dumb luck, and doesn't tell you
anything. If there is no noise at all, then just getting XY is enough to make
a decision.

~~~
mikekchar
The OP is making some short cuts. 1/(6 choose 3) means that they start with
the assumption that every possible ordering is equally likely. If that is the
case, then the odds that XXXYYY pops out is 5%. What happens if we change our
assumption? Let's assume that XXXYYY is _more_ likely. This would imply that
algo B is faster than algo A. If we assume that any of the other (or all of
the other) combinations are more likely, then this _reduces_ the odds that
XXXYYY would pop out.

This means that either we had it right (algo B is faster than algo A), or the
result we saw was at least as unlikely as we predicted. That's what it means
to have a "confidence interval".

I'll leave the Bayesian version to someone else because I don't really trust
myself to do it.

~~~
repsilat
> _they start with the assumption that every possible ordering is equally
> likely_

If the times in each run are independent, this assumption is (for _some_
distributions) the weakest form of "X is not faster than Y."

For a distribution where this is not the case: assume

\- X always runs in 999 seconds, and

\- Y runs in 1000 seconds in 99% of runs and in 0 seconds the other 1% of the
time.

Then XXXYYY is a very likely ordering (~97% chance), though Y runs faster than
X "on average". (Not in median or mode though.)

For a more concrete example: say you have two sorting algorithms, one that is
a little slower most of the time, but worst case O(n log n), and another that
is usually a bit faster but can be O(n^2) on pathological input.

------
rspeer
If the author is reading this: When someone provides you a pro-bono
translation, by all means credit them, but _do not_ let them host it on their
own site.

Frequently, they are siphoning your PageRank. They will eventually replace the
translation with monetized content of their choice. A good translation takes
work! If you got a translation for free, why should you believe it's good, or
that it has no ulterior motive?

A firm called "WebHostingGeeks" used to do this all the time. They would offer
free translations of blog posts, into languages the author probably didn't
speak (and they didn't either, they were just using machine translation).
They'd ask authors to link to the translation on their site, and over time
they would add their SEO links to the translation.

I first noticed this when WebHostingGeeks offered me a Romanian translation of
ConceptNet documentation, my roommate spoke Romanian, and he said "maybe I'm
not used to reading technical documentation in Romanian but I think this is
nonsense".

~~~
gboudrias
Yeah this is weird. Also, if OP doesn't speak Italian themselves, how can they
attest to the quality of the translation?

~~~
thedirt0115
I have a couple L2’s that I mostly speak and read, virtually no writing - my
grammar is poor and my vocabulary is limited. Because of this I would feel
uncomfortable/embarrassed to translate any tech stuff I wrote to it. However,
I could read someone else’s translation and know if it’s way off the mark.
Also, I have plenty of friends that speak the language natively, so I could
ask them to review. If they say it’s good, I’d vouch for the translation
because I trust them (even if I couldn’t read it at all).

~~~
gboudrias
Oh ok cool :)

------
madrox
What the author glosses over somewhat is the method of sampling. If you read
the first 20 pages, find no typos, and use this rule to arrive at 15%, that
could be way off. He's assuming the risk of typos are evenly distributed when
there's a lot of reasons it may not be. For example, the first half of the
book could've been more heavily proof-read than the latter half. It's not out
of the question that editors get lazier the farther they get into the book.

If you were to _randomly read 20 pages in a book_ and find no typos, 15%
probability makes more sense.

It's understandable to not mention this in a short blog post about the rule of
three, but never forget that when you're interpreting statistics...how you
build your sample matters.

~~~
nitwit005
Doesn't the book example need to factor in the book length? If you read 20
pages, and it's 20 pages long, you should be 100% confident, but if you read
20 pages and it's 5000 pages long, your confidence should be near 0.

~~~
madrox
Not in this case, because the probably being estimated is "the chance a single
page could contain a typo" and not "the chance the rest of the book could
contain a typo."

To do the latter, you'd need to know the length of the book, yes.

------
citilife
I've seen a very similar problem to this referred to as "black swan
events"[1]. The whole point is you can't actually compute it. You see the
period, it's there. What he's doing here isn't science, it's a guess.

As others have pointed it, it's rather rare that events happen with perfectly
distributed probability (probably the opposite). For instance, the chance of
being in an accident is much higher if an accident just occurred right next to
you. It's almost way more likely to get sick, if others are already sick. In
fact, although I don't have any statistics to back me up, I'd guess that most
events happen in clusters (including spelling errors, or when you test for
perfect pitch in children, when you go to the music class).

This is essentially a guess, and it's better to say you don't know than guess
wildly.

[1]
[https://en.wikipedia.org/wiki/Black_swan_theory](https://en.wikipedia.org/wiki/Black_swan_theory)

~~~
carlmr
>The whole point is you can't actually compute it.

You can't compute the probability. But you can compute an __upper bound __on
the probability with a reasonable confidence, which is what the author is
doing here.

This is standard statistics and might be useful in some cases. It does assume
that the events are independent, but this is a pretty standard assumption that
you have to check whether it applies in your case, or at least approximately
applies in your case.

>What he's doing here isn't science, it's a guess.

Which is a pretty big thing in this subfield mathematics called statistics.

------
ajkjk
So basically the '3' comes entirely from the choice of a 95% confidence
interval. If you want a 99% confidence interval it's instead the 'rule of
4.6', which doesn't roll off the tongue as well.

~~~
jedberg
You could do the 'rule of 5' though and have a confidence of 99.3%, which is
pretty close to 99.

~~~
neolefty
If you're truly estimating, though, wouldn't you use a 50% confidence interval
— which gives you the rule of 1 (chances of the thing being true are 1/n — if
you've seen 20 pages without a typo, chances are 1/20 that there's a typo on a
page, with 50% confidence)?

~~~
hawkice
So, for a 50% confidence interval, you could look at the first word -- if it
is a typo, boom, done. Otherwise, flip a coin.

This stinks. Estimating is about approximating an answer with low information
-- it's about efficiency of using data, not _only_ doing better than guessing.

~~~
neolefty
Sorry, I meant once you've examined n trials, what is your 50% confidence
interval about the odds for a single trial? I think it would be 1/n.

------
motohagiography
Intuitively, this seems like an obverse of the "optimal stopping problem."
[https://en.wikipedia.org/wiki/Optimal_stopping](https://en.wikipedia.org/wiki/Optimal_stopping)
or the subset of the Odds Algorithm
([https://en.wikipedia.org/wiki/Odds_algorithm](https://en.wikipedia.org/wiki/Odds_algorithm))
where optimal stopping point to find the highest quality item in a sample is
essentially N * 1/e = 0.368...

Handwavily, this resembles the rule of three, where we could probably say
instead, "the rule of 1/e," because both of these appear to be artifacts of
the same relationship and same type of problem.

------
YeGoblynQueenne
So, according to this rule, if I wait for 10 minutes for a bus to come and
none does, and then I wait for another 10 minutes for an alien invasion and
none happens, the two have the same upper bound on their probability?

Or are we going to start talking about priors, on buses and alien invasions,
in which case the rule of three is not really useful? If I want to know how
likely a specific book is to have typos, can't I just go look for statistics
on typoes in books, and won't that give me a better estimate than a "rule"
that will give the same results no matter what it is that it's trying to
model?

~~~
PeterisP
This is the scenario where you know _nothing else_ other than you waited 10
minutes for this event and it didn't happen.

Sure, if you have more data, then you can get much tighter bounds for your
estimate.

~~~
YeGoblynQueenne
You always have more data -background knowledge- unless you've only existed in
those last 10 minutes.

And if something has really never happened before, like my alien invasion
example, what have we learned by applying the rule of three?

Honestly- perform the experiment yourself. Wait for X time, then calculate
3/X. Do you now have an upper bound on the probability that an alien invasion
will happen?

~~~
PeterisP
Yes, I do have a reasonable (95% confidence, as per article) upper bound on
the probability - as I haven't noticed an alien invasion during my lifetime,
it seems reasonable to conclude that noticeable alien invasions are very rare
events and happen less frequently than once every decade. Possibly _much_ less
frequently, possibly many orders of magnitude less frequently, possibly never,
but that's how upper bounds work.

~~~
YeGoblynQueenne
An upper bound of "maybe, who knows" is not useful or informative enough to
make a whole "rule" about it.

------
oldgradstudent
The rule of three requires quite a lot of assumptions about the nature of the
phenomena.

Or as Taleb says it:

> Consider a turkey that is fed every day. Every single feeding will firm up
> the bird's belief that it is the general rule of life to be fed every day by
> friendly members of the human race "looking out for its best interests," as
> a politician would say.

> On the afternoon of the Wednesday before Thanksgiving, something unexpected
> will happen to the turkey. It will incur a revision of belief.

~~~
PeterisP
It certainly does not require unwarranted assumptions, the proposed approach
is consistent with the well-known turkey scenario.

From the experience of feeding a turkey can infer that being slaughtered is a
rare event, and that the likelihood it happening exactly tomorrow (without
having access to a calendar) is not necessarily 0, but is below a certain rate
- and it's definitely not likely to happen three times in the next week.
Surviving for 180 days is reasonable justification to assume that, on average,
turkeys get slaughtered less frequently than every 60 days, i.e. that the
likelihood of Thanksgiving suddenly arriving tomorrow isn't 50% but rather
something below 2%.

~~~
AstralStorm
Completely wrong. Continued survival of given that gives no information at all
of survival rates and its distribution. This is because variability of data
input is extremely low so mutual information between each of the days is
vanishingly small.

This is why a good experimental design will observe a measurement with some
expected variability.

An even better trick is the sleeping beauty problem. To solve it you need
external information.

------
DoctorOetker
I wouldn't use a probability density of typo's per page, but instead the
probability that a word is spelled wrong.

Then it's like drawing marbles from a vase containing an unknown proportion of
blue and red marbles.

I would use the formula (M+1)/(N+2) where N is the number of words, and M is
the number of mistakes. Note that for a large corpus (M+1)/(N+2) approaches ->
M/N, so we recover the frequentist probability.

Also note the author (John D Cook) correctly expresses intuitive doubt that 0
typos / 20 pages can not be construed as certainty of no mistakes. Similarily,
seeing a mistake in every word in a subset does not guarantee that all words
in the book will have a typo. Let's look at the modified formula (M+1)/(N+2)
in these cases: if we observe no typos (M=0) in 1000 words, it estimates the
typo probability as (0+1)/(1000+2)=1/1002 != 0, so we can't rule out mistakes.
Similarily if all words were typos (M=N) for the first 1000 words we get
1001/1002 != 1, so we can't be sure every future word is a typo.

Check out Norman D Megill's paper on "Estimating Bernoulli trial probability
from a small sample" where the formula (M+1)/(N+2) is derived, on page 3 it
appears:

[https://arxiv.org/abs/1105.1486](https://arxiv.org/abs/1105.1486)

------
saagarjha
> If the sight of math makes you squeamish, you might want to stop reading
> now.

Sigh…another article normalizing the concept that math is something it’s OK to
be uncomfortable about…

------
jeffdavis
I'd be interested to know how to estimate things that are very rare or don't
have a normal distribution.

For instance, let's say I have a bold plan to protect us from meteor strikes
that will cost $100B. How would a person make a decision about whether that's
a good trade-off or not? And how would a mathematician help them make that
decision?

How would it change for more complex cases, like a shield to prevent nuclear
ICBM warfare which has never happened, but we all are worried about?

~~~
carlmr
Simple answer, this is a rough upper bound. The more information about your
problem you have, the better you will be able to model the probability
distribution and the better priors you have on it happening.

------
whack
In the example given, the author says that the odds of a given page having a
typo is _less than 3 /20_. Sure, but if we don't want a range, but an _exact
number_? That sounds like a more interesting challenge to me.

Formal statement:

\- You have observed N events, with 0 occurrences of X

\- Someone wants to make a bet with you about the likelihood of X happening

\- Once you've quoted a number, your counter-party then has the option of
making an even bet about whether the actual likelihood is greater than or less
than your prediction

Eg: If you predict 3/N using a 95% confidence interval, then 95% of the time,
the actual likelihood will be less than 3/N. Your counterparty will then win
the bet 95% of the time, simply by predicting it to be lower.

Your ideal strategy would be to quote a likelihood which is over/under 50% of
the time, not 95%.

Ie, you want to pick E such that 50% of the time, it matches the observation
you've made (no occurrences), and 50% of the time it does not.

E^N = 0.5

N log E = log 0.5

log E == log 0.5 / N

E = 0.5^(1/N)

For the example given, that comes out to 0.966. Ie, there's a 96.6% chance of
no typos in a given page. Across 20 pages, this comes out to 0.966^20 => 50%
chance of no typos. If your goal is to quote the single best estimate which
can hold up well in a betting market, I believe this would be the ideal
strategy

~~~
olooney
[https://en.m.wikipedia.org/wiki/Rule_of_succession](https://en.m.wikipedia.org/wiki/Rule_of_succession)

[https://en.m.wikipedia.org/wiki/Sunrise_problem](https://en.m.wikipedia.org/wiki/Sunrise_problem)

~~~
AstralStorm
This presumes conditional independency - violations of which are common.

Instead, you get to estimate the dependency between each observation as in
advanced variants of Bayes chain rule. Ultimately, some place of the estimator
will contain an assumption giving only bounded optimality.

------
AstralStorm
Unmentioned assumption of normal distributed errors is pretty evil.

This is exactly this approach is worthless for rare events which by definition
have very skewed distributions. In that case, probably is overestimated a lot.

On the other hand, I'd the is a rare but systematic error, the error probably
will likely be grossly underestimated.

Thought experiment: suppose you're writing a long string of digits that
consists of 1 followed by a large number of zeroes (say 99 for simplicity)
followed by (say 10000) uniformly distributed digits.

Your writing system has an issue that changes half of 5 digits into 6. What
probability of error will be estimated by this dumb method after 100th digit?

Correct bayesian approach updates the prior based on input variability keeping
the error estimates high when input has low variability etc. (This can be with
variance or another method.) The even better method tries to estimate the
shape of input distribution.

In other words, your result would be a difference of likelihood ratio of both
input and output prior distributions (estimated to date - since no errors the
ratio would be 1) minus likelihood ratio of posterior distributions.

------
simulate
This reminds me of an 1999 article in the New Yorker by Tim Ferris called "How
to Predict Anything" [https://www.newyorker.com/magazine/1999/07/12/how-to-
predict...](https://www.newyorker.com/magazine/1999/07/12/how-to-predict-
everything)

> Princeton physicist J. Richard Gott III has an all-purpose method for
> estimating how long things will last. In particular, he has estimated that,
> with 95% confidence, humans are going to be around at least fifty-one
> hundred years, but less than 7.8 million years. Gott calls his procedure the
> Copernican method, a reference to Copernicus' observation that there is
> nothing special about the place of the earth in the universe. Not being
> special plays a key role in Gott's method.

------
BenoitEssiambre
This is cool.

The given example of typos on page kind of highlights the fact that more
sophisticated math involving a prior might give better results in some cases.

The beta(1, N+1) prior is an assumption that you start with the a priori
knowledge that a typo rate of 1% and a typo rate of 99% are equally as likely
as each other.

Most people would assume that books don't have typos on most pages and a 99%
typo rate is unlikely.

However as your sample gets bigger the prior matters less and less so this
rule is still useful.

Just know that it is reasonable to bias the results a bit according to your
prior when N is small.

------
mihaifm
If you’re wondering where does the formula (1-p)^n come from, it’s a number
often used in gambling (if I throw a die 7 times, what are the chances of
getting at least a 3). The probability of an event having probability p
happening after n trials is 1-(1-p)^n, and he’s using the inverse of that.

[https://en.m.wikipedia.org/wiki/Binomial_distribution](https://en.m.wikipedia.org/wiki/Binomial_distribution)

------
codeulike
This is a useful rule of thumb.

Lets try stretching it: Humans havent destroyed the world in their 300,000
years of existence, so probability of them destroying the planet in future is
less than 0.00001 (1 in 100,000). I feel like that might be an underestimate.

Reminds me of the Doomsday Argument

[https://en.wikipedia.org/wiki/Doomsday_argument](https://en.wikipedia.org/wiki/Doomsday_argument)

------
blowski
I’m not very good at maths, so I didn’t understand the whole post. However,
does the size of the whole population affect the “3/n” thing? For example, if
I’ve read 200 pages of a 201 page book and not discovered a typo the chances
are 3/200 if I’ve understand the post correctly. If the book has 20000 pages,
is the probability still 3/200?

~~~
geetfun
N is the sample size that you’ve sampled already in your observation.

If you’ve sampled all the pages, then we are talking about certainty which
this wouldn’t apply.

~~~
blowski
In my example, there’s either 1 page or 19800 pages left. But the post
suggests the probability of a typo is the same in both cases.

~~~
babygoat
It's the probability of finding a typo on a page, not the remainder of the
book.

~~~
blowski
Ah now I see the difference, thank you.

------
LeonB
Although interesting, the article doesn’t relate to predicting things that
haven’t happened yet, just things that aren’t known yet.

When predicting things that haven’t happened yet, a publicized “certain”
prediction will inevitably influence the actual probability in an
unpredictable way.

------
johntiger1
This post is of course talking about the difference between MLE and MAP
estimation. Consider the converse case: you flip a coin once and observe it is
heads. Do you then conclude that the probability of heads is 100%? No, because
even though you have data supporting that claim, you also have a strong
_prior_ belief in what your probability of heads should roughly be. This is
encoded in the beta distribution, as mentioned in the article

------
Mauricio_
This is very similar to the sunrise problem. Laplace said the probability of
the sun rising tomorrow is (k+1)/(k+2), where k is the number of days we know
the sun has risen consecutively, if we always saw it rise, and if we don't
have any other information.

[https://en.m.wikipedia.org/wiki/Sunrise_problem](https://en.m.wikipedia.org/wiki/Sunrise_problem)

------
rlue
I don't know the first thing about schools of thought in statistics
(frequentist? Bayesian?) but something feels fishy about extrapolating a
probability based on sample size (or number of trials) alone. 20 pages, no
typos, <15% chance of flawless spelling? What if the manuscript is 2×10⁴¹
pages long?

Would someone care to explain why the math says it shouldn't matter (for
reasonably small values of p)?

~~~
carlmr
><15% chance of flawless spelling?

per page

------
fernly
Clearly this is a bad estimating rule for some processes. The one that popped
into my mind is the probability of earthquakes (guess where I live...). This
would have the probability of an earthquake declining with each quake-free
year that passes, where in fact the USGS would say the opposite.

~~~
gugagore
There's something that feels different about that. You're talking about the
rate of some event occurring (like a Poisson process), and measuring how many
occurrences are in an interval (a year). What are the 6 samples you are
collecting? 6 years of numbers of earthquakes in a year?

Edit: sorry I thought you were replying to another comment.

------
eximius
I feel like I'm having a math stroke.

The posterior probability of p being less than 3/N for Beta(1, N+1) should be
integral(Beta(1, N+1), 0, 3/N), right?

That trends toward zero, so I must be wrong, but I can't for the life of me
remember why.

EDIT: Ah! I was accidentally using Beta instead of the PDF for Beta.

------
sangd
This rule of 3 may be a good example for reading books and finding typos. It
is no where as good for estimating the chances of "something that hasn't
happened". It's so random using this rule & claiming the result as an
estimate.

------
mirimir
There's a trivial corollary, which I remember from my first physics class.
Never base anything on less than three measurements. Or maybe it wasn't even
physics, but rather carpentry. The old "Measure twice, cut once." rule is
iffy.

~~~
babygoat
Three data points is not the same thing as three measurements of the same
object.

~~~
mirimir
There's no question that three measurements of some property of an object are
three data points. I vaguely recall that my first physics lab experiment was
measuring something with a ruler.

------
TheNewAndy
This feels related, and is interesting:

[https://en.wikipedia.org/wiki/Sunrise_problem](https://en.wikipedia.org/wiki/Sunrise_problem)

(estimating the probability that the sun will rise tomorrow)

------
anotheryou
related question: probability that the sun will rise again tomorrow

[https://en.m.wikipedia.org/wiki/Sunrise_problem](https://en.m.wikipedia.org/wiki/Sunrise_problem)

A practical use of a Stilton to this is included in reddits "best" sorting of
comments, leaving from room for doubt with low sample sizes of votes.

[https://redditblog.com/2009/10/15/reddits-new-comment-
sortin...](https://redditblog.com/2009/10/15/reddits-new-comment-sorting-
system/)

------
binarysolo
Mildly interesting trivia: there's a Chinese idiom stating "compare 3 shops
for a good deal" (貨比三家不吃虧), I guess that makes sense from a statistical
standpoint. :)

------
Jedi72
I was just on a flight, hanging with a guy who's PhD was this topic. Josh is
that you?? What are the odds.

------
Koshkin
Interesting: the frequentist derivation is using the logarithm, while the
Bayesian one, the exponent.

~~~
jey
But note that log and exp are inverses of each other, and it's applied to
opposite sides of the equation. In particular, this:

    
    
        1 - exp(-3) ≈ 0.95
    

Can be rewritten as:

    
    
        -3 ≈ log(1 - 0.95)

------
davmar
(1-p)^n is one my favorite things. happy to see it get some attention.

------
brian_herman__
Murphy’s law “Anything that can go wrong will go wrong”

------
Cyphase
This is from 2010.

------
throwaway487548
Oh, numeric astrology and probability tantras.

Future is not predictable by definition. It is just an abstract concept, a
projection of the mind. Any modeling, however close to reality it might seem
to be, is disconnected from it, like a movie or a cartoon. Following complex
probabilistic inferences based on sophisticated models is like to act in life
guided by movies or tantric literature (unless you are Goldman Sachs, of
course).

For a fully observable, discrete, fully deterministic models, such as dice or
a deck of cards probability could only say how likely a certain outcome might
be, but not (and never) what exactly the next outcome would be.

Estimation of anything about non-fully-observable, partially-deterministic
environments is a fucking numeric astrology with cosplay of being a math
genius.

No matter what kind of math you pile up - equations from thermodynamics,
gaussian distributions or what not it is still disconnected from reality
stories, like the ones in tantras.

------
the_cat_kittles
i think you get more bang for your buck if you try to understand the mechanics
that generate successes and failures. assuming a flat prior is crazy in almost
every real world case. in otherwords, i think effort is probably better spent
understanding the problem rather than understanding how to make the most of
ignorance.

