This works because the probability of the ordering XXXYYY happening by random chance is 1/(6 choose 3) = 1/20 = 5%. It's quite a weak approach -- you can get more sensitivity if you know something about the measurements (e.g., that errors are normally distributed) -- but for a quick-and-dirty verification of "this should be a big win" I find that it's very convenient.
edit: Ah, got it, in my case 3rd result for X can be higher than the first result for Y. You said all times must be smaller.
In the case of checking if x is faster than y 3 times, you're doing x1 < y1 AND x2 < y2 AND x3 < y3.
In the case of checking if all three measurements of x are smaller than all three measurements of y, you're checking x1 < y1 AND x1 < y2 AND x1 < y3 AND x2 < y1 AND x2 < y2 AND x2 < y3 AND x3 < y1 AND x3 < y2 AND x3 < y3.
In other words, the latter case is checking whether the slowest x of three trials is faster than the fastest y of three trials.
P(F|O) = P(O|F) * P(F) / P(O)
This means that either we had it right (algo B is faster than algo A), or the result we saw was at least as unlikely as we predicted. That's what it means to have a "confidence interval".
I'll leave the Bayesian version to someone else because I don't really trust myself to do it.
If the times in each run are independent, this assumption is (for some distributions) the weakest form of "X is not faster than Y."
For a distribution where this is not the case: assume
- X always runs in 999 seconds, and
- Y runs in 1000 seconds in 99% of runs and in 0 seconds the other 1% of the time.
Then XXXYYY is a very likely ordering (~97% chance), though Y runs faster than X "on average". (Not in median or mode though.)
For a more concrete example: say you have two sorting algorithms, one that is a little slower most of the time, but worst case O(n log n), and another that is usually a bit faster but can be O(n^2) on pathological input.
That's the idea. The null hypothesis is that this is dumb luck. Here's a piece of evidence that, if this is dumb luck, is not very likely. Ergo, the odds this is just dumb luck is low and there may be a real effect here.
As the OP said, it's not meant to be used in a scientific paper, it's to let you see if you might be on the right track.
Well in this case, we're only going to consider two hypotheses. Call them H0 and H1. H0 is the "null hypothesis", and says that X and Y are exactly as fast as each other. H1 is the hypothesis that H0 is wrong and Y is totally faster than X. You'll notice that H0 is oddly specific, and H1 is ill-defined. Don't worry about it.
To start off, pick your prior distribution over H0 and H1. Pick whatever you want, because we're going to ignore it shortly.
Now some evidence comes in. Time to update! We got the ordering XXXYYY. First, let's update H0. P(XXXYYY|H0) = 1 / 6choose3 = 5%. Wow, that's not a very good update for H0. It's probably just false. For expediency, let's just toss it out.
H1 is the remaining hypothesis. H1 wins! Y is faster than X.
It only makes sense to toss H0 out if P(XXXYYY|H1) >> 5% (such that the evidence for H1 relative to H0 increases after the observation).
You are implicitly assuming that’s the case because “it makes sense”. But as the parent post mentioned, the likelihood is not defined and in particular if the noise in the observation process is large enough, P(XXXYYY|H1) may be very close to 0.05 as well.
Correct. I believe standard procedure would be to assume that time for algo A/B are normally distributed, put non-informative priors on the parameters, then integrate over the space where F is true. I think the non informative prior for a normal distribution is P(mu, sigma^2) \propto 1 / (sigma^2) - it's harder than this, because you know that the runtime of a program is > 0, and maybe you think that sigma is likely to be quite a bit less than mu. If you choose your prior wisely, I suspect you will actually get something close to 1 / 20, the added uncertainty will come from when |mu_A - mu_B| is roughly less than max(sigma_A, sigma_B), but even there, most of it will cancel out.
A normality assumption may be reasonable - if you're assuming that the run-time of the code is dominated by the addition of a bunch of independent operations that are roughly the same timescale, that's what you will get - it will fall down if there is a small number of steps which dominate (e.g. http requests or garbage collection or something). If you don't want to some kind prior with parameters like that, you need to go into non-parametric Bayesian stats, and you'll end up with a lot more uncertainty.
I mentioned the "three old and three new" test because it's simple, not because it's powerful.
A bit of maths would show that the probability that all 5 samples lie entirely above or entirely below the median (i.e. the median is NOT between the greatest and the smallest value) is 1/16. That gives you a 15/16 (= 93.75%) chance that the median is contained within the bounds.
But if you do this test repeatedly, even if the code had identical performance, you'll get a false positive 5% of the time. And depending on the spread of the timings you might not get a clear XXXYYY win even when it is indeed a minor improvement.
One could probably make an argument that not all trials are independent, but not sure what else down-voters saw.
On a multitasking system, you should use the fastest benchmark run (assuming you're running the same code on the same data in all cases, and if you're not then you're not really benchmarking). This will be the one that is least influenced by any other processing going on.
On one hand, I believe a piece of code has a true benchmark time, but we only measure noisy versions that are skewed positive due to multitasking noise. On the other hand, it seems important to make benchmark measurements in the context of the whole system. If A beats B in the best case (unloaded system), but doesn't win in the average case (taking the context into account), then that's important information.
By taking the minimum, we focus attention on the deterministic operations. This isn't likely to be realistic for production, but production won't look much like your benchmark hardware anyway.
When making code changes, deterministic operations are the part you have most influence over and have the most impact. Removing an unnecessary, deterministic operation will speed up every run, not just some of them.
It is very common for the first run to be slower. It means that by "random chance" slow-slow-slow-fast-fast-fast is more likely than fast-fast-fast-slow-slow-slow. I personally tend to discard first runs as outliers when profiling.
"is my new code faster than my old code"
"If X and Y come from the same distribution then all orderings are equally likely"
If we think X and Y distributions are both something like normal with similar variance, then we should also be able to say the chance of XXXYYY given Y is better than X is at most 0.05.
But if the distributions for X and Y can be really different, then I think you're right -- this test could be misleading! For example, say Y always takes 2 seconds, and X takes 1 second 90% of the time, but 1% of the time it takes an hour. If we run three tests of each, we'll probably only see good runs from X and conclude it's better, when it's not.
Frequently, they are siphoning your PageRank. They will eventually replace the translation with monetized content of their choice. A good translation takes work! If you got a translation for free, why should you believe it's good, or that it has no ulterior motive?
A firm called "WebHostingGeeks" used to do this all the time. They would offer free translations of blog posts, into languages the author probably didn't speak (and they didn't either, they were just using machine translation). They'd ask authors to link to the translation on their site, and over time they would add their SEO links to the translation.
I first noticed this when WebHostingGeeks offered me a Romanian translation of ConceptNet documentation, my roommate spoke Romanian, and he said "maybe I'm not used to reading technical documentation in Romanian but I think this is nonsense".
I still think that if a blog post author says "this is a translation of my post", the author should host it as a sibling page on their own server. That allows them to more reliably vouch for its content, and removes an avenue that can be used for trickery.
Maybe the sooner people stop paying attention to it, the sooner search engines will learn to find better metrics.
If you were to randomly read 20 pages in a book and find no typos, 15% probability makes more sense.
It's understandable to not mention this in a short blog post about the rule of three, but never forget that when you're interpreting statistics...how you build your sample matters.
I usually sample books from the middle. I wonder what the heatmap of bookstore reading really looks like?
To do the latter, you'd need to know the length of the book, yes.
I feel like the very first word, "Estimating" made what you're talking about pretty clear. It's extremely common for estimates to make simplifying assumptions that don't necessarily hold if you care about accuracy or specific situations. He also called the rule of three "quick and dirty".
> If you were to randomly read 20 pages in a book and find no typos, 15% probability makes more sense.
OTOH, if typos are actually uniformly distributed, then reading the first 20 is better than a random 20 unless you take care to pick random pages without repeats.
It's not impossible, just location dependent. In some places, there actually is a clear cycle, where the weather alternates between good and bad on a weekly cycle. The Bay Area of California, where HN is headquartered, seems to be one of those places:
The Fog Cycle
Week by week from spring into August, the forces that produce the fog increase in intensity. The Paciﬁc High moves farther north, closer to the latitude of San Francisco, sending out stronger winds; offshore the up-welling of cold bottom waters increases, condensing the winds’ moisture into thicker masses of fog; in the Central Valley, the northward-moving sun sends temperatures to the 100 degree mark and beyond. The hot air rises, sucking cool masses in great drafts through the only break in the Valley’s surrounding mountains, San Francisco Bay. With the ocean air comes the fog, evaporating gradually in the hot, dry air of the Valley, but sometimes penetrating at night as far as Sacramento and Stockton.
The fog seems to come and go in cycles. Until recent years, the conventional explanation for the fog’s behavior was a simple one: as the cool, fog-bearing ocean air is pulled over the coastal hills and across the Bay toward the hot Central Valley (that is, from a high-pressure to a low-pressure area), the nearest parts of the Valley begin to cool off after a few days, much as a draft from an open door lowers the temperature in a warm room.
The incoming cool, heavy sea air replaces the warm, rising land air, and temperatures in Sacramento and Stockton may drop from well above 100 degrees to the “cool” 90s. When the Valley cools sufficiently, the fog-producing machinery breaks down. Without the intense Valley heat to draw the sea air in through the Bay Region, the wind diminishes and no longer carries the fog inland. San Francisco, the Golden Gate, and the coastline are fog free.
Then the process starts all over again. Without the incoming wind and fog, the sun gradually reheats the Valley. The rising warm air again begins to attract the foggy marine air inland. The result is a fog cycle of about a week in length, producing roughly three or four days of fog over the Bay and three or four days of sun.
As others have pointed it, it's rather rare that events happen with perfectly distributed probability (probably the opposite). For instance, the chance of being in an accident is much higher if an accident just occurred right next to you. It's almost way more likely to get sick, if others are already sick. In fact, although I don't have any statistics to back me up, I'd guess that most events happen in clusters (including spelling errors, or when you test for perfect pitch in children, when you go to the music class).
This is essentially a guess, and it's better to say you don't know than guess wildly.
You can't compute the probability. But you can compute an upper bound on the probability with a reasonable confidence, which is what the author is doing here.
This is standard statistics and might be useful in some cases. It does assume that the events are independent, but this is a pretty standard assumption that you have to check whether it applies in your case, or at least approximately applies in your case.
>What he's doing here isn't science, it's a guess.
Which is a pretty big thing in this subfield mathematics called statistics.
> although I don't have any statistics to back me up, I'd guess that most events happen in clusters
More seriously, yes, before you apply this rule you should think about whether "each event has the same probability of error" is appropriate for your situation.
Another approach would be this: Future is impossible to research, as you can only research entities that exist or phenomena that is happening. Future entities and future phenomena do not exist yet ( by definition ) so you cannot research them. -> no science of future.
What we are talking here is scientific prediction. "Scientific" is just another word for good and only means something when compared to another method of prediction that can be shown to be worse.
So we very much agree. I just wanted to write this out because the Future studies crowd has bothered me for some time now.
This stinks. Estimating is about approximating an answer with low information -- it's about efficiency of using data, not _only_ doing better than guessing.
Handwavily, this resembles the rule of three, where we could probably say instead, "the rule of 1/e," because both of these appear to be artifacts of the same relationship and same type of problem.
Or are we going to start talking about priors, on buses and alien invasions,
in which case the rule of three is not really useful? If I want to know how
likely a specific book is to have typos, can't I just go look for statistics
on typoes in books, and won't that give me a better estimate than a "rule"
that will give the same results no matter what it is that it's trying to
Sure, if you have more data, then you can get much tighter bounds for your estimate.
To make the math work, I think you need to make several other assumptions. Don't you also have to know that that ten minutes you sampled are representative? That the events (if they did occur) are independent? It seems odd that you'd in a situation where you can rely on specific assumptions like these while also believing that buses and aliens are equally likely to appear.
And if something has really never happened before, like my alien invasion example, what have we learned by applying the rule of three?
Honestly- perform the experiment yourself. Wait for X time, then calculate 3/X. Do you now have an upper bound on the probability that an alien invasion will happen?
Assuming we would observe an alien invasion and write about it, we have about 5000 years since humans started writing. So 3/5000a, so an upper bound on an alien invasion in a year (assuming p hasn't changed since we started writing) is 0,06% per year.
It's an upper bound in any case, meaning that is this or less. That includes p = 0. You're just 95% confident that it won't be higher than that.
But maybe the aliens brought us our writing system.
Or as Taleb says it:
> Consider a turkey that is fed every day. Every single feeding will firm up the bird's belief that it is the general rule of life to be fed every day by friendly members of the human race "looking out for its best interests," as a politician would say.
> On the afternoon of the Wednesday before Thanksgiving, something unexpected will happen to the turkey. It will incur a revision of belief.
From the experience of feeding a turkey can infer that being slaughtered is a rare event, and that the likelihood it happening exactly tomorrow (without having access to a calendar) is not necessarily 0, but is below a certain rate - and it's definitely not likely to happen three times in the next week. Surviving for 180 days is reasonable justification to assume that, on average, turkeys get slaughtered less frequently than every 60 days, i.e. that the likelihood of Thanksgiving suddenly arriving tomorrow isn't 50% but rather something below 2%.
This is why a good experimental design will observe a measurement with some expected variability.
An even better trick is the sleeping beauty problem. To solve it you need external information.
Then it's like drawing marbles from a vase containing an unknown proportion of blue and red marbles.
I would use the formula (M+1)/(N+2) where N is the number of words, and M is the number of mistakes. Note that for a large corpus (M+1)/(N+2) approaches -> M/N, so we recover the frequentist probability.
Also note the author (John D Cook) correctly expresses intuitive doubt that 0 typos / 20 pages can not be construed as certainty of no mistakes. Similarily, seeing a mistake in every word in a subset does not guarantee that all words in the book will have a typo. Let's look at the modified formula (M+1)/(N+2) in these cases: if we observe no typos (M=0) in 1000 words, it estimates the typo probability as (0+1)/(1000+2)=1/1002 != 0, so we can't rule out mistakes. Similarily if all words were typos (M=N) for the first 1000 words we get 1001/1002 != 1, so we can't be sure every future word is a typo.
Check out Norman D Megill's paper on "Estimating Bernoulli trial probability from a small sample" where the formula (M+1)/(N+2) is derived, on page 3 it appears:
Sigh…another article normalizing the concept that math is something it’s OK to be uncomfortable about…
For instance, let's say I have a bold plan to protect us from meteor strikes that will cost $100B. How would a person make a decision about whether that's a good trade-off or not? And how would a mathematician help them make that decision?
How would it change for more complex cases, like a shield to prevent nuclear ICBM warfare which has never happened, but we all are worried about?
- You have observed N events, with 0 occurrences of X
- Someone wants to make a bet with you about the likelihood of X happening
- Once you've quoted a number, your counter-party then has the option of making an even bet about whether the actual likelihood is greater than or less than your prediction
Eg: If you predict 3/N using a 95% confidence interval, then 95% of the time, the actual likelihood will be less than 3/N. Your counterparty will then win the bet 95% of the time, simply by predicting it to be lower.
Your ideal strategy would be to quote a likelihood which is over/under 50% of the time, not 95%.
Ie, you want to pick E such that 50% of the time, it matches the observation you've made (no occurrences), and 50% of the time it does not.
E^N = 0.5
N log E = log 0.5
log E == log 0.5 / N
E = 0.5^(1/N)
For the example given, that comes out to 0.966. Ie, there's a 96.6% chance of no typos in a given page. Across 20 pages, this comes out to 0.966^20 => 50% chance of no typos. If your goal is to quote the single best estimate which can hold up well in a betting market, I believe this would be the ideal strategy
Instead, you get to estimate the dependency between each observation as in advanced variants of Bayes chain rule. Ultimately, some place of the estimator will contain an assumption giving only bounded optimality.
In particular, if you're considering the probability of a catastrophic event, it's probably better to find some other rationale for estimating the probability than just saying 'it has never happened before'.
This is exactly this approach is worthless for rare events which by definition have very skewed distributions. In that case, probably is overestimated a lot.
On the other hand, I'd the is a rare but systematic error, the error probably will likely be grossly underestimated.
Thought experiment: suppose you're writing a long string of digits that consists of 1 followed by a large number of zeroes (say 99 for simplicity) followed by (say 10000) uniformly distributed digits.
Your writing system has an issue that changes half of 5 digits into 6. What probability of error will be estimated by this dumb method after 100th digit?
Correct bayesian approach updates the prior based on input variability keeping the error estimates high when input has low variability etc. (This can be with variance or another method.)
The even better method tries to estimate the shape of input distribution.
In other words, your result would be a difference of likelihood ratio of both input and output prior distributions (estimated to date - since no errors the ratio would be 1) minus likelihood ratio of posterior distributions.
> Princeton physicist J. Richard Gott III has an all-purpose method for estimating how long things will last. In particular, he has estimated that, with 95% confidence, humans are going to be around at least fifty-one hundred years, but less than 7.8 million years. Gott calls his procedure the Copernican method, a reference to Copernicus' observation that there is nothing special about the place of the earth in the universe. Not being special plays a key role in Gott's method.
The given example of typos on page kind of highlights the fact that more sophisticated math involving a prior might give better results in some cases.
The beta(1, N+1) prior is an assumption that you start with the a priori knowledge that a typo rate of 1% and a typo rate of 99% are equally as likely as each other.
Most people would assume that books don't have typos on most pages and a 99% typo rate is unlikely.
However as your sample gets bigger the prior matters less and less so this rule is still useful.
Just know that it is reasonable to bias the results a bit according to your prior when N is small.
Lets try stretching it: Humans havent destroyed the world in their 300,000 years of existence, so probability of them destroying the planet in future is less than 0.00001 (1 in 100,000). I feel like that might be an underestimate.
Reminds me of the Doomsday Argument
1 - (1-3/200)^19800
The best way to state your conclusions is to say "after sampling 200 pages at random, we are 95% confident that the true typo-per-page rate is in the interval 0% - 1.5% or equivalently the total number of typos in the 20,000 page book is between 0 and 300."
"... or equivalently the total number of [pages with] typos in the 20,000 page book is between 0 and 300."
We're not measuring the number of typos on a page, but only whether the page contains a typo or not. So we can't speak about the number of typos.
Just as the article itself states regarding the approximation "Since log(1-p) is approximately –p for small values of p" which relies on discarding p-squared terms as insignificantly small.
And I won't even bother with the argument that presence of one typo on a page might actually increase the probability of another typo on the same page or nearby pages. The idea that this principle is based on uniform distribution is discussed enough in other threads.
So after reading 20 pages with no errors and having no other information the odds of finding an error on the next page are no more than 1/60. If you have 19,800 more pages left you'd expect to find no more than about 330 errors.
Of course, the comments here suggest that it just makes the whole concept confusing to an audience who should either already know about confidence intervals or be well prepared to learn about them.
If you’ve sampled all the pages, then we are talking about certainty which this wouldn’t apply.
So in your example, according to the rule, the probability of errors is less than 3/200. Which it is, either 0/201 or 1/201.
When predicting things that haven’t happened yet, a publicized “certain” prediction will inevitably influence the actual probability in an unpredictable way.
Would someone care to explain why the math says it shouldn't matter (for reasonably small values of p)?
Edit: sorry I thought you were replying to another comment.
The posterior probability of p being less than 3/N for Beta(1, N+1) should be integral(Beta(1, N+1), 0, 3/N), right?
That trends toward zero, so I must be wrong, but I can't for the life of me remember why.
EDIT: Ah! I was accidentally using Beta instead of the PDF for Beta.
That's the quote of the day.
My father was a carpenter, and every day I think software has more in common with carpentry than computer science.
(estimating the probability that the sun will rise tomorrow)
A practical use of a Stilton to this is included in reddits "best" sorting of comments, leaving from room for doubt with low sample sizes of votes.
1 - exp(-3) ≈ 0.95
-3 ≈ log(1 - 0.95)
Future is not predictable by definition. It is just an abstract concept, a projection of the mind. Any modeling, however close to reality it might seem to be, is disconnected from it, like a movie or a cartoon. Following complex probabilistic inferences based on sophisticated models is like to act in life guided by movies or tantric literature (unless you are Goldman Sachs, of course).
For a fully observable, discrete, fully deterministic models, such as dice or a deck of cards probability could only say how likely a certain outcome might be, but not (and never) what exactly the next outcome would be.
Estimation of anything about non-fully-observable, partially-deterministic environments is a fucking numeric astrology with cosplay of being a math genius.
No matter what kind of math you pile up - equations from thermodynamics, gaussian distributions or what not it is still disconnected from reality stories, like the ones in tantras.