
Statistical Formulas For Programmers - bajames
http://www.evanmiller.org/statistical-formulas-for-programmers.html
======
EvanMiller
Uhhhh, Evan Miller here. Not sure why my name is in the submitted title, but
whatever.

The current selection on that page is somewhat limited, but I hope to grow it
over time. The stuff at the beginning is pretty basic (e.g. standard
deviation), but things get pretty gnarly by the time you get to the Kiefer
equation. At some point I'll add some more references on how to implement
things, e.g. find successive zeros of Bessel equations. For now it should be a
good jumping-off point. Enjoy!

~~~
salimmadjd
HN displays the URL's domain. Which happens to be your name

~~~
pseut
The title had the name in it too originally but it seems that a moderator
changed it.

------
kevinalexbrown
If you're looking for things to add big-picture-wise, it might be helpful to
specify what assumptions go into various tests/methods. In my experience, this
is the biggest hangup and mistake, because a) it's more difficult to
understand and b) ignoring it gives the appearance of rigor even if the test
used is inappropriate for the data.

I should emphasize that this is not a nitpick or even a criticism, just a
feature I would love to see. It's also what I spend a large portion of my time
trying to track down, so having it in a convenient location would be nice.

~~~
scottedwards
Great suggestion. I've been amazed to find out that many coders and amateur
"data scientists" don't realize that testing the assumptions is an important
part of conducting statistical analyses. Part of this may be due to the recent
emphasis on machine learning techniques, which tend to be assumption-free
(often just assuming independence of cases in the sample).

~~~
gtani

        machine learning techniques, which tend to be assumption-free 
    

ML should be a rigorous exercise in Bayesian and classical/frequentist stats,
computational methods, dataset integrity, visualization etc, if you've been
thru the texts by Murphy or Bishop. It often happens that people a couple
years out of their last stats class only retain that high R-squared, p-, t-
and f-values are what they're looking for, and heteroskedasticity and
sphericity are just big words.

My evidence that ML is a rigorous exercise: the free texts listed (Barber,
Mackay and Smola's are excellent, ESL not as accessible)

[http://metaoptimize.com/qa/questions/186/good-freely-
availab...](http://metaoptimize.com/qa/questions/186/good-freely-available-
textbooks-on-machine-learning)

~~~
scottedwards
Thanks, @gtani, great resource. Yeah didn't mean to imply that ML techniques
are free of ANY assumptions, just that several of the popular ones like
logistic regression don't have distributional assumptions. (actually, I really
want to understand the VC Inequality at some point, as it seems to allow us to
make conclusions about out of sample error rates without depending on
distributional assumptions)

------
pornel
I hoped for it to be more "for programmers", like this one:

<http://gdr.geekhood.net/gdrwpl/metnum.php>

For me formulas written in pseudocode are much easier to understand than
classic mathematical notation.

For example I've learned bayesian classification, chi-square, etc. from
Practical Common Lisp (<http://www.gigamonkeys.com/book/practical-a-spam-
filter.html>) after failing to understand how to apply formulas from
Wikipedia. It was easier for me to learn Lisp than to decipher abstract
declarative mathematical notation (admittedly I get a brain freeze whenever I
see ∑, even though I know what it means. I prefer `for(…) acc += …`).

~~~
Ixiaus
This is off-topic; but I couldn't help notice your user name. My RL given name
is "Parnell" and people often mis-pronounce it as "Pornell"...

------
pseut
The first example, "unbiased standard deviation" is mislabeled. The estimator
of the variance is unbiased but the square root of an unbiased estimator is
not itself unbiased. So it's not as nit-picky as it looks; it's either a brain
fart or a hole in understanding (especially since the linked Wikipedia page
discusses this issue)[1,2]

Not to be a dick, but getting the first example wrong like that doesn't
inspire confidence in the rest of the post.

[1] <https://en.wikipedia.org/wiki/Standard_deviation>

[2]
[https://en.wikipedia.org/wiki/Unbiased_estimation_of_standar...](https://en.wikipedia.org/wiki/Unbiased_estimation_of_standard_deviation)

~~~
EvanMiller
Thanks! I have relabeled it. (In a previous draft the entry was for unbiased
variance and the inaccuracy slipped in during the transition.)

~~~
pseut
Your post said, "draft," so I think you're covered for that sort of error. :)

------
christopheraden
Hey Evan, from one statistics guy to another, thanks for fighting the good
fight :). The formulas might benefit from examples, especially with some of
the more complicated cases (KS test and onwards). The important part of
statistics comes from knowing _when_ to apply something, rather than _how_ to
(that part is just math/numerical analysis). A mention of the assumptions of
each of these intervals would be good, too. Too often I see conclusions
invalidated by using a probability model that doesn't make sense. This is a
common failure I see with using a Wald interval for the slope of the
regression line.

~~~
yankoff
While you guys are here, can you recommend a good intro book for statistics?

~~~
christopheraden
Yankoff, you might want to be more specific. Intro statistics in general, or
for computer scientists, or scientists, or looking to learn R at the same
time? I liked Freedman, Pisani, and Purves [1], and have TA'ed using McClave,
Sincich, and Mendenhall [2]. You may want something a little more advanced
than these, but they are pretty good for intro level.

[1]: [http://www.amazon.com/Statistics-4th-David-
Freedman/dp/03939...](http://www.amazon.com/Statistics-4th-David-
Freedman/dp/0393929728) [2]: [http://www.amazon.com/Statistics-11th-Edition-
Book-CD/dp/013...](http://www.amazon.com/Statistics-11th-Edition-Book-
CD/dp/0132069512/ref=pd_sim_b_2) [3]:
[http://stats.stackexchange.com/questions/421/what-book-
would...](http://stats.stackexchange.com/questions/421/what-book-would-you-
recommend-for-non-statistician-scientists)

~~~
yankoff
Yeah, I meant something for computer scientists. I'm going over coursera ML
course currently and wanted to learn at least basics of statistics in
parallel.

Thanks, I'll check out your links.

Btw, what do you think of OpenIntro
statistics?<http://www.openintro.org/stat/down/OpenIntroStatSecond.pdf>

~~~
christopheraden
You will find the intro books don't talk much about parallel computing. Most
of the general data sets in intro books will be no more than 30 observations.
They are trying to teach classical methods moreso than useful computational
techniques. As for parallel statistics, I don't have a good book
recommendation. Most of my knowledge on the topic comes from papers and
vignettes from the R community and not books. Maybe check out one of those
O'Reilly books about big data techniques?

I haven't seen this OpenIntro statistics before. I'll check it out!

------
durbatuluk
As a programmer and statistician I should warn newcomers about the assumptions
of tests (parametric tests like test t). Test t for example need same variance
and normality or you'll take wrong decisions. These formulas and p-values
shouldn't be used as substitutes for graphics (you should check them first and
test later).

Just take care, there's a lot of problems from using statistics in wrong way.
special care with small sample size and even large ones
[[http://scienceblogs.com/mixingmemory/2006/10/31/jeffreylindl...](http://scienceblogs.com/mixingmemory/2006/10/31/jeffreylindley-
paradox/)]

------
SagelyGuru
Nice but little knowledge is a dangerous thing. It is probably safer and more
effective for non-statistician "data scientists" to use Robust Statistics:
<https://en.wikipedia.org/wiki/Robust_statistics>

~~~
platz
The wikipedia article talks much about dealing with outliers. How can outliers
be removed/replaced or handled differently as the article suggests? Are the
outliers not part of the data, after all? It seems like the goal of 'improving
performance' here involves tweaking the data to get the results you want. What
have I misunderstood here?

~~~
vwinsyee
The main difference between classical regression using ordinary least squares
(OLS) and robust regression using iterative re-weighted least squares (IRLS)
is that with OLS, all observations are given equal weight and with IRLS,
observations may or may not be given equal weight. Essentially, IRLS gives
outliers and/or influential [1] data points less weight, which may improve the
performance of the overall model since these outliers/influential data would
otherwise cause assumption violations using classical regression. If there are
no outliers, then results from robust and classical regression converge.

I would disagree with SagelyGuru in recommending robust regression for non-
statisticians, though I can see where he or she is coming from. With robust
regression, you don't have to worry as much about assumptions as with
classical regression. But with robust regression, you need to be aware that
the underlying analytical method is different and what that means. For
example, the standard robust regression implementation in R (i.e, the rlm
function in the MASS package) doesn't produce t-statistics or p-values.
There're also warnings that especially at lower sample sizes, the standard
errors produced by rlm may be unreliable. One recommended way to obtain those
p-values would be to get bootstrapped standard error estimates, so that
normal-theory approximation would apply.

[1] There are different types of robust estimators (e.g., M, S, MM, etc.) that
have different robustness properties.

~~~
platz
Thanks for the explanation - it was helpful

------
deanclatworthy
A nice resource, however I'd expected to see some actual programming on this
page. Perhaps some sample code in a few languages would be nice. You could put
tabs on a code box to switch between a python and PHP implementation for
example.

~~~
esalman
Same here. Theory, formula, assumptions etc. are available in a textbook,
Wikipedia and elsewhere.

This could be useful- PHP stats functions:
<http://www.php.net/manual/en/ref.stats.php>

------
dreen
I think if trying to explain things like these to programmers, maybe you
should consider using actual code (or even pseudo code) for this? It would
have an added benefit of not requiring to know math-english (for non-native
english speakers like me) or any advanced concepts of math at all.

------
scottedwards
Great effort, and I certainly hope more coders will get into statistics (most
I know are only interested in machine learning). However, I think your
definition of 1.3 "Confidence Interval around the Mean" could be improved. You
state:

"A confidence interval reflects the set of statistical hypotheses that won't
be rejected at a given significance level. So the confidence interval around
the mean reflects all possible values of the mean that can't be rejected by
the data."

That seems a bit vague and perhaps confusing. Might I suggest something more
like this:

"The confidence interval specifies a range (+/- a multiple of the above
standard error [SE]) around our estimate of the mean (x-bar) such that: if we
repeated our sampling process an infinite number of times (i.e. with the same
sample size and forming a new x-bar and SE each time [and therefore, a new
confidence interval]), Confidence_Level% of those intervals would contain the
population (true) mean."

In addition, I think in this case, at least, there are no assumptions about
the data to worry about, given a sufficiently moderate sample size due to the
Central Limit Theorem (I'm confident about that in the case of the mean
(x-bar), but I'll leave it up to others to correct me if I'm wrong about this
applying to the standard error (SE)).

~~~
bearmf
Confidence intervals are inherently confusing. I have yet to hear a definition
that is both correct and easily understood and remembered.

------
bearmf
I feel this would be quite confusing for an average programmer. It is more
like a cheat sheet for people who have some statistical training but always
have to look the formulas up because they don't use them frequently enough.
For an average programmer, really understanding how linear regression works
and some basic linear algebra would be a good start. A lot of programmers have
trouble even with these "simple" topics.

Most of these formulas are very rarely used even by quantitative analysts. The
most used are for standard deviation and regression. The more complicated ones
are generally used as a part of statistical routines, say, in R. It is very
rare that someone has to code them.

> From a statistical point of view, 5 events is > indistinguishable from 7
> events. What is this supposed to mean? There is a concept of statistical
> significance but if an effect is not statistically significant it does not
> follow that it does not exist. Btw where is the Bayes formula? :)

------
onan_barbarian
This is a good motivation, but it's plunging right into formulas, which is
wrong-headed. Even the mean can be completely meaningless without eyeballing a
plot of the data. Given that this is aimed at non-statisticians, it's critical
that these points are made in Big Flashing Letters before we start handing out
formulas that make people act like they have "superpowers".

The first superpower is to look at the data and see if it makes sense using
your eyes and your brain, not to start spewing confidence intervals.

As other posters have pointed out, it is even more irresponsible to start
waving around things like the t-test without discussing the parametric
assumptions that these things depend on for their validity.

------
chuckcode
Great start on an important topic. Quick extra info - For drawing a trend line
it is often useful to have the intercept as well. Using y=mx+b line notation
the best fit intercept is: \hat{b} = \bar{y} - m * \bar{x} [1]

Be great to see some pictures to illustrate the formulas and some mention of
robust statistics as I find outliers to be a huge issue in application of
statistical techniques.

[1] <http://en.wikipedia.org/wiki/Simple_linear_regression>

------
cantrevealname
I've wondered for a long time if there's a way to condense certain statistical
(and probability) information into a single How-Much-Should-I-Care number. Can
someone shed some light on this?

To pick a couple examples from health news in the popular press:

NB: I'm making up all the numbers here for the sake of example.

(1) A study shows that people who consume more than 10g of added salt a day
live shorter lives.

But _how much_ shorter? If it's 30 minutes shorter, I don't care about the
study and I'm not going to change my behavior. If it's 6 months longer, then
I'm interested and might very well do something.

(2) A study shows that people who drink 2 or more cups of coffee a day have
lower risk of Alzheimer's Disease.

But _how much_ lower risk? If the average lifetime risk is 1 in 50, and
drinking coffee lowers it to 1 in 49.997, then I don't want to waste time even
reading the article. If it lowers it to 1 in a 1000, then yes, I might change
my behavior.

So, in the above examples, is there any way to reduce the information into a
single How-Much-Should-I-Care number?

Like this:

(1) A study shows that people who consume more than 10g of added salt a day
have an ____x____ factor shorter life.

(2) A study shows that people who drink 2 or more cups of coffee a day have a
____y____ factor lower risk of Alzheimer's Disease.

Then, by looking at x and y, I can tell at a glance whether some result is
irrelevant, trivial, useful, or groundbreaking. I understand that it'll still
be subjective in the end -- like whether $1, $10, $1000, or $10,000,000 seems
like a lot of money to an individual -- but at least it'll be one number.

~~~
kmm
Just a factor is often not enough. If a study would show that cell phone usage
increased the risk of a certain cancer with 40%, that might still not be
interesting if it's an extremely rare cancer and they found 7 cases instead of
5.

Or it doesn't show whether various correlation factors matter, or whether this
is a statistical paradox. Did you know that babies of smokers are healthier
than babies of non-smokers of the same weight? This is because baby born to
the smoker will have decreased weight because of the smoking, whereas if the
baby of the non-smoker is underweight it will be for other reasons that are
often worse.

There's heaps of these kind of paradoxes and pitfalls that need to be taken in
account.

What we need in newspapers and other media is a simplified abstract of the
paper and an explanation or approval of a real life statistician, with no
relation to the study.

------
Sealy
Great resource. For anybody struggling with ANY formula... or even maths
class. I'd encourage you to learn how Wolfram Alpha works:

<http://www.wolframalpha.com/>

I used it extensively for the development of different trading bots and
algorithms. They say you need to be a real hotshot at maths to make it in that
field. Little do they know that this world will soon belong to script-kiddies
and hackers! Here's one of my favorites from them, this will rearrange any
complex equation and make anything you like the subject:

[http://www.wolframalpha.com/widgets/view.jsp?id=4be4308d0f9d...](http://www.wolframalpha.com/widgets/view.jsp?id=4be4308d0f9d17d1da68eea39de9b2ce)

The one thing you will have to learn is how to represent an equation in text
based form, ie to use ^ to signify a power etc...

~~~
bcbrown
Man, that would have been useful in college.

~~~
Sealy
Check this out for a demo of what it was designed to do.

This will solve the following complex equation (Find both X and Y):

x+y=10, x-y=4

[http://www.wolframalpha.com/input/?i=x%2By%3D10%2C+x-y%3D4&#...</a><p>... and
thats just scratching the surface. Kids studying maths these days don't know
how good they have it. I struggled so much when I was younger.

------
Myrmornis
Ugh recipe-book statistics! This might be useful but not to programmers. No-
one should be applying any of the more complex formulas on this page without a
good basic grounding in the theory, which is something nearly all programmers
will lack (and fair enough). The best place for programmers to start is
understanding the basics of likelihood functions, maximum likelihood
estimation, Bayes rule and likelihood ratio tests. Then look at e.g. the
derivation of the chi-squared test statistic as an approximation to a
likelihood ratio test of multinomial distributions to get a sense of where all
the mysterious formulas of classical statistics come from.

------
quackerhacker
Thanks for writing this Evan.

Maybe this solidifies the fact that I'm a programmer and not a statistician,
but I got lost after the 2nd hyperbole in the beginning...but I like a
challenge, I may have to read it multiple times until your presentation sinks
in though :)

------
th0114nd
The link to "how-to-read-an-unlabeled-sales-chart.html" has an extra s after
unlabeled.

------
Ixiaus
Tangent: Why is your product not a SaaS offering!? (I would pay for it)

------
salimmadjd
Great blog! I wish there was a cookbook style code with eac formula. In C or
even JavaScript for those of us who have forgotten a lot of math but can think
in code.

------
tieTYT
I like the spirit of this article, but I wish it put much more emphasis on
explaining why/when you'd need this. Examples would be very helpful.

------
sidcool
Isn't the plural form of formula called formulae?

~~~
pseut
Either formulae and formulas are pretty common (and my Firefox spellcheck is
flagging the first and not the second FWIW).

~~~
sidcool
Same here, firefox flags formulae.

------
novaleaf
christ, if your going to post statistical formulas for programmers, post it as
an algorithm, not as a math formula.

------
silvertonia
Evan Miller, I love you.

