
List of probability distributions - whosbacon
http://en.wikipedia.org/wiki/List_of_probability_distributions
======
christopheraden
I've surfed around on this article several times while trying to find a
distribution I needed in order to make my rejection samplers work better for
distributions that have no name or are obscure (posteriors in Bayesian
inference often generate some totally whacky distributions--especially once
you get out of the realm of using conjugate priors. Here, there be dragons!).

If you aren't reading probability from a textbook, this article isn't going to
tell you which distributions are the most useful. The Beta Distribution
receives half the amount of space as the Dirac delta function, but for you
folks who want to take a Bayesian approach to A/B testing, the beta is going
to be far more useful (The beta is the conjugate prior to the binomial) than
the dirac, though one might argue that the Dirac is interesting from a
mathematical perspective, it being a degenerate distribution and all...

It turns out though that many of the "useful" distributions (by which I mean
the ones that many classical problems were devoted to) share very similar
properties. If your aim is to get familiar with the most common distributions
encountered in the wild, the article on the exponential family will serve you
better than this overwhelming article.

<http://en.wikipedia.org/wiki/Exponential_family>

~~~
yummyfajitas
The beta distribution is fantastically useful for modelling conversion rates.
I'll take this opportunity to shamelessly plug a blog post I wrote about it
last week.

[http://www.chrisstucchio.com/blog/2013/bayesian_analysis_con...](http://www.chrisstucchio.com/blog/2013/bayesian_analysis_conversion_rates.html)

------
prezjordan
I've taken two stats courses, and I still don't understand the derivation of
the Bell Curve (why it happens, why it's shaped like it is, etc). I did well
in the classes, too. It's weird how "unintuitive" probability is to me.

~~~
btilly
I don't blame you. They don't teach it because actually teaching it would be a
major side track, everyone would ask why you need to know it, and it would
generally cause problems.

But it isn't that hard to prove for yourself for the simple case of the
binomial curve if you're willing to supply some elbow grease. Here is an
outline of how I did it when I was an undergrad.

1\. Prove that log(n!) = n log(n) - log(n)/2 + k + o(1) where k is some
constant. (Trick: approximate the integral of log(x) with the trapezoidal
rule, that gives you n log(n). You wind up with log(n)/2 + log(1)/2 left over.
Prove that the sum of the errors to infinity converges to a constant, which
gives you, to finite n, k + o(1) where the o(1) term is the sum of the rest of
the errors that you haven't reached yet.)

2\. Suppose that X_1, X_2, etc are an identically distributed independent set
of random variables that have probability 0.5 of being +1 or -1. Prove that
each X_i has mean 0 and variance 1.

3\. Consider the random variable Y_n = (X_1 + X_2 + ... + X_n)/sqrt(n). Prove
that it has mean 0 and variance 1.

4\. What is the "density" (in quotes because it is discrete) of Y_n around x?
Well you can calculate an exact formula in terms of the binomial formula, and
stick in your approximation from step 1, then show that the point weight of
the nearest point to x is going to be e^(-x^2)(c + o(1))/sqrt(n) where c is a
constant that can be derived from k in step 1. Remember that that's the only
point for a distance of sqrt(n), so the density is proportional to e^(-x^2).

5\. Look up the standard argument that the integral from -infinity to infinity
of e^(-x^2) is sqrt(pi). Note that Y has to sum up to a total weight of 1
(because it is a probability density), substitute that fact in then you get
both the standard normal distribution out AND as an unexpected bonus,
Stirling's formula!

(If you wish you can do the same for a biased coin, the derivation is slightly
more complex but follows the same lines. However to get arbitrary probability
distributions you need to use different approaches. People usually don't
encounter that proof until graduate school.)

~~~
jules
Another way to derive the bell curve is as the maximum entropy distribution
with known mean and variance.

The derivation does require one trick that is usually not taught in high
school. You know how you can maximize a function by setting its derivative to
zero? Well, there is a generalization of that to functions of multiple
variables: you set the gradient to zero. But there is yet another
generalization to that for functions of continuously infinite number of
variables! That's called the calculus of variations. If we want to maximize
the entropy over all possible probability distributions over the real numbers,
then we have a continuously infinite number of variables, since for each real
number x, we need to assign a probability mass p(x). Here is the derivation
for those interested:

Let m be the mean we want and v be the variance we want, and let p(x) be the
probability density function.

We want to maximize the entropy S = int(p(x)log(p(x))) given the constraints
that:

    
    
        int(p(x)) = 1  // total probability should sum to 1
        E(X) = int(x*p(x)) = m  // mean should be m
        E((X-m)^2) = int((x-m)^2*p(x)) = v  // variance should be v
    

This problem can be solved with calculus of variations with Lagrange
multipliers. The Lagrangian is all the terms inside the integrals summed up:

    
    
        L = p*log(p) + a*p + b*x*p + c*(x-m)^2*p
    

Calculus of variations tells us that for the optimal p, we have:

    
    
        dL/dp = 0
    

This is the analogue of setting the derivative to zero for maximizing a
function.

Calculating that we get:

    
    
        a + bx + c(x-m)^2 + log p + 1 = 0
    

rewriting:

    
    
        p = e^-(a + bx + c(x-m)^2 + 1)
    

We have the general form of the Bell curve. Now we just need to find the right
values for a,b,c to satisfy the 3 constraints.

As special cases of this we have:

1) The probability distribution with maximum entropy without constraints is
the uniform distribution.

2) The probability distribution with maximum entropy with only the constraint
that the mean is m, is the exponential distribution.

In general if you want to find maximum entropy distribution with known
expectation values E(f(X)) and E(g(X)), then that is of this form:

    
    
        p(x) = C*e^(a*f(x) + b*g(x))
    

This gives an indication why so many distributions have this form. For example
the entire exponential family can also be derived as maximum entropy
distributions with certain known expectation values.

Of course this derivation is only a good motivation for the bell curve if you
already cared about entropy, mean and variance.

------
elchief
Probability distributions are abstractions. No actual data distribution is
Normal, though some approach it, I suppose.

It is a mathematical shortcut. It can make analysis and computations simpler,
thus cheaper. Though it can also destroy economies if you forget that it's an
abstraction.

You don't need to use the analytic distributions. You can get by with
simulations, bootstrapping, if you have computational horsepower.

~~~
nova
I recommend Robert Kass' article "Statistical inference: the big picture" for
a related point of view on that issue, which should be obvious but it's easily
forgotten in practice.

------
hdivider
Here's a possibly interesting idea: in a game, randomly select from some of
these distributions to generate objects (e.g. powerups, enemies, game
balancing constants) whenever the player appears to be doing well. You'd
initially give them the _impression_ that the behaviour is non-random, but
then suddenly everything changes.

Just thinking out loud here. =) In practice you'd probably need a solid game
to begin with to even get the player to notice changes in the distribution
used. I'd say lots of times you don't need more than a plain old uniform
distribution though.

~~~
btilly
In actual games with random behavior, the challenge usually lies in convincing
people that they are just having an unlucky streak but the game is random and
unbiased. Therefore I don't see the point.

(I've encountered this at <http://www.wargear.net/> \- and I've sadly needed
to convince myself that it is truly random.)

~~~
dietrichepp
Well, classic problems in statistical analysis deal with
"overdispersion"—e.g., the ratio of boys to girls in individual families has a
variance too high to be modeled as a simple binomial distribution, so you can
model it as a beta-binomial.

In games, the challenge is the opposite: to generate distributions that with
less variance in order to convince people that the random number generator is
"fair". You don't have to do this, but some games will tweak the numbers so
they're not independent, to keep the sample mean close to its asymptotic
value.

In other words, if you have a losing streak, the software steps in and breaks
it for you, so that you think the RNG is "fair".

------
graycat
Sorry, guys. After advanced study of probability, writing a Ph.D. dissertation
in applied probability, and doing applications of probability in US national
security and business, here's my take on the OP:

(1) Sure, given a random variable taking values in the real numbers, it has a
(cumulative) distribution and may have a probability density distribution.

(2) In some special, fortunate cases, actually can get some useful information
about the distribution, e.g., it has an expectation, the expectation is
finite, it has a finite variance, and might even be able to say that the
distribution is in one of the famous families, Gaussian, exponential, Poisson,
etc. and, then, estimate the parameters of the distribution, e.g., for a
Gaussian, estimate the mean and variance.

(3) While we know that our random variable has a distribution, mostly in
practice we can't know much about that distribution. Commonly if we know that
the distribution has an expectation, that the expectation is finite, and that
the variance is finite (that commonly we get just by what we know about the
real problem with no effort in probability or statistics), then we have to be
satisfied with those 'qualitative' facts and move on. E.g., finite variance is
enough about the distribution to apply the strong law of large numbers!

(4) Mostly when we do know that the distribution is Gaussian, exponential,
etc., it is because we are using one of the classic limit theorems of
probability, e.g., the central limit theorem or the renewal theorem.

(5) If the random variable takes values in, for some positive integer n, some
n-dimensional space over, say, the real or complex numbers, then in practice
knowing much detail about the distribution is next to hopeless. Why? Because
of the 'curse of dimensionality' where the number of points needed for a fine
grid is commonly so high it makes 'big data' look like something on a 3 x 5
index card!

Net, mostly we do our work in applied probability knowing, of course, that a
random variable has a cumulative distribution but otherwise knowing at most
only some qualitative aspects, e.g., finite variance, and little else.

So, our more important methods of working with random variables should need
only meager assumptions about the distributions of those random variables.
E.g., order statistics are always sufficient. Commonly to get the most desired
estimates from least squares techniques, need only finite mean and variance;
the common Gaussian assumptions about the 'model error terms' is needed only
for doing classic confidence interval, etc. estimates.

Generally there are a lot of 'non-parametric' (distribution-free, where we
make no assumptions about the distribution) techniques in hypothesis testing.
For estimation, there are non-parametric techniques based on, say,
'resampling' for getting confidence intervals. And there are other
distribution-free techniques.

Once I saw a paper in hypothesis testing that was both multidimensional and
distribution-free.

Or, in shortest terms: Yes, a random variable always has a cumulative
distribution, but that doesn't mean that always in practice we should try to
find that distribution; instead, commonly in practice we have little hope of
knowing much about the distribution!

Or, one of the best books in probability I know of is

Jacques Neveu, 'Mathematical Foundations of the Calculus of Probability',
Holden-Day.

and there is at most only meager mention of specific distributions in the
whole book!

~~~
christopheraden
Couple comments: Anyone in statistics/probability has seen this quote multiple
times: "All models are wrong, but some are useful." -George Box.

We often don't know the underlying distribution of a particular process or our
data, true, but that doesn't mean that we can't get useful information out of
it by making some assumptions, provided that our assumptions are not wildly
wrong. For example, if we assume height is normally distributed, that should
mean it's possible, under our model, to see someone who is -9000 feet tall.
We're willing to accept this falsity in exchange for having some nice
properties we wouldn't have had otherwise, provided we exercise some common
sense.

In regards to (3), if all you can wield is the SLLN, you can't do much
inference or prediction. So you're able to guarantee that the sample mean will
converge to the expectation almost surely. The closeness of approximation in a
finite sample case using asymptotics is intrinsically tied to sample size and
the particulars of the underlying distribution of the random variable.

In regards to (5), the curse of dimensionality does become a large problem,
but throwing out assumptions makes it a far bigger problem. The nonparametrics
community has a lot of trouble with multivariate distributions for precisely
this reason! This is one area where parametric models do better than non-
parametric ones, since they have so much more structure to play with, it makes
the problems so much more tractable.

You mention order statistics... order statistics in higher dimensions is a
tricky concept that is not as defined as the one dimensional case. Which one
is more: (3, 2, 1) or (1, 2, 3)? Wouldn't that necessarily depend upon the
underlying distribution or some other measure of distance? If you have a paper
to suggest that general high-dimensional order statistics makes sense, throw
it this way. You are correct that Least Squares doesn't require a distribution
to get mean and variance estimates, but what good is that if you don't know
how much of your population fits within a multiple of your standard deviation
around your mean? You could use something that works for all distributions
like Chebyshev's Inequality, but to get that level of sweeping generality
seriously hurts your power (If your data IS actually normal, 95% of obs fall
within 1.96 SD's of the mean. Chebyshev will tell you 75%), and even Chebyshev
had to impose a restriction to get that bound--it assumes finite mean and
variance.

While there are distribution-free hypothesis testing methods, most of
classical nonparametric statistics makes some assumptions about the underlying
distributions, albeit they are far gentler than the parametric assumptions.

Always choosing nonparametric methods is just as short-sighted as always using
parametric methods. It's very possible that you are throwing away vast amounts
of information by using a nonparametric procedure when a parametric one would
have been completely adequate. With ANOVA and two-sample tests, you don't lose
much by going with Kruskall-Wallis or Mann-Whitney tests (under normality, MW
has a relative efficiency of 3/pi versus the t-test), but in other
circumstances, using a nonparametric method could be way worse, provided the
parametric assumptions are true.

What I'm getting at is that while non-parametrics is nice in that it doesn't
require many assumptions, they may be throwing away too much. Fitting to a
distribution may very useful in making inferences and predictions. All models
are wrong, but some are useful.

~~~
graycat
Spoken like a true statistician! Yes, the statisticians keep assuming
Gaussian, fitting distributions to data, etc., and you have a way out: Of
course it's wrong, but it's still useful! Wow!

> If you have a paper to suggest that general high-dimensional order
> statistics makes sense, throw it this way.

Do an hypothesis test. For positive integers m and n, consider m > 1 points in
real Euclidean n-space. We want to know if point m is distributed like the
rest. Our null hypothesis is that all the points have the same distribution
and are i.i.d.

Now for positive integer k, for each of the m points, we can find the distance
to the farthest of the k nearest neighbors. So, we get m distances. These,
however, are not i.i.d. But can do a little work to show have a finite group
of measure preserving transformations such that, if we sum over the group,
then we get something similar to a permutation test. So, if the distance for
point m is in the upper 2% of the distances, then have probability of type I
error of 2% and, thus, an hypothesis test. So, have a multidimensional,
distribution-free hypothesis test. That is, intuitively if point m is 'too far
away' from the other m - 1 points, then we reject the null hypothesis that
point m is distributed like the other m - 1 points (we continue to believe in
the independence assumption and do not reject it). But the k-nearest neighbors
with the Euclidean norm need not be the only one used -- so get a family of
such tests.

Intuitively, for n = 2, we have a density that looks like, say, a mountain
range. Suppose the rocks of the range are porous to water and pour in some
water. Then the water all seeks the same level. So, we have multiple lakes,
with islands, with lakes, etc. Suppose the water covers 2% of the probability
mass. Then the test is to see if point m falls in the water.

Yes, we expect the lake boundaries to be fractals. So, right, with enough
points m, we are approximating the fractal boundaries of the lakes. Increasing
k makes the approximation to the boundaries more smooth. If pour in more
water, then we get another set of contours. So, we get a way to do contour
maps of the density. So, we get a technique of multidimensional density
estimation.

E.g., go to a beach, pick up m = 200 rocks, and for each measure n = 10
properties. Now given one more rock, ask if it came from that beach!

As I recall, there is such a paper in 'Information Sciences' in 1999 about an
"anomaly detector'. Why 'Information Sciences'? Because the paper suggested
monitoring large server farms and networks for 'health and wellness' by this
technique. So, we would get multidimensional, distribution-free 'behavioral
monitoring' with false alarm rate we could set in advance and get exactly. The
work does not promise to have the best power in the sense of the Neyman-
Pearson result, but the paper uses Ulam's 'tightness' to argue that the test
is not 'trivial'. I have yet to see a distribution-free test that tries to
argue it is as powerful as Neyman-Pearson!

While we don't get Neyman-Pearson, we get an approximation to the smallest
region where we will make a type II error and, thus, in a sense for any
alternative distribution for point m the most power in the goofy sense of
shifting that alternative distribution around! The argument is just a simple
use of Fubini's theorem! So, in a goofy but possibly "useful" sense we
minimize type II error. There might be a duality situation here!

