List of probability distributions

christopheraden · on May 25, 2013

I've surfed around on this article several times while trying to find a distribution I needed in order to make my rejection samplers work better for distributions that have no name or are obscure (posteriors in Bayesian inference often generate some totally whacky distributions--especially once you get out of the realm of using conjugate priors. Here, there be dragons!).

If you aren't reading probability from a textbook, this article isn't going to tell you which distributions are the most useful. The Beta Distribution receives half the amount of space as the Dirac delta function, but for you folks who want to take a Bayesian approach to A/B testing, the beta is going to be far more useful (The beta is the conjugate prior to the binomial) than the dirac, though one might argue that the Dirac is interesting from a mathematical perspective, it being a degenerate distribution and all...

It turns out though that many of the "useful" distributions (by which I mean the ones that many classical problems were devoted to) share very similar properties. If your aim is to get familiar with the most common distributions encountered in the wild, the article on the exponential family will serve you better than this overwhelming article.

http://en.wikipedia.org/wiki/Exponential_family

yummyfajitas · on May 25, 2013

The beta distribution is fantastically useful for modelling conversion rates. I'll take this opportunity to shamelessly plug a blog post I wrote about it last week.

http://www.chrisstucchio.com/blog/2013/bayesian_analysis_con...

dagss · on May 25, 2013

If you do that a lot, Adaptive Rejection Sampling (or some variant if it's not log-concave) may be useful; it builds up the rejection envelope while trying to sample and often succeed in 5-10 evaluations.

PaulHoule · on May 25, 2013

My thoughts on the matter are affected by nonparametric and Bayesian methods. Binomial and beta rule, most others drool.

What I can say is more than half the time that people are messing around with parametric statistics they don't quite understand what they're doing and they end up voiding the warrantee on at least one of the distributions or methods that they use.

ibotty · on May 26, 2013

the dirac distribution is everywhere in mathematics. i've only heard about the beta distribution and never needed to think about it...

prezjordan · on May 25, 2013

I've taken two stats courses, and I still don't understand the derivation of the Bell Curve (why it happens, why it's shaped like it is, etc). I did well in the classes, too. It's weird how "unintuitive" probability is to me.

btilly · on May 25, 2013

I don't blame you. They don't teach it because actually teaching it would be a major side track, everyone would ask why you need to know it, and it would generally cause problems.

But it isn't that hard to prove for yourself for the simple case of the binomial curve if you're willing to supply some elbow grease. Here is an outline of how I did it when I was an undergrad.

1. Prove that log(n!) = n log(n) - log(n)/2 + k + o(1) where k is some constant. (Trick: approximate the integral of log(x) with the trapezoidal rule, that gives you n log(n). You wind up with log(n)/2 + log(1)/2 left over. Prove that the sum of the errors to infinity converges to a constant, which gives you, to finite n, k + o(1) where the o(1) term is the sum of the rest of the errors that you haven't reached yet.)

2. Suppose that X_1, X_2, etc are an identically distributed independent set of random variables that have probability 0.5 of being +1 or -1. Prove that each X_i has mean 0 and variance 1.

3. Consider the random variable Y_n = (X_1 + X_2 + ... + X_n)/sqrt(n). Prove that it has mean 0 and variance 1.

4. What is the "density" (in quotes because it is discrete) of Y_n around x? Well you can calculate an exact formula in terms of the binomial formula, and stick in your approximation from step 1, then show that the point weight of the nearest point to x is going to be e^(-x^2)(c + o(1))/sqrt(n) where c is a constant that can be derived from k in step 1. Remember that that's the only point for a distance of sqrt(n), so the density is proportional to e^(-x^2).

5. Look up the standard argument that the integral from -infinity to infinity of e^(-x^2) is sqrt(pi). Note that Y has to sum up to a total weight of 1 (because it is a probability density), substitute that fact in then you get both the standard normal distribution out AND as an unexpected bonus, Stirling's formula!

(If you wish you can do the same for a biased coin, the derivation is slightly more complex but follows the same lines. However to get arbitrary probability distributions you need to use different approaches. People usually don't encounter that proof until graduate school.)

on May 25, 2013

[deleted]

btilly · on May 25, 2013

Thanks, fixed.

jules · on May 25, 2013

Another way to derive the bell curve is as the maximum entropy distribution with known mean and variance.

The derivation does require one trick that is usually not taught in high school. You know how you can maximize a function by setting its derivative to zero? Well, there is a generalization of that to functions of multiple variables: you set the gradient to zero. But there is yet another generalization to that for functions of continuously infinite number of variables! That's called the calculus of variations. If we want to maximize the entropy over all possible probability distributions over the real numbers, then we have a continuously infinite number of variables, since for each real number x, we need to assign a probability mass p(x). Here is the derivation for those interested:

Let m be the mean we want and v be the variance we want, and let p(x) be the probability density function.

We want to maximize the entropy S = int(p(x)log(p(x))) given the constraints that:

    int(p(x)) = 1  // total probability should sum to 1
    E(X) = int(x*p(x)) = m  // mean should be m
    E((X-m)^2) = int((x-m)^2*p(x)) = v  // variance should be v

This problem can be solved with calculus of variations with Lagrange multipliers. The Lagrangian is all the terms inside the integrals summed up:

    L = p*log(p) + a*p + b*x*p + c*(x-m)^2*p

Calculus of variations tells us that for the optimal p, we have:

    dL/dp = 0

This is the analogue of setting the derivative to zero for maximizing a function.

Calculating that we get:

    a + bx + c(x-m)^2 + log p + 1 = 0

rewriting:

    p = e^-(a + bx + c(x-m)^2 + 1)

We have the general form of the Bell curve. Now we just need to find the right values for a,b,c to satisfy the 3 constraints.

As special cases of this we have:

1) The probability distribution with maximum entropy without constraints is the uniform distribution.

2) The probability distribution with maximum entropy with only the constraint that the mean is m, is the exponential distribution.

In general if you want to find maximum entropy distribution with known expectation values E(f(X)) and E(g(X)), then that is of this form:

    p(x) = C*e^(a*f(x) + b*g(x))

This gives an indication why so many distributions have this form. For example the entire exponential family can also be derived as maximum entropy distributions with certain known expectation values.

Of course this derivation is only a good motivation for the bell curve if you already cared about entropy, mean and variance.

nnq · on May 25, 2013

Try describing in equations how the "bean machine" (http://en.wikipedia.org/wiki/File:Planche_de_Galton.jpg) works (you just need knowledge of basic probability, and then some calculus when you go to actually deriving the functions), until you get to the mathematical derivation of the bell curve - it will give you a very "mechinistically intuitive" understanding of it if you can imagine it starting from a physical mechanical process -> idealized physical process -> mathematical model -> generalize the model as much as possible -> come back to the real world by applying the model to real world problems. It's weird how intuitive stats becomes when you see it as "just physics and math" and then distill the purest "abstract" model out of it in a way that it can be used generally and without knowledge of how it was derived (I know, in school you're taught to "apply probability and statistics in physics and engineering", not to "apply physics and math and engineering to understand statistics and probability" - but this way of turning you mind upside down in a way that is "kind of wrong" helps your intuition a lot).

christopheraden · on May 25, 2013

Btilly outlined the steps of how you'd go about showing mathematically that a binomial converges to the normal distribution, but the reason you don't see this approach, even after two courses in statistics, is that it takes awhile for people to wrap their heads around the steps. At most universities, calculus (more than just a simple derivative and integral here and there, I mean) and real analysis isn't introduced to statistics until students take math stats, which is something you'd take after you were well-versed in the applications and shapes of several common distributions. The Bean Machine is usually a more intuitive way that students are introduced to "why" binomials become normal as the sample size gets larger.

The reason that _so many_ things have normal distributions is because of the Central Limit Theorem. If there's any reason to study math stats (other than understand the underpinnings of statistics at a deeper level than most of the population), it's that you get to become really familiar with how amazing this Theorem is.

rrmm · on May 25, 2013

Central Limit Theorem? http://en.wikipedia.org/wiki/Central_Limit_Theorem

The sum of die rolls shows a good example of why you get the bell shape: there are more ways (combinations) with which to sum up to the center values than the extremal values.

btilly · on May 25, 2013

It is easy to write a computer simulation of specific cases, and watch the bell shape emerge out of it. All that tells you is that the mathematicians are telling the truth, but not WHY it happens.

christopheraden · on May 25, 2013

To explain why is difficult in layman's terms except for a few cases (the binomial is the token example for classical CLT).

For the majority of the people I had in the intro statistics courses I TA'ed for, this was enough explanation. In undergrad, we didn't prove Lindeberg-Levy CLT until my third quarter of math stats, and the versions with weaker assumptions we didn't prove until graduate math stats.

When you're struggling to understand the fundamentals of hypothesis testing and simple combinatorics, going through the rigorous process of proving the CLT is daunting, mind-boggling, and largely unnecessary :).

rrmm · on May 25, 2013

Doesn't the combinatorial approach which allows you to directly calculate the probabilities of each bin show why it takes that particular shape?

btilly · on May 25, 2013

It can, but it takes a bunch of elbow grease.

It is slightly easier if you start off assuming Stirling's formula. However there is no need to assume that - you can derive it at the same time. Elsewhere in this thread I gave an outline of how to derive this simple case.

anifow · on May 25, 2013

If you are willing to spend a little time with it, this might help give you a more intuitive udnerstanding:

http://intromath.ca/aakkozzll/

The trick is to keep in mind that in this model, each ball has a 50/50 chance of either going left or right when it crosses each peg.

This is analogous to a game where you flip a coin 9 times in a row and win a dollar for every heads, lose a dollar for every tails. On average, you come out even, but the outcome of any particular game is like one of the balls falling down that Pachinko board. You might get really lucky and get 9 heads in a row, but usually there's a mix of heads and tails, and if you look at all the possible results, it would stack up like those balls in the pachinko board.

yareally · on May 25, 2013

A Bell curve (generally a Gaussian Curve but not always, though what you are inquiring about would be) is just where the data is laid out so the sum of the data adds up to 1 and symmetrical around the average (with the average being the center point). Since the data adds up to one (so you would have to convert each data set to percentages). Useful when you don't know the exact distribution of the data at hand.

To give an easy to visualize example, a teacher that grades on the curve, does so by normalizing the data (taking the top 10% of the grades for a test and making those A's regardless of previous score, then making the top 10%-20% B's regardless of score and so on. There are different methods to calculate the percentiles (and what those percentiles are could vary by grading system), but that's just the most common one.

christopheraden · on May 25, 2013

I'm worried your explanation may be confusing to people in some places.

The bell curve and Gaussian curve (and normal curve) are almost always synonymous. Bell Curve describes its shape, while the Gaussian curve describes the guy who derived that nasty-looking pdf. While some other distributions may look bell-shaped (especially ones that converge to a Normal in some asymptotic sense, like the t-distribution), people will look at you funny if you refer to anything other than the normal (or in rare cases, the t) as "The Bell Curve".

The point about "the data is laid out so the sum of the data adds up to 1" doesn't make much sense since the Normal takes values on the entire real line. Could you clarify?

The procedure you describe is not what most people would call "normalizing", so much as breaking the data up into quantiles/percentiles. Normalizing the data would be for the instructor will subtract the average from everyone's raw score, then divide by the standard deviation. If the raw scores themselves were distributed normally, then these new "normalized" scores are now standard normal, so letter grades are given based on how far away from 0 a student's scaled score is. This procedure is pretty flawed since scores are usually not super close to normality (many exams are not symmetric in scores, the distribution may have several modes, etc). The procedure you describe makes no assumptions about the underlying data, and does not require normality, so I like it more.

yareally · on May 25, 2013

Sorry, by data, I meant the percentages add up to one (so that the integral equals 1), not the data it was derived from.

Thanks for the correction. My method would be more fair, but is incorrect when applied to standard normalization. It should be percentiles. In a strict bell curve, if the top 20% get A's, then the bottom 20% would also have to fail (even if they didn't really get an F) so a pretty unfair system as you mentioned as grades are already predetermined :(

christopheraden · on May 25, 2013

Your first point is not just a property of the Gaussian, but of all distributions. It's the Second Axiom of Probability!

Your method is incorrect as a method of normalization, but it seems we are in agreement that perhaps normalization itself is a silly method for this. Most instructors I've known that claimed they were "curving" were in fact using the percentile approach. Grading based on percentiles (top 10% get A's, next 20% get B's, 40% C's, 10% D, bottom 20% F, for example) allows you far more flexibility, as the percentiles don't even need to be symmetric. My grad courses usually only had three grades: A+, A, and "Please see me to discuss your future in the department".

btilly · on May 25, 2013

You're showing no sign of understanding why the bell curve is shaped like it is either. You're just describing what you've seen.

yareally · on May 25, 2013

Instead of criticizing, why don't you correct me then please by giving a better example. We're all here to learn, not to put each other down by 10 second, thoughtless replies.

Your background (being just shy of a PhD in Math), should have lots of useful information you can share.

btilly · on May 25, 2013

I was in the process of typing that up. It takes time.

See https://news.ycombinator.com/item?id=5766387.

yareally · on May 25, 2013

Thank you! I didn't consider what he would want would be an actual logical proof instead of metaphors/analogies.

mturmon · on May 25, 2013

btilly gives a starting point for a derivation, which is great. Here are two other ideas, working from the other direction. Notice that the Gaussian is closed under rescaling, that is, if the variables X1...Xn are all iid Gaussian, then the properly normalized sum is N(0,1). In this sense, the Gaussian is an attractor of the normalized transform. Another worthwhile experiment is to generate 6 or 10 iid uniform variates, and notice how close the normalized sum is to Gaussian. It converges rather fast. You can do this by Monte Carlo, or symbolically using Mathematica.

elchief · on May 25, 2013

Probability distributions are abstractions. No actual data distribution is Normal, though some approach it, I suppose.

It is a mathematical shortcut. It can make analysis and computations simpler, thus cheaper. Though it can also destroy economies if you forget that it's an abstraction.

You don't need to use the analytic distributions. You can get by with simulations, bootstrapping, if you have computational horsepower.

nova · on May 25, 2013

I recommend Robert Kass' article "Statistical inference: the big picture" for a related point of view on that issue, which should be obvious but it's easily forgotten in practice.

hdivider · on May 25, 2013

Here's a possibly interesting idea: in a game, randomly select from some of these distributions to generate objects (e.g. powerups, enemies, game balancing constants) whenever the player appears to be doing well. You'd initially give them the impression that the behaviour is non-random, but then suddenly everything changes.

Just thinking out loud here. =) In practice you'd probably need a solid game to begin with to even get the player to notice changes in the distribution used. I'd say lots of times you don't need more than a plain old uniform distribution though.

christopheraden · on May 25, 2013

Uniform's pretty much the go to distribution for stuff like that--or having a non-random piece and a small random component. That's how Pokemon determined Catch Rates[1].

The problem with a lot of these distributions is that while you may have a closed form expressed for the PDF, it doesn't always give you an easy way to generate samples from that distribution. There's a huge amount of research that goes into efficient ways to generate samples from these distributions, and they are often quite complex once you are unable to use the inverse probability transform [2]. There's a reason that the Metropolis-Hastings Algorithm was such a big breakthrough, and that's because sampling is often quite difficult. In the multivariate setting, sometimes you may have to throw away 10,000 random numbers before you finally get one that could come from your target distribution.

For most of the univariate distributions, it's not tremendously difficult to sample, though, so you may be okay.

[1]: http://bulbapedia.bulbagarden.net/wiki/Catch_rate (this article is total nerdvana, by the way) [2]: http://en.wikipedia.org/wiki/Inverse_transform_sampling

btilly · on May 25, 2013

In actual games with random behavior, the challenge usually lies in convincing people that they are just having an unlucky streak but the game is random and unbiased. Therefore I don't see the point.

(I've encountered this at http://www.wargear.net/ - and I've sadly needed to convince myself that it is truly random.)

dietrichepp · on May 25, 2013

Well, classic problems in statistical analysis deal with "overdispersion"—e.g., the ratio of boys to girls in individual families has a variance too high to be modeled as a simple binomial distribution, so you can model it as a beta-binomial.

In games, the challenge is the opposite: to generate distributions that with less variance in order to convince people that the random number generator is "fair". You don't have to do this, but some games will tweak the numbers so they're not independent, to keep the sample mean close to its asymptotic value.

In other words, if you have a losing streak, the software steps in and breaks it for you, so that you think the RNG is "fair".

hdivider · on May 25, 2013

Yeah, I suppose we humans are just too good at seeing patterns where none exist.

Still, it's fun to experiment. =)

graycat · on May 25, 2013

Sorry, guys. After advanced study of probability, writing a Ph.D. dissertation in applied probability, and doing applications of probability in US national security and business, here's my take on the OP:

(1) Sure, given a random variable taking values in the real numbers, it has a (cumulative) distribution and may have a probability density distribution.

(2) In some special, fortunate cases, actually can get some useful information about the distribution, e.g., it has an expectation, the expectation is finite, it has a finite variance, and might even be able to say that the distribution is in one of the famous families, Gaussian, exponential, Poisson, etc. and, then, estimate the parameters of the distribution, e.g., for a Gaussian, estimate the mean and variance.

(3) While we know that our random variable has a distribution, mostly in practice we can't know much about that distribution. Commonly if we know that the distribution has an expectation, that the expectation is finite, and that the variance is finite (that commonly we get just by what we know about the real problem with no effort in probability or statistics), then we have to be satisfied with those 'qualitative' facts and move on. E.g., finite variance is enough about the distribution to apply the strong law of large numbers!

(4) Mostly when we do know that the distribution is Gaussian, exponential, etc., it is because we are using one of the classic limit theorems of probability, e.g., the central limit theorem or the renewal theorem.

(5) If the random variable takes values in, for some positive integer n, some n-dimensional space over, say, the real or complex numbers, then in practice knowing much detail about the distribution is next to hopeless. Why? Because of the 'curse of dimensionality' where the number of points needed for a fine grid is commonly so high it makes 'big data' look like something on a 3 x 5 index card!

Net, mostly we do our work in applied probability knowing, of course, that a random variable has a cumulative distribution but otherwise knowing at most only some qualitative aspects, e.g., finite variance, and little else.

So, our more important methods of working with random variables should need only meager assumptions about the distributions of those random variables. E.g., order statistics are always sufficient. Commonly to get the most desired estimates from least squares techniques, need only finite mean and variance; the common Gaussian assumptions about the 'model error terms' is needed only for doing classic confidence interval, etc. estimates.

Generally there are a lot of 'non-parametric' (distribution-free, where we make no assumptions about the distribution) techniques in hypothesis testing. For estimation, there are non-parametric techniques based on, say, 'resampling' for getting confidence intervals. And there are other distribution-free techniques.

Once I saw a paper in hypothesis testing that was both multidimensional and distribution-free.

Or, in shortest terms: Yes, a random variable always has a cumulative distribution, but that doesn't mean that always in practice we should try to find that distribution; instead, commonly in practice we have little hope of knowing much about the distribution!

Or, one of the best books in probability I know of is

Jacques Neveu, 'Mathematical Foundations of the Calculus of Probability', Holden-Day.

and there is at most only meager mention of specific distributions in the whole book!

christopheraden · on May 25, 2013

Couple comments: Anyone in statistics/probability has seen this quote multiple times: "All models are wrong, but some are useful." -George Box.

We often don't know the underlying distribution of a particular process or our data, true, but that doesn't mean that we can't get useful information out of it by making some assumptions, provided that our assumptions are not wildly wrong. For example, if we assume height is normally distributed, that should mean it's possible, under our model, to see someone who is -9000 feet tall. We're willing to accept this falsity in exchange for having some nice properties we wouldn't have had otherwise, provided we exercise some common sense.

In regards to (3), if all you can wield is the SLLN, you can't do much inference or prediction. So you're able to guarantee that the sample mean will converge to the expectation almost surely. The closeness of approximation in a finite sample case using asymptotics is intrinsically tied to sample size and the particulars of the underlying distribution of the random variable.

In regards to (5), the curse of dimensionality does become a large problem, but throwing out assumptions makes it a far bigger problem. The nonparametrics community has a lot of trouble with multivariate distributions for precisely this reason! This is one area where parametric models do better than non-parametric ones, since they have so much more structure to play with, it makes the problems so much more tractable.

You mention order statistics... order statistics in higher dimensions is a tricky concept that is not as defined as the one dimensional case. Which one is more: (3, 2, 1) or (1, 2, 3)? Wouldn't that necessarily depend upon the underlying distribution or some other measure of distance? If you have a paper to suggest that general high-dimensional order statistics makes sense, throw it this way. You are correct that Least Squares doesn't require a distribution to get mean and variance estimates, but what good is that if you don't know how much of your population fits within a multiple of your standard deviation around your mean? You could use something that works for all distributions like Chebyshev's Inequality, but to get that level of sweeping generality seriously hurts your power (If your data IS actually normal, 95% of obs fall within 1.96 SD's of the mean. Chebyshev will tell you 75%), and even Chebyshev had to impose a restriction to get that bound--it assumes finite mean and variance.

While there are distribution-free hypothesis testing methods, most of classical nonparametric statistics makes some assumptions about the underlying distributions, albeit they are far gentler than the parametric assumptions.

Always choosing nonparametric methods is just as short-sighted as always using parametric methods. It's very possible that you are throwing away vast amounts of information by using a nonparametric procedure when a parametric one would have been completely adequate. With ANOVA and two-sample tests, you don't lose much by going with Kruskall-Wallis or Mann-Whitney tests (under normality, MW has a relative efficiency of 3/pi versus the t-test), but in other circumstances, using a nonparametric method could be way worse, provided the parametric assumptions are true.

What I'm getting at is that while non-parametrics is nice in that it doesn't require many assumptions, they may be throwing away too much. Fitting to a distribution may very useful in making inferences and predictions. All models are wrong, but some are useful.

graycat · on May 25, 2013

Spoken like a true statistician! Yes, the statisticians keep assuming Gaussian, fitting distributions to data, etc., and you have a way out: Of course it's wrong, but it's still useful! Wow!

> If you have a paper to suggest that general high-dimensional order statistics makes sense, throw it this way.

Do an hypothesis test. For positive integers m and n, consider m > 1 points in real Euclidean n-space. We want to know if point m is distributed like the rest. Our null hypothesis is that all the points have the same distribution and are i.i.d.

Now for positive integer k, for each of the m points, we can find the distance to the farthest of the k nearest neighbors. So, we get m distances. These, however, are not i.i.d. But can do a little work to show have a finite group of measure preserving transformations such that, if we sum over the group, then we get something similar to a permutation test. So, if the distance for point m is in the upper 2% of the distances, then have probability of type I error of 2% and, thus, an hypothesis test. So, have a multidimensional, distribution-free hypothesis test. That is, intuitively if point m is 'too far away' from the other m - 1 points, then we reject the null hypothesis that point m is distributed like the other m - 1 points (we continue to believe in the independence assumption and do not reject it). But the k-nearest neighbors with the Euclidean norm need not be the only one used -- so get a family of such tests.

Intuitively, for n = 2, we have a density that looks like, say, a mountain range. Suppose the rocks of the range are porous to water and pour in some water. Then the water all seeks the same level. So, we have multiple lakes, with islands, with lakes, etc. Suppose the water covers 2% of the probability mass. Then the test is to see if point m falls in the water.

Yes, we expect the lake boundaries to be fractals. So, right, with enough points m, we are approximating the fractal boundaries of the lakes. Increasing k makes the approximation to the boundaries more smooth. If pour in more water, then we get another set of contours. So, we get a way to do contour maps of the density. So, we get a technique of multidimensional density estimation.

E.g., go to a beach, pick up m = 200 rocks, and for each measure n = 10 properties. Now given one more rock, ask if it came from that beach!

As I recall, there is such a paper in 'Information Sciences' in 1999 about an "anomaly detector'. Why 'Information Sciences'? Because the paper suggested monitoring large server farms and networks for 'health and wellness' by this technique. So, we would get multidimensional, distribution-free 'behavioral monitoring' with false alarm rate we could set in advance and get exactly. The work does not promise to have the best power in the sense of the Neyman-Pearson result, but the paper uses Ulam's 'tightness' to argue that the test is not 'trivial'. I have yet to see a distribution-free test that tries to argue it is as powerful as Neyman-Pearson!

While we don't get Neyman-Pearson, we get an approximation to the smallest region where we will make a type II error and, thus, in a sense for any alternative distribution for point m the most power in the goofy sense of shifting that alternative distribution around! The argument is just a simple use of Fubini's theorem! So, in a goofy but possibly "useful" sense we minimize type II error. There might be a duality situation here!