
Common Probability Distributions: The Data Scientist’s Crib Sheet - ingve
http://blog.cloudera.com/blog/2015/12/common-probability-distributions-the-data-scientists-crib-sheet/
======
ggrothendieck
The image beside each node is nice but the meaning of the edges needs to be
well defined for this chart to be useful. An interactive graphic of 76
probability distributions which does define the edges and also links each node
to a description is available here:
[http://www.math.wm.edu/~leemis/chart/UDR/UDR.html](http://www.math.wm.edu/~leemis/chart/UDR/UDR.html)
and is further discussed here:
[http://www.amstat.org/publications/jse/v20n3/leemis.pdf](http://www.amstat.org/publications/jse/v20n3/leemis.pdf)

Wikipedia also has a chart:
[https://en.wikipedia.org/wiki/Relationships_among_probabilit...](https://en.wikipedia.org/wiki/Relationships_among_probability_distributions)

~~~
jamessb
The blog post does link to
[http://www.math.wm.edu/~leemis/chart/UDR/UDR.html](http://www.math.wm.edu/~leemis/chart/UDR/UDR.html)

John Cook also has a nice diagram adapted from it (Clicking on the arrows
takes you to the explanation):
[http://www.johndcook.com/blog/distribution_chart/](http://www.johndcook.com/blog/distribution_chart/)

------
mavam
On a related note, the first few pages of my statistics cookbook [1] contain
visualizations of several distributions with varying parameterization, both
P[MD]Fs and CDFs. I found this quite helpful in getting an intuitive
understanding of a function's "behavior."

[1] [http://statistics.zone](http://statistics.zone)

------
stared
It looks interesting (especially the diagram), but it's pain to read due to
low contrast (vide
[http://contrastrebellion.com/](http://contrastrebellion.com/)). I read it
only after manually changing #666666 to #000000.

~~~
jrapdx3
The "reader" function in Firefox works pretty well for that. I also found the
contrast too low, especially on an older laptop. Switch to FF reader mode made
a big difference.

------
graycat
Nicely done.

There is an important point that the article makes although only implicitly:
If have some data and want to know what the probability distribution is, then
hopefully know enough good things about where the data came from basically to
_know_ , even without looking at the data, what the probability distribution
_must_ be. The article gave such ways to know.

A biggie point: In practice this way of _knowing_ is not only powerful but,
really, nearly the only little platform you have to stand on to know how your
data is distributed.

Here is an example one step beyond the article: You have a Web site, and users
arrive. Okay, each user has their own complicated life, maybe use the Internet
only in the morning, only in the evening, have nearly a fixed list of sites
they go to, only get to your site from links at other sites they do see
regularly, etc. That is, each user can have wildly complicated, unique
personal behaviors on the internet.

Still, the arrivals at your site will be as in a Poisson process, that is, the
number of arrivals in the next minute will have the Poisson distribution and
the time until the next arrival will have the exponential distribution. Why? A
classic result called the _renewal theorem_. There is a careful proof in the
second volume (the one difficult to read) on probability by W. Feller.

So, the arrivals at your Web site from user user #1, Joe, is some complicated,
unknowable _stochastic arrival process_. Fine. Joe has a complicated life.
User #2, Mary, also has a complicated life but has essentially nothing to do
with Joe (Joe is a nerd, and Mary is pretty!). So, Mary acts _independently_
of Joe. Similarly for users #2, 3, ..., 2 billion. Then the arrivals at your
Web site are the sum of those 2 billion complicated, unique, with details
unknowable, independent arrival processes. Then, with a few more meager
assumptions, presto, bingo, the renewal theorem says that the arrivals at your
site form a Poisson arrival process.

There's a terrific chapter on the Poisson process in E. Cinlar's introduction
to stochastic processes. Terrific. Some of what you can say, knowing that you
have a Poisson process, is amazing. All with no or meager attention to the
data and, instead, from knowing you have a Poisson process, e.g., from the
renewal theorem from a sum of many independent arrival processes.

Bigger lesson: The renewal theorem is true in the limit of a sum of many
independent arrival processes. So, it is a _limit theorem_. Then, more
generally, many of the crown jewels of probability are limit theorems that say
what happens in the _big picture_ when it is a limit of some kind of smaller
things about which have nearly no ability to understand. So, astoundingly,
such limit theorems show that the effects of some universe of detail, maybe
even _big data_ , just wash out. Often very powerful stuff. A big part of a
good course in probability is the full collection of classic limit theorems --
astounding, powerful stuff in there. Wait until discover martingales --
totally mind blowing that any such powerful things could be true, but they
are!

Final lesson: It's possible also to take from the article and from much of
introductory lessons in statistics an implicit lesson that is wrong and even
dangerous: That lesson is that, given some data, right away, ASAP, do not pass
GO, do not collect $100, and ASAP rush to find the probability distribution.
Well, if can find the distribution via something like the Poisson process
outlined above, _terrific_. But usually can't do that. Instead just have the
data, just the darned data. Maybe even _big data_. Then, sure, can get a
histogram and look at it. Okay, no harm done so far. But, then, maybe, from
the implicit but dangerous lesson, feel an urge, a need, a compulsion, a
strong drive to find _the distribution_ of that data, go through some huge
list of increasingly bizarre well known probability distributions looking for
a _fit_ , etc. Mostly, don't do that.

Or, yes, there is a probability distribution, but, usually in practice,
especially when you are given data without any additional information that
will let you conclude something like Poisson above, beyond just that
histogram, you don't have much chance of finding or approximating the
probability distribution in any way that stands to be useful. Or, mostly just
get the histogram and stop there.

Next, all the above holds for one dimensional data, that is, single numbers.
But if your data comes in pairs of numbers, say, points on a plane, or, for
some positive integer _n_ , _n_ -tuples, then your desire to find _the
distribution_ is much, much less promising. Indeed, just getting a histogram
is much less promising. For _n_ > 2, already histograms are tough to see or
work with.

But, fear not: The field of applied probability and statistics is just awash
in techniques where you don't need anything like precise data on
distributions!

Succinct version of this lesson: Yes, the probability distribution exists, but
commonly you can't really find it and commonly you don't need to find it.

------
bm1362
Awesome primer on probability distributions.

