

Visualization of randomness - matthiasb
http://codebazaar.blogspot.com/2011/05/visualization-of-randomness.html

======
foob
It's a fun little visualization but it's also worth noting that a chi square
test would achieve the same result but in a much more quantitative manner.
Using his example, we'll generate N integers ranging from 0-9. Let n(0) denote
the number of 0's we get, n(1) the number of 1's, and so on. Then our expected
value for n(0) is N/10 and likewise for each of the other n(i). The
distribution of n(0) will be Gaussian for sufficiently large N and the
standard deviation will be sqrt(N/10). Then if we take the sum over i from 0
to 9 of [(n(i)-N/10)^2]10/N and call this quantity Q then it can be shown that
Q will follow a chi-square distribution with k=9 degrees of freedom (one less
than ten because the tenth is determined by N and the other 9). If we then
take [(Q/k)-1]/sqrt(2k) then this tells us how many standard deviations away
from our expected chi-square value we fall. If there's a mistake with the
distribution of numbers then the number of standard deviations will tend to
drift farther and farther from 0 as N gets bigger, a rule of thumb being that
if you're more than two or three sigma out you might want to run a few more
times to double check and if you fall five or more sigmas out then there's
most probably a mistake in your algorithm.

EDIT: I tried to figure out how to make an asterisk and not just _italics_ but
couldn't do it.

~~~
mturmon
I compute the standard deviation of n(1) as sqrt(N p (1-p)) where p = 1/10.
It's N iid realizations of a biased coin toss.

You left out the 1-p part, I think.

I totally agree with your sentiment that just eyeballing the plot is not the
best route here. It's all too easy to convince yourself that the deviation
isn't that bad, if you're just looking at the plot without error bars.

~~~
foob
Thanks, good catch. I'm used to thinking about the limit where you have so
many bins that the Binomial distribution can be approximated as a Poisson
distribution without thinking twice. 10 is sort of borderline but I probably
should have been more explicit. If there are only 10 bins then this will
introduce a roughly 5% overestimation of the standard deviation in the
distribution of n(i). This in turn would be an 11% overestimation of Q and an
11% overestimation of the number of standard deviations you're off by. The
statistical error on your estimation of this sigma is generally going to be
significantly larger than the errors introduced by making this approximation
so it really doesn't matter all that much. As you make the number of bins
larger they matter less and less (mturmon probably knows this, I'm just making
it clear to anyone reading our comments).

