

The Central Limit Theorem Makes Random Testing Hard - sharkbot
http://blog.regehr.org/archives/660

======
imurray
Warning: slight pedantry ahead.

Specifically what this article is noticing is the _law of large numbers_ [1],
that sums or averages are "close" to their mean. The central limit theorem
says that sums _converge in distribution_ towards being normally distributed.
But convergence in distribution isn't a strong statement about tail behavior,
and so shouldn't be used in the sort of argument in the post.

[1] <http://en.wikipedia.org/wiki/Law_of_large_numbers>

------
streptomycin
If the goal is to randomly probe the parameter space of a model, the author is
correct that independent Gaussian random numbers are a poor choice (although
it's not really related to the central limit theorem). However, this isn't a
novel revelation. There are other/better ways to do it, for instance
<http://en.wikipedia.org/wiki/Latin_hypercube_sampling>

~~~
mturmon
Smart suggestion. There is a grain of truth in what the article says, but it's
a little confused.

The article is also conflating two things: the (unknown) internal state of the
system, which could live in a very high-dimensional space, and the system
inputs, which move the internal state around in a not-understood way, but
which you at least get to choose directly.

The cases the article examines use an internal state of a number, inputs of +1
or -1, and the system dynamics:

    
    
       state(t) = state(t-1) + input(t)
    

(E.g., "state" is "number of open files".) The CLT does apply in this case,
but it's clear that this is a very narrow case.

I'm not aware of convergence results that hold in any generality here. There
are concepts from control theory of observability and controllability (where
can the internal state go, and how can you read it out). And there are the
concepts from nonlinear dynamics of attractors.

------
Darmani
While it is well known that uninformed random test generation often does a
poor job at probing the parameter space, this has nothing to do with the CLT.
Even his first example with the bounded queue, one where the CLT does make
predictions, he is incorrect to apply it. That's because what we care about is
not the difference between the number of enqueue and dequeue operations (which
has expected value 0), but the magnitude of the difference (which has expected
value sqrt(N)). And what we care about is not the magnitude of the difference
after completing the tests, but the greatest magnitude at any point along the
random walk. This is governed by a more advanced theorem called the Law of the
Iterated Logarithm, which tells us that the probability of finding the bug is
quite a bit higher than the author realizes.

------
lliiffee
The central limit theorem only applies when random variables are combined
_linearly_. I see no reason to suspect that this is the case with random
inputs for testing. (Most functions are nonlinear!)

------
DaniFong
Why the central limit theorem is usually off:

[http://daniellefong.com/2008/01/28/outliers-why-the-
central-...](http://daniellefong.com/2008/01/28/outliers-why-the-central-
limit-theorem-is-typically-off/)

~~~
scott_s
That's a great essay, but the author is talking about random-walk situations,
which _are_ independent random variables. We know this, because that's how the
automatic software testing tools are (currently) written.

~~~
DaniFong
Correct, I probably should have pointed out more context here.

The reason that random testing in this fashion would be a problem is only if
the actual distribution of requirements does not well match the central limit
theorem -- if for example user demand is very correlated. Because the central
limit theorem is usually off, it is not acceptable to assume that testing
independent random variables will reproduce adequately the risky situations
testing should test.

------
bcbrown
I think one approach would be to use RNGs in a slightly different way. Instead
of a 'random walk' of various API calls, define a possible outcome category,
and choose a random number to be the quantity of that outcome.

So for the bounded queue problem, set the outcome category to be "elements in
queue". Say you get 3, 12, and 7 as your first three random numbers. Then the
first three test cases would be "get the load to 3 elements", "get the
capacity to 12 elements", "get the capacity to 7 elements".

~~~
regehr
This does indeed work. The problem is that there is a very large number of
possible goals that a random tester might aim for, and we do not know ahead of
time which ones are most important. Also, for complex testers like Csmith, it
is not clear how to reach some goals. This kind of thing complicates the code
a lot too.

~~~
bcbrown
I'll state up front that I have no familiarity with Csmith or compilers.

You can characterize a random test as a random walk through a many-dimensional
space, right? And the problem is that a random walk will naturally spend more
time close to the starting point? What if, when choosing the next step to
take, you bias the options towards moving away from the initial state? Is it
even possible to talk about 'away' and 'towards' in the state space?

Going back to the queue example, this would mean that if the starting load of
the queue is 5, whenever the load is above 5, bias the distribution of
enqueues/dequeues towards enqueueing.

~~~
regehr
Csmith is much less like a random walk; most program generation steps result
in forward progress. The problem is that at the end of many forward steps, we
may have a test case that is, in some sense, very similar to all of the other
programs generated by Csmith. So when we compile this test case, nothing very
interesting happens.

If we decided up front that, for example, the next program to generate would
have 95% stores through pointers and only 5% scalar operations, then this
would probably be pretty useful. But nothing in Csmith currently supports this
kind of high-level goal-seeking behavior.

So overall there are engineering issues, but also there are some hard open
questions about what it is that a random tester is trying to accomplish. By
necessity these programs start with only very weak hypothesis about what the
bugs that they are trying to find look like.

------
zerostar07
I can't understand the relation to CLT. When using an RNG you basically sample
from the same distribution (thus have only 1 random variable) plus the CLT
comes into play when you add (sum, average) different random variables. In the
case of the queue, it has nothing to do with the number of cases missed.

~~~
Daniel_Newby
Consider a coin labeled +1 on one side and -1 on the other side. A single coin
toss is binary distributed with mean of 0.

Toss the coin N times and let S be the sum of the results. The individual
tosses are binary distributed, but for larger N, S is approximately Gaussian
distributed thanks to the central limit theorem.

~~~
zerostar07
Yes, but you re talking about the sum of the results, so the CLT is only
relevant only to those programs that sum their input and nothing else.

