Basically, it generated a sample of numbers from the RNG (not PRNG: real random source). Then it ran statistical tests of randomness to validate, "yes, we have good random-numbers for security and whatnot".
Of course, these tests can fail; there is a nonzero probability that some pattern emerges which looks non-random to the tests; that's a consequence of them being really random.
There are problems with testing a random source, but calibration can solve that one. (The bigger issue is that the "true" random going into your (PRNG) postprocessor should be whatever comes out of your best entropy-generating device, which typically isn't uniformly random; and that the data coming out of your postprocessor will look random to a black-box test even if the input contains basically no entropy.)
I mean like the half the point of stats is to verify a result isn't due to chance, at least to an acceptable degree
Wouldn't gaurantee the test won't misdiagnose, but enough trials should push that chance into oblivion
In fact, this whole comment is a random string of characters. I just got really lucky, and there's essentially no way to disprove that. Throwing a dice a trillion times and getting all 6's is not an invalid outcome, just very improbable.
Randomness is about unpredictability. How can you assert that something is or is not predictable? It's always possible you just haven't observed it for long enough. An initial, randomly-appearing sequence may turn out to start repeating itself after some point in time; and an initial, self-repeating sequence can always be statistically cancelled out by later data.
Randomness is not something we can have; it's something we don't have (the ability to predict).
It depends what happens when the test fails as to whether it is reasonable or not.
To test the randomness of the RNG you could make the sample size so large that the chance of a false positive would be smaller than the chance of some memory bit flipping as a result of the Heisenberg uncertainty principle.
I think it might be a case of perfect being the enemy of good, the best approach is to minimize the false negatives.
You can prove that certain results are as one would expect for a random source, but that is neither guaranteed to be passed by uniform randomness, nor does passing guarantee true randomness.
On a more philosophical level, as you get better measurements, entropy of a random source will always decrease. Time-stamp seeding, for example, can be weakened by timing.
What can be done, on the theoretical level for PRNGs is to see how easy it is to predict the next iteration from the current iteration. An easy bound here is simply the period of the PRNG.
Does one really need to do data analysis to see that it is simply natural for large tests to be more flaky? All else being equal large tests perform more operations. If on average the probability of any single operation to fail is constant (fair assumption given the large number of tests) than the test that performs more operations has higher probability to fail (and no it is not linear as shown in the article, it is asymptotic approaching probability of one).
The "analysis" of tool types are strange too. Webdriver tests are more flaky because of inherent nature of UI tests operating on less stable interface (compared to e.g. typical unit test which usually deals with a pretty stable contract).
I used to spend quite some time on test stability issues in my previous team and wrote a couple posts on the topic and how to successfully go from unstable to stable test suite:
[Edit: formatting, spelling]
Doing studies and collecting data should be how any assumption, seemingly reasonable or not, is proven right or wrong.
I used to believe that devs should write the tests, but now I'm not so sure, all to often they just produce more technical debt and fail at improving quality. The industry considers knowing the api of a test framework as an experience automated tester.
You also tend to never achieve full reliability because the environments aren't entirely stable. Almost everyone sees more "random" failures on old IE versions. Poor over-stressed IE9 running in a VM will have a bad day and your test will fail.
More interesting things would be:
Number of network calls.
Number of syscalls
E.g. B must not be done until A finishes. A takes too long; we time out on it, and do B anyway. (Alternatively: A provides no reliable notification of being done, so we delay for N seconds and assume A is done.)
If having done B is incorrect if A still happens after the timeout, then we have a problem. It's a race between A and B, where A has a big "head start", that's all.
Famous example (maybe forgotten now): Microsoft's ten second COM DLL unload delay. The last reference count on a COM DLL is dropped by a function that is in the DLL itself. So the refcount is zero, but the thread which dropped the refcount to zero must still execute instructions in that DLL until it returns out of that code. Oops! Anyone who calls the DLL's DLLCanUnloadNow function will see a true result: we can unload it now! But actually we cannot. Solution: oh, ten seconds should be enough for that last thread to vacate. Long page faults under thrashing? Who cares; the system is probably dying in that situation anyway.
(That should also be considered a data point in the GC versus refcounting debates. A garbage collector can tell that a thread's instruction pointer is still referencing a chunk of dynamic code.)
Or to put it another way, adding a timeout makes any function's output non-deterministic, no matter how "pure" it is. It's only deterministic if you wait as long as necessary for it to finish.
a) They were log/log, especially the first one.
b) There were units on the horizontal axis.