Hacker News new | past | comments | ask | show | jobs | submit login
Where do our flaky tests come from? (googleblog.com)
85 points by monkeyshelli on April 19, 2017 | hide | past | favorite | 38 comments

I worked on a system that performed a flaky test at run time and made it a condition that it had to succeed in order to run. Oh boy.

Basically, it generated a sample of numbers from the RNG (not PRNG: real random source). Then it ran statistical tests of randomness to validate, "yes, we have good random-numbers for security and whatnot".

Of course, these tests can fail; there is a nonzero probability that some pattern emerges which looks non-random to the tests; that's a consequence of them being really random.

How lame.

For a true random number generator, that's simply a matter of tuning the probability of random failure to be acceptable; e.g. the old FIPS 140-2 randomness tests were calibrated to spuriously fail with probability of (IIRC) about one in a million.

There are problems with testing a random source, but calibration can solve that one. (The bigger issue is that the "true" random going into your (PRNG) postprocessor should be whatever comes out of your best entropy-generating device, which typically isn't uniformly random; and that the data coming out of your postprocessor will look random to a black-box test even if the input contains basically no entropy.)

And of course, there's the classic "dilbert" random number generator: http://dilbert.com/strip/2001-10-25

Wouldn't a statistical test presumably have some minimum number of trials to hit some confidence value

I mean like the half the point of stats is to verify a result isn't due to chance, at least to an acceptable degree

Wouldn't gaurantee the test won't misdiagnose, but enough trials should push that chance into oblivion

This is a good example of something that is virtually guaranteed, but I will still never, ever make it a condition for exiting unceremoniously, or worse, the condition for terminating a loop.

Can that be fixed with a larger sample size at startup? Of course the validation should be done long before startup.

If the RNG is truly random, then it must have a non-zero probability of failing any test for randomness that uses a finite sample of the RNG's output. For instance, the RNG could theoretically return a sequence of all the same number. You can't "fix" this.

Sure, but there is a big difference between flakey and non-zero. A 1 in a million chance would probably be more acceptable, a 1 in a hundred chance, not so much.

You can often make the failure probability extremely low, like 1e-30. Then it is more likely the server performing the test would break than the test fails due to a chance.

I don't think testing for randomness makes sense. Randomness is something we assume. There's always a non-zero probability that randomness will appear non-random.

In fact, this whole comment is a random string of characters. I just got really lucky, and there's essentially no way to disprove that. Throwing a dice a trillion times and getting all 6's is not an invalid outcome, just very improbable.

Randomness is about unpredictability. How can you assert that something is or is not predictable? It's always possible you just haven't observed it for long enough. An initial, randomly-appearing sequence may turn out to start repeating itself after some point in time; and an initial, self-repeating sequence can always be statistically cancelled out by later data.

Randomness is not something we can have; it's something we don't have (the ability to predict).

If you are using a hardware RNG there is the potential that there is some hardware fault causing a static readout or other such catastrophic failure.

It depends what happens when the test fails as to whether it is reasonable or not.

On the contrary, randomness is everywhere. Newtonian physics are nothing more than the statistical properties of quantum randomness with a large enough sample size.

To test the randomness of the RNG you could make the sample size so large that the chance of a false positive would be smaller than the chance of some memory bit flipping as a result of the Heisenberg uncertainty principle.

Is there an alternative? Can you mathematically prove randomness? (real question I'm not a mathemagician) If you can how do you verify that there is not an implementation bug in the code?

I think it might be a case of perfect being the enemy of good, the best approach is to minimize the false negatives.

You can disprove randomness (by prediction) but you can't prove randomness.

You can prove that certain results are as one would expect for a random source, but that is neither guaranteed to be passed by uniform randomness, nor does passing guarantee true randomness.

On a more philosophical level, as you get better measurements, entropy of a random source will always decrease. Time-stamp seeding, for example, can be weakened by timing.

What can be done, on the theoretical level for PRNGs is to see how easy it is to predict the next iteration from the current iteration. An easy bound here is simply the period of the PRNG.

Strange article.

Does one really need to do data analysis to see that it is simply natural for large tests to be more flaky? All else being equal large tests perform more operations. If on average the probability of any single operation to fail is constant (fair assumption given the large number of tests) than the test that performs more operations has higher probability to fail (and no it is not linear as shown in the article, it is asymptotic approaching probability of one). The "analysis" of tool types are strange too. Webdriver tests are more flaky because of inherent nature of UI tests operating on less stable interface (compared to e.g. typical unit test which usually deals with a pretty stable contract).

I used to spend quite some time on test stability issues in my previous team and wrote a couple posts on the topic and how to successfully go from unstable to stable test suite:



[Edit: formatting, spelling]

It's not strange​ at all. I appreciate the article's analysis. As the founder of the Selenium project, I frequently hear people blame Selenium for their problems. Sure, Selenium isn't perfect, but it seems like it is so much easier for people to blame their tools instead of questioning other things. But I understand why people do it. For a typical software developer, without an obvious cause for flakiness, the apparent randomness of a flaky test tends to make it easier (and plausibly justifiable) to "shoot the messenger".

OT, but thank you for Selenium!

Thanks, although credit these days rightfully goes to the huge multi-company and org team effort. (In other words, at this point, there are tons of people to blame for those flaky tests!)

The reasoning behind why large and Webdriver tests ought to be more flakey is sound. However, I don't see why having empirical data and doing the analysis is "strange".

Doing studies and collecting data should be how any assumption, seemingly reasonable or not, is proven right or wrong.

It is heartening to know that the great and mighty Google spends a real amount of time and effort dealing with flaky Selenium tests, just like the rest of us!

And it prompts the question: Is this really a challenge so difficult that even the Gods struggle?

I'm surprised that google can't manage it, they have proper test engineers. But automated testing is a discipline that very few learn to do well. Even most unit test you see out in the wild are absolutely woeful. I've got a long list of horror stories about testing if you'd like?

I used to believe that devs should write the tests, but now I'm not so sure, all to often they just produce more technical debt and fail at improving quality. The industry considers knowing the api of a test framework as an experience automated tester.

Yes, it really is. Automated UI tests are notoriously fragile.

I'm curious: Why? It's bits and bytes. What stops it from being reliably deterministic like any other kind of computation?

You only need to make one mistaken assumption to introduce non-determinism. Say you wait for the page to load, then click a button, is that valid? Sometimes it is, but some UI frameworks might render the button after the page reports being loaded. Your test will generally pass regardless, but mysteriously fail periodically.

You also tend to never achieve full reliability because the environments aren't entirely stable. Almost everyone sees more "random" failures on old IE versions. Poor over-stressed IE9 running in a VM will have a bad day and your test will fail.

If multiple threads are running, the order that the threads are scheduled is nondeterministic. And if there are any timeouts anywhere in the code, the result is nondeterministic due to the nondeterministic method that the kernel will schedule the process and environmental effects of whatever other programs are running at the same time or recently ran.

Some examples I've seen: Changes in browser code or DOM handling can break the test. Memory issues from having too many instances of Chrome can break the test. There are a lot of factors at play that are slightly difficult to control for

Not sure if this paper helps answer your question: http://alandix.com/academic/papers/nond90/

It would be interesting to know how this correlates with how long the test normally takes to run, versus its timeout. (Any timeout is a race condition in disguise - and yet we need timeouts.)

I agree, picking binary size as a variable seems like not a great choice to me.

More interesting things would be:

  Number of network calls.
  Network IO
  Disk IO
  Number of syscalls
I say this because all of those things seem much more likely to be the source of failures to me. We all know about network unreliability, CAP theorem and the Byzantine generals problem, so why doesn't it apply to testing too?

Umm, no; only timeouts which are related to ensuring some execution order are race conditions.

E.g. B must not be done until A finishes. A takes too long; we time out on it, and do B anyway. (Alternatively: A provides no reliable notification of being done, so we delay for N seconds and assume A is done.)

If having done B is incorrect if A still happens after the timeout, then we have a problem. It's a race between A and B, where A has a big "head start", that's all.

Famous example (maybe forgotten now): Microsoft's ten second COM DLL unload delay. The last reference count on a COM DLL is dropped by a function that is in the DLL itself. So the refcount is zero, but the thread which dropped the refcount to zero must still execute instructions in that DLL until it returns out of that code. Oops! Anyone who calls the DLL's DLLCanUnloadNow function will see a true result: we can unload it now! But actually we cannot. Solution: oh, ten seconds should be enough for that last thread to vacate. Long page faults under thrashing? Who cares; the system is probably dying in that situation anyway.

(That should also be considered a data point in the GC versus refcounting debates. A garbage collector can tell that a thread's instruction pointer is still referencing a chunk of dynamic code.)

I suppose it depends on how you define "race condition", but what I mean is that there's literally a race between the executing code and the timeout handler. Either one can win. The output is non-deterministic; it depends on scheduling.

Or to put it another way, adding a timeout makes any function's output non-deterministic, no matter how "pure" it is. It's only deterministic if you wait as long as necessary for it to finish.

These plots would be much more clear if:

a) They were log/log, especially the first one.

b) There were units on the horizontal axis.

Clearly we should be doing more testing of our tests.

You can regression test your regression tests by running them against previous buggy versions of your program. If the tests do not detect the bug they were supposed to prevent, you have to fix them.

Isn't this a common bug fixing method? Write a test that fails due to the bug's behaviour, then update the code to make the test pass (ie. fix the bug)

It is, but I think the OC was suggesting that the old code can be kept around and used for automated regression meta-testing, which is an interesting idea.

Perhaps there should be a testing test suite. Call it react-test-tester. Maybe implement ML in Rust to do the test testing.

Applications are open for YC Winter 2023

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact