
Where do our flaky tests come from? - monkeyshelli
https://testing.googleblog.com/2017/04/where-do-our-flaky-tests-come-from.html
======
kazinator
I worked on a system that performed a flaky test _at run time_ and made it a
condition that it had to succeed in order to run. Oh boy.

Basically, it generated a sample of numbers from the RNG (not PRNG: real
random source). Then it ran statistical tests of randomness to validate, "yes,
we have good random-numbers for security and whatnot".

Of course, these tests can fail; there is a nonzero probability that some
pattern emerges which looks non-random to the tests; that's a consequence of
them being really random.

How lame.

~~~
JoachimSchipper
For a true random number generator, that's simply a matter of tuning the
probability of random failure to be acceptable; e.g. the old FIPS 140-2
randomness tests were calibrated to spuriously fail with probability of (IIRC)
about one in a million.

There are problems with testing a random source, but calibration can solve
_that_ one. (The bigger issue is that the "true" random going into your (PRNG)
postprocessor should be whatever comes out of your best entropy-generating
device, which typically _isn 't_ uniformly random; and that the data coming
out of your postprocessor will look random to a black-box test even if the
input contains basically no entropy.)

~~~
beobab
And of course, there's the classic "dilbert" random number generator:
[http://dilbert.com/strip/2001-10-25](http://dilbert.com/strip/2001-10-25)

------
devman
Strange article.

Does one really need to do data analysis to see that it is simply natural for
large tests to be more flaky? All else being equal large tests perform more
operations. If on average the probability of any single operation to fail is
constant (fair assumption given the large number of tests) than the test that
performs more operations has higher probability to fail (and no it is not
linear as shown in the article, it is asymptotic approaching probability of
one). The "analysis" of tool types are strange too. Webdriver tests are more
flaky because of inherent nature of UI tests operating on less stable
interface (compared to e.g. typical unit test which usually deals with a
pretty stable contract).

I used to spend quite some time on test stability issues in my previous team
and wrote a couple posts on the topic and how to successfully go from unstable
to stable test suite:

[https://blogs.vmware.com/management/2015/09/automated-
tests-...](https://blogs.vmware.com/management/2015/09/automated-tests-and-
stability-part-1-why-is-it-important.html)

[https://blogs.vmware.com/management/2015/09/automated-
tests-...](https://blogs.vmware.com/management/2015/09/automated-tests-
part-2-regaining-stability.html)

[Edit: formatting, spelling]

~~~
hugs
It's not strange​ at all. I appreciate the article's analysis. As the founder
of the Selenium project, I frequently hear people blame Selenium for their
problems. Sure, Selenium isn't perfect, but it seems like it is so much easier
for people to blame their tools instead of questioning other things. But I
understand why people do it. For a typical software developer, without an
obvious cause for flakiness, the apparent randomness of a flaky test tends to
make it easier (and plausibly justifiable) to "shoot the messenger".

~~~
vosper
OT, but thank you for Selenium!

~~~
hugs
Thanks, although credit these days rightfully goes to the huge multi-company
and org team effort. (In other words, at this point, there are tons of people
to blame for those flaky tests!)

------
kyoob
It is heartening to know that the great and mighty Google spends a real amount
of time and effort dealing with flaky Selenium tests, just like the rest of
us!

~~~
paulddraper
And it prompts the question: Is this really a challenge so difficult that even
the Gods struggle?

~~~
billsmithaustin
Yes, it really is. Automated UI tests are notoriously fragile.

~~~
paulddraper
I'm curious: Why? It's bits and bytes. What stops it from being reliably
deterministic like any other kind of computation?

~~~
nitwit005
You only need to make one mistaken assumption to introduce non-determinism.
Say you wait for the page to load, then click a button, is that valid?
Sometimes it is, but some UI frameworks might render the button after the page
reports being loaded. Your test will generally pass regardless, but
mysteriously fail periodically.

You also tend to never achieve full reliability because the environments
aren't entirely stable. Almost everyone sees more "random" failures on old IE
versions. Poor over-stressed IE9 running in a VM will have a bad day and your
test will fail.

------
skybrian
It would be interesting to know how this correlates with how long the test
normally takes to run, versus its timeout. (Any timeout is a race condition in
disguise - and yet we need timeouts.)

~~~
kazinator
Umm, no; only timeouts which are related to ensuring some execution order are
race conditions.

E.g. B must not be done until A finishes. A takes too long; we time out on it,
and do B anyway. (Alternatively: A provides no reliable notification of being
done, so we delay for N seconds and assume A is done.)

If having done B is incorrect if A still happens after the timeout, then we
have a problem. It's a race between A and B, where A has a big "head start",
that's all.

Famous example (maybe forgotten now): Microsoft's ten second COM DLL unload
delay. The last reference count on a COM DLL is dropped by a function that is
in the DLL itself. So the refcount is zero, but the thread which dropped the
refcount to zero must still execute instructions in that DLL until it returns
out of that code. Oops! Anyone who calls the DLL's _DLLCanUnloadNow_ function
will see a true result: we can unload it now! But actually we cannot.
Solution: oh, ten seconds should be enough for that last thread to vacate.
Long page faults under thrashing? Who cares; the system is probably dying in
that situation anyway.

(That should also be considered a data point in the GC versus refcounting
debates. A garbage collector can tell that a thread's instruction pointer is
still referencing a chunk of dynamic code.)

~~~
skybrian
I suppose it depends on how you define "race condition", but what I mean is
that there's literally a race between the executing code and the timeout
handler. Either one can win. The output is non-deterministic; it depends on
scheduling.

Or to put it another way, adding a timeout makes any function's output non-
deterministic, no matter how "pure" it is. It's only deterministic if you wait
as long as necessary for it to finish.

------
ISL
These plots would be much more clear if:

a) They were log/log, especially the first one.

b) There were units on the horizontal axis.

------
ams6110
Clearly we should be doing more testing of our tests.

~~~
yorwba
You can regression test your regression tests by running them against previous
buggy versions of your program. If the tests do not detect the bug they were
supposed to prevent, you have to fix them.

~~~
joshschreuder
Isn't this a common bug fixing method? Write a test that fails due to the
bug's behaviour, then update the code to make the test pass (ie. fix the bug)

~~~
extension
It is, but I think the OC was suggesting that the old code can be kept around
and used for automated regression meta-testing, which is an interesting idea.

