
Where do Google's flaky tests come from? - bobjordan
https://docs.google.com/document/d/1mZ0-Kc97DI_F3tf_GBW_NB_aqka-P1jVOsFfufxqUUM
======
slavik81
Not at Google, but making reliable tests was one thing I really struggled with
professionally. There are many, many possible causes of non-determinism. Even
something as simple as using multiple threads means that you suddenly realize
that you must decide how long you're willing to wait for the other thread to
complete. (You always kind of did, but when all the logic was on one thread,
the waiting was not as obvious and errors were less likely to cause hangups.)

Unfortunately, under high load, I found that threads can be starved of CPU for
seconds at a time. The CI machine was flooring it trying to build the software
and run the tests. Setting the timeout high enough that there were no random
failures caused by CPU starvation meant that build failures would take a very
long time, delaying feedback for errors. That was nearly as bad as the
flakiness.

I mostly fixed the flakiness by making the number of threads configurable and
only running one thread in the testing environment, but I think a better way
would be to reduce the number of parallel jobs to keep the test machines' CPU
utilization rate at something more reasonable. Then, you can set timeouts that
reflect real performance expectations for your software.

~~~
sebcat
When doing anything concurrently and/or in parallel, it is always good to have
an input queue and an upper bound on the number of tasks
(/threads/processes/...). Also an asynchronous flow.

You need to handle other problems (back-pressure, priorities, timeouts, ...)
but if you need it you need it.

~~~
zenexer
But how do you go about creating tests for all of that?

~~~
cyphar
You don't need to write your own and test it -- use something like GNU
parallel (if your tests can be run individually as commands or shell scripts)
or a test framework that supports parallelism.

------
lawrenceyan
This document is getting absolutely decimated. Why did OP post this without
putting in some moderation controls first?

Edit: I found the actual engineering blog post that this google docs ended up
turning into. It's probably going to be a easier read from here I imagine -
[https://testing.googleblog.com/2017/04/where-do-our-flaky-
te...](https://testing.googleblog.com/2017/04/where-do-our-flaky-tests-come-
from.html#comments)

~~~
cronix
wow lol! Watching 63 people making realtime changes to the same doc is crazy
lol.

~~~
djohnston
lol, it's pretty funny

~~~
JohnHaugeland
not really. it's just jerks being vandals to show that they can.

------
slavik81
Original blog post (non-wiki version):
[https://testing.googleblog.com/2017/04/where-do-our-flaky-
te...](https://testing.googleblog.com/2017/04/where-do-our-flaky-tests-come-
from.html)

Previous discussion:
[https://news.ycombinator.com/item?id=14146841](https://news.ycombinator.com/item?id=14146841)

------
johnfn
But of course larger tests are flaky. They have more lines of code. That’s
practically the null hypothesis, where you assume every line of code has a
certain chance of flaking, and then you just multiply out all the
probabilities.

I was expecting some in depth analysis, like “tests that use time libraries
are flaky” or “tests that have callbacks are flaky” or something. This doesn’t
really make any headway towards understanding the problem.

~~~
joshuamorton
>where you assume every line of code has a certain chance of flaking, and then
you just multiply out all the probabilities.

This assumption doesn't make much sense though. `y = x + 1` shouldn't flake.
Nor is your comment here a correct reflection of "size". Size in the article
means time the test runs or amount of memory used (ie. constructing a larger
SUT), not LoC. Edit: I'm partially mistaken, one measure of size is binary
size, which should be generally correlated with LoC, though not absolutely.
Webdriver tests which use image/video goldens are an example where test size
may not correlate with LoC at all.

>This doesn’t really make any headway towards understanding the problem.

The second part of the article does address exactly this (webdriver tests are
correlated with flakyness, although they're also larger by virtue of needing
to spin up a headless browser, which has a memory cost).

More edit:

As someone who has written and heavily deflaked some (very large) tests using,
I believe, the "internal testing framework" mentioned in the article,
flakyness always appeared as a tradeoff with speed. Parallelizing tasks
introduces a level of randomness which can lead to flakes. Short(er) timeouts
on things that can have long 95 or 99th percentile latencies, etc. It can
sometimes be worth it to have a test flake some small percentage of the time
instead of waiting longer (granted this is also often a flaw in test design
somewhere).

But when you're testing a big system with many moving parts, it's not uncommon
to have something that takes 30 seconds 95 or 99% of the time, and every once
in a while 5 minutes. The question is if its worth it to keep extending your
timeouts for n-sigma events, especially since that means that when your tests
are _actually_ broken, they'll potentially take much, much longer to complete.
You can lower the false false negative rate, but the cost is that true
negatives become much more expensive to compute. Is that worth it? It depends.

~~~
andrewprock
It's probably worth noting that

    
    
        y = x + 1
    

will flake out if x is an int, and it's value is INT_MAX.

~~~
joshuamorton
I'm pretty sure it will still act consistently. Under the same compiled binary
(which a hermetic environment enforces), the undefined behavior will be
consistent from run to run, since the same optimizations will be used.

That's not true in a theoretical sense (undefined behavior allows me to
rewrite signed overflow to "return 7 with some probability, and `rm ~/`
otherwise"), but for any modern compiler, I'd assume that the optimizations
chosen where signed overflow is undefined don't involve randomness.

~~~
andrewprock
It will act consistently if the value of x is consistent. If it's not, then it
could flake out on that line.

~~~
joshuamorton
If the value of x is inconsistent, you had flakyness somewhere else.

That line will always act consistently, modulo solar radiation.

------
boxed
That was underwhelming. I was hoping for some analysis of the statistics of
causes, not just "more code can contain more bugs". I've fought the flaky test
battle a lot at work and written a few articles that all contain more
information and mitigation strategies that I've used:

[https://medium.com/@boxed/use-the-biggest-
hammer-8425e4c7188...](https://medium.com/@boxed/use-the-biggest-
hammer-8425e4c71882) [https://medium.com/@boxed/intermittent-tests-aligned-
primary...](https://medium.com/@boxed/intermittent-tests-aligned-primary-keys-
dcf14953d9af) [https://medium.com/@boxed/flaky-tests-part-3-freeze-the-
worl...](https://medium.com/@boxed/flaky-tests-part-3-freeze-the-
world-e4929a0da00e)

------
cpeterso
Here is Mozilla's list of common causes for flaky Firefox tests (aka
"intermittent oranges"):

[https://developer.mozilla.org/en-
US/docs/Mozilla/QA/Avoiding...](https://developer.mozilla.org/en-
US/docs/Mozilla/QA/Avoiding_intermittent_oranges)

~~~
raverbashing
Those are interesting but very JS centric (though some principles apply
regardless)

Of the tests I've seen, one usual source of flakyness is ignored side-effects
between tests (DB is not emptied, some source of persistence is not reset -
could be a 'global' variable or _a mock library not mocking things correctly_
), sometimes import conflicts (some module clashes with a builtin library, but
if you do this in a certain order it works)

Other sources include people using the equivalent of "now()" but then you're
on those small time periods where your assumptions break (for example, now()
is the last day of the year and you're assuming tomorrow is still in the same
year, things like that)

------
dsl
If you can manage to get an editing slot (Google docs hard caps the number of
simultaneous editors) the amount of suggested vandalism this already has is
amazing. It also gives an interesting view in to what people are feeling about
Google's level of testing.

Here is a preview:
[https://cdn.pbrd.co/images/HZfekkKc.png](https://cdn.pbrd.co/images/HZfekkKc.png)

~~~
joshuamorton
And if you want to see the unmangled article:
[https://docs.google.com/document/d/1mZ0-Kc97DI_F3tf_GBW_NB_a...](https://docs.google.com/document/d/1mZ0-Kc97DI_F3tf_GBW_NB_aqka-P1jVOsFfufxqUUM/preview)

------
dws
Watch a document get vandalized in realtime. Fun.

~~~
heyjudy
IKR, realtime Wikipedia brought to you by multiplayer EtherPad. With
multiplayer word highlighting!

------
JohnHaugeland
Unfortunately, Redditors and HN members are destroying the document, because
they think it's funny.

The pre-destroyed document can be found here:

[https://docs.google.com/document/d/1mZ0-Kc97DI_F3tf_GBW_NB_a...](https://docs.google.com/document/d/1mZ0-Kc97DI_F3tf_GBW_NB_aqka-P1jVOsFfufxqUUM/edit#heading=h.ec0r4fypsleh)

~~~
sqldba
I hate Google docs like this because I just want to see it but I can’t without
“requesting permission” which means someone knows exactly who I am.

~~~
JohnHaugeland
we used to be able to until redditors and hn people decided to ruin it

------
lifeisstillgood
"""Google has around 4.2 million tests that run on our continuous integration
system"""

On how many lines of code? Just wondering how big google is LOC wise?

------
brianberns
Shouldn't the developers be falling over themselves to fix all those flaky
unit tests? I can't believe Google just lets them keep running unfixed. In my
code, I don't check anything in unless all unit tests pass (without flaking).

~~~
Buge
How many times do you run the tests? I've had tests that only fail about 1/200
times or less.

~~~
brianberns
If someone showed me that my tests were not deterministic, I'd be in a hurry
to figure out why. Hidden race conditions? Unreliable initial state of the
system? Unexpected interactions between tests? I'm just surprised that this
article doesn't mention any effort to find the root cause and fix it.

~~~
SpicyLemonZest
The challenge isn’t when _your_ tests are flaky. The challenge is when a test
written by some guy who left the company begins flaking, and your preliminary
investigation shows it’s a timing issue two layers below you in the stack, but
the people who own that project insist you’re using it wrong and your test
shouldn’t pass in the first place. It’s hard to sell why your team should drop
everything to untangle that knot.

~~~
tacostakohashi
These are all just excuses, and common ones at that. Think of it this way:

* The test cases written by the some guy who left the company _are now your test cases_.

* Before he left the company, he was able to commit the test cases, passing whatever review / quality processes were put in place by the company (or that the company failed to put in place).

* If your team wants to have the benefit of reliable testcases, the knot needs to be untangled. That doesn't entail "dropping everything" (although perhaps thats the only way your organization knows how to operate), there is no reason that improving test cases can't be prioritized alongside and in the context of other work.

Unfortunately, in my experience it's usually the last bullet point where
things fall down. Product owners consistently pressure development teams to
deliver features _now_ , and do test cases _later_ , which means _never_.

~~~
SpicyLemonZest
The last bullet point often should fall down. There are a lot of tests that
aren’t worth the effort of making them run reliably, but most organizations
have (understandably) strong norms against fixing a failing test by removing
it.

------
aboutruby
(2017) (I was pretty sure I already read this)

------
gumby
Warning: requires logging into google to read post!

You can read the original post without the "paywall" here:
[https://testing.googleblog.com/2017/04/where-do-our-flaky-
te...](https://testing.googleblog.com/2017/04/where-do-our-flaky-tests-come-
from.html)

------
bcherny
tl;dr-

1\. Larger tests are more flaky

2\. Choice of tech has additional, minor contributions to flakiness (eg. unit
tests vs. webdriver vs. Android emulator)

