Hacker News new | comments | ask | show | jobs | submit login
Where do Google's flaky tests come from? (docs.google.com)
61 points by bobjordan 14 days ago | hide | past | web | favorite | 49 comments

Not at Google, but making reliable tests was one thing I really struggled with professionally. There are many, many possible causes of non-determinism. Even something as simple as using multiple threads means that you suddenly realize that you must decide how long you're willing to wait for the other thread to complete. (You always kind of did, but when all the logic was on one thread, the waiting was not as obvious and errors were less likely to cause hangups.)

Unfortunately, under high load, I found that threads can be starved of CPU for seconds at a time. The CI machine was flooring it trying to build the software and run the tests. Setting the timeout high enough that there were no random failures caused by CPU starvation meant that build failures would take a very long time, delaying feedback for errors. That was nearly as bad as the flakiness.

I mostly fixed the flakiness by making the number of threads configurable and only running one thread in the testing environment, but I think a better way would be to reduce the number of parallel jobs to keep the test machines' CPU utilization rate at something more reasonable. Then, you can set timeouts that reflect real performance expectations for your software.

When doing anything concurrently and/or in parallel, it is always good to have an input queue and an upper bound on the number of tasks (/threads/processes/...). Also an asynchronous flow.

You need to handle other problems (back-pressure, priorities, timeouts, ...) but if you need it you need it.

But how do you go about creating tests for all of that?

You don't need to write your own and test it -- use something like GNU parallel (if your tests can be run individually as commands or shell scripts) or a test framework that supports parallelism.

In such cases it is better to await on some outcome of the asynchronous calculation. In Java ecosystem we have awaitility (https://github.com/awaitility/awaitility) which provides a bunch of utility methods in this spirit.

That's how our tests worked, but you still must choose a timeout or you will wait forever when you do not receive the response you're listening for.

Setting up the CI system to report failure on the first error would have helped a lot too. Waiting on a single timeout was not too bad. The problem was waiting on hundreds of them.

This document is getting absolutely decimated. Why did OP post this without putting in some moderation controls first?

Edit: I found the actual engineering blog post that this google docs ended up turning into. It's probably going to be a easier read from here I imagine - https://testing.googleblog.com/2017/04/where-do-our-flaky-te...

Change "Suggesting" to "Viewing" in the top right to get rid of the vandalism for now.

The document they're destroying is rather different, actually.

You can see it here:


wow lol! Watching 63 people making realtime changes to the same doc is crazy lol.

lol, it's pretty funny

not really. it's just jerks being vandals to show that they can.

But of course larger tests are flaky. They have more lines of code. That’s practically the null hypothesis, where you assume every line of code has a certain chance of flaking, and then you just multiply out all the probabilities.

I was expecting some in depth analysis, like “tests that use time libraries are flaky” or “tests that have callbacks are flaky” or something. This doesn’t really make any headway towards understanding the problem.

>where you assume every line of code has a certain chance of flaking, and then you just multiply out all the probabilities.

This assumption doesn't make much sense though. `y = x + 1` shouldn't flake. Nor is your comment here a correct reflection of "size". Size in the article means time the test runs or amount of memory used (ie. constructing a larger SUT), not LoC. Edit: I'm partially mistaken, one measure of size is binary size, which should be generally correlated with LoC, though not absolutely. Webdriver tests which use image/video goldens are an example where test size may not correlate with LoC at all.

>This doesn’t really make any headway towards understanding the problem.

The second part of the article does address exactly this (webdriver tests are correlated with flakyness, although they're also larger by virtue of needing to spin up a headless browser, which has a memory cost).

More edit:

As someone who has written and heavily deflaked some (very large) tests using, I believe, the "internal testing framework" mentioned in the article, flakyness always appeared as a tradeoff with speed. Parallelizing tasks introduces a level of randomness which can lead to flakes. Short(er) timeouts on things that can have long 95 or 99th percentile latencies, etc. It can sometimes be worth it to have a test flake some small percentage of the time instead of waiting longer (granted this is also often a flaw in test design somewhere).

But when you're testing a big system with many moving parts, it's not uncommon to have something that takes 30 seconds 95 or 99% of the time, and every once in a while 5 minutes. The question is if its worth it to keep extending your timeouts for n-sigma events, especially since that means that when your tests are actually broken, they'll potentially take much, much longer to complete. You can lower the false false negative rate, but the cost is that true negatives become much more expensive to compute. Is that worth it? It depends.

It's probably worth noting that

    y = x + 1
will flake out if x is an int, and it's value is INT_MAX.

I'm pretty sure it will still act consistently. Under the same compiled binary (which a hermetic environment enforces), the undefined behavior will be consistent from run to run, since the same optimizations will be used.

That's not true in a theoretical sense (undefined behavior allows me to rewrite signed overflow to "return 7 with some probability, and `rm ~/` otherwise"), but for any modern compiler, I'd assume that the optimizations chosen where signed overflow is undefined don't involve randomness.

It may be fun to point out that the complier could technically launch a game of Tetris or rm -rf /, but is that what any mainstream compiler does? I guess the behavior could be inconsistent, but who'd design a compiler like that? Seems like it'll roll over or it won't (ie. stay stuck at INT_MAX). Good unit tests would catch this if you changed compiler, but really you should handle it explicitly (look before you leap).

Not at all. In a code path where the compiler reasons that x is definitely INT_MAX, it could simply omit this code, leaving y uninitialized and nondeterministic.

It will act consistently if the value of x is consistent. If it's not, then it could flake out on that line.

If the value of x is inconsistent, you had flakyness somewhere else.

That line will always act consistently, modulo solar radiation.

That was underwhelming. I was hoping for some analysis of the statistics of causes, not just "more code can contain more bugs". I've fought the flaky test battle a lot at work and written a few articles that all contain more information and mitigation strategies that I've used:

https://medium.com/@boxed/use-the-biggest-hammer-8425e4c7188... https://medium.com/@boxed/intermittent-tests-aligned-primary... https://medium.com/@boxed/flaky-tests-part-3-freeze-the-worl...

Here is Mozilla's list of common causes for flaky Firefox tests (aka "intermittent oranges"):


Those are interesting but very JS centric (though some principles apply regardless)

Of the tests I've seen, one usual source of flakyness is ignored side-effects between tests (DB is not emptied, some source of persistence is not reset - could be a 'global' variable or a mock library not mocking things correctly), sometimes import conflicts (some module clashes with a builtin library, but if you do this in a certain order it works)

Other sources include people using the equivalent of "now()" but then you're on those small time periods where your assumptions break (for example, now() is the last day of the year and you're assuming tomorrow is still in the same year, things like that)

If you can manage to get an editing slot (Google docs hard caps the number of simultaneous editors) the amount of suggested vandalism this already has is amazing. It also gives an interesting view in to what people are feeling about Google's level of testing.

Here is a preview: https://cdn.pbrd.co/images/HZfekkKc.png

And if you want to see the unmangled article: https://docs.google.com/document/d/1mZ0-Kc97DI_F3tf_GBW_NB_a...

> nonexistent

> 4.2 million

You have a really low bar for what constitutes “interesting”.

Watch a document get vandalized in realtime. Fun.

IKR, realtime Wikipedia brought to you by multiplayer EtherPad. With multiplayer word highlighting!

Unfortunately, Redditors and HN members are destroying the document, because they think it's funny.

The pre-destroyed document can be found here:


I hate Google docs like this because I just want to see it but I can’t without “requesting permission” which means someone knows exactly who I am.

we used to be able to until redditors and hn people decided to ruin it

"""Google has around 4.2 million tests that run on our continuous integration system"""

On how many lines of code? Just wondering how big google is LOC wise?

Shouldn't the developers be falling over themselves to fix all those flaky unit tests? I can't believe Google just lets them keep running unfixed. In my code, I don't check anything in unless all unit tests pass (without flaking).

I’ve never worked on a large(ten+ of devs including people that left) project that didn’t have flakes.

They’re hard to fix. If you can’t reliably reproduce then you don’t even know if you that it’s fixed or you’re just lucky.

They’re hard to find. The change that they first appear tends not to be related to the cause. Leads to a lot of “not my problem”

Really that’s just lack of ownership. Ultimately who owns the entire codebase. You say “my code” but I see a lot of people thinking that means the files I edit, not everything

someone has to explicitly decide to spend day(s) fixing tests, and then only might fix a couple. The roi is super frustrating and it’s hard to spend whole days on a single test.

There’s a ton of broken window theory. Once you get used to hitting rebuild, you just always hit rebuild and don’t even know it’s a new flake.

The most effective thing I’ve seen when you hit this point is get all the test data, junit xml say, somewhere queryable and make graphs and explicit parts of okr with time against it.

How many times do you run the tests? I've had tests that only fail about 1/200 times or less.

It gets pretty brutal when you have 10,000 tests that each have a 1/1000 chance of failing.

The worst was that some tests were actually wrong and only passed due to nondeterminism. Once I made the tests deterministic, I discovered they always failed. Until that point, I found those test failures very confusing, because it seemed like my efforts to improve the reliability of the tests was having no effect on them.

I once only had a test that failed Saturday evenings. We didn't notice it for the longest time because generally, people don't check in code then, because they're not working.

The test dealt with a "week", and on Saturday evening, the current week in UTC differed from that of the local timezone.

If someone showed me that my tests were not deterministic, I'd be in a hurry to figure out why. Hidden race conditions? Unreliable initial state of the system? Unexpected interactions between tests? I'm just surprised that this article doesn't mention any effort to find the root cause and fix it.

The challenge isn’t when your tests are flaky. The challenge is when a test written by some guy who left the company begins flaking, and your preliminary investigation shows it’s a timing issue two layers below you in the stack, but the people who own that project insist you’re using it wrong and your test shouldn’t pass in the first place. It’s hard to sell why your team should drop everything to untangle that knot.

These are all just excuses, and common ones at that. Think of it this way:

* The test cases written by the some guy who left the company are now your test cases.

* Before he left the company, he was able to commit the test cases, passing whatever review / quality processes were put in place by the company (or that the company failed to put in place).

* If your team wants to have the benefit of reliable testcases, the knot needs to be untangled. That doesn't entail "dropping everything" (although perhaps thats the only way your organization knows how to operate), there is no reason that improving test cases can't be prioritized alongside and in the context of other work.

Unfortunately, in my experience it's usually the last bullet point where things fall down. Product owners consistently pressure development teams to deliver features now, and do test cases later, which means never.

The last bullet point often should fall down. There are a lot of tests that aren’t worth the effort of making them run reliably, but most organizations have (understandably) strong norms against fixing a failing test by removing it.

Its all about competition. You move fast and deliver rapidly, or get left behind. Time is actually money, for an engineer at least. So unless those tests are needed now, it always seems better to de-prioritize them.

Any test with a normally non-hit timeout anywhere is nondeterministic, because the operating system could schedule your test very slowly causing it to hit the timeout.

A previous workplace has thousands of tens of thousands of tests and about 2-3% fail per build.

I don’t know why the developers don’t care but I suspect they’ve learned to ignore specific tests which repeatedly fail.

Like in most development places management doesn’t give a flying shit about all of the tests passing if it means a feature will ship.

Probably they’re mostly older tests and so fixing them detracts from adding a new fancy feature

(2017) (I was pretty sure I already read this)

Warning: requires logging into google to read post!

You can read the original post without the "paywall" here: https://testing.googleblog.com/2017/04/where-do-our-flaky-te...


1. Larger tests are more flaky

2. Choice of tech has additional, minor contributions to flakiness (eg. unit tests vs. webdriver vs. Android emulator)

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact