Unfortunately, under high load, I found that threads can be starved of CPU for seconds at a time. The CI machine was flooring it trying to build the software and run the tests. Setting the timeout high enough that there were no random failures caused by CPU starvation meant that build failures would take a very long time, delaying feedback for errors. That was nearly as bad as the flakiness.
I mostly fixed the flakiness by making the number of threads configurable and only running one thread in the testing environment, but I think a better way would be to reduce the number of parallel jobs to keep the test machines' CPU utilization rate at something more reasonable. Then, you can set timeouts that reflect real performance expectations for your software.
You need to handle other problems (back-pressure, priorities, timeouts, ...) but if you need it you need it.
Setting up the CI system to report failure on the first error would have helped a lot too. Waiting on a single timeout was not too bad. The problem was waiting on hundreds of them.
Edit: I found the actual engineering blog post that this google docs ended up turning into. It's probably going to be a easier read from here I imagine - https://testing.googleblog.com/2017/04/where-do-our-flaky-te...
You can see it here:
Previous discussion: https://news.ycombinator.com/item?id=14146841
I was expecting some in depth analysis, like “tests that use time libraries are flaky” or “tests that have callbacks are flaky” or something. This doesn’t really make any headway towards understanding the problem.
This assumption doesn't make much sense though. `y = x + 1` shouldn't flake. Nor is your comment here a correct reflection of "size". Size in the article means time the test runs or amount of memory used (ie. constructing a larger SUT), not LoC. Edit: I'm partially mistaken, one measure of size is binary size, which should be generally correlated with LoC, though not absolutely. Webdriver tests which use image/video goldens are an example where test size may not correlate with LoC at all.
>This doesn’t really make any headway towards understanding the problem.
The second part of the article does address exactly this (webdriver tests are correlated with flakyness, although they're also larger by virtue of needing to spin up a headless browser, which has a memory cost).
As someone who has written and heavily deflaked some (very large) tests using, I believe, the "internal testing framework" mentioned in the article, flakyness always appeared as a tradeoff with speed. Parallelizing tasks introduces a level of randomness which can lead to flakes. Short(er) timeouts on things that can have long 95 or 99th percentile latencies, etc. It can sometimes be worth it to have a test flake some small percentage of the time instead of waiting longer (granted this is also often a flaw in test design somewhere).
But when you're testing a big system with many moving parts, it's not uncommon to have something that takes 30 seconds 95 or 99% of the time, and every once in a while 5 minutes. The question is if its worth it to keep extending your timeouts for n-sigma events, especially since that means that when your tests are actually broken, they'll potentially take much, much longer to complete. You can lower the false false negative rate, but the cost is that true negatives become much more expensive to compute. Is that worth it? It depends.
y = x + 1
That's not true in a theoretical sense (undefined behavior allows me to rewrite signed overflow to "return 7 with some probability, and `rm ~/` otherwise"), but for any modern compiler, I'd assume that the optimizations chosen where signed overflow is undefined don't involve randomness.
That line will always act consistently, modulo solar radiation.
Of the tests I've seen, one usual source of flakyness is ignored side-effects between tests (DB is not emptied, some source of persistence is not reset - could be a 'global' variable or a mock library not mocking things correctly), sometimes import conflicts (some module clashes with a builtin library, but if you do this in a certain order it works)
Other sources include people using the equivalent of "now()" but then you're on those small time periods where your assumptions break (for example, now() is the last day of the year and you're assuming tomorrow is still in the same year, things like that)
Here is a preview: https://cdn.pbrd.co/images/HZfekkKc.png
> 4.2 million
The pre-destroyed document can be found here:
On how many lines of code? Just wondering how big google is LOC wise?
They’re hard to fix. If you can’t reliably reproduce then you don’t even know if you that it’s fixed or you’re just lucky.
They’re hard to find. The change that they first appear tends not to be related to the cause. Leads to a lot of “not my problem”
Really that’s just lack of ownership. Ultimately who owns the entire codebase. You say “my code” but I see a lot of people thinking that means the files I edit, not everything
someone has to explicitly decide to spend day(s) fixing tests, and then only might fix a couple. The roi is super frustrating and it’s hard to spend whole days on a single test.
There’s a ton of broken window theory. Once you get used to hitting rebuild, you just always hit rebuild and don’t even know it’s a new flake.
The most effective thing I’ve seen when you hit this point is get all the test data, junit xml say, somewhere queryable and make graphs and explicit parts of okr with time against it.
The worst was that some tests were actually wrong and only passed due to nondeterminism. Once I made the tests deterministic, I discovered they always failed. Until that point, I found those test failures very confusing, because it seemed like my efforts to improve the reliability of the tests was having no effect on them.
The test dealt with a "week", and on Saturday evening, the current week in UTC differed from that of the local timezone.
* The test cases written by the some guy who left the company are now your test cases.
* Before he left the company, he was able to commit the test cases, passing whatever review / quality processes were put in place by the company (or that the company failed to put in place).
* If your team wants to have the benefit of reliable testcases, the knot needs to be untangled. That doesn't entail "dropping everything" (although perhaps thats the only way your organization knows how to operate), there is no reason that improving test cases can't be prioritized alongside and in the context of other work.
Unfortunately, in my experience it's usually the last bullet point where things fall down. Product owners consistently pressure development teams to deliver features now, and do test cases later, which means never.
I don’t know why the developers don’t care but I suspect they’ve learned to ignore specific tests which repeatedly fail.
Like in most development places management doesn’t give a flying shit about all of the tests passing if it means a feature will ship.
Probably they’re mostly older tests and so fixing them detracts from adding a new fancy feature
You can read the original post without the "paywall" here: https://testing.googleblog.com/2017/04/where-do-our-flaky-te...
1. Larger tests are more flaky
2. Choice of tech has additional, minor contributions to flakiness (eg. unit tests vs. webdriver vs. Android emulator)