
Tests that sometimes fail - sams99
https://samsaffron.com/archive/2019/05/15/tests-that-sometimes-fail
======
matharmin
We've had a couple of cases of flaky tests failing builds over the last two
years at my company. Most often it's browser / end-to-end type tests (e.g.
selenium-style tests) that are the most flaky. Many of them only fail in 1-3%
of cases, but if you have enough of them the chances of a failing build is
significant.

If you have entire builds that are flaky, you end up training developers to
just click "rebuild" the first one or two times a build fails, which can
drastically increase the time before realizing the build is actually broken.

An important realization is that unit testing is not a good tool for testing
flakyness of your main code - it is simply not a reliable indicator of failing
code. Most of the time it's the test itself that is flaky, and it's not worth
your time making every single test 100% reliable.

Some things we've implemented that helps a lot:

1\. Have a system to reproduce the random failures. It took about a day to
build tooling that can run say 100 instances of any test suite in parallel in
CircleCI, and record the failure rate of individual tests.

2\. If a test has a failure rate of > 10%, it indicates an issue in that test
that should be fixed. By fixing these tests, we've found a couple of
techniques to increase overall robustness of our tests.

3\. If a test has a failure rate of < 3%, it is likely not worth your time
fixing it. For these, we retry each failing test up to three times. Not all
test frameworks support retying out of the box, but you can usually find a
workaround. The retries can be restricted to specific tests or classes of
tests if needed (e.g. only retry browser-based tests).

~~~
justinpombrio
> If a test has a failure rate of < 3%, it is likely not worth your time
> fixing it.

How do you know? What you say is plausible, but it's also plausible that these
rarely-failing tests also rarely-fail in production, and occasionally break
things badly and cause outages or make customers think of your software as
flaky.

Since you say this, I presume you've spent the time to actually track down the
root causes of several tests that fail < 3% of the time? If so, what did you
find? Some sort of issues with the test framework, or issues with your own
code that you're confident would only ever be exposed by testing, or something
else? I'm very curious.

~~~
paulddraper
It's possible, but after fixing lots of these, my experience says usually
talking about stuff like clicking a button before a modal animates out of the
way.

It's sort if a "bug" in that yes, clicking here and then here 1ms later
doesn't do do the best thing, but it's basically irrelevant.

Testing is inherently a probabilistic endeavor.

"What can I do that is most likely to prevent the largest amount of
bugginess?"

Fixing tests that rarely fail is -- in my experience -- a poor answer to such
a question.

~~~
henrikschroder
> Testing is inherently a probabilistic endeavor.

That's a pretty powerful insight!

I think that a lot of developers who are firmly in the test-driven camp don't
realize this, but instead think that if you have 100% test coverage, your code
will work 100% of the time. Fixing bugs, to them, is "just" an inevitable
result of increasing your test coverage, so that's what they focus on.

------
mceachen
Every company I've founded or worked for has struggled with flaky tests.

Twitter had a comprehensive browser and system test suite that took about an
hour to run (and they had a large CI worker cluster). Flaky tests could and
did scuttle deploys. It was a never-ending struggle to keep CI green, but most
engineers saw de-flaking (not just deleting the test) as a critical task.

PhotoStructure has an 8-job GitLab CI pipeline that runs on macOS, Windows,
and Linux. Keeping the ~3,000 (and growing) tests passing reliably has proven
to be a non-trivial task, and researching why a given task is flaky on one OS
versus another has almost invariably led to discovery and hardening of edge
and corner conditions.

It seems that TFA only touched on set ordering, incomplete db resets and time
issues. There are _many_ other spectres to fight as soon as you deal with
multi-process systems on multiple OSes, including file system case
sensitivity, incomplete file system resets, fork behavior and child process
management, and network and stream management.

There are several aspects I added to stabilize CI, including robust shutdown
and child process management systems. I can't say I would have prioritized
those things if I didn't have tests, but now that I have it, I'm glad they're
there.

~~~
hexfran
Sorry for the OT, what is "TFA"?

~~~
mceachen
Sorry. The Fine Article. I didn't mean it in the disparaging connotation.

It's a reference to RTFM, Read The Fine Manual.

TIL: RTFM was a phrase from the 40s : "Read the field manual."

~~~
panopticon
I never seen the F in RTFM mean “fine” before. I’ve always seen it used as the
more vulgar “read the f __*ing manual”.

~~~
j88439h84
I believe that's the joke.

------
joosters
In an old job, we had a frustrating test that passed well over 99 times in
100. It was shrugged off for a very long time until a developer eventually
tracked it down to code that was generating a random SSL key pair. If the
first byte of the key was 0, faulty code elsewhere would mishandle the key and
the test failed.

Keeping the randomness in the test was the key factor in tracking down this
obscure bug. If the test had been made completely deterministic, the test
harness would never have discovered the problem. So although repeatable tests
are in most cases a good thing, non-determinism can unearth problems. The
trick is how to do this without sucking up huge amounts of bug-tracking
time...

(Much effort was spent in making the test repeatable during debugging, but of
course the crypto code elsewhere was deliberately trying to get as much
randomness as it could source...)

~~~
kenha
It doesn't seem to be a strong argument to have non-deterministic tests.

There was the logic that generates the SSL key pair, and there is the faulty
logic that consumes it. Based on the description, it seems it's an indication
of missing test coverage around the faulty code. If, when the faulty code was
written, more time were spent on understanding the assumptions the code has
made, then maybe the test wouldn't appear in the first place.

This anecdote, however, does bring up a good point: Don't shrug off
intermittently failed tests - Dig in and understand the root cause of it.

~~~
AstralStorm
The only other solution is exhaustive property testing. And even that is not
workable when concurrency is in play.

~~~
dllthomas
Good luck exhaustively testing something with a cryptographic key as input.
Non-exhaustive property testing is also pretty cool, though.

------
pytester
What I found to be the major reasons for flaky tests:

* Non-determinism in the code - e.g. select without an order by, random number generators, hashmaps turned into lists, etc. - Fixed by turning non-deterministic code into deterministic code, testing for properties rather than outcomes or isolating and mocking the non-deterministic code.

* Lack of control over the environment - e.g. calling a third party service that goes down occasionally, use of a locally run database that gets periodically upgraded by the package manager - fixed by gradually bringing everything required to run your software under control (e.g. installing specific versions without package manager, mocking 3rd party services, intercepting syscalls that get time and replacing them with consistent values).

* Race conditions - in this case the test should really repeat the same actions so that it consistently catches the flakiness.

~~~
taneq
> e.g. calling a third party service that goes down occasionally

I thought tests weren't meant to have external dependencies (or at least, ones
outside the control of the test harness)?

~~~
DougBTX
In this context, yes, tests shouldn't require external dependencies. By
"tests" we're really talking about tests like, "is this particular build
consistent with its spec?"

There could be other types of test where a remote call would make sense, for
example, "was the deployment successful?" tests might try to verify that the
deployed version of the software can communicate with external dependencies
correctly.

~~~
yebyen
There are also cases that are less justified that you might have, especially
once you start going down the road of "my dev environment should be a clone of
production"

If you have an Employee model and it returns certain attributes of an employee
like Salary, you might have tests that depend on the structure of an employee.
You might have, say, Job and Position models which define an employee-job and
the base definition of the particular job. Say Position has a salary range
associated, and Job has validation rules which check that the salary is in
range.

You could define factories for all those things, or you could use real
examples that are served by a live Employee API.

The canonical way to address this is with factories and mocks, if you have
time do that! (It will probably save you in the long-run, when that complexity
has grown a bit.)

If you just grab the example person whose salary is out of the range for their
position and quickly test that the behavior in nearby modules matches your
expectations, well, those are still tests, and you could be forgiven for
writing them this way.

I think they call these the "London" and "Detroit" styles of mocking, but the
short version IMHO is that a mistake was making dev as a clone of production,
and any errors in judgement that came after that were merely coping
mechanisms.

If you want your tests to tell you when something has changed that requires
your attention, you need a test that hits this Employee API and will fail if
the structure of the employees returned is no longer conforming to your
expectations, even though it's external. The design of such a thing is
something I won't profess to know how to do well.

(It's better to version your API and write a changelog that tells what you
need to know if the old version has been replaced by a new version, but if
you're writing these microservices all for yourself it can seem pedantic to
explicitly version your API, too. There are also coping mechanisms you'll need
to embrace once you get to "we're not incrementing the API version" and
surprise, many of them are the same ones...)

------
roland35
There was one weird bug reported to me in an microcontroller based project I
was recently working on which shut off half the LCD screen. I wrote a test
which blasted the LCD screen with random characters and commands and did not
see the same error for awhile... but it finally happened during a test! I was
able to then see that when I was checking the LCD state between commands I
only would toggle the chip select for the first half of the LCD (there were 2
driver chips built into the screen and you had to read each chip
individually). There would be no way I could have recreated the bug without
automated tests.

I have had to deal with non-deterministic tests with my embedded systems and
robotic test suites and have found a few solutions to deal with them:

\- Do a full power reset between tests if possible, or do it between test
suites when you can combine tests together in suites that don't require a
complete clean slate

\- Reset all settings and parameters between tests. A lot of embedded systems
have settings saved in Flash or EEPROM which can affect all sorts of
behaviors, so make sure it always starts at the default setting.

\- Have test commands for all system inputs and initialize all inputs to known
values.

\- Have test modes for all system outputs such as motors. If there is a motor
which has a speed encoder you can make the test mode for the speed encoder
input to match the commanded motor value, or also be able to trigger error
inputs such as a stalled motor.

\- Use a user input/dialog option to have user feedback as part of the test
(for things like the LCD bug).

Robot Framework is a great tool which can do all these things with a custom
Python library! I think testing embedded systems is generally much harder so
people rarely do it, but I think it is a great tool which can oftentimes
uncover these flaky errors.

------
darekkay
Related stories: "unit tests fail when run in Australia" [1] and "the case of
the 500-mile email" [2]. There is a whole GitHub repository dedicated to some
very interesting debugging stories [3].

[1]
[https://github.com/angular/angular.js/issues/5017](https://github.com/angular/angular.js/issues/5017)

[2]
[http://www.ibiblio.org/harris/500milemail.html](http://www.ibiblio.org/harris/500milemail.html)

[3] [https://github.com/danluu/debugging-
stories](https://github.com/danluu/debugging-stories)

------
zubspace
We call them Flip Floppers.

We do a lot of integration testing, more so than unit testing, and those
tests, which randomly fail, are a real headache.

One thing I learned is that setting up tests correctly, independent of each
other, is hard. It is even harder if databases, local and remote services are
involved or if your software communicates with other software. You need to
start those dependencies and take care of resetting their state, but there's
always something: Services sometimes take longer to start, file handles not
closing on time, code or applications which keeps running when another test
fails... etc, etc...

There are obvious solutions: Mocking everything, removing global state,
writing more robust test setup code... But who has time for this? Fixing
things correctly can even take more time and usually does not guarantee that
some new change in the future disregards your correct code...

~~~
pytester
>There are obvious solutions: Mocking everything, removing global state,
writing more robust test setup code... But who has time for this?

I find that doing all of this tends to actually save time overall it's just
that the up front investment is high and the payoff is realized over a long
time.

Most software teams seem to prefer higher ongoing costs if it comes with quick
wins to up front investment.

~~~
c0vfefe
Those are the age-old arguments against TDD. Every team will have to analyze
the value proposition in their context to see if the return is worth the
investment.

------
lukego
I have learned to love non-deterministic tests.

The world is non-deterministic. A test suite that can represent non-
determinism is much more powerful than one that cannot. To paraphrase
Dijkstra, "Determinism is just a special case of non-determinism, and not a
very interesting one at that."

If a test is non-deterministic then a test framework needs to characterize the
distribution of results for that test. For example "Branch A fails 11% (+/\-
2%) of the time and Branch B fails 64% (+/\- 2%) of the time." Once you are
able to measure non-determinism then you can also effectively optimize it
away, and you start looking for ways to introduce more of it into your test
suites e.g. to run each test on a random CPU/distro/kernel.

~~~
muro
But you pay the cost of retrying the failing tests and lack of clear signal.
And if the application code is flaky, users get to experience the breakage
too.

~~~
mrkeen
> And if the application code is flaky

This is the only relevant factor. Forget the rest. Users don't experience your
flaky tests just like they don't experience your messy Jira boards or your bad
office coffee.

~~~
AstralStorm
How do you know which is failing without exhaustive analysis?

See, once you know why the test fails and it's not the tested application,
which is exceedingly rare in practice, you can just disable it or fix it. But
only if you're actually sure, not before.

~~~
muro
In my experience it is usually the test.

~~~
lukego
In my experience that's because tests are usually written in a timid style
that tries _not_ to provoke flaky behaviour from applications.

If your test suite can handle non-determinism then you can approach test-
writing in a completely different - braver - way.

------
throwaway5752
Call it a pet peeve, but if we call it "chaos engineering" it costs a ton and
gets people conference talks when a sporadic system integration issue is
found. But if you have the same thing happen in a plain old CI half the time
it will be ignored or flagged flaky.

~~~
dllthomas
IIUC, Chaos Engineering is about moving things out of "eh, it won't happen in
production, I'll ignore it" into "it will happen in production, I'd better
handle it", and making sure mitigation and recovery code is actually exercised
in a realistic setting. "Periodic errors in CI that go unmitigated and produce
test failures" seems very meaningfully distinct from chaos engineering.

~~~
throwaway5752
If I use spinnaker and chaos monkey in a prod scale (or even in a prod
experiment) to create a circumstance where I can't perform a write of a
resource because I couldn't achieve quorum in a replica set and that let to an
inconsistency between two data stores... is that meaningfully different than
observing the same issue but caused by incorrect vpc routing or insufficienty
resourced test instances leading/slow node startup times/race conditions in a
CI test environment?

I think there is overlap and that it does not have to be a choice between
either approach.

~~~
dllthomas
The meaningful question isn't in observing the issue, but in what happens
after. Chaos engineering is about making sure you still have enough of a
chance of _success_ in the face of failures. For CI, success means an error
report that correctly captures whether the PR in question is breaking anything
(... at least, anything we're testing). If your process means you can be
sloppy about isolation and still get that, then I'd be okay with calling that
an example of "chaos engineering". If being sloppy about isolation means you
have failing tests in many CI runs that have nothing to do with the changes
under consideration, that's not "chaos engineering" \- it's just bad CI.

------
mekane8
As soon as I saw that whole section on database-related flakiness my mind went
from "flaky unit tests" to "tests called unit tests that are actually
integration tests". I worked on a team where we labored under that
misconception for a long, long time. By the time we finally realized that many
of the tests in our suite were integration tests and not unit tests it was too
late to change (due to budget and timeline pressure).

I really like the different approaches to dealing with these flaky tests, that
is a good list.

~~~
mceachen
I think it's important that engineers can distinguish between testing code in
isolation versus "integration" or "system" testing, but I've seen a sophomoric
stigma around integration tests that lead to mocking hell, and a hatred
towards testing in general.

Unit tests are great. You want them. Craft your interfaces to enable them.

Integration and system tests are important too. Again, crafting higher level
interfaces that allow for testing will, in general, lead to a more ergonomic
API.

Analogously: unit tests ensure each of your LEGO blocks are individually well-
formed. Integration tests ensure that the build instructions actually result
in something reasonable.

------
jonthepirate
Hi - I'm Jon, creator of "Flaptastic"
([https://www.flaptastic.com/](https://www.flaptastic.com/)) and passionate
advocate for unit test health.

Having coded at both Lyft and at DoorDash, I noticed both companies had the
exact same unit test health problems and I was forced to manually come up with
ways to make the CI/CD reliable in both settings.

In my experience, most people want a turnkey solution to get them to a
healthier place with their unit testing. "Flaptastic" is a flaky unit tests
recognition engine written in a way that anybody can use it to clean up their
flaky unit tests no matter what CI/CD or test suite you're already using.

Flaptastic is a test suite plugin that works with a SAAS backend that is able
to differentiate between a unit test that failed due to broken application
code _versus_ tests that are failing with no merit and only because the tests
are not written well. Our killer feature is that you get a "kill switch" to
instantly disable any unit test that you know is unhealthy with an option to
unkill it later when you've fixed the problem. The reason is this is so
powerful is that when you kill an unhealthy test, you are able to immediately
unblock the whole team.

We're now working on a way to accept the junit.xml file from your test suite.
We can run it through the flap recognition engine allowing you to make
decisions on what you will do next if you know all of the tests that failed
did fail due to known flaky test patterns.

If Flaptastic seems interesting, contact us on our chat widget we'll let you
use it for free indefinitely (for trial purposes) to decide if this makes your
life easier.

------
andrey_utkin
At Undo we develop a "software flight recorder technology" \- basically think
of `rr` reversible debugger, it is our open source competitor.

One particular usecase for Undo (besides obviously recording software bugs per
se) is recording execution of tests. Huge time saver. We do this ourselves -
when a test fails in CI, engineers can download a recording file of a failing
test and investigate it with our reversible debugger.

~~~
roca
Yeah, this is huge. rr also has "chaos mode" to randomize things to make test
failures easier to reproduce. (I understand Undo has something similar.)

I think that's one message that is completely lost in the article and in the
rest of the comments here: it _is_ possible to improve technology so that
flaky tests are more debuggable.

With enough investment (hardware and OS support for low-impact always-on
recording) we could make _every_ flaky test debuggable.

------
bhaak
At our place, we call them "peuteterli" (losely translated: "could-be-ish"
constructed from the French "peut être" and slapped on the local German
diminutive -li.

For the ID issue I have a monkey patch for Activerecord:

    
    
          if ["test", "cucumber"].include? Rails.env
            class ActiveRecord::Base
              before_create :set_id
    
              def set_id
                self.id ||= SecureRandom.random_number(999_999_999)
              end
            end
          end
    

Unique IDs are also helpful when scanning for specific objects during test
development. When all objects of different classes start with 1, it is hard to
following the connections.

------
notacoward
I deal with this issue a lot in my current job, and did in my last job too.
IMX timing issues are by far the most common culprit. Usually it's because a
test has to guess how long a background repair or garbage-collection activity
will take, when in fact that duration can be highly variable. Shorter timeouts
mean tests are unreliable. Longer timeouts mean greater reliability but tests
that sometimes take forever. Speeding up the background processes can create
CPU contention if tests are being run in parallel, making _other_ tests seem
flaky. Various kinds of race conditions in tests are also a problem, but not
one I personally encounter that often. Probably has to do with the type of
software I work on (storage) and the type of developers I consequently work
with.

No matter what, developers complain and try to avoid running the tests at all.
I'd love to force their hand by making a successful test run an absolute
requirement for committing code, but the very fact that tests have been slow
and flaky since long before I got here means that would bring development to a
standstill for weeks and I lack the authority (real or moral) for something
that drastic. Failing that, I lean toward re-running tests a few times for
those that are merely flaky (especially because of timing issues), and
quarantine for those that are fully broken. Then there's still a challenge
getting people to fix their broken tests, but life is full of tradeoffs like
that.

------
Slartie
We're usually calling them "blinker tests" in our integration test suite.
Reasons for blinker tests vary, but most are in line with what others here
have already stated: concurrency, especially correct synchronization of test
execution with stuff happening in asynchronous parts of the (distributed)
system under test, is by far the biggest cause for problematic tests. This one
is often exagerrated by the difference in concurrent execution on developer
machines with maybe 4-6 cores and the CI server with 50-80, which often leads
to "blinking" behavior that never happens locally, but every few builds on the
CI server.

Second biggest is database transaction management and incorrect assumptions
over when database changes become visible to other processes (which are in
some way also concurrency problems, so it basically comes down to that). Third
biggest is unintentional nondeterminism in the software, like people assuming
that a certain collection implementation has deterministic order, but actually
it doesn't, someone was just lucky to get the same order all the time while
testing on the dev machine.

------
jonatron
"Making bad assumptions about DB ordering" That's caught me out before.
Postgres is just weird, I had to run the same test in a loop for an hour
before it'd randomly change the order.

~~~
anarazel
There's several reasons for potential ordering changes:

\- the order of items on the page is different, due to the way tuples have
been inserted (different external scheduling, different postgres internal
scheduling) \- concurrent sequential scans can coordinate relation scans,
which is quite helpful for relations that are larger than the cache \-
different query plans, e.g. sequential vs index scans

Unless you specify the ORDER BY, there really isn't any guarantee by postgres.
We could make it consistent, but that'd add overhead for everyone.

------
adamb
If anyone is looking for ideas for how to build tooling that fights flaky
tests, I consolidated a number of lessons into a tool I open sourced a while
ago.

[https://github.com/ajbouh/qa](https://github.com/ajbouh/qa)

It will do things like separate out different kinds of test failures (by error
message and stacktrace) and then measure their individual rates of incidence.

You can also ask it to reproduce a specific failure in a tight loop and once
it succeeds it will drop you into a debugger session so you can explore what's
going on.

There are demo videos in the project highlighting these techniques. Here's
one:
[https://asciinema.org/a/dhdetw07drgyz78yr66bm57va](https://asciinema.org/a/dhdetw07drgyz78yr66bm57va)

------
pjc50
The two big problems seem to be concurrency (always a problem) and state,
which immediately suggest that making things as functional as possible would
help a lot.

Ideally all state that's used in a test would be reset to a known value at or
before the start of the test, but this is quite hard for external non-mocked
databases, clocks and so on.

For integration tests, do you run in a controllable "safe" environment and
risk false-passes, or an environment as close as possible to production and
risk intermittent failure?

A variant I've seen is "compiled languages may re-order floating point
calculations between builds resulting in different answers", which is
extremely annoying to deal with especially when you can't just epsilon it
away.

~~~
AstralStorm
Why not both? Test suite too slow? Live test too dangerous or inconsistent?

------
rrnewton
Both this article and this comment thread include a number of different ideas
regarding controlling (or randomizing) environmental factors: test ordering,
system time, etc.

But why do all of this piecemeal? Our philosophy is to create a controlled
test sandbox environment that makes all these aspects (including concurrency)
reproducible:

[https://www.cloudseal.io/blog/2018-04-06-intro-to-fixing-
fla...](https://www.cloudseal.io/blog/2018-04-06-intro-to-fixing-flaky-tests)

The idea is to guarantee that any flake is easy to reproduce. If people have
objections to that approach, we'd love to hear them. Conversely, if you would
be willing to test out our early prototype, get in touch.

------
invertednz
I used to work at a company with over 10,000 tests where we weren't able to
get more than an 80% pass rate due to flaky tests. This article is great and
covers a lot of the options for handling flaky tests. I founded Appsurify to
make it easy for companies to handle flaky tests, with minimal effort.

First, don't delete them, flaky tests are still valuable and can still find
bugs. We also had the challenge where a lot of the 'flakiness' was not the
test or the application's fault but was caused by 3rd party providers. Even at
Google "Almost 16% of our tests have some level of flakiness associated with
them!" \- John Micco, so just writing tests that aren't flaky isn't always
possible.

Appsurify automatically raises defects when tests fail, and if the failure
reason looks to be 'flakiness' (based on failure type, when the failure
occurred, the change being made, previous known flaky failures) then we raise
the defect as a "flaky" defect. Teams can then have the build fail based only
on new defects and prevent it from failing when there are flaky test results.

We also prioritize the tests, which causes fewer tests to be run which are
more likely to fail due to a real defect, which also reduces the number of
flaky test results.

------
pure-awesome
> A few months back we introduced a game.

> We created a topic on our development Discourse instance. Each time the test
> suite failed due to a flaky test we would assign the topic to the developer
> who originally wrote the test. Once fixed the developer who sorted it out
> would post a quick post morterm.

What's the game here? It just seems like a process. Useful, sure, but not
particularly fun...

------
boothby
I'm the primary developer for a heuristic, nondeterministic algorithm. It's
both production software, and also a neverending research project.
Specifically, I can't guarantee that a particular random seed will always
produce identical results because that hobbles my ability to make future
improvements to the heuristic. I've got reasonable coverage of my base classes
and subroutines, but minor changes to the heuristic can have significant
impact on the "power" of the heuristic.

My solution was to add a calibrated set of benchmarks. For each problem in the
test suite, I measure the probability of failure. From that probability, I can
compute the probability of n repeated failures. Small regressions are ignored,
but large regressions (p < .001) splat on CI. It's fast enough, accurate
enough, and brings peace of mind.

I understand that, and why, engineers hate this. But it's greatly superior to
nothing.

------
tom-jh
We run in-browser end to end tests for our browser extension. There were
several reasons for flakiness:

* Puppeteer (browser automation) bugs or improper use. Certain sequence of events could deadlock it, causing timeouts relatively rarely. The fix was sometimes upgrading puppeteer, sometimes debugging and working around the issue.

* Vendor API, particularly their oauth screen. When they smell automation, they will want to block the requests on security grounds. We have routed all requests through one IP address and reuse browser cookies to minimize this.

* Vendor API again, this time hitting limits on rare situations. We could have less parallel tests, but then you waste more time waiting.

Eventually, we will have to mock up this (fairly complex) API to progress.
It's got to a point where I don't feel like adding more tests because they may
cause further flakiness - not good.

------
mariefred
Flaky tests are indeed a big issue, the main concern being loss of confidence
in the results.

The otherwise good advice for randomization has its drawbacks-

\- it complicates issue reproduction, especially if the test flow itself is
randomized and not just the data

\- the same way it catches more issues, it might as well skip some

Something else that was mentioned but not stressed enough is the importance of
clean environment as the basis for the test infrastructure.

A cleanup function is nice but using a virtual environment, Docker or a clean
VM will save you a lot of debugging time finding environmental issues. The
same goes for mocked or simplified elements if they contribute to the
reproducibility of the system- a simpler in-memory database can help re
creating a clean database for each test instead of reverting for example

~~~
AstralStorm
Sometimes it's the code that is flaky and not the test.

In case of concurrent execution there are a only a few reasonably working
tricks like Relacy and other exhaustive ordering checkers as well as formal
proofs. Neither is cheap to use, so you will always get flaky tests there - or
rather tests that do not always fall.

~~~
mariefred
if the code is flaky then I have earned my pay honestly, this is a problem
that should be solved.

Subtle concurrency issues are indeed very difficult to be found debugged and
reproduced and randomization could help with that simply by covering more
space.

------
notacoward
Here's a Google testing blog post about the same thing in 2016.

[https://testing.googleblog.com/2016/05/flaky-tests-at-
google...](https://testing.googleblog.com/2016/05/flaky-tests-at-google-and-
how-we.html)

~~~
zellyn
If any Googlers are reading this and have the knowledge, I’m curious whether
things have improved since that article. The numbers are sobering.

~~~
bhuga
Not a googler, but they posted an update in 2017 with some more information:
[https://testing.googleblog.com/2017/04/where-do-our-flaky-
te...](https://testing.googleblog.com/2017/04/where-do-our-flaky-tests-come-
from.html)

------
rellui
Personally I've always called them flaky tests. I agree with the article that
flaky tests shouldn't be ignored completely. But the issue is they take much
more effort than usual test failures to debug. So it comes down to a balancing
act of how much effort you're willing to spend debugging these vs the chance
that it's an actual issue.

In my few years of automation experience, I've only seen 2 actual instances
where the flaky tests were an actual issues and one of them should've been
found by performance testing. Almost all of the rest were environment related
issues. It's tough testing across all of the different platforms without
running into some environment instability.

------
mannykannot
Tests are part of the system too, and if you accept lower standards for your
test suite than you think you hold the product to, you have actually lowered
your standards for the product to those you accept for the tests.

------
ArturT
For annoying flaky features tests, I use rspec-retry gem to repeat the test a
few times before marking it as failed. It helped for integration tests with
external sandbox API.

I noticed discourse had a lot of flaky tests while using their repo to test my
knapsack_pro ruby gem to run test suite with CI parallelisation. A few
articles with CI examples of parallelisation can be found here
[https://docs.knapsackpro.com](https://docs.knapsackpro.com)

I need to try the latest version of discourse code, maybe now it will be more
stable to run tests in parallel.

------
chippy
One recent test that was sometimes failing was ordering a list. It was due to
how I made a sequence of my fixtures using numbers as a affix to a string so
it was ordering correctly unless e.g. "string 8, string 9, string 10".

I fixed it for me by creating a random selection from /usr/share/dict/words to
make a large array of sorted words to choose from. This made the fixtures have
better and amusing names such as "string trapezoidal, string understudy"

------
boyter
These sort of tests are perfect examples for me to add to
[https://boyter.org/posts/expert-excuses-for-not-writing-
unit...](https://boyter.org/posts/expert-excuses-for-not-writing-unit-tests/)
Tongue in cheek it is but I’m always on the lookout for additional examples to
flesh it out.

------
pavel_lishin
Flaky tests are one of the factors that led me to leave a previous job. Test
coverage was already so bad (and honestly, so was the code) that it was
difficult to do anything with confidence - add to this that tests _sometimes_
worked meant that writing code was basically a dice-roll. I got tired of the
stress.

------
piokoch
"Non-deterministic tests have two problems, firstly they are useless, secondly
they are a virulent infection that can completely ruin your entire test
suite."

"To this I would like to add that flaky tests are an incredible cost to
businesses."

I think that the misconception here is that "tests should not fail", because
they are "cost", "has to be analyzed and fixed", etc.

An integration or functional test that is guaranteed to never fail is kind of
useless for me. Good test with a lot of assertions will fail occasionally
since things are happening - unexpected data are provided, someone manually
played with the database, ntp service was accidentally stopped and date in not
accurate and filtering by date might be failing, someone plugged in some
additional system that alters/locks data.

In case of unit tests, well, if everything is mocked and isolated then yes,
such test probably should never fail, but unit tests are mostly useful only if
there is some complicated logic involved.

~~~
YjSe2GMQ
You clearly have not worked on a codebase with thousands of tests. At my
previous job the build system had an option to run a test N times concurrently
in the cloud. I used this whenever I wanted to commit to some other project
but some of their tests were garbage (to prove that test is flaky, and
therefore to be ignored). You could even binary search (running 1000 times on
each pivot point) to see who introduced the flakiness. Expensive but gets the
job done.

In my projects I either fix the nondeterminism or delete such tests.

~~~
AstralStorm
Pseudorandom deterministic tests have their value, presuming you store faulty
input and/or seed.

These are not exactly nondeterministic but sometimes people end up with that
instead of pseudorandom ones.

------
rgoulter
_" You won't have code like this obviously contrived example, but you might
have code which is equivalent."_

Ha, yes! The problem sounds super dumb and obvious once you explain it, but
can be a PITA to track down or recognise in the code.

------
revskill
To me, unit tests only make sense for pure code.

For impure code, it made no sense to make a unit test.

Ability to separate pure vs impure code determines your test suites, where
should be put in unit test, where should be put in integration test.

~~~
stagas
It looks like you are coupling your unit tests with your integration tests. At
integration level, we test if the integration paths work under various
conditions, that is, only the code that deals whether our unit has been called
correctly, with the right parameters, etc. At unit level, we mock all of the
dependencies and test the branches of the effective code under various
conditions. And at the acceptance level we should be testing our business
logic requirements, to make sure all of our features are working the way they
should, especially during refactoring (where integration and unit tests are
subject to change).

------
jdlshore
This is a great article. Grounded in experience, detailed, actionable. Nicely
done.

