This is our 5th watercooler discussion that came from reddit.com/r/softwaretesting and our Software Automation Discord:
https://discord.com/invite/9m4HkejXgs
We have over 300+ members from all ranges/expertise of YOE (0-30+) and many of us are happy to give free advice!
It’s really important to dig into every single flaky test failure to at least figure out where the problem lies. This sometimes becomes a team effort the more you rule out common flaky issues. Depending on your team’s test culture/discipline, success against flaky tests will vary .
It’s rare but it has happened enough times that flaky tests were a result of a rare edge case bug that later showed up on production.
It’s also really important to dig into the problem as soon as it occurs because reproducing it later could prove difficult if the problem ends up something deep and complex (like an infrastructure related issue that rears its symptoms once or twice a day in the form of a test failure).
In reality the most common flakiness is due to poorly written tests. Common examples I see is forgetting to use polling mechanisms, not writing a test to be parallel friendly, or having tests rely on another. These problems emerge the more you scale your test-suite and introduce more workers/resources.
For example if your test has a 100% success rate when running by itself, but 20% success rate when running 2 at a time in parallel, then you’re gonna have a bad time when you decide you want to run 15 at a time, or 40.
The second which is less common but easier to describe are application bugs, poorly written app code. There are times where I’ll have to ask a domain expert to investigate the test failure when I’m not sure what the business logic should be.
Again these are less common but do happen enough that I make it part of my checklist when addressing flaky tests.
Nothing stings more than a flaky test you ignored coming back to bite you in prod lol.
The remaining types of flakiness I put into the “Environment/Infra” bucket. These failures might only happen in certain environments (Dev or CI)
For example a test may fail because the process ran out of memory, but it only occurs in CI because the CI machine may have diff configs and extra overhead of running automation frameworks.
Maybe you’re not isolating your testing databases and one test is leaking into another causing a data failure.
Maybe a dev introduced a bad data migration which is now breaking tests for everybody, or just some tests etc.
Maybe a new dependency that was installed on the machines is now causing an image rendering problem 1 out of 100 times.
These last types of failures are really hard to pin down so I try to simplify and isolate my testing environment as much as possible and deduce the easiest things first.
Kinda like a hospital, I try to keep things sanitized and not share needles to prevent weird edge cases that I never saw coming which sometimes leads to me chasing ghosts (problems that appear/disappear on there own).
In terms of actually addressing/preventing them, here’s what we’ve done in the past which others mentioned in more detail so I’ll attempt to keep short:
1. Treat test code like any other type of code. Have it go through your normal code review process and treat it like first class citizen code.
2. Stress test any newly introduced test in parallel to sniff out any common issues (like needing to add a polling mechanism or make it parallel safe).
3. Metrics for all your flaky tests and your test-suite in general. This is critical with debugging/addressing flaky tests.
4. Think of the environment. Isolate/Sanitize your tests as much as possible to help prevent weird issues in the first place
5. Good TEAM TEST culture to actually ADDRESS the flaky tests as they come up and not just sweep them under the rug.
6. Error reporting (like Sentry or Datadog) for your tests to see if any flaky tests share common failures
7. Have a mechanism to Quarantine your tests temporarily while addressing test (please only do this if you are actually going to address the problem )
Once you reduce the flakiness, you’ll increase confidence in the test-suite which will in turn increase team/developer buy in. Getting that momentum and critical mass can be difficult if you have a lot of flaky tests or poor test culture. Also CI/CD becomes a well oiled machine and easier to maintain since you reduce a lot of “ghosts” in the system. The more work you put into it, eventually the easier test-suite maintenance will get.
So this is a compilation of responses that spawned from a Discord discussion and led to more discussion from other communities.
I figured why not post those results and see if anyone else wants to add to the discussion as I'll most likely update the doc with top voted and/or worthwhile answers, thanks!
We have over 300+ members from all ranges/expertise of YOE (0-30+) and many of us are happy to give free advice!