Hi HN, we are Spriha and Ankit building Flakybot is a tool to automatically identify and suppress test-flakiness so that developers are better able to trust their test results.
Most CI systems leave it up to teams to manually identify and debug test flakiness. Since most CI systems today don’t handle test reruns, teams just end up with manually rerunning tests that are flaky. Ultimately, tribal knowledge gets built over time where certain tests are known to be flaky, but the flakiness isn’t specifically addressed. Our solution, Flakybot, removes one of the hardest parts of the problem: identifying flaky tests in the first place.
We ingest test artifacts from CI systems, and note when builds are healthy (so that we can mark them as “known-good builds” to use while testing for flakiness). This helps automatically identify flakiness, and proactively offer mitigation strategies, both in the short term and long term. You can read more about this here:
https://ritzy-angelfish-3de.notion.site/FlakyBot-How-it-work...
We’re in the early stages of development and are opening up Flakybot for private beta to companies that have serious test-flakiness issues. The CI systems we currently support are Jenkins, CircleCI and BuildKite, but if your team uses a different CI and has very serious test-flakiness problems, sign up anyway and we’ll reach out. During the private beta, we’ll work closely with our users to ensure their test flakiness issues are resolved before we open it up more broadly.
How about you fix the flaky tests? Am I insane for thinking that? The whole concept of "just reboot it" or "re run it again" and "fixing" the problem is at least one reason the modern world sits on a mountain of complete garbage software.
This is how we think about testing for the most part - if a test is 'flaky', it gets looked at very quickly, and if it's not urgent (e.g. the behavior is fine and it's actually a flake), it's skipped in code.
Once the test is skipped, a domain expert can come back and take a look and figure out why it was flaky, and fix it.
If it's urgently broken (e.g. there is real impact), we treat it like an incident and gather people with the right context to fix it quickly.
As long as everyone agrees to these norms, it's not a huge burden to keep this up with thousands of tests. People generally write their tests to be more resilient when they know they're on the hook for them not being flaky, and nobody stays blocked for long when they are permitted to skip a flaky test.
Curious, how often do you see a flaky test in your system? In my past experience at one of the mid-size startups, we used to get a new flaky test almost on a weekly basis in a monorepo. We started the process of actually flagging them as ignored (we created a separate tag for flaky tests), but later realized that the backlog of fixing flaky test never came down.
In another case observed, devs just got used to rerunning the entire suite (the flakiness here was about 10-20%)
Haha great point. Well from what we have learned from our users is "fixing" test typically end up with "delete most of them". Fixing tests can be time consuming effort.
Another way to think about it is, whether Flaky tests are worth keeping? At some point if the tests fail often, do these really add value. And we think - it does. If you are able to identify flakiness from real failure and reduce noise, you can still avoid real failures.
Wow. That works like really poor technical leadership. Fixing flaky tests (as opposed to deleting them) is indeed time consuming, but it is a far cheaper choice than getting to the point your test suite is untrustworthy.
There may be a point where the cost of ownership for a specific test exceeds its utility, but the way to resolve that is usually to reevaluate your code and supporting tests. Suppressing flaky tests seems a very unwise choice.
Perhaps under extreme circumstances and with unhealthy code bases there may be a case for this, but I struggle to imagine it.
That is a fair argument. Not all organizations have the bandwidth to measure and manage stability of builds. Some companies build internal tools / dev productivity team for this purpose. There are always right intentions to comment out the flaky test with the mindset of coming back to it, but it is also a very low priority item in most cases when you have to ship new features.
Fixing flaky tests can very commonly take longer than writing new tests.
Let me give you the example of a test that hadn't ever failed on a dev machine or on staging or prod, just on the flaky CI infrastructure.
Yes, I'm mostly agreeing with you that the tests should be fixed, but I have seen ones that were perfectly fine (given the constraints) and what should have been fixed was the CI.
Yes, but even in that scenario, it should be consistent, not flaky. If it always works in some environments and always fails in others, that is not ideal, but at least can be accepted. But if it sometimes works and sometimes fails, it should be investigated.
What I want is a tool to make flaky tests fail reliably.
They won't be fixed until they start actually preventing commits. If somebody deletes a test, that is on that person. I don't want a tool automatically suppressing testing.
What if it was easy to see why a test is flaky, compare failed/successful test runs like a code diff? Would that be useful?
This is what we're building at Thundra (foresight product), instrument the tests as well as backend services to enable devs to quickly diagnose failing/flaky tests. Would appreciate any feedback you may have, here or privately.
It would be helpful to be able to present diffs of log output between successful and failing runs of a test.
This is tricky to implement, for several reasons.
Log output is normally timestamped, making every line unique. Those parts of log lines would need to be ignored when comparing between runs.
Log output ordering is often indeterminate, particularly when a test has multiple threads, or interacts with an external service. Often the order of events logged is an essential feature of the difference between a successful and failed run. But some or most order differences are just incidental. The number of logged events may vary incidentally, or significantly. Explaining all these differences in detail to the test system would be too hard. So, the system needs to discover as much as possible of this for itself, and represent these discoveries symbolically. Then, allow a test to be annotated to override default judgments about the diagnostic significance of these features.
That's very interesting feedback. We certainly don't have a way to force simulate failure.
A related capability we are working on is to also rerun the identified flaky tests X times so they pass. This depends on the capabilities of the test runner, so it will work with specific ones first (cypress, pytest, etc). That way you still make sure that flaky tests pass instead of supressing.
We've been relying on manual testing so far. We're just starting to think about unit tests and integration tests. We don't know where to start. Would be cool if you could provide guidance on setting up good testing practices in the first place so that we avoid flaky tests all together.
Yeh flaky tests generally creep up in a big service due to several issues. There are some best practices to avoid the tests that requires some discipline and good oversight! We wrote some stuff around it:
https://www.flakybot.com/blog/five-causes-for-flaky-tests
This is by no means an exhaustive list, but our goal with FlakyBot is to get better at identifying root causes as we identify flakiness across the systems.
Nice. I’m for us building a rerun bot and I’m trying to advocate for it at work but not everyone agrees. Nice to pull this out as a service. Well done.
Most CI systems leave it up to teams to manually identify and debug test flakiness. Since most CI systems today don’t handle test reruns, teams just end up with manually rerunning tests that are flaky. Ultimately, tribal knowledge gets built over time where certain tests are known to be flaky, but the flakiness isn’t specifically addressed. Our solution, Flakybot, removes one of the hardest parts of the problem: identifying flaky tests in the first place.
We ingest test artifacts from CI systems, and note when builds are healthy (so that we can mark them as “known-good builds” to use while testing for flakiness). This helps automatically identify flakiness, and proactively offer mitigation strategies, both in the short term and long term. You can read more about this here: https://ritzy-angelfish-3de.notion.site/FlakyBot-How-it-work...
We’re in the early stages of development and are opening up Flakybot for private beta to companies that have serious test-flakiness issues. The CI systems we currently support are Jenkins, CircleCI and BuildKite, but if your team uses a different CI and has very serious test-flakiness problems, sign up anyway and we’ll reach out. During the private beta, we’ll work closely with our users to ensure their test flakiness issues are resolved before we open it up more broadly.