Hacker News new | past | comments | ask | show | jobs | submit login
Measuring the Cost of Regression Testing in Practice (acm.org)
44 points by luu 5 months ago | hide | past | favorite | 9 comments

I think that there is a wide spread in overall software development process that can lead to good product quality. Some teams will automate testing to a greater degree. Some rely on consistent manual testing. Others seem to hire and raise fastidious developers that produce fewer defects to begin with. Each strategy takes a while to build into the culture, and during that build-up seems to be a heavy cost. So I think that it’s hard to judge the benefits of test automation separately from the software development organization and its product.

The nasty part is assigning "monitary costs" to the bugs CI discovered. Without that you can't compare it to the cost of writing, maintaining and running the tests.

Why do people ever write flaky tests? What kind of information one can hope to get from them?

I doubt many people intentionally write flaky tests.

In developers I've worked with I've observed a lack of understanding of the language/libraries used (ie. the order of something is coincidental, not guaranteed, leading to flakiness) and a lack of understanding of the test system (eg. shared db for multiple test runners, querying last insert is not deterministic in this case) as well as accidental mistakes (tests timeout after 1000ms, enough I/O is in the test that it can vary from 700ms to (rarely) 1200ms.

Recently I wrote a test that checked the deterministic properties of a build. I forgot that the two test builds could run in parallel and thus end up with the same timestamp (to the ms) and thus the same artifact. It never occurs on my dev laptop where it was a single threaded build, but it occurs maybe one in a hundred times on my desktop. Simple to fix, but easy to write. The fix was of course to mock the timestamp properly.

As with bugs in general, nobody sets out to write flaky tests.

Sometimes organizational imperatives like code coverage goals can result in less-than-stellar tests. Sometimes a developer doesn't understand the system well enough, and sometimes (often, in my experience) something outside the logical scope of a test changes, e.g., an underlying implicit dependency.

Even a test that is 100% reliable today may become unreliable tomorrow.

Stable and meaningful tests are very hard to write. Probably harder than writing the software that gets tested. It’s even harder to maintain them and keep meaningful. Tests that made sense a while ago may become flaky or obsolete when the system changes.

Statistics is simply working against you, system tests are a long chain of of (let's say) independent events with (let's hope) a small chance for failure. But your test multiply those probabilities creating a much bigger probability for failure.

Now you run multiple tests, each with a more complex environment than your real system (you need to control your tests, test logs etc.) usually on a lesser environment than your production one. Add all this and you'll get an uncomfortable probability of failure

> But your test multiply those probabilities creating a much bigger probability for failure.

Minor nitpick: when probabilities multiply, they get smaller. Such is the nature of numbers in the range [0, 1].

Multiplication would happen when counting the probability of two (independent) events coinciding. What you're thinking about here is the probability of any one of several events occuring. That will be a (rather convoluted) sum, not a product.

We engineers can do convoluted sums, the assumption is that the events are independent and you multiply the probability of success (or 1-probability of failure).

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact