Hacker News new | comments | show | ask | jobs | submit login
Coverage Is Not Strongly Correlated with Test Suite Effectiveness (2014) (linozemtseva.com)
74 points by sawwit on Nov 30, 2015 | hide | past | web | favorite | 52 comments

Notice the result is a bit more nuanced if reading the paper, i.e.

> Our results suggest that, for large Java programs, the correlation between coverage and effectiveness drops when suite size is controlled for.


> While coverage measures are useful for identifying under-tested parts of a program, and low coverage may indicate that a test suite is inadequate, high coverage does not indicate that a test suite is effective.

They also propose an alternative as a "quality goal" for test suites:

> Of course, developers still want to measure the quality of their test suites, meaning they need a metric that does correlate with fault detection ability. While this is still an open problem, we currently feel that mutation score may be a good substitute for coverage in this context.

I wasn't familiar with mutation testing. This is also from the article:

>We used the open source tool PIT [35] to generate faulty versions of our programs. To describe PIT’s operation, we must first give a brief description of mutation testing.

>A mutant is a new version of a program that is created by making a small syntactic change to the original program. For example, a mutant could be created by modifying a constant, negating a branch condition, or removing a method call. The resulting mutant may produce the same output as the original program, in which case it is called an equivalent mutant. For example, if the equality test in the code snippet in Figure 1 were changed to if (index >= 10), the new program would be an equivalent mutant.

Sounds like a resource intensive way to test -- # tests * loc * mutation options. However, if you need the fault tolerance it could be worth it. You can probably spot check to get similar quality with lower resources. Very interesting.

Computing the mutation score using a randomly selected subset of the mutants seems to work well, though it hasn't been studied a lot. One paper is "Reducing the cost of mutation testing: An empirical study" by Wong and Mathur.

pitest has a few tricks up its sleeve to keep overhead reasonable. For example (from http://pitest.org/faq/):

> Per test case line coverage information is first gathered and all tests that do not exercise the mutated line of code are disguarded.

I've used pitest in real life and while it is quite a bit heavier than straightforward testing, it's definitely usable. I'm now using it on all Java projects, I wish there was something comparable for JavaScript!

I don't think it is quite as complete, but this looks usable https://github.com/jimivdw/grunt-mutation-testing. That's just from google, no experience with it.

Paper author here. We did a follow up study and found that the mutation score is, in fact, correlated with the suite's ability to detect real faults, at least for the programs we studied. That paper is also on my website: "Are Mutants a Valid Substitute for Real Faults in Software Testing?"

>> for large Java programs

I wonder if the results would be different at all for a less strongly typed language, like python for example.

I am not surprised by the findings of the study - not at all.

In discussions about software metrics, I'm always trying to make the point that you can only use most metrics from within a team to gain insights, not as an external measure about how good/bad the code (or the team) is. In other words, if a team thinks they have a problem, they can use metrics to gain insights and explore possible solutions. But as soon as someone says "[foo metric] needs to be at least [value]", you have already lost - the cheating and gaming begins. Even if the agreement on [value] comes from within the team.

Back to the topic :) I am not surprised by the findings - Higher test coverage does not mean that everything is fine. But very low test coverage indicates that there might be hidden problems here or there. This is how I like to use test coverage and how I try to teach it.

But it is great that we have now empirical data here: From now on, I can point others to this study when we are discussing whether the build server should reject commits with less than [value] coverage.

A brief skim of the study shows that the headline is a bit misleading. The correlation between coverage and bug yield is weak when suite size is controlled for.

Another way of looking at this is that large suites improve bug yield. They tend to have high coverage as a by-product, because a large test suite will hammer each unit from various directions.

My suspicion is that scoring coverage for every pass over an LOC or a pathway would change the result substantially.

Nothing in this paper is an argument that test coverage is useless; rather, it should be seen as a secondary metric that rapidly loses indicative value as it approaches 100%.

Number of test cases seems to be the independent variable here.

Paper author here. Scoring coverage for every pass is an interesting idea and something I'd like to look into. I'm not sure it would change the result as much as you think, though. The basic finding of the paper is that coverage is a complicated way of measuring the size of the suite. Counting the number of times each line is hit will have the same problem, I think: writing more tests increases that score but also increases the number of bugs found, causing a spurious correlation.

My hunch is that the quality of the oracle matters more than the coverage score. You can write tests that cover all the code without actually checking anything; the tests will only catch bugs if you're carefully comparing expected and actual results. Maybe a simple metric like "number of asserts" would be useful -- except, of course, that will also be correlated with the size of the suite... It's a tough problem.

The point about the title is fair. I erred on the side of clickbait when I wrote the paper and regret it a bit. On the other hand, it worked. :)

I see your point about coverage volume being a better-fitted proxy for suite size. It'd also be a proxy for path coverage. Still, it'd be fun to count the horse's teeth anyhow.

Would you say that coverage's worth as a negative metric still seems meaningful, at least as a heuristic? I imagine that's covered in other literature.

Where I work we don't really fuss too much about coverage. We TDD, so in practice our coverage hovers around the high 90s as a matter of course. When I am writing a test I often manually mutate the code and test once it goes green as a quick validation that the test does what I think it does.

One last question -- did you classify tests? Feature, integration and unit tests should show quite different curves. Especially heavily mockist style unit tests.

I'd definitely say that the absence of coverage is a problem. My view is that coverage is necessary but not sufficient for good testing.

We talked a bit about classifying tests but didn't do it in the end because it's surprisingly hard to do. I do know of one paper that looked at different kinds of tests, called "The Effect of Code Coverage on Fault Detection under Different Testing Profiles": http://goo.gl/nnxgwE. The authors found differences between tests for error cases vs. tests for normal operation and between functional tests vs. random tests. IIRC, they had undergrads do a term project that had to pass 1200 tests before the final submission, and the professors themselves wrote the tests, so categorization was a bit easier.

As a rough way to automatically classify tests, you can look for tools like capybara, selenium or htmlunit for feature and mocking libraries for unit.

Mind you, there's as many taxonomies for tests as there are tests. To be honest I expect one way to classify them is by working backwards from coverage -- feature tests should have low volume but wide distribution across a codebase (perhaps that's another metric -- density?). Unit tests would be narrow but deep on a particular module.

Those are good ideas, thanks!

Throw me on as the tenth or eleventh coauthor and I'll buy beer to sweeten the deal.

If you need a corpus of code from a highly doctrinaire TDD shop, or if you think we can help, let me know: jchester@pivotal.io.

In other words, writing another test for the hot section of your code seems to improve quality on average about as much as writing a test for some rarely run code that you didn't bother to test until now.

If anything, I got surprised coverage is that important. But as the article (and you) says, it's only important on lower values, so not that surprising.

> In other words, writing another test for the hot section of your code seems to improve quality on average about as much as writing a test for some rarely run code that you didn't bother to test until now.

I'm not sure.

I've given it some more thought. I think the problem with distinguishing between the two is that "coverage" is defined, so to speak, as an area.

Once an LOC is executed by any test, its value ticks over once. It never gets counted again.

If instead each LOC execution was summed up independently, each LOC gets a "height" -- the number of times it was exercised in a suite. Then the overall suite gets test coverage volume, rather than test coverage area.

Once you track per-LOC volumes, you could start trying to tease out the differences between introducing a test at all and adding another test to critical code.

And, I suspect, the correlation between coverage volume and bug yield would be much more robust than area when controlled for number of cases. Because area when controlling for number of cases is ... a rough proxy for test coverage volume.

Edit: though with an interesting difference. Coverage area converges to 100%, but coverage volume is effectively unbounded. This would hopefully upset the abuse of coverage as a management metric instead of as a code smell.

My thoughts go that way too, but I expected some volumes to be more important than others. And because of it, I expected the article to measure a negative correlation. Hence my surprise.

But reading the paper, the study is about random samples of the complete test suite, and the results are far too weak to get any conclusion about it, finding negative correlations some times, and weak positive other times.

Anyway, they got a strong correlation, with good p-value (not very worthy on this context, but it's all we get) for your hypothesis about coverage volume. It's just my hypothesis about some code being more important than other that is inconclusive.

> It's just my hypothesis about some code being more important than other that is inconclusive.

There's a lot of literature going back a long way showing that defects tend to cluster. A leading indicator might be defects found in a module divided by coverage volume.

On the other hand, defects tend to cluster in the hard stuff, not at random.

I would imagine coverage is a more important metric for dynamic languages than static. Just running through code paths to ensure it doesn't crash due to misspellings etc.. is not really something a compiled language would need as the compiler pretty much handles that.

A linter catches most of those in dynamic languages; many already do cross-file type inference.

Another case of conflating a phenomenon and its measure.

Rather than write good tests, people have gotten side-tracked by chasing magical number that may or may not reflect the phenomenon of interest.

Heuristics are sometimes wrong, but always fast and not fundamentally bad. Writing unit tests for high-risk code is a good heuristic, god dammit. Chasing after 100% test-coverage is pedantic, and -- I honestly think -- evidence of a development team that favors form over function.

I'll continue writing unit tests that actually test critical boundary conditions, and I will continue not to care if only 10% of my code is covered. If that 10% represents 80% of my bugs, I've won.

The usual term for this is Goodhart's Law. From https://en.wikipedia.org/wiki/Goodhart%27s_law

> Goodhart's law is named after the economist who originated it, Charles Goodhart. Its most popular formulation is: "When a measure becomes a target, it ceases to be a good measure."

What about the other 20% of your bugs? They don't matter?


Preemptively testing a subset of my code doesn't prevent me from fixing other bugs, or from writing regression tests.

From the tone of your question, you'd think that 100% test coverage would fix the remaining 20%, when it clearly doesn't. I mean no disrespect, but this is what I meant when I said that unit-test coverage is a pedantic subject (which, I must again insist, is not meant to imply that you are a pedant).

The bugs get fixed, they just don't get discovered/resolved with up-front unit tests alone. It ends up being more effective and more efficient to hunt them down by other means.

No kidding. This has been known for a long time but it's good to see empirical evidence. The empirical evidence of 70's-90's showed the best ways to reduce defects were the following:

1. Unambiguous specifications of what the system will do.

2. Implementation in safe languages or subsets with plenty of interface checks.

3. Human review of design, implementation, and configuration looking for errors. Best results were here.

4. Usage-based testing to eliminate most errors user will experience in practice.

5. Thorough functional and unit testing.

(Later on...)

6. Fuzz testing.

7. Static analysis.

Practices 1-4 were used in Cleanroom, some major commercial projects, and high assurance security work. Every time, they produced results in preventing serious defects from entering the system that often might have slipped by testing. So, those are where the majority of our verification work should go in. Supporting evidence is from those that do a subset of that from OpenBSD team to Microsoft's SDL along with substantial defect reduction that followed.

Note: 6 and 7 showed their value later on. No 7 should definitely be done where possible with 6 done to a degree. I haven't researched 6 enough to give advice on most efficient approach, though.

So, internal testing and coverage are the weakest forms of verification. They have many benefits for sure but people put way too much effort and expectations into them where it pays off more elsewhere. Do enough testing for its key benefits then just cut it off from there. And don't even do most of that until you've done code reviews, etc which pay off more.

I'm not going to dispute that #7 earns a place lower down the chain than the features you mentioned in practical terms of identifying bugs. That said, it's only overly feasible when run from early on in design.

If a project currently produces clean output, and I make a commit that causes clang's analyser to complain, it's easy enough to fix, whether it's a "legitimate" error or not.

I've tried throwing static analyzers at long established projects like OpenSSL and.. there's just no point. Noone is going to review a few thousand errors and make hundreds of commits that don't actually fix anything just to get perfectly clean static analysis output. It's more likely I would break something than help.

It's really not that hard to implement a "no commit unless it passes static analysis" rule from the start, and even if the bugs removed are low, I usually argue they are non-zero.

I did it in chronological order of discovery and significant application. I agree it should be used as early as possible. Plenty of benefit to that. Further, the Orange Book era systems discovered a heuristic that supports your theory about old codebases: things must be designed for such verification or security from Day 1. Usually impossible to retrofit such properties into an existing codebase whose methods were quite different.

I find it strange that so much time is spent trying to convince people that some aspect of testing (coverage / tdd) is not a panacea and so little around improving how to test.

As a student / recent grad I remember thinking that testing was maybe some thing you maybe had to do for super hardcore projects. Now I see it as one of the first things to think about on a serious project and something that is going to take ~30 - 50% of the cost / effort.

IMO testing's biggest strength is in allowing you to delegate remembering all the use cases of an API to some code, instead of having to look them up in a checklist, or worse, having to remember them all.

Even if the tests are covering mundane, easy functions and execution paths, the benefit of not worrying if you broke something far outweighs the time to write tests.

Isn't "remembering use cases" something that should be done with comments or documentation ?

Comments and documentation become stale because they require deliberate human intervention to maintain their truthfulness. They also rely on the author of the documentation or comments correctly mentally executing the code being marked up.

Tests are compared with the API every time they're run, as seen by the computer's model of execution, giving immediate feedback on detected discrepancies.

(I believe that Pythonistas have clever docs that can encode snippets of testing into documentation, which splits the difference a bit, but is no substitute for a robust test suite).

From experience, it's amazingly easy to break something unintentionally when you're making changes, and even with a relatively small codebase it quickly becomes difficult to keep everything in your head.

I'm not an advocate of TDD though, and I agree that more effort should be put into how to test.

Try weird thing is the more you test, the larger fraction of your time is needed from testing, because testing avoid the blowups of investigations and bugfixes and workarounds.

Many companies have introduced counterproductive test coverage policies which are harmful to good testing practice. And any improvement has to start with measures - it's hard to make the case for some improved testing practice to someone who thinks that coverage alone is a good guarantee of test quality.

This somewhat makes my point, it is the type of comment that shows up every time somebody mentions testing but nobody has ever thought that coverage alone is a good guarantee of test quality, it is however very obviously a useful metric.

> nobody has ever thought that coverage alone is a good guarantee of test quality

Wrong. I could name three managers who have from personal experience.

> it is however very obviously a useful metric.

Disagree. I think it's an actively counterproductive metric, and one is better off not measuring it at all.

Coverage is pretty awesome feedback for reminding you that "oh yeah, I should probably test that case too." And if you start out with this kind of feedback early it can help influence your design to increase its testability.

If X% coverage is a goal measured by non-technical team members, it likely loses much of its value.

I've been getting into shooting for 100% coverage in critical [1] code lately, and while that may not be to everyone's taste, what I've noticed is that I always learn things from at least looking at the coverage. "Oh, I thought I had that covered... that's interesting, I thought I had that error simulated but it didn't happen after all... oh, look, there's no possible way to ever enter those 125 lines of nasty-looking code plink."

Might I also add that I discover errors in the error-handling surprisingly often. Coverage testing goes a long way towards reminding you to check those too. What your code does in error cases is part of the spec.

Shooting for 100% may not always be the best use of time, but if you are writing unit tests you should at least look at the line-by-line coverage results. You've already done the hard work of writing the test suite, be sure to reap all the benefits you can.

[1]: "critical" is, ahem, a critical word here. I don't shoot for 100% on everything, but every even halfway sanely-designed system has a "core". Total coverage on the core has the dual benefits of assuring the core really does work as intended, and making it possible to refactor the core. It is a very common antipattern for the core to metastasize because without a test suite, all ability to modify it, even to fix bugs, is lost.

Coverage does not guarantee effectiveness. Lack of coverage guarantees the behavior of uncovered code is not checked.

Use the tool for what it is good at. It's not a substitute for code review or writing convincing tests. Granted even convincing tests are going to miss important stuff, but they will miss less and after the initial bout of bugs will become quite good (because you write regression tests for EVERY bug right?).

I have a few objections to the evidence presented.

Keeping in mind that I don't disagree with the statement. Test coverage is an objective metric, and test Effectiveness is a... what is it again? How many bugs you'll find with it? Obviously the two are separate concepts.

This is my first objection: The paper seems to say that Mutation Testing is test effectiveness, but Mutation testing is merely another metric. They cite other papers, but papers that attempt to demonstrate that this metric is correlated with test "effectiveness".

Metric against metric, is that meaningful?

She presents graphs of results that seem to demonstrate linear correlation between test suite size/test suite coverage/and mutation testing (called "effectiveness") This is addressed later, "isn't this what we would expect?" yah! And the explanation for why it's unexpected sailed fully over my head. (I admit it! I'm dumb.)

finally, many test suites are generated on the presumption of achieving code coverage, is this test valid without also having test suites that were made without that goal in mind? Could such a suite exist?

so summary of my objections

* is mutation testing a meaningful measure of effectiveness?

* can you measure one metric against another another, get a linear relationship, and conclude any meaningful differences?

* does the presence of code coverage as a target spoil the conclusion?

I'd love to hear input on this.

Paper author here. We measured "effectiveness" as the mutation score. We showed in a separate paper that the mutation score is a good way to measure a test suite's ability to detect real faults. Of course, using the mutation score instead of the real fault detection rate does add a layer of indirection, but it's a lot more practical for a large empirical study (i.e., can be automated), so there's a bit of a tradeoff there. (The other paper is also on my website if you want to read it: http://www.linozemtseva.com/research/2014/fse/mutant_validit...)

What the graphs show is that effectiveness rises with suite size, which is expected: more tests catch more bugs. We also see that effectiveness rises with coverage, which seems intuitive: you can't catch bugs in code you never run. But when you graph effectiveness against coverage for suites that are all the same size, the correlation drops significantly, in some cases to 0. Here's an analogy: the number of PhD students that graduate in the US is highly correlated with the amount of profit generated by arcades in the US, but there's no causal relationship between them. They probably both depend on the size of the population. Similarly, coverage and effectiveness are correlated because they both depend on the size of the suite, but we can't say that there's a causal relationship. In other words, saying that a suite will catch a lot of bugs because it has high coverage is like saying that a lot of PhD students will graduate because arcades turned a good profit this year.

Your third point is a really good question. It's true that developers don't write tests randomly, so our method of making new suites by picking random test cases isn't quite realistic. What impact that would have on the results, I'm not sure. That's something I'd like to look into in the future.

Thank you very much for the response! This paper is pretty interesting for sure.

It's just it doesn't have what I would consider good evidence that coverage isn't correlated with effectiveness. (or... I don't understand the section that explains it <:( )

What I would consider good evidence would be to demonstrate that different test suites of the same size and different coverage achieve the same effectiveness.

* test suite A has a size of 300 sloc, and a coverage of 5%, effectiveness of ~8%

* test suite B has a size of 300 sloc, and a coverage of 12%, effectiveness of ~8%

That to me would be evidence of the stated conclusion, but I don't see where this is demonstrated (or where this is demonstrated, I don't understand). I do see where it is stated! Though... now that I have read through the paper a bit more thoroughly.

On the subject of normalized effectiveness.

  we are comparing suite A, with 50% coverage, to suite B, with
  60% coverage. Suite B will almost certainly have a higher
  raw effectiveness measurement, since it covers more code and
  will therefore almost certainly kill more mutants. However,
  if suite A kills 80% of the mutants that it covers, while suite
  B kills only 70% of the mutants that it covers, suite A is
  in some sense a better suite."
I don't believe a majority of people would agree with this. To me, this says coverage is positively correlated with effectiveness, and that suite B is doing more with less. Maybe that's a philosophical stand point? By normalizing, have you eliminated the premise?

Anyway, thank you for your time and work!

What you're looking for is in Figure 3 (admittedly a bit hard to read because I had to squish it into the paper). Each panel in that figure shows the results for test suites of a fixed size. For example, the top left panel shows the results for suites with three test cases for Apache POI. If you draw a horizontal line through the graph, all of the test suites that fall on that line have the same effectiveness score even though they have different coverage levels. I gave a talk about this paper at GTAC this year and Google posts all the videos on YouTube, so if you have time and you're still curious the explanation in the talk might help (and the graphs are much easier to read). https://www.youtube.com/watch?v=sAfROROGujU

The normalization was a point of contention with the peer reviewers as well, so in the end I tried both the normalized and unnormalized metrics and found similar results with both. The other tables and figures are available on my site if you want to look at them.

I'm not sure I understand what you mean when you say suite B is doing more with less, though. In the example, I was trying to say that suite B covers more code, so it will kill more mutants. Maybe suite A kills 20 mutants and suite B kills 25 mutants, just to have some numbers to talk about. But if B covers 50 mutants, and is only killing 25 of them, while A covers 25 mutants and kills 20 of the 25, it seems like suite A is doing a better job of testing the code it covers. Or to put it another way, suite B is broad but shallow while suite A is focused but deep. B isn't necessarily a bad suite, but I wouldn't say it's doing more with less, just that it has a different focus. Maybe I'm misunderstanding your point, though.

Another way of thinking about it is that the raw mutation score measures breadth: B is better than A because 25 > 20. The normalized score measures depth: A is better than B because 80% (20/25) > 50% (25/50).

I'm biased (it's my research field, in part) but I'd suggest that studies on coverage are all over the place, with this one showing lack of correlation, and other studies showing good correlation between coverage and... some kind of effectiveness (the ICSE paper mentioned above, also from 2014, and a TOSEM paper coming out this year, as well as a variety of publications over the years).

http://www.cs.cmu.edu/~agroce/onwardessays14.pdf covers the Inozemtseva et al. paper as well as some other recent work, and nothing in the time since we wrote that has modified my view that the jury is still out on coverage, depending on the situation in which you want to use it. Saying "coverage is not useful" is pretty clearly wrong, and saying "coverage is highly effective for measuring all suites in all situations" is also clearly wrong. Beyond that, it's hard to make solidly supported claims that don't depend greatly on details of what you are measuring and how.

I suspect Laura generally agrees, though probably our guesses on what eventual answers might be differ.

The SPLASH Onward! 2014 essay concludes with this advice to practitioners:

In some cases where coverage is currently used, there is little real substitute for it; test suite size alone is not a very helpful measure of testing effort, since it is even more easily abused or misunderstood than coverage. Other testing efforts already have ways of determining when to stop that don’t rely on coverage (ranging from “we’re out of time or money” to “we see clearly diminishing returns in terms of bugs found per dollar spent testing, and predict few residual defects based on past projects”). When coverage levels are required by company or government policy, conscientious testers should strive to produce good suites that, additionally, achieve the required level of coverage rather than aiming very directly at coverage itself [56]. “Testing to the test” by writing a suite that gets “enough” coverage and expecting this to guarantee good fault detection is very likely a bad idea — even in the best-case scenario where coverage is well correlated with fault detection. Stay tuned to the research community for news on whether coverage can be used more aggressively, with confidence, in the future.

While I think it's well-agreed upon in this community that code coverage is not all that big a deal, I think we also should consider the methods used in this study before we all pat ourselves on the back for having 'proof' of what we know.

> we generated 31,000 test suites for five systems consisting of up to 724,000 lines of source code

You auto-generated unit tests suites, and you're surprised they weren't very good at finding bugs? Well, no kidding, they were auto-generated! Would you trust your unit tests to be generated by a computer? Of course not.

Do a study of real-world software, and compare the unit test coverage to the Test Suite effectiveness. Then I'll be interested.

The unit tests are written by the programmers on each project, not generated by a computer. The test suites are generated by selecting a subset of the project's real-world test suite.

Paper author here. As someone already mentioned, the suites were made by picking a random subset of the test cases written by the developers, not by using a suite generation tool.

Now, our method is arguably not 100% realistic either, because in practice different parts of the program will have different coverage. Some kind of weighted random sampling might make suites that are a bit closer to "real" suites. I've been thinking about looking into that. I think our suites are realistic enough to make the results worthwhile, though.

Maybe we need to write tests to cover the test suites? Why not? They are code too. Then they'll improve.

Yes, these are mutation tests. Riddler or some such for java. It modifies a part of your code and then if your tests that cover that code still pass, that's a clue that the test is not effective

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact