> Our results suggest that, for large Java
programs, the correlation between coverage and effectiveness
drops when suite size is controlled for.
> While coverage
measures are useful for identifying under-tested parts of a
program, and low coverage may indicate that a test suite is
inadequate, high coverage does not indicate that a test suite
They also propose an alternative as a "quality goal" for test suites:
> Of course, developers still want to measure the quality
of their test suites, meaning they need a metric that does
correlate with fault detection ability. While this is still an
open problem, we currently feel that mutation score may be
a good substitute for coverage in this context.
>We used the open source tool PIT  to generate faulty
versions of our programs. To describe PIT’s operation, we
must first give a brief description of mutation testing.
>A mutant is a new version of a program that is created
by making a small syntactic change to the original program.
For example, a mutant could be created by modifying a
constant, negating a branch condition, or removing a method
call. The resulting mutant may produce the same output as
the original program, in which case it is called an equivalent
mutant. For example, if the equality test in the code snippet
in Figure 1 were changed to if (index >= 10), the new
program would be an equivalent mutant.
Sounds like a resource intensive way to test -- # tests * loc * mutation options. However, if you need the fault tolerance it could be worth it. You can probably spot check to get similar quality with lower resources. Very interesting.
> Per test case line coverage information is first gathered and all tests that do not exercise the mutated line of code are disguarded.
I wonder if the results would be different at all for a less strongly typed language, like python for example.
In discussions about software metrics, I'm always trying to make the point that you can only use most metrics from within a team to gain insights, not as an external measure about how good/bad the code (or the team) is. In other words, if a team thinks they have a problem, they can use metrics to gain insights and explore possible solutions. But as soon as someone says "[foo metric] needs to be at least [value]", you have already lost - the cheating and gaming begins. Even if the agreement on [value] comes from within the team.
Back to the topic :) I am not surprised by the findings - Higher test coverage does not mean that everything is fine. But very low test coverage indicates that there might be hidden problems here or there. This is how I like to use test coverage and how I try to teach it.
But it is great that we have now empirical data here: From now on, I can point others to this study when we are discussing whether the build server should reject commits with less than [value] coverage.
Another way of looking at this is that large suites improve bug yield. They tend to have high coverage as a by-product, because a large test suite will hammer each unit from various directions.
My suspicion is that scoring coverage for every pass over an LOC or a pathway would change the result substantially.
Nothing in this paper is an argument that test coverage is useless; rather, it should be seen as a secondary metric that rapidly loses indicative value as it approaches 100%.
Number of test cases seems to be the independent variable here.
My hunch is that the quality of the oracle matters more than the coverage score. You can write tests that cover all the code without actually checking anything; the tests will only catch bugs if you're carefully comparing expected and actual results. Maybe a simple metric like "number of asserts" would be useful -- except, of course, that will also be correlated with the size of the suite... It's a tough problem.
The point about the title is fair. I erred on the side of clickbait when I wrote the paper and regret it a bit. On the other hand, it worked. :)
Would you say that coverage's worth as a negative metric still seems meaningful, at least as a heuristic? I imagine that's covered in other literature.
Where I work we don't really fuss too much about coverage. We TDD, so in practice our coverage hovers around the high 90s as a matter of course. When I am writing a test I often manually mutate the code and test once it goes green as a quick validation that the test does what I think it does.
One last question -- did you classify tests? Feature, integration and unit tests should show quite different curves. Especially heavily mockist style unit tests.
We talked a bit about classifying tests but didn't do it in the end because it's surprisingly hard to do. I do know of one paper that looked at different kinds of tests, called "The Effect of Code Coverage on Fault Detection under
Different Testing Profiles": http://goo.gl/nnxgwE. The authors found differences between tests for error cases vs. tests for normal operation and between functional tests vs. random tests. IIRC, they had undergrads do a term project that had to pass 1200 tests before the final submission, and the professors themselves wrote the tests, so categorization was a bit easier.
Mind you, there's as many taxonomies for tests as there are tests. To be honest I expect one way to classify them is by working backwards from coverage -- feature tests should have low volume but wide distribution across a codebase (perhaps that's another metric -- density?). Unit tests would be narrow but deep on a particular module.
If you need a corpus of code from a highly doctrinaire TDD shop, or if you think we can help, let me know: email@example.com.
If anything, I got surprised coverage is that important. But as the article (and you) says, it's only important on lower values, so not that surprising.
I'm not sure.
I've given it some more thought. I think the problem with distinguishing between the two is that "coverage" is defined, so to speak, as an area.
Once an LOC is executed by any test, its value ticks over once. It never gets counted again.
If instead each LOC execution was summed up independently, each LOC gets a "height" -- the number of times it was exercised in a suite. Then the overall suite gets test coverage volume, rather than test coverage area.
Once you track per-LOC volumes, you could start trying to tease out the differences between introducing a test at all and adding another test to critical code.
And, I suspect, the correlation between coverage volume and bug yield would be much more robust than area when controlled for number of cases. Because area when controlling for number of cases is ... a rough proxy for test coverage volume.
Edit: though with an interesting difference. Coverage area converges to 100%, but coverage volume is effectively unbounded. This would hopefully upset the abuse of coverage as a management metric instead of as a code smell.
But reading the paper, the study is about random samples of the complete test suite, and the results are far too weak to get any conclusion about it, finding negative correlations some times, and weak positive other times.
Anyway, they got a strong correlation, with good p-value (not very worthy on this context, but it's all we get) for your hypothesis about coverage volume. It's just my hypothesis about some code being more important than other that is inconclusive.
There's a lot of literature going back a long way showing that defects tend to cluster. A leading indicator might be defects found in a module divided by coverage volume.
On the other hand, defects tend to cluster in the hard stuff, not at random.
Rather than write good tests, people have gotten side-tracked by chasing magical number that may or may not reflect the phenomenon of interest.
Heuristics are sometimes wrong, but always fast and not fundamentally bad. Writing unit tests for high-risk code is a good heuristic, god dammit. Chasing after 100% test-coverage is pedantic, and -- I honestly think -- evidence of a development team that favors form over function.
I'll continue writing unit tests that actually test critical boundary conditions, and I will continue not to care if only 10% of my code is covered. If that 10% represents 80% of my bugs, I've won.
> Goodhart's law is named after the economist who originated it, Charles Goodhart. Its most popular formulation is: "When a measure becomes a target, it ceases to be a good measure."
Preemptively testing a subset of my code doesn't prevent me from fixing other bugs, or from writing regression tests.
From the tone of your question, you'd think that 100% test coverage would fix the remaining 20%, when it clearly doesn't. I mean no disrespect, but this is what I meant when I said that unit-test coverage is a pedantic subject (which, I must again insist, is not meant to imply that you are a pedant).
The bugs get fixed, they just don't get discovered/resolved with up-front unit tests alone. It ends up being more effective and more efficient to hunt them down by other means.
1. Unambiguous specifications of what the system will do.
2. Implementation in safe languages or subsets with plenty of interface checks.
3. Human review of design, implementation, and configuration looking for errors. Best results were here.
4. Usage-based testing to eliminate most errors user will experience in practice.
5. Thorough functional and unit testing.
6. Fuzz testing.
7. Static analysis.
Practices 1-4 were used in Cleanroom, some major commercial projects, and high assurance security work. Every time, they produced results in preventing serious defects from entering the system that often might have slipped by testing. So, those are where the majority of our verification work should go in. Supporting evidence is from those that do a subset of that from OpenBSD team to Microsoft's SDL along with substantial defect reduction that followed.
Note: 6 and 7 showed their value later on. No 7 should definitely be done where possible with 6 done to a degree. I haven't researched 6 enough to give advice on most efficient approach, though.
So, internal testing and coverage are the weakest forms of verification. They have many benefits for sure but people put way too much effort and expectations into them where it pays off more elsewhere. Do enough testing for its key benefits then just cut it off from there. And don't even do most of that until you've done code reviews, etc which pay off more.
If a project currently produces clean output, and I make a commit that causes clang's analyser to complain, it's easy enough to fix, whether it's a "legitimate" error or not.
I've tried throwing static analyzers at long established projects like OpenSSL and.. there's just no point. Noone is going to review a few thousand errors and make hundreds of commits that don't actually fix anything just to get perfectly clean static analysis output. It's more likely I would break something than help.
It's really not that hard to implement a "no commit unless it passes static analysis" rule from the start, and even if the bugs removed are low, I usually argue they are non-zero.
As a student / recent grad I remember thinking that testing was maybe some thing you maybe had to do for super hardcore projects. Now I see it as one of the first things to think about on a serious project and something that is going to take ~30 - 50% of the cost / effort.
Even if the tests are covering mundane, easy functions and execution paths, the benefit of not worrying if you broke something far outweighs the time to write tests.
Tests are compared with the API every time they're run, as seen by the computer's model of execution, giving immediate feedback on detected discrepancies.
(I believe that Pythonistas have clever docs that can encode snippets of testing into documentation, which splits the difference a bit, but is no substitute for a robust test suite).
I'm not an advocate of TDD though, and I agree that more effort should be put into how to test.
Wrong. I could name three managers who have from personal experience.
> it is however very obviously a useful metric.
Disagree. I think it's an actively counterproductive metric, and one is better off not measuring it at all.
If X% coverage is a goal measured by non-technical team members, it likely loses much of its value.
Might I also add that I discover errors in the error-handling surprisingly often. Coverage testing goes a long way towards reminding you to check those too. What your code does in error cases is part of the spec.
Shooting for 100% may not always be the best use of time, but if you are writing unit tests you should at least look at the line-by-line coverage results. You've already done the hard work of writing the test suite, be sure to reap all the benefits you can.
: "critical" is, ahem, a critical word here. I don't shoot for 100% on everything, but every even halfway sanely-designed system has a "core". Total coverage on the core has the dual benefits of assuring the core really does work as intended, and making it possible to refactor the core. It is a very common antipattern for the core to metastasize because without a test suite, all ability to modify it, even to fix bugs, is lost.
Use the tool for what it is good at. It's not a substitute for code review or writing convincing tests. Granted even convincing tests are going to miss important stuff, but they will miss less and after the initial bout of bugs will become quite good (because you write regression tests for EVERY bug right?).
Keeping in mind that I don't disagree with the statement. Test coverage is an objective metric, and test Effectiveness is a... what is it again? How many bugs you'll find with it? Obviously the two are separate concepts.
This is my first objection: The paper seems to say that Mutation Testing is test effectiveness, but Mutation testing is merely another metric. They cite other papers, but papers that attempt to demonstrate that this metric is correlated with test "effectiveness".
Metric against metric, is that meaningful?
She presents graphs of results that seem to demonstrate linear correlation between test suite size/test suite coverage/and mutation testing (called "effectiveness") This is addressed later, "isn't this what we would expect?" yah! And the explanation for why it's unexpected sailed fully over my head. (I admit it! I'm dumb.)
finally, many test suites are generated on the presumption of achieving code coverage, is this test valid without also having test suites that were made without that goal in mind? Could such a suite exist?
so summary of my objections
* is mutation testing a meaningful measure of effectiveness?
* can you measure one metric against another another, get a linear relationship, and conclude any meaningful differences?
* does the presence of code coverage as a target spoil the conclusion?
I'd love to hear input on this.
What the graphs show is that effectiveness rises with suite size, which is expected: more tests catch more bugs. We also see that effectiveness rises with coverage, which seems intuitive: you can't catch bugs in code you never run. But when you graph effectiveness against coverage for suites that are all the same size, the correlation drops significantly, in some cases to 0. Here's an analogy: the number of PhD students that graduate in the US is highly correlated with the amount of profit generated by arcades in the US, but there's no causal relationship between them. They probably both depend on the size of the population. Similarly, coverage and effectiveness are correlated because they both depend on the size of the suite, but we can't say that there's a causal relationship. In other words, saying that a suite will catch a lot of bugs because it has high coverage is like saying that a lot of PhD students will graduate because arcades turned a good profit this year.
Your third point is a really good question. It's true that developers don't write tests randomly, so our method of making new suites by picking random test cases isn't quite realistic. What impact that would have on the results, I'm not sure. That's something I'd like to look into in the future.
It's just it doesn't have what I would consider good evidence that coverage isn't correlated with effectiveness. (or... I don't understand the section that explains it <:( )
What I would consider good evidence would be to demonstrate that different test suites of the same size and different coverage achieve the same effectiveness.
* test suite A has a size of 300 sloc, and a coverage of 5%, effectiveness of ~8%
* test suite B has a size of 300 sloc, and a coverage of 12%, effectiveness of ~8%
That to me would be evidence of the stated conclusion, but I don't see where this is demonstrated (or where this is demonstrated, I don't understand). I do see where it is stated! Though... now that I have read through the paper a bit more thoroughly.
On the subject of normalized effectiveness.
we are comparing suite A, with 50% coverage, to suite B, with
60% coverage. Suite B will almost certainly have a higher
raw effectiveness measurement, since it covers more code and
will therefore almost certainly kill more mutants. However,
if suite A kills 80% of the mutants that it covers, while suite
B kills only 70% of the mutants that it covers, suite A is
in some sense a better suite."
Anyway, thank you for your time and work!
The normalization was a point of contention with the peer reviewers as well, so in the end I tried both the normalized and unnormalized metrics and found similar results with both. The other tables and figures are available on my site if you want to look at them.
I'm not sure I understand what you mean when you say suite B is doing more with less, though. In the example, I was trying to say that suite B covers more code, so it will kill more mutants. Maybe suite A kills 20 mutants and suite B kills 25 mutants, just to have some numbers to talk about. But if B covers 50 mutants, and is only killing 25 of them, while A covers 25 mutants and kills 20 of the 25, it seems like suite A is doing a better job of testing the code it covers. Or to put it another way, suite B is broad but shallow while suite A is focused but deep. B isn't necessarily a bad suite, but I wouldn't say it's doing more with less, just that it has a different focus. Maybe I'm misunderstanding your point, though.
Another way of thinking about it is that the raw mutation score measures breadth: B is better than A because 25 > 20. The normalized score measures depth: A is better than B because 80% (20/25) > 50% (25/50).
http://www.cs.cmu.edu/~agroce/onwardessays14.pdf covers the Inozemtseva et al. paper as well as some other recent work, and nothing in the time since we wrote that has modified my view that the jury is still out on coverage, depending on the situation in which you want to use it. Saying "coverage is not useful" is pretty clearly wrong, and saying "coverage is highly effective for measuring all suites in all situations" is also clearly wrong. Beyond that, it's hard to make solidly supported claims that don't depend greatly on details of what you are measuring and how.
I suspect Laura generally agrees, though probably our guesses on what eventual answers might be differ.
In some cases where coverage is currently used, there is little real substitute for it; test suite size alone is not a very helpful measure of testing effort, since it is even more easily abused or misunderstood than coverage. Other testing efforts already have ways of determining when to stop that don’t rely on coverage (ranging from “we’re out of time or money” to “we see clearly diminishing returns in terms of bugs found per dollar spent testing, and predict few residual defects based on past projects”). When coverage levels are required by company or government policy, conscientious testers should strive to produce good suites that, additionally, achieve the required level of coverage rather than aiming very directly at coverage itself . “Testing to the test” by writing a suite that gets “enough” coverage and expecting this to guarantee good fault detection is very likely a bad idea — even in the best-case scenario where coverage is well correlated with fault detection. Stay tuned to the research community for news on whether coverage can be
used more aggressively, with confidence, in the future.
> we generated 31,000 test suites for five systems consisting of up to 724,000 lines of source code
You auto-generated unit tests suites, and you're surprised they weren't very good at finding bugs? Well, no kidding, they were auto-generated! Would you trust your unit tests to be generated by a computer? Of course not.
Do a study of real-world software, and compare the unit test coverage to the Test Suite effectiveness. Then I'll be interested.
Now, our method is arguably not 100% realistic either, because in practice different parts of the program will have different coverage. Some kind of weighted random sampling might make suites that are a bit closer to "real" suites. I've been thinking about looking into that. I think our suites are realistic enough to make the results worthwhile, though.