
Coverage Is Not Strongly Correlated with Test Suite Effectiveness (2014) - sawwit
http://www.linozemtseva.com/research/2014/icse/coverage/
======
riffraff
Notice the result is a bit more nuanced if reading the paper, i.e.

> Our results suggest that, for large Java programs, the correlation between
> coverage and effectiveness drops when suite size is controlled for.

and

> While coverage measures are useful for identifying under-tested parts of a
> program, and low coverage may indicate that a test suite is inadequate, high
> coverage does not indicate that a test suite is effective.

They also propose an alternative as a "quality goal" for test suites:

> Of course, developers still want to measure the quality of their test
> suites, meaning they need a metric that does correlate with fault detection
> ability. While this is still an open problem, we currently feel that
> mutation score may be a good substitute for coverage in this context.

~~~
daveguy
I wasn't familiar with mutation testing. This is also from the article:

>We used the open source tool PIT [35] to generate faulty versions of our
programs. To describe PIT’s operation, we must first give a brief description
of mutation testing.

>A mutant is a new version of a program that is created by making a small
syntactic change to the original program. For example, a mutant could be
created by modifying a constant, negating a branch condition, or removing a
method call. The resulting mutant may produce the same output as the original
program, in which case it is called an equivalent mutant. For example, if the
equality test in the code snippet in Figure 1 were changed to if (index >=
10), the new program would be an equivalent mutant.

Sounds like a resource intensive way to test -- # tests * loc * mutation
options. However, if you need the fault tolerance it could be worth it. You
can probably spot check to get similar quality with lower resources. Very
interesting.

~~~
julian37
pitest has a few tricks up its sleeve to keep overhead reasonable. For example
(from [http://pitest.org/faq/](http://pitest.org/faq/)):

> Per test case line coverage information is first gathered and all tests that
> do not exercise the mutated line of code are disguarded.

I've used pitest in real life and while it is quite a bit heavier than
straightforward testing, it's definitely usable. I'm now using it on all Java
projects, I wish there was something comparable for JavaScript!

~~~
mordocai
I don't think it is quite as complete, but this looks usable
[https://github.com/jimivdw/grunt-mutation-
testing](https://github.com/jimivdw/grunt-mutation-testing). That's just from
google, no experience with it.

------
struppi
I am not surprised by the findings of the study - not at all.

In discussions about software metrics, I'm always trying to make the point
that you can _only_ use most metrics from within a team to gain insights, not
as an external measure about how good/bad the code (or the team) is. In other
words, if a team thinks they have a problem, they can use metrics to gain
insights and explore possible solutions. But as soon as someone says "[foo
metric] needs to be at least [value]", you have already lost - the cheating
and gaming begins. Even if the agreement on [value] comes from within the
team.

Back to the topic :) I am not surprised by the findings - Higher test coverage
does not mean that everything is fine. But very low test coverage indicates
that there might be hidden problems here or there. This is how I like to use
test coverage and how I try to teach it.

But it is great that we have now empirical data here: From now on, I can point
others to this study when we are discussing whether the build server should
reject commits with less than [value] coverage.

~~~
jacques_chester
A brief skim of the study shows that the headline is a bit misleading. The
correlation between coverage and bug yield is weak _when suite size is
controlled for_.

Another way of looking at this is that large suites improve bug yield. They
tend to have high coverage as a _by-product_ , because a large test suite will
hammer each unit from various directions.

My suspicion is that scoring coverage for _every_ pass over an LOC or a
pathway would change the result substantially.

Nothing in this paper is an argument that test coverage is useless; rather, it
should be seen as a secondary metric that rapidly loses indicative value as it
approaches 100%.

 _Number of test cases_ seems to be the independent variable here.

~~~
lmmi
Paper author here. Scoring coverage for every pass is an interesting idea and
something I'd like to look into. I'm not sure it would change the result as
much as you think, though. The basic finding of the paper is that coverage is
a complicated way of measuring the size of the suite. Counting the number of
times each line is hit will have the same problem, I think: writing more tests
increases that score but also increases the number of bugs found, causing a
spurious correlation.

My hunch is that the quality of the oracle matters more than the coverage
score. You can write tests that cover all the code without actually checking
anything; the tests will only catch bugs if you're carefully comparing
expected and actual results. Maybe a simple metric like "number of asserts"
would be useful -- except, of course, that will also be correlated with the
size of the suite... It's a tough problem.

The point about the title is fair. I erred on the side of clickbait when I
wrote the paper and regret it a bit. On the other hand, it worked. :)

~~~
jacques_chester
I see your point about coverage volume being a better-fitted proxy for suite
size. It'd also be a proxy for path coverage. Still, it'd be fun to count the
horse's teeth anyhow.

Would you say that coverage's worth as a _negative_ metric still seems
meaningful, at least as a heuristic? I imagine that's covered in other
literature.

Where I work we don't really fuss too much about coverage. We TDD, so in
practice our coverage hovers around the high 90s as a matter of course. When I
am writing a test I often manually mutate the code and test once it goes green
as a quick validation that the test does what I think it does.

One last question -- did you classify tests? Feature, integration and unit
tests should show quite different curves. Especially heavily mockist style
unit tests.

~~~
lmmi
I'd definitely say that the absence of coverage is a problem. My view is that
coverage is necessary but not sufficient for good testing.

We talked a bit about classifying tests but didn't do it in the end because
it's surprisingly hard to do. I do know of one paper that looked at different
kinds of tests, called "The Effect of Code Coverage on Fault Detection under
Different Testing Profiles": [http://goo.gl/nnxgwE](http://goo.gl/nnxgwE). The
authors found differences between tests for error cases vs. tests for normal
operation and between functional tests vs. random tests. IIRC, they had
undergrads do a term project that had to pass 1200 tests before the final
submission, and the professors themselves wrote the tests, so categorization
was a bit easier.

~~~
jacques_chester
As a rough way to automatically classify tests, you can look for tools like
capybara, selenium or htmlunit for feature and mocking libraries for unit.

Mind you, there's as many taxonomies for tests as there are tests. To be
honest I expect one way to classify them is by working backwards from coverage
-- feature tests should have low volume but wide distribution across a
codebase (perhaps that's another metric -- density?). Unit tests would be
narrow but deep on a particular module.

~~~
lmmi
Those are good ideas, thanks!

~~~
jacques_chester
Throw me on as the tenth or eleventh coauthor and I'll buy beer to sweeten the
deal.

If you need a corpus of code from a highly doctrinaire TDD shop, or if you
think we can help, let me know: jchester@pivotal.io.

------
omginternets
Another case of conflating a phenomenon and its measure.

Rather than write good tests, people have gotten side-tracked by chasing
magical number that may or may not reflect the phenomenon of interest.

Heuristics are sometimes wrong, but always fast and not fundamentally bad.
Writing unit tests for high-risk code is a good heuristic, god dammit. Chasing
after 100% test-coverage is pedantic, and -- I honestly think -- evidence of a
development team that favors form over function.

I'll continue writing unit tests that actually test critical boundary
conditions, and I will continue not to care if only 10% of my code is covered.
If that 10% represents 80% of my bugs, I've won.

~~~
knughit
What about the other 20% of your bugs? They don't matter?

~~~
omginternets
Huh?

Preemptively testing a subset of my code doesn't prevent me from fixing other
bugs, or from writing regression tests.

From the tone of your question, you'd think that 100% test coverage would fix
the remaining 20%, when it clearly doesn't. I mean no disrespect, but this is
what I meant when I said that unit-test coverage is a pedantic subject (which,
I must again insist, is _not_ meant to imply that you are a pedant).

The bugs get fixed, they just don't get discovered/resolved with up-front unit
tests alone. It ends up being more effective and more efficient to hunt them
down by other means.

------
nickpsecurity
No kidding. This has been known for a long time but it's good to see empirical
evidence. The empirical evidence of 70's-90's showed the best ways to reduce
defects were the following:

1\. Unambiguous specifications of what the system will do.

2\. Implementation in safe languages or subsets with plenty of interface
checks.

3\. _Human review_ of design, implementation, and configuration looking for
errors. Best results were here.

4\. Usage-based testing to eliminate most errors user will experience in
practice.

5\. Thorough functional and unit testing.

(Later on...)

6\. Fuzz testing.

7\. Static analysis.

Practices 1-4 were used in Cleanroom, some major commercial projects, and high
assurance security work. Every time, they produced results in preventing
serious defects from entering the system that often might have slipped by
testing. So, those are where the majority of our verification work should go
in. Supporting evidence is from those that do a subset of that from OpenBSD
team to Microsoft's SDL along with substantial defect reduction that followed.

Note: 6 and 7 showed their value later on. No 7 should definitely be done
where possible with 6 done to a degree. I haven't researched 6 enough to give
advice on most efficient approach, though.

So, internal testing and coverage are the weakest forms of verification. They
have many benefits for sure but people put way too much effort and
expectations into them where it pays off more elsewhere. Do enough testing for
its key benefits then just cut it off from there. And don't even do most of
that until you've done code reviews, etc which pay off more.

~~~
technion
I'm not going to dispute that #7 earns a place lower down the chain than the
features you mentioned in practical terms of identifying bugs. That said, it's
only overly feasible when run from early on in design.

If a project currently produces clean output, and I make a commit that causes
clang's analyser to complain, it's easy enough to fix, whether it's a
"legitimate" error or not.

I've tried throwing static analyzers at long established projects like OpenSSL
and.. there's just no point. Noone is going to review a few thousand errors
and make hundreds of commits that don't actually fix anything just to get
perfectly clean static analysis output. It's more likely I would break
something than help.

It's really not that hard to implement a "no commit unless it passes static
analysis" rule from the start, and even if the bugs removed are low, I usually
argue they are non-zero.

~~~
nickpsecurity
I did it in chronological order of discovery and significant application. I
agree it should be used as early as possible. Plenty of benefit to that.
Further, the Orange Book era systems discovered a heuristic that supports your
theory about old codebases: things must be designed for such verification or
security from Day 1. Usually impossible to retrofit such properties into an
existing codebase whose methods were quite different.

------
daleharvey
I find it strange that so much time is spent trying to convince people that
some aspect of testing (coverage / tdd) is not a panacea and so little around
improving how to test.

As a student / recent grad I remember thinking that testing was maybe some
thing you maybe had to do for super hardcore projects. Now I see it as one of
the first things to think about on a serious project and something that is
going to take ~30 - 50% of the cost / effort.

~~~
lubonay
IMO testing's biggest strength is in allowing you to delegate remembering all
the use cases of an API to some code, instead of having to look them up in a
checklist, or worse, having to remember them all.

Even if the tests are covering mundane, easy functions and execution paths,
the benefit of not worrying if you broke something far outweighs the time to
write tests.

~~~
Fradow
Isn't "remembering use cases" something that should be done with comments or
documentation ?

~~~
jacques_chester
Comments and documentation become stale because they require deliberate human
intervention to maintain their truthfulness. They also rely on the author of
the documentation or comments correctly mentally executing the code being
marked up.

Tests are compared with the API every time they're run, as seen by the
computer's model of execution, giving immediate feedback on detected
discrepancies.

(I believe that Pythonistas have clever docs that can encode snippets of
testing into documentation, which splits the difference a bit, but is no
substitute for a robust test suite).

------
wyldfire
Coverage is pretty awesome feedback for reminding you that "oh yeah, I should
probably test that case too." And if you start out with this kind of feedback
early it can help influence your design to increase its testability.

If X% coverage is a goal measured by non-technical team members, it likely
loses much of its value.

~~~
jerf
I've been getting into shooting for 100% coverage in critical [1] code lately,
and while that may not be to everyone's taste, what I've noticed is that I
_always_ learn things from at least _looking_ at the coverage. "Oh, I thought
I had that covered... that's interesting, I thought I had that error simulated
but it didn't happen after all... oh, look, there's no possible way to ever
enter those 125 lines of nasty-looking code _plink_."

Might I also add that I discover errors in the error-handling surprisingly
often. Coverage testing goes a long way towards reminding you to check those
too. What your code does in error cases is part of the spec.

Shooting for 100% may not always be the best use of time, but if you are
writing unit tests you should at least look at the line-by-line coverage
results. You've already done the hard work of writing the test suite, be sure
to reap _all_ the benefits you can.

[1]: "critical" is, ahem, a _critical_ word here. I don't shoot for 100% on
everything, but every even halfway sanely-designed system has a "core". Total
coverage on the core has the dual benefits of assuring the core really does
work as intended, and making it possible to refactor the core. It is a very
common antipattern for the core to metastasize because without a test suite,
all ability to modify it, even to fix bugs, is lost.

------
arielweisberg
Coverage does not guarantee effectiveness. Lack of coverage guarantees the
behavior of uncovered code is not checked.

Use the tool for what it is good at. It's not a substitute for code review or
writing convincing tests. Granted even convincing tests are going to miss
important stuff, but they will miss less and after the initial bout of bugs
will become quite good (because you write regression tests for EVERY bug
right?).

------
GhotiFish
I have a few objections to the evidence presented.

Keeping in mind that I don't disagree with the statement. Test coverage is an
objective metric, and test Effectiveness is a... what is it again? How many
bugs you'll find with it? Obviously the two are separate concepts.

This is my first objection: The paper seems to say that Mutation Testing is
test effectiveness, but Mutation testing is merely another metric. They cite
other papers, but papers that attempt to demonstrate that this metric is
correlated with test "effectiveness".

Metric against metric, is that meaningful?

She presents graphs of results that seem to demonstrate linear correlation
between test suite size/test suite coverage/and mutation testing (called
"effectiveness") This is addressed later, "isn't this what we would expect?"
yah! And the explanation for why it's unexpected sailed fully over my head. (I
admit it! I'm dumb.)

finally, many test suites are generated on the presumption of achieving code
coverage, is this test valid without also having test suites that were made
without that goal in mind? Could such a suite exist?

so summary of my objections

* is mutation testing a meaningful measure of effectiveness?

* can you measure one metric against another another, get a linear relationship, and conclude any meaningful differences?

* does the presence of code coverage as a target spoil the conclusion?

I'd love to hear input on this.

~~~
lmmi
Paper author here. We measured "effectiveness" as the mutation score. We
showed in a separate paper that the mutation score is a good way to measure a
test suite's ability to detect real faults. Of course, using the mutation
score instead of the real fault detection rate does add a layer of
indirection, but it's a lot more practical for a large empirical study (i.e.,
can be automated), so there's a bit of a tradeoff there. (The other paper is
also on my website if you want to read it:
[http://www.linozemtseva.com/research/2014/fse/mutant_validit...](http://www.linozemtseva.com/research/2014/fse/mutant_validity/))

What the graphs show is that effectiveness rises with suite size, which is
expected: more tests catch more bugs. We also see that effectiveness rises
with coverage, which seems intuitive: you can't catch bugs in code you never
run. But when you graph effectiveness against coverage for suites that are all
the _same_ size, the correlation drops significantly, in some cases to 0.
Here's an analogy: the number of PhD students that graduate in the US is
highly correlated with the amount of profit generated by arcades in the US,
but there's no causal relationship between them. They probably both depend on
the size of the population. Similarly, coverage and effectiveness are
correlated because they both depend on the size of the suite, but we can't say
that there's a causal relationship. In other words, saying that a suite will
catch a lot of bugs because it has high coverage is like saying that a lot of
PhD students will graduate because arcades turned a good profit this year.

Your third point is a really good question. It's true that developers don't
write tests randomly, so our method of making new suites by picking random
test cases isn't quite realistic. What impact that would have on the results,
I'm not sure. That's something I'd like to look into in the future.

~~~
GhotiFish
Thank you very much for the response! This paper is pretty interesting for
sure.

It's just it doesn't have what I would consider good evidence that coverage
isn't correlated with effectiveness. (or... I don't understand the section
that explains it <:( )

What I would consider good evidence would be to demonstrate that _different_
test suites of the _same size_ and _different coverage_ achieve the same
effectiveness.

* test suite A has a size of 300 sloc, and a coverage of 5%, effectiveness of ~8%

* test suite B has a size of 300 sloc, and a coverage of 12%, effectiveness of ~8%

That to me would be evidence of the stated conclusion, but I don't see where
this is demonstrated (or where this is demonstrated, I don't understand). I do
see where it is stated! Though... now that I have read through the paper a bit
more thoroughly.

On the subject of normalized effectiveness.

    
    
      Suppose
      we are comparing suite A, with 50% coverage, to suite B, with
      60% coverage. Suite B will almost certainly have a higher
      raw effectiveness measurement, since it covers more code and
      will therefore almost certainly kill more mutants. However,
      if suite A kills 80% of the mutants that it covers, while suite
      B kills only 70% of the mutants that it covers, suite A is
      in some sense a better suite."
    

I don't believe a majority of people would agree with this. To me, this says
coverage is positively correlated with effectiveness, and that suite B is
doing more with less. Maybe that's a philosophical stand point? By
normalizing, have you eliminated the premise?

Anyway, thank you for your time and work!

~~~
lmmi
What you're looking for is in Figure 3 (admittedly a bit hard to read because
I had to squish it into the paper). Each panel in that figure shows the
results for test suites of a fixed size. For example, the top left panel shows
the results for suites with three test cases for Apache POI. If you draw a
horizontal line through the graph, all of the test suites that fall on that
line have the same effectiveness score even though they have different
coverage levels. I gave a talk about this paper at GTAC this year and Google
posts all the videos on YouTube, so if you have time and you're still curious
the explanation in the talk might help (and the graphs are much easier to
read).
[https://www.youtube.com/watch?v=sAfROROGujU](https://www.youtube.com/watch?v=sAfROROGujU)

The normalization was a point of contention with the peer reviewers as well,
so in the end I tried both the normalized and unnormalized metrics and found
similar results with both. The other tables and figures are available on my
site if you want to look at them.

I'm not sure I understand what you mean when you say suite B is doing more
with less, though. In the example, I was trying to say that suite B covers
more code, so it will kill more mutants. Maybe suite A kills 20 mutants and
suite B kills 25 mutants, just to have some numbers to talk about. But if B
covers 50 mutants, and is only killing 25 of them, while A covers 25 mutants
and kills 20 of the 25, it seems like suite A is doing a better job of testing
the code it covers. Or to put it another way, suite B is broad but shallow
while suite A is focused but deep. B isn't necessarily a bad suite, but I
wouldn't say it's doing more with less, just that it has a different focus.
Maybe I'm misunderstanding your point, though.

Another way of thinking about it is that the raw mutation score measures
breadth: B is better than A because 25 > 20\. The normalized score measures
depth: A is better than B because 80% (20/25) > 50% (25/50).

------
agroce
I'm biased (it's my research field, in part) but I'd suggest that studies on
coverage are all over the place, with this one showing lack of correlation,
and other studies showing good correlation between coverage and... some kind
of effectiveness (the ICSE paper mentioned above, also from 2014, and a TOSEM
paper coming out this year, as well as a variety of publications over the
years).

[http://www.cs.cmu.edu/~agroce/onwardessays14.pdf](http://www.cs.cmu.edu/~agroce/onwardessays14.pdf)
covers the Inozemtseva et al. paper as well as some other recent work, and
nothing in the time since we wrote that has modified my view that the jury is
still out on coverage, depending on the situation in which you want to use it.
Saying "coverage is not useful" is pretty clearly wrong, and saying "coverage
is highly effective for measuring all suites in all situations" is also
clearly wrong. Beyond that, it's hard to make solidly supported claims that
don't depend greatly on details of what you are measuring and how.

I suspect Laura generally agrees, though probably our guesses on what eventual
answers might be differ.

------
agroce
The SPLASH Onward! 2014 essay concludes with this advice to practitioners:

In some cases where coverage is currently used, there is little real
substitute for it; test suite size alone is not a very helpful measure of
testing effort, since it is even more easily abused or misunderstood than
coverage. Other testing efforts already have ways of determining when to stop
that don’t rely on coverage (ranging from “we’re out of time or money” to “we
see clearly diminishing returns in terms of bugs found per dollar spent
testing, and predict few residual defects based on past projects”). When
coverage levels are required by company or government policy, conscientious
testers should strive to produce good suites that, additionally, achieve the
required level of coverage rather than aiming very directly at coverage itself
[56]. “Testing to the test” by writing a suite that gets “enough” coverage and
expecting this to guarantee good fault detection is very likely a bad idea —
even in the best-case scenario where coverage is well correlated with fault
detection. Stay tuned to the research community for news on whether coverage
can be used more aggressively, with confidence, in the future.

------
mabbo
While I think it's well-agreed upon in this community that code coverage is
not all that big a deal, I think we also should consider the methods used in
this study before we all pat ourselves on the back for having 'proof' of what
we know.

> we generated 31,000 test suites for five systems consisting of up to 724,000
> lines of source code

You auto-generated unit tests suites, and you're surprised they weren't very
good at finding bugs? Well, no kidding, they were auto-generated! Would you
trust your unit tests to be generated by a computer? Of course not.

Do a study of real-world software, and compare the unit test coverage to the
Test Suite effectiveness. Then I'll be interested.

~~~
jhpriestley
The unit tests are written by the programmers on each project, not generated
by a computer. The test suites are generated by selecting a subset of the
project's real-world test suite.

------
JoeAltmaier
Maybe we need to write tests to cover the test suites? Why not? They are code
too. Then they'll improve.

~~~
knughit
Yes, these are mutation tests. Riddler or some such for java. It modifies a
part of your code and then if your tests that cover that code still pass,
that's a clue that the test is not effective

