Just because code is touched during the test process doesn’t necessarily mean it has been tested. Coverage is more useful for finding chunks of code that aren’t exercised at all by tests. Branches that never get hit that probably deserve extra scrutiny. Coverage is an interesting heuristic but 100% code coverage is not 100% bug free code.
Likewise, there’s stuff that’s just not worth it to wrap in a test. The effort involved in mocking/wrapping/abstracting sometimes isn’t worth the effort for some code, like simple calls to common libraries, reading a file, CLI help strings, etc. This stuff generally is better served by end-to-end or integration tests. Or just gets exercised naturally as part of development. Or can crash without much consequence.
My point is coverage is a tool that tells you where to look, not the one and only metric that tells you how reliable your code is.
Totally agree. In one of our projects (around 200K Python LoC) we had 90% CodCov, but we frequently found some bugs.
Recently we started to heavily use mutation testing and fuzzing to find edge cases on parts of the code that were already "covered" according to the CodCov report. It was definitely worth it.
I highly recommend investing in mutation and fuzzy testing.
I was just about to bring up mutation testing. I've had some pretty great success with PIT [1] when writing Java. Code coverage and mutation test coverage pair wonderfully together.
Mutmut author here. Great to hear you're using mutmut!
I think it's pretty important that we who know about MT and have used it keep talking about it. It's a pretty great tool that more people should have in their tool belt.
test coverage is calculated as % of the locations seen by profiler during the test runs, right ?
This metric completely ignores the fact that every independent conditional doubles the space of the behaviors of the program.
Assuming we have three conditions and three sequential groups of basic blocks chosen corresponding to true (X) and false (~X) values of these conditions, testing A+B+C and ~A+~B+~C code paths gives 100% code coverage but it covers at the very best only 25% of the code behaviors.
There are many types of test coverage measurement. I believe path coverage ("Has every possible route through a given part of the code been executed?") is your latter example. Quoting https://en.wikipedia.org/wiki/Code_coverage which also says:
> Full path coverage, of the type described above, is usually impractical or impossible. ... a general-purpose algorithm for identifying infeasible paths has been proven to be impossible (such an algorithm could be used to solve the halting problem).[14] Basis path testing is for instance a method of achieving complete branch coverage without achieving complete path coverage."
Mutation testing can prove that the tests check all the behavior of the code, but it can't prove that the code shouldn't have more behavior, or that this behavior is correct.
It's a really helpful tool for as you say telling "you where to look". One trick I've learned: if you want to forecast where the bugs will popup next month, take a quick look at the code coverage report.
#1 source will be the code that doesn't appear at all (because the test build system didn't even include it)
#2 will be all those red lines.
And that’s why you shouldn’t chase after code coverage percentages.
While there are uncovered lines, you have a TODO list that has been generated by the machine, telling you what needs more attention. If you go to that section intending to test it well, you’ll get a lot of lines of test code for a modest increase in coverage. If you’re just trying to boost coverage, you’ll check off the todo without accomplishing much.
The only place these two metrics line up at all is when you can keep the number of conditions in each method close to one, and the functions pure. Otherwise you have a bunch of little functions that accumulate state and you’ve just moved the combinatorics where it’s even harder to reason about.
I think the most you can say with any certainty is that coverage percentage should not go down when adding code. But staying flat doesn’t prevent things from getting worse.
Zero assertions means that you know there aren't any lurking exceptions triggered by just running the code. That's a good thing to know, but I'd hope like hell that it's not the only thing you want to know.
When “hold people accountable to metrics” is not just your company’s management philosophy but the gospel of its entire surrounding culture, this is not an argument that gets anywhere. It is always better to just cover the damn lines than make your Director explain to the VP why he has worse coverage numbers than a sister org.
The same way you should make an expected “no such object” result a 200 status code, so that she does not have to stand up and justify her error rate.
Note that you're now discussing corporate politics, not just creating software, even though creating software often has to deal with corporate politics.
My point is rzimmerman never contextualized this as a "how to navigate the workspace" argument, they are arguing it from a pure technical perspective.
> Coverage is an interesting heuristic but 100% code coverage is not 100% bug free code.
Absolutely correct. But at the same time, <100% coverage means you’ve got code where you can’t be sure there are no bugs because it doesn’t even get tested.
Strive for full coverage that hits every line AND tests the appropriate scenarios and edge cases.
Really? You want to strive to test that logging logs and observability observes? You want to test constructors, getters and setters?
Testing trivial code brings negative value, why would you do that?
> You want to strive to test that logging logs and observability observes?
You're asking this rhetorically, but I often find interesting problems once I start really exercising the latter, like:
- Non-global collectors we forgot to register, or fail to get registered on some code paths.
- Collectors that get registered twice or overlap with other collectors only in some conditions (e.g. connection addresses that are sometimes the same and sometimes not depending on how your service gets distributed).
- Collectors that are not really safe scraping concurrently with the rest of the program; our test suites include race detectors.
- Metrics that don't follow idiomatic naming conventions (can be found via e.g. `promtool check metrics`).
- Metrics with legitimate difficult-to-notice bugs. We had an underflow counting the size of a connection pool because `Close()`ing one twice was explicitly allowed.
> Testing trivial code brings negative value
Agreed in the abstract, but if we were good judges of what code was trivial or not, we'd write a lot less bugs in the first place!
That being said, this is also a discussion about coverage, not assertions. Even if I'm not actually checking any of those things, those code paths (and logging) are still getting exercised via any decent E2E/functional/behavioral test suite.
I noticed the bug when I went to look at a dashboard with that metric on it, and noticed it had the wrong name. If go's Prometheus library had an easy way to run "metric.CurrentValue()", I would have tested it... but it didn't, so I didn't. And then had to patch it, release it, and update all my servers. Writing the test (with that API) would have taken less than a second. Finding the bug in production and fixing it took an hour.
(That codebase has 100% line coverage, but of course, I know that 100% test coverage is a meaningless metric. That the code ran is a prerequisite to the program behaving correctly, but it also has to do the right thing.)
Arguably you should be testing behaviour rather than getters and setters. If a codepath requires a name formatted correctly, then that forms the basis of a test and an assertion. Otherwise you're introducing fragility by coupling your 'under the hood' stuff with your 'API' (so to speak).
> You want to strive to test that logging logs and observability observes?
Sure, why not? All that requires is running tests with the log level turned up. And it’s not unheard of that e.g. somewhere in a log statement someone forgot to check that the object they’re calling toString() on is not null.
You probably don't need to test any of those things but I would still expect them to be exercised. You can't very well test a class without constructing it.
> You want to strive to test that logging logs and observability observes?
A test that calls code (which counts as code coverage) does not have to verify that something specific happened. Often, this is a weakness of a test. The test does not verify what you think it would for various reasons...visibility, access to resources, etc.
Another advantage that does not get the love it deserves in my opinion, is how useful coverage becomes when it comes to upgrading a dependency, or even the language version. But your project will have to stick around for a while to appreciate that.
There is a big difference here between strongly typed, compiled languages and weakly typed interpreted languages.
High code coverage is much more important on e.g. Javascript or Python because running the code is the only way you know that it "compiles" and you didn't do something dumb like mistake the type of arguments, access fields that don't exist, etc.
In a compiled codebase, I think functional / end-to-end tests are more important. I don't care as much if every single function is exercised by unit tests if I know that the software works as intended for all supported use cases.
That's interesting because I think the exact opposite.
For dynamically typed languages I think it's more valuable to have end-to-end tests to make sure that the whole pipeline works correctly. I want to validate that all the call sites for function F are passing an int, as intended, as opposed to a string.
For statically typed languages, I already know that all call sites for F are passing an int. So I want to unit test all small parts of the pipeline, in isolation, to make sure I easily spot which part doesn't follow its contract. If a function F uses a function G for something, I'm gonna unit test F and make sure it does behave correctly for all possible returns from G, and that's it.
Dynamically types languages are just plane unsuitable for the type of large project where you need to break your end to tests down into unit tests. If you have a trivial program (under 100,000 lines of code or so), your end to end tests won't take that long to run if you take care, you will never be able to maintain that much in a dynamic language just because change becomes so hard without static types.
That's not really true, change is incredibly _easy_ without static types. I've worked an a few large projects now all in Python or JS where the test time has never really been an issue, nor has the ability to make changes.
I'd say it's not really the language, but rather the abuse of it. The same goes for any project in any programming language.
Typed language also make it possible to exhaustively test inputs. For functions that only have a few dozen or hundred possible inputs we can exhaustively test every single case (even millions/billions as part of a slow suite).
For looser languages, maybe the function takes a fuzzy notion of "a number" - OK, do I need to try the string "12"? What about "012"? What about "12.0" or " 12" or "\n12"?
This is where I've become a staunch proponent of type hinting Python code. I've caught more subtle bugs by enabling mypy on new sections of code than by running and debugging in production. Of course, this requires adding type hints everywhere, which can a struggle for older codebases, but MonkeyType is a great way to solve that, along with diligence on adding type hints to all new code you write, giving progressively more benefits as more of the codebase is covered by type annotations.
mypy is great but I've become enamoured with pyright[1] of late. It's super fast and catches way more problems than mypy does. It also catches syntax and semantic errors as a nice aside.
Maybe another way to put it is: if you are forced to use a weakly typed language you better have high code coverage (but really you've got bigger problems that could be solved with better languages).
Of course there are times where you need these weakly typed languages with lots of coverage. In a sense that is what the Web is, and the browser vendors use millions of websites for their coverage.
All these discussions in the comments about coverage percentages would be so much more insightful if everyone mentioned the kind of code they work on. Client-side JS? Real-time video processing firmware? A language support library? Automotive middleware? Java for an ATM? An ML library wrapper? Personal website backend in Rust? Android game engine? Cosmic particle simulation?
I don't even care about the implied "what could go wrong" but about the size of the input spaces and the complexity of the algorithms behind them. I certainly don't subscribe to the idea that there is a one-size-fits-all testing strategy for all the above, and I see little gain in arguing for or against a certain approach with someone who has entirely different constraints and code architectures to work with.
Require 100% code coverage on an async task library? Sure, go for it. 100% coverage on a (deep-learning-free) algorithm identifying constellations in pictures of the night sky? Do you really want to have to violate every single assumption one by one in your unit tests? How do you even get that input data (and who will create more nonsense data when you improve the algorithm)?
Or in other words: For any reply in this thread you can probably come up with a scenario where it makes sense to operate as described in it. There is some nice perspective but overall I find it a bit unactionable...
I'm a strident adherent to having 100% code coverage, but this mostly works because my history in infrastructure and a desire to sleep well. The key idea that I have found that if you can't synthetically get your process into a specific state, then life is going to be hard.
The problem with a lot of people is that they view testing as a burden rather than a criticism of the code they are testing. If you can't test your code easily, then your code is awful. End of story.
I wrote a UDF for MySQL. It had just shy of 100% coverage. It used a malloc. I could not figure out how to trigger a malloc failure inside of a imported library running on a MySQL instance.
Do I not check for malloc failure (in order to get 100%)?
Or do I develop some sort of instrumented version of MySQL which lets me test malloc failure? That seemed far more overkill than warranted for the small project I worked on, though I certainly know some projects which do that!
I decided to accept 99.5% or so coverage and manual inspection of the failure case.
A core cultural challenge that we have to deal with is that people are very soft about testing, and they don't make their artifacts or platforms testable.
There is a bit of diminishing returns to what I believe, but it also depends on ones dependencies and what one considers 100%.
As an example, I designed a multi-TB file-format and synchronization engine. I made that engine a dependency free library such that all the math, contracts, threading, and scheduling could be 100% tested before integration. This forced the messy integration into the core application which then was E2E tested with hundreds of randomized tests.
The key concept at play was beating all the math into the ground such that it was damn near perfect because a single flaw would lose data.
Yes, I figure it would have been easy to get to 100% if MySQL had been designed so that plugins could be coverage tested.
For example, there might be an "environment" API which abstracts memory management and file I/O. The plugin could use this API instead of direct calls. Then if MySQL had an option to use a specific environment implementation for a plugin, then testers could drop in their own implementation under test.
Getting that sort of support by MySQL is an example of what I think you mean by "cultural challenge."
I considered using a shim, but couldn't figure out which malloc comes from my UDF and which comes from MySQL, and I was highly doubtful that MySQL would have identical numbers of malloc by the time my code runs.
I believe you can override libc's malloc by passing in LD_PRELOAD, at which point you've got the basic building blocks to test malloc failure if you can find a way to tell your malloc call apart from the others.
Having said that, I don't disagree with your choice, because messing with LD_PRELOAD is difficult, scary, and probably overkill anyway.
Totally agree with you. This really is a concept that you need to keep in mind when you write your code "testability". And i find that when you get this you also get a Buch of other properties for "free". Orthogonality and reusability for example. And really it's a force multiplier and kicks productivity into high gear when you know that your lower level components work as expected and you don't have to go ok a wilf goose hunt when something would be constantly broken. To summarize I find that writing tests early on reduce product velocity but the compound effect of good quality and regression prevention give you higher velocity later on.
I don't think so. I once took on the challenge of writing as many unit tests as possible for a project at work (the project did not have unit tests - but was well covered with other types of tests).
The two key takeaways I got from the effort:
First, I had no idea how coupled my code was until I tried writing many unit tests for it. If it's hard for you to instantiate your class without involving N other libraries/objects, your code is very coupled. No one in my team would have looked at the code and said "It's highly coupled". The real proof was "Can you test this in isolation?" If not, you're strongly coupled.
I had to redesign a lot of bits to succeed, and as another commenter pointed out, in my attempts to do so my code really was a lot better. I discovered good principles of design in doing it.
The second thing I learned, which may appear to disagree with the first: There are always easy ways to write code to be unit testable. Most of those easy ways are bad and reduce the "quality" of your code. Forcing yourself to not redesign just "for the sake of writing tests", while still ensuring you have 100% code coverage, will really force you to think heavily about your code, architecture, failure points, etc.
So a 100% code coverage really doesn't tell you if you have good code. But less than 100% does indicate potential problems.
It's both project and language specific. A trivial example is C++ classes with private methods. Some people do a hack that when compiling for testing converts all private methods/attributes to public. This way it's easy to test private methods individually. Please don't do this.
Unit testing C++ is not as easy as in some other languages. The language is fairly rigid. Some people make heavy use of friend classes to assist with testing, but this can be overdone (indeed, many C++ developers are against any use of friend).
I'm not. One of my side projects for example is a programming language where I have 100% code coverage.
From a career perspective, I have seeded new infrastructure services with 100% code coverage. At core, I believe achieving great reliability requires a solid foundation.
I don't claim these services are perfect, but when a bug is discovered then I can usually use the logs to figure out what state the program is in and then test the bug with a unit test. Then, I can protect future engineers from that issue.
Now, don't get me wrong, there are LOADS of silly tests for languages like Java to make code coverage tools happy. Like "new ClassThatOnlyHasStaticMethodsInIt()", but the key is that you can alarm on code coverage less than 100% but once you let the paper cuts build up it is hard to manage.
I'm a big believer in "slow is smooth, and smooth is fast" when it comes to building services for others.
What I often see is people want 100% coverage AND want tests to run fast. And often that leads to using mocks and other techniques. Once you go down that road you end up with brittle test and your developers spend a lot of time writing and repairing tests.
I'm not against testing but feel people waste a lot of time writing tests that are more a liability then an asset. I'm hoping we get better tools soon and people look back and wonder WTF were people thinking.
Conversations like this a meaningless without first determining what kind of software are we talking about. If it's a first version made quickly to test an idea, the ideal would be 0%. Rapidly developing product, 40%, mature product, 60%, database (like Postgres or SQLite) - 100%, and any software that deals with money, health and airspace should have each live of code covered at least three times by completely different tests, both unit and integration.
There's not one size that fits all in software development. And if you don't specify what software are you talking about, everybody will explicitly talk about their own domain and experience.
> 1/3 of the code every software project is irrelevant, buggy, overly complicated, or simply sucks. It has a reason to be where it is, but chances are, one year down the road, it will become a liability. Being dogmatic about tests and covering every line will only make it more difficult to get rid of it.
If you have actually tested every line of code with your test suite (not the same as "covering" every line), then your code better not be irrelevant, buggy, or simply suck. That is the sort of thing you are supposed to uncover and fix as part of the whitebox test development process. If your covered code is that bad, you either haven't actually tested the code or you have completely missed the point of software verification.
Code coverage can be a very misleading metric. I have to clarify the difference between "tested" and "covered" code to coworkers and management all the time. In the same way that "standard lines of code" is an insufficient measurement of complexity or development effort, "coverage" is an insufficient measurement of test quality. However, it can still help with ballpark estimates.
Uncovered code is actually a much more useful metric, IMO. You can cover code without testing it, but there is no way you could have tested uncovered code.
I think the point the author should be making here is that not every app needs to be 100% tested.
> Uncovered code is actually a much more useful metric, IMO. You can cover code without testing it, but there is no way you could have tested uncovered code.
Hurrah, finally I see someone with the same viewpoint I've pushed for some time, though for a different reason.
For me, 80% covered isn't the same thing as 20% uncovered. The difference is in the code that has been examined and decided tests for one reason or another aren't worth writing. Could be untestable (elsewhere in these comments is a malloc example), too difficult to reasonably test (API calls abstracted behind a facade; mock the facade in other tests but the facade itself is kept tiny and only tested manually), utilities like python's "__repr__", etc, etc.
Having a way to mark such code as "does not need tests" is the key to making this viewpoint useable though, and I'd also really like it if coverage tools focused on counting uncovered down to 0% rather than covered up to 100%.
I was also going to say around 60-70 pct, but for a different reason: what's left in my code is mostly checking of assertions, debug logging and handling rare errors, i.e code that's not supposed to run. This is code that will dump some state to the logs, then crash without causing corruption.
It tends to be 'tested' accidentaly when I am making big changes, and I add more of it if I get a crash that's undebuggable from dumps and logs.
> I was also going to say around 60-70 pct, but for a different reason: what's left in my code is mostly checking of assertions, debug logging and handling rare errors, i.e code that's not supposed to run.
I grant you assertion-checking, but code for handling rare errors is not the code you should skip writing unit tests for. If it runs infrequently then you're far less likely to stumble on a regression during other testing, so it's even more important to unit test.
Depends on rare errors. I meanthings like 'My database crashed halfway a transaction, my file system drops from under my application, my back end service gave up.
You can't do much here. Dump some info, abrt the half- done work, maybe try again somewhere in the future. And yes, that last part migh deserve a test.
If you can't do much, it shouldn't be much code, so coverage should stay high.
I used to be of the "70% is good enough" school, but I found that as I wrote better programs, I also achieved better test coverage. Not because I was writing more tests, but because I was choosing core designs with fewer edge cases, simplifying my error handling with fewer reachable paths, and admitting I might as well just crash in more cases. So now my line/branch coverage is more like 95-100%, but my code is easier to read and I write _fewer_, mostly functional/behavioral, tests. Most lines I don't reach are fatal errors, and most fatal errors are the only statement in their branch.
Code you can't reach from test cases isn't a sign you should write more tests, it's a sign you should remove that code from the program.
Today I view 90-95% really as a baseline, and focus more on path coverage (most standard tools are quite bad for this still) and edge cases in data (e.g. denormed floats or different kinds of invalid data I want to make sure stay invalid) that don't affect coverage one way or another.
It adds up surprisingly fast. I live mostly in the java world, and libraries throwing 3 or so required to catch exceptions are common.
I'd agree that 90pct is doable for your own algorithmic code. Glue code for libraries, commonly not very well thought out, requires a very different style. Defensive up to the paranoid, and catching every gremlin at the source. It's not uncommon to have more than half of the code dedocated to exception handling. Dumping said libraries is rarely an option, unfortunately.
I'm not necesarily disagreeing with you, but maybe we live in different code worlds.
We definitely do, and I see how that might apply - I do write some Java but avoid as much as possible the disaster of checked exceptions. Mostly I am talking about Python, Go, TypeScript, and other more esoteric stuff.
I don't agree that this defensive style is necessary though - I do (like everyone these days) write a ton of glue code, and that's as I described. Java is nearly as bad as JavaScript for language-cart-pulling-program-horse designs, it's just a different set of mistakes. It could be so much better if developers just did less.
I have seen more than a few problems be caused by this not being done right. IIRC it was an element in the events that destroyed Knight Capital in just a few minutes.
If there is something in your code that can put your system in an inconsistent state, and your testing has not covered that scenario, then it is irrelevant what percentage of the code was covered by that testing. This combinatorial complexity and history-dependence is the reason why code coverage is a misleading indicator of quality: 100% code coverage is very far from 100% scenario coverage.
Even better are claims of data coverage. USER: And every possible combination of data has been tested and validated to be correct like your management assured us? PROGRAMMER: No, that's simply not possible. For example, if there are 10,000 possible values for variable one, 1,000,000 values for variable two, and 5,000 values for variable three, that'd be 50 trillion possible combinations. And you have thousands of variables. USER: OK, but if that's true, why did your management assure us you'd verify every possible combination? PROGRAMMER: Because that's what you wanted to hear?
Code coverage is an excellent tool but a terrible metric.
It's basically the lines-of-code of test driven development. It's useful as a general guide for developers, but its use as a management metric is extremely destructive.
I generally agree with the crux of this and would like to take it a step further. It's far more important to test edge cases and the lesser used endpoints.
We have major parts of our product where the coverage isn't great, but get tested _every time_ you use the product.
The coverage on our login process is mediocre, but every time a Developer or QA person uses the product, they log in. We'd hear about it. Our landing page post login is largely the same story.
Our rarely used tools several clicks deep into the site _all_ have much better coverage.
We don't have continuous deployment and it's not a goal.
> Being dogmatic about tests and covering every line will only make it more difficult to get rid of it.
I find the opposite to be the case. I'm very comfortable tossing away a bunch of code with great test coverage. You can always cherry-pick it back later and know that it works.
(I'm not 100% purist, but certainly I'd say 80-90+% and not 66%).
The important thing is you should be so fast at writing testable code that getting to 80% is a like going for a light jog. If it feels like a slog, you've identified weaknesses in your craftmanship that you should practice.
> You can always cherry-pick it back later and know that it works.
Exactly, "this is what git is for". I think being uncomfortable tossing tested code is more of a reflection of the temperament of the person saying it than any truism about coders in general. If anything a little bit of a roadblock with test cases can be a good thing. "By deleting this test I am really certain I want to remove this feature".
There's one place in my code base where I have extra tests in there, that I have to replace every time (it's snapshot-testing code-generating code), but that one little roadblock make me feel good because I know that the code that is being emitted is human-readable and easy for an end-user to debug.
> I have to replace every time (it's snapshot-testing code-generating code),
Do you store your snapshots separately?
One note for devs reading this: I think snapshot tests are very cool, but I recommend trying them as a very separate thing than unit tests (if you are not already doing that). I removed all snapshot tests from our unit testing pipeline and it's much better for it. Was very surprised that Jest doesn't have warnings that snapshot testing is a very different thing, and should not be intermixed with unit tests.
Snapshot testing is sort of like code coverage, a great tool for telling you what to look for, and you should make it so that doing snapshot testing is easy, but the snapshots themselves should be stored outside of the main repo/main testing repo, and in their own thing. Not sure if you are doing that or not, but for anyone else looking at snapshot testing, this is a common mistake I see.
Normally I would agree for webapps but in this case, it's for generating low-level code.
In general, It's only a mistake if it's burdensome. I only have about 30 of these tests, and they are for critical code (I want the end user to be able to review the test and be confident that the generated code doesn't contain memory leaks, etc).
But can't you identify the signal of the input and output and write small test cases to test just the signal?
Snaps test the signal + noise. They help alert you when either changed, which is helpful if you think you may be missing tests for the signal, but ideally you would just make sure all the signal is tested, and the snaps would be just a later, independent sanity check tool.
Obviously you know the details and often once I learn the details I'm like "oh yeah in this case I see how the cost benefit makes sense". Just more speaking in generalities.
I find that the way a test is written substantially affects people’s willingness to delete them, so while there is some truth to what you say, it’s not the entire picture.
Near as I can guess, it’s related to the anchoring. I’m replacing a few tests with a new set of tests. How complicated should they be? The old test was simple, I should replace it with a simple test. I need six of them, six simple tests ain’t so bad. Nearly took me longer to think about it than to do it.
If I write three line tests, the test quality tends to erode slowly. If I left sketchy tests, they turn to garbage disturbingly fast.
As with so many other code quality metrics and tools, automated testing (and by extension code coverage) is not necessarily about the number but about discovery.
If you are finding out that certain parts of your software are hard to test or will never be reached by any of your tests, that should tell you something about your architecture.
Hey Jerrod here, CEO of Codecov. We live in coverage world every day and I agree that there is work to be done.
Fundamentally, no one ever said that coverage had to just be a binary of 0 or 1 as a "hit" line, though it is definitely the branding that code coverage has today.
As some others on this thread mentioned, how is something being tested should matter:
- Unit vs. integration vs. end-to-end test vs. other testing
- Flakiness of test (non-determinism)
- Accuracy of test (E.g., mutation testion / fuzzing)
Also, What is being tested matters:
- How important is this line? How is often is it actually being called in production?
- Are their errors / exceptions on this line that we can see?
I do not believe that 100% coverage is the goal, nor should it be for all teams.
Do you have feedback here? I'd love to talk more about this.
Because code coverage should measure coverage of REQUIREMENTS, not lines of code.
Right coverage metric would take as input list of all requirements broken into smallest conditions and then check if they all are verified by a unit test.
In particular, this would cause almost all code to be covered because if a line of code was not covered by test it would mean it is not necessary to meet any of the requirements.
Interestingly enough, unit tests would also be then complete documentation of every single requirement and that is how I understand "test driven development". Sadly, I have yet to see it implemented anywhere.
Hmm... in my experience, stable code is more likely to be correct than code that changes often. In fact, I would consider this obvious, given that there's a non-zero chance to introduce a bug with every change.
> The only sure-fire way to improve code coverage (and by that keep software relevant) is to identify and remove the unnecessary code.
I remember what I think is a demoscene group taking that to the extreme. They ran coverage analysis and removed all uncovered code, thus achieving 100% coverage.
The point was to fit all their code in an very small executable (typically 4k or 64k). It was absolutely terrible for robustness: all edge cases are ignored. But since it was a demoscene production, it didn't matter, but if it was a server, that would have made it the most exploitable code ever.
100% is the way to be. Once you get there, then you can start having useful conversations about how to make the coverage more meaningful. But first, you have to make sure that every line is run at least once without crashing (except for the lines that are supposed to crash things - those you need to verify DO crash things).
Interestingly, once something is designed for testability it is more likely to not have bugs. But saying that therefore you don't need the tests is silly, because without the tests you wouldn't have designed for testability.
Disagree, empirically. Once you bury the needle, nobody wants to talk about whether the tests are garbage, because doing so means that you have to admit to misrepresenting things to management. So you’re either calling people out or complicit in the lie. People don’t want to look at these actions honestly, so they deflect to protect their egos.
None of these dynamics are an issue when you’re at 80% code coverage. It’s easier to fix how we do things while still slowly raising the stats.
You’re trying to run when you haven’t learned to walk (or with a few teams I’ve seen, crawl).
Back in my undergrad software testing course, we were taught test coverage in the context of a control flow graph.
I think most coverage measurements (at least in the GitHub projects where I've seen a coverage badge) only refer to statement coverage where it checks if every node of the CFG is reached. It's one of the weakest forms of testing and will rarely reveal the more obscure bugs that edge/(simple, prime, complete) path coverage testing can reveal.
I personally think the industry should stop using (node) coverage as some kind of golden metric.
In our teams, is less about bugs and more about the team culture. A test mindset that ranges from the business ideas, to the new features developed and the code that we write.
We enforce 100% test coverage. Not 90, not 95, not 99, but 100. You need to test your code if you want to reach production. And production is the only environment we have. Having just prod generates less friction and allows us to make more changes, more features. But in order to live by this premise, the team needs to be really good at testing.
Interesting article, but the answer I'd be looking for is #itdepends.
What's the impact of a bug in this code base?
How likely is there to be a bug?
How quickly can you rectify a bug if it happens?
Most of my Makefiles and setup.py files have 0% test coverage. Libraries that are automatically deployed to a few projects, where I can't easily ship a fix and I don't really fully understand the impact - I'd aim for 100%, but not arbitrarily; I want to make sure we've actually tested the edge cases.
I always say between 60-80%, 60% for greenfield or new testing as that is generally right for covering the golden path(s) and the major error cases. It should slowly grow over time as bugs get tests written for them, but if it gets over 80% you probably want to refactor some.
>80% test coverage digs way to far into the code, testing the implementation instead of the interfaces/apis. It is a definite code smell that probably means the tests were written poorly and need some major reworking.
> that is generally right for covering the golden path(s) and the major error cases
The thing is, your metric of 50-80% has nothing to do with if you're actually covering the golden path(s) and major error cases. The error cases you didn't think of, wouldn't count in this metric of course, but they might still be missed.
Code coverage as a metric says as much as number of lines as a metric. Meaning nothing, it's just a number. Aiming at any sort of code coverage metric misses to think about what to actually cover. 0 code with 0 tests has 100% code coverage, doesn't mean it's actually a good program.
Loosely coupled code could well have unit testing and mocking covering >90% with great results. We should not conflate unit and integrations tests here. I have also found regression testing to be a great way to declare and interpret intent when trying to understand someone else's code.
I think the classic book "Working Effectively with Legacy Code" declared "legacy code" to simply be code that has no tests.
> Being dogmatic about tests and covering every line will only make it more difficult to get rid of it.
This is only true if one is lackadaisical about how they architect their code and (unit) tests. You should be able to completely smash a function and it's tests without breaking any other tests ... If not then you're testing a different unit within those broken unittests.
Someone on my team likes to write their own mocks instead of using the mocking framework. A lot of people don’t really understand the point of testing, but this behavior is pretty far down on the spectrum.
The mock is hard to write by hand, so they get sunk cost fallacy and share it between tests. Now the tests are coupled to each other, which makes it hard to change features. But wait, there’s more.
They made space for some delegation/composition that never arrived or disappeared, and so they’re sharing the same mock across two different test suites, in two different directories. And made very coarse grained commits, so even if I wanted to know how tf they got here, which I really don’t, I’d have trouble doing it.
Writing complex scaffolding for your tests is supposed to hurt. Pain is your body telling you something is wrong. If you can’t set the preconditions in a few lines then you don’t understand the problem, or you don’t understand your own code. Both are bad for your coworkers.
Big decisions are the culmination of a bunch of little decisions. They are not the little decisions themselves. Don’t write your code like they are, and all of this shit gets ten times easier.
> A lot of people don’t really understand the point of testing, but this behavior is pretty far down on the spectrum.
I did "unit testing" and dependency injection of "mocks" for years before I actually gave them the thought and learning they required. Frequently we treat testing as a barrier to the the already "good enough" code we want to put in production. We miss that the larger production is (feature wise), the worse the tradeoff of marginally adding one feature to potentially taking many/all down...
Reading the sinon.js documentation https://sinonjs.org/releases/v9.2.1/ really helped clarify the roles of various spies,stubs,mocks et al that IMO is required to
The mock should display the same behaviour as the real thing.
So I don't think sharing one is bad.
The ideal fake is one that shares the same behaviour as the real thing. And you have tests that run across both to validate this.
Quite frankly mocks become the bane of large code bases.
I don't know how many times I've seen tests that are basically useless because the mock behaves nothing like how it behaves in the real world. Like throwing an exception for not found rather than null.
Keeping those contracts in sync are much easier with fakes.
You can’t change or reorganize the behavior of the real thing without changing everything that using it all at once. This is challenging enough without elaborate handwritten mocks, but much worse with them. Most people don’t have the sort of stubbornness it takes to overcome this sort of adversity. So you’ve locked in your tech debt even harder.
If you have one function simultaneously using large parts of another chunk of code, such that you feel like you should write logic instead of stubs to simulate it, you already have a huge coupling problem that you should fix, instead of shoveling more code after bad. That’s what I mean by “it should hurt.” The friction is not a bug, it’s a feature. Slow your roll and look at your busted architecture, instead of cementing it in place with a layer of tests.
The other difficulty with mutating your mocks or otherwise changing testing tools is with negative tests. The changes can and sadly do end up creating tests that can’t fail (evergreen) but still increase coverage and confidence. Like a safety railing that has corroded, or a broken smoke detector. Simple mocks that exist entirely within the test, or the suite at farthest, are more amenable to change. Our job is change, when you get right down to it.
I would love to see a different metric be introduced: Code elusion. All it is is the inverse of code coverage.
It would do a better job with the mental game. You may not be able to conclude that code is adequately tested just by knowing that the tests exercise it. But you can be pretty confident that code is not being tested if it eludes all of the tests.
What types of solutions exist for automating the creation of front-end tests? Either full automation, or partial with some user input/direction. I found https://kwola.io/ but wondering if there are other alternatives
Likewise, there’s stuff that’s just not worth it to wrap in a test. The effort involved in mocking/wrapping/abstracting sometimes isn’t worth the effort for some code, like simple calls to common libraries, reading a file, CLI help strings, etc. This stuff generally is better served by end-to-end or integration tests. Or just gets exercised naturally as part of development. Or can crash without much consequence.
My point is coverage is a tool that tells you where to look, not the one and only metric that tells you how reliable your code is.