Hacker News new | comments | show | ask | jobs | submit login

I've had to work on mission critical projects with 100% code coverage (or people striving for it). The real tragedy isn't mentioned though - even if you do all the work, and cover every line in a test, unless you cover 100% of your underlying dependencies, and cover all your inputs, you're still not covering all the cases.

Just because you ran a function or ran a line doesn't mean it will work for the range of inputs you are allowing. If your function that you are running coverage on calls into the OS or a dependency, you also have to be ready for whatever that might return.

Therefore you can't tell if your code is right just by having run it. Worse, you might be lulled into a false sense of security by saying it works because that line is "covered by testing".

The real answer is to be smart, pick the right kind of testing at the right level to get the most bang for your buck. Unit test your complex logic. Stress test your locking, threading, perf, and io. Integration test your services.




This was one of the fun things about working on Midori, the whole system was available to us, components all communicated through well defined interfaces, you really could (if necessary) mock the interfaces above and below and be sure that your logic was being exercised. The browser team, in particular, pushed hard to be close to 100% coverage. The overall number for the project was (IIRC) in the 85% range.

When I'm writing tests, I'm not so concerned about what the tool tells me about which lines are covered, I like to work through the mental exercise of knowing which basis paths are covered. If a function has too many for me to reason about, that is a problem in itself.

As an aside, while I'm rambling, all the examples in the article appeared to represent unnecessary abstraction, which is the opposite problem. If you have many methods in a class with only one basis path, what purpose does the class serve. These testing concerns may be the code smell that points to deeper problems


> As an aside, while I'm rambling, all the examples in the article appeared to represent unnecessary abstraction, which is the opposite problem.

One other thing about code coverage, unit testing, and other testing fads is I think they actively affect the architecture, and usually in the overthinking it way.

Instead of having one tight bit of procedural code (which may have some state, or some dependency calls), people split it up into multiple classes, and then test each bit. This allows them to use the class architecture to mock things, but really has just multiplied out the amount of code. And in the end, you're running tests over the golden path probably even less. It's even possible to have 100% of the code covered, and not run the golden path, because you're always mocking out at least one bit.


I think there's an argument to be made that if the desire for testing is a main driver for your architecture then your tests are too granular, and you aren't testing any internal integration points. In my experience that means that your tests are so tied to your current architecture that you can't even refactor without having to rewrite tests.


mocking / interaction / expect breaks encapsulation to perform "testing".

Thus it is often a test of the implementation's assumptions when first written, and even worse, when the code is maintained/edited, the test is merely changed to get it to pass, because unit tests with mocks are usually:

1) fragile to implementation 2) opaque as to intent

Whereas input/output integration points are more reliable, transparent, and less fragile to implementation changes if the interface is maintained.

However, if you must do mock-level interaction testing, Spock has made it almost palatable in Javaland.

This is one area where functional fans get to make the imperative folks eat their lunch.


Exactly. I've seen tests where some code calls:

  printf("hello world");
And the test is:

  mock_printf(string) {
    if string != "hello world" then fail;
  }
Which is basically just duplicating your code as tests.


I think that depends on the language/environment... for example, it's usually MUCH easier to get high coverage, and to do just enough mocking testing against Node-style modules in JS than logic in a typical C# or Java project.

Usually because interfaces need to be clearly/well defined and overridden... In JS there are tools (rewire, proxyquire, etc) that can be used with tests in order to easily inject/replace dependencies without having to write code to support DI nearly as much. In fact, I'd say that there's usually no reason not to be very close to 100% coverage in JS projects.


It's strange to me that we continue to make languages that force us to make certain architectural choices to help facilitate testing.


We have it since Eiffel and Common Lisp, but not all mainstream languages were keen on adopting design by contract.

On .NET it is a plain library, which requires the VS Ultimate editions to be useful and on Java it was mostly third party libraries.

C# design team is considering adding proper support as language feature, but it seems to be a very far away feature still.

C++20 might get contracts, but C++17 just got ratified, so who knows.

D, Ada 2012 and SPARK do already support contracts.


Indeed. It's almost as if testing should be built into the language itself. (I'm always a big fan of including self-test code in projects)


And it is built in in D:

http://dlang.org/spec/unittest.html

It's been very effective at improving the overall quality of the code. And because of CTFE (Compile Time Function Execution) and static assert, many code correctness tests can even be run at compile time.


What are the languages that have testing not as afterthought?


This depends on how you think of things. Some languages (Eiffel?) have pre- and post-conditions, which do some of this job. Much of the boilerplate testing done for dynamic languages is done by the compiler for static languages. The borrow checker in Rust is doing the work that a test suite and/or static analysis tool would be doing for C/C++


Definitely Ruby in general, testing is a core of the language culture.


That doesn't mean testing isn't an afterthought in Ruby, the language. It means the culture built up a defense against writing buggy code.


I think that's the nature of a dynamic/scripted language... it's just easier to handle in Ruby and JS than many other languages.



> testing should be built into the language itself

I wonder what that would look like...


Rust has simple unit testing built into the language. And in Ada/Spark, tests can be a first class verification tool, alongside formal verification.

We should go a lot further though. IMO, a unit that does not pass a spec/test should cause a compile time error. Testing systems should facilitate and converge with formal verification. Where possible, property based testing should be used and encouraged. And debugger tools should be able to hone in on areas where the result diverges from expectations.


> IMO, a unit that does not pass a spec/test should cause a compile time error.

We can achieve this in dependently-typed languages like Idris. First we define a datatype 'Equal x y', which will represent the proposition that expression 'x' is equal to expression 'y':

    data Equal x y where
      Refl : (x : _) -> Equal x x
There are two things to note:

- There is only one way to construct a value of type `Equal x y`, which is to use the constructor we've called `Refl`.

- `Refl` only takes one argument, `x` (of arbitrary type `_`), and it only constructs values of type `Equal x x`. This is called "reflexivity", thus the name `Refl`.

Hence if we use the type `Expr x y` anywhere in our program, there is only one possible value we can provide that might typecheck (`Refl x`), and that will only typecheck if `Expr x x` unifies with `Expr x y`, which will only happen if `x` and `y` are the same thing; i.e. if they are equal. Thus the name `Equal`.

This `Equal` type comes in the standard library of all dependently typed languages, and is widely used. To use it for testing we just need to write a test, e.g. `myTest`, which returns some value indicating pass/fail, e.g. a `Boolean`. Then we can add the following to our program:

    myTestPasses : Equal myTest True
    myTestPasses = Refl myTest
This will only type-check if `Equal myTest myTest` (the type of `Refl myTest`) unifies with `Equal myTest True`, which will only be the case if `myTest` evaluates to `True`.


Force.com doesn't allow you to deploy code to production unless at least 75% of the statements are executed[0]. I wish my employer had a similar requirement.

[0] https://developer.salesforce.com/docs/atlas.en-us.apexcode.m...


^ This.

And another example of something I'd want checked by the language is exception throwing / handling. It's another one of those places where code coverage won't help you unless you already know what you're looking for. Languages are getting better about it, but in general, handling errors is hard.


Pure languages like haskell might be the closest thing at the moment.

In haskell you embed domain specific languages when you require side effects:

    class MonadState s m where
        get :: m s
        put :: s -> m ()
This is like an interface specifying the basics of all stateful computations.

You can use different implementations for production and testing without changing any code so mocking is built into everything.


Didn't Smalltalk do something like that? I thought it had TDD built into its standard IDE.


Dynamic languages can kinda cheat here. Python has mocking built in to the core lib.


> If you have many methods in a class with only one basis path, what purpose does the class serve.

It ties together related algorithms. You want "addDays(n)", "addHours(n)", "addMinutes(n)" and so on to be in the same class, even if they're one-liners.


Clearly there are good examples either way. I'm more interested in the idea that this strong mismatch between the complexity of the code and the complexity of its tests is, in itself, a code smell that may point to an issue that is nothing to do with TDD.

I note that the examples you give have trivial test cases - presuming you don't care about overflow, and if you do, then those methods now have basis paths, whether explicit or implicit depends on the language.


This is what I love about the design and operational philosophy of the Erlang/OTP evironment: the trickier class of harder-to-test/reproduce transient/timing pre-defensive-coding issues can sometimes be resolved automagically under supervision trees restarting said Erlang process. Log the stacktrace as a failure to maybe be fixed and move on. That's in addition to a robust soft-realtime, SMP VM which allows remote, live debugging.


As a newbie to Erlang, how does Erlang handle this case better for a service than a Python process which logs and restarts on failure? Also, this runtime testing strategy doesn't handle cases where you have other systems depending on that Erlang code path being successful, does it?.


If you absolutely want to cover all cases, you need to do mutation testing. A mutation testing system analyses your code, changes an operator (> becomes < or >= for example), and then runs your tests. If one test fails, your tests covered that statement.

It seems to me you need to have serious OCD to go for 100% mutation coverage, but that is what you really need to do if perfect coverage is your aim.

The other option is to accept that perfect coverage is not feasible, or possibly even a trap, and that you just need to cover the interesting bits.


Strictly, mutation testing can't assure all cases either -- combinatorial state explosions are fun. But it does improve assurance a lot.


How would automated mutation testing handle the case of accidentally causing infinite loops, or invoking undefined behavior?


I use pitest to run mutation coverage on most of my Java code bases. pitest implements a timeout to check for infinite loops introduced by changing the code.

http://pitest.org/faq/


Pitest has solved the halting problem?


In the real world, you can't wait five minutes for the server to return a response. If the server returns the correct response after ten minutes, it's still wrong, unless the programmer has explicitly acknowledged that the procedure in question is a long-running procedure and has lengthened (but not eliminated) the time-out accordingly.

The halting problem is an irrelevant, Ivory-tower distraction in production code.


Timeouts are not a perfect solution to the halting problem, but usually good enough.


There can be circumstances in which the substituted operator is just as valid as the operator being substituted for, resulting in false positives.


How would that be a false positive? If the substituted operator does not cause a test to fail, then your tests don't cover it. If the function of the program is not changed by changing the operator, then the operator does nothing and should be removed.

That's the idea at least. All focus on high coverage, whether line coverage or mutation testing, creates an incentive to remove redundant robustness checks, and maybe that's not such a good idea after all. But that's a problem with all unit testing, and not just with mutation testing.


I was thinking that, for example, that a >= test is as valid as == in some cases. If == is actually used and is correct, substituting >= would not cause any valid test to fail. As I wrote this, however, I realized that if you substitute the inverse operator (!= for ==, < for >= etc.) and it is not caught, you can infer it is not covered. (edit: or that, in the context of the program as a whole, the result of the operation is irrelevant. I imagine this might legitimately happen, e.g. in the expansion of a macro or the instantiation of a template.)


Substituting == for >= should cause a test to fail. Either == is correct, or >=. They can't both be correct. Should the two things always be equal? Then they shouldn't be unequal. Is one allowed to be bigger than the other? Then that should be allowed, and not rejected.

If this change doesn't cause a test to fail, you're not testing edge cases for this comparison.

(Of course that's assuming you want perfect coverage.)


Counter-example:

  while ( j < JMAX) {
     i = callback(j);
     if( i >= FINAL) break;
     ...
If, in the specific circumstances of this use, callback(j) will always equal FINAL before it exceeds it, a test for equality here will not cause the program to behave differently.

The fallacy of your argument is that == is not an assertion that its arguments should always be equal; it is a test of whether they are at some specific point in the algorithm.


Which why I really love things like MSR's PEX[1].

"Pex uses a constraint solver to produce new test inputs which exercise different program behavior. "

So this is not lines, but I guess.. branches and numerical limits. Either way it's cool and I wish it was integrated somewhere. They've got a online demo thing http://www.pexforfun.com/

[1]https://www.microsoft.com/en-us/research/publication/pex-whi...


IntelliTest[0] is the productized version of PEX. The first release was in Visual Studio 2015 Enterprise edition.

[0] https://blogs.msdn.microsoft.com/visualstudio/2015/09/30/int...


Still a pity it's only included in Enterprise edition. We need a strong foundation of accessible tools for automated test input generation improving on the current state of the art fuzzers. See e.g. danluu's vision of "Combining AFL and QuickCheck for directed fuzzing" [https://danluu.com/testing/].


I only have one piece of code for which I can say true 100% coverage exists: a library that works with HTML/CSS color values, and which ships a test that generates all 16,777,216 hexadecimal and integer rgb() color values, and runs some functions with each value.

However, I don't run that as part of the normal test suite. It only gets run when I'm prepping a new release, as a final verification step; the normal runs-every-commit test suite just exercises with a selection of values likely to expose obvious problems.


Just out of curiosity... What is the point in testing all 2^24 possible color values?


The point is being certain I haven't missed an edge case somewhere.

The hard part is the percentage rgb() values, of which there are technically an uncountably infinite number (since any real number in the range 0-100 is a legal percentage value). For those I generate all 16,777,216 integer values, and verify that converting to percentage and back yields the original value.


Hmm...true 100% input coverage? Including the time domain? (Order dependency, Idempotence, etc...) :)


how do you know the test that generates those values is correct?


Python's test suite.


How long does that take?


Depends on the machine. See the comments in the file:

https://github.com/ubernostrum/webcolors/blob/master/tests/f...

The fun part is people who criticize the generation of the integer triplets; yes, it's three nested loops and that's bad, but the total number of iterations will always be 16,777,216 no matter what algorithm you decide to use to generate them. So it uses nested loops since that's the most readable way to do it.


I don't see how that's a "tragedy", as in "an event causing great suffering, destruction, and distress, such as a serious accident, crime, or natural catastrophe".

You're also making it sound like somebody promised you that tests can prove the absence of bugs, when that was never the bargain and smart people have already told you so, probably before you were born even.

> Testing shows the presence, not the absence of bugs

Dijkstra (1969)


If we're being pedantic, a tragedy is _a drama or literary work in which the main character is brought to ruin or suffers extreme sorrow, especially as a consequence of a tragic flaw, moral weakness, or inability to cope with unfavorable circumstances_.

Definitely not a tragedy.


If we're being pedantic, there's a second common definition of the word tragedy that does not refer to a drama or literary work, but "an event causing great suffering, destruction, and distress, such as a serious accident, crime, or natural catastrophe."

I think when you take all of the time wasted on useless tests written merely for the sake of having tests, that waste is tragic. You could be doing anything else with that time.


I am curious if there is any good way to measure that amount of waste. I agree that waste from extra and useless tests exists, but I am not convinced that waste is entirely bad.

It seems to me that waste is unavoidable with teams newer to automated testing and may simply be part of the cost of using automated tests. If that is the case then it seems better to compare the cost of bugs with no automated to cost of superfluous tests. In that comparison extra tests definitely seems like the lesser of two evils, even without hard numbers. I would prefer hard numbers because my intuition could be wrong.


Well, if the organization tracks the time developers spend doing things, then it should be easy to estimate the waste.

1. Take the number of hours spent writing tests.

2. Multiply by whatever percentage of the tests are unnecessary.

3. Multiply by the labor cost per hour. Or revenue that was not made (e.g. if you could have billed those hours to a customer, but didn't).

The resulting number could either be a big deal or not, depending on how big your organization is and how much time you spent on superfluous tests.


I agree, I was just kicking GP's pedantry up a notch.


There's use in looking at the testing for industrial strength code e.g. compilers, DirectX, OpenGL conformance (god help you). With that background, I was confused by TDD to put it politely.


Code coverage should perhaps not be counted in lines, but in the number of type permutations covered. Then your challenge is determining all the types that _can_ be passed as inputs.


By types, I assume you mean categorically different types of data, rather than the types of the type system that your language recognizes?


I agree that test coverage does not ensure that the code is really tested. But not covered code is not tested code. So i use coverage in this way, to spot not tested code.


You can test stuff manually.


You really shouldn't, or if you do it should be an extreme minority of tests and automated tests should do the heavy lifting.

Automated tests can do so much more than manual tests that shops still living by manual tests either have a damn good reason or are just as wrong as people who argued against Revision Control Systems (or people who argued for gotos instead of functions).

Automated tests will be executed identically each time, so no missing test cases because someone slacked off or made a typo. Automated test can serve as examples for how to use the code. Automated tests can aid in porting to new platforms, once it builds you can find all the bugs your care about swiftly. Automated test can be integrated with documentation, tools like doxyegn and mdbook make this easier.

Automated tests enable Continuous Integration. Are you familiar with Travis CI or Jenkins? If not, imagine a computer that a team commits their code to, instead of directly to the mainline Revision Control (Git master, svn head, etc...). That computer builds the software, runs all the tests, perhaps on every supported platform or in an environment very close to production, then only merges commits that appear to fully work. This doesn't completely eliminate bugs and broken builds, but the change is so large that teams without it are at a clear competitive disadvantage.

When integrated into process tests can be used to protect code from changes. If a test exercises an API and the team knows that is the purpose then when they change things in a way that break the test they shouldn't... This sounds vague or obvious, but consider this: At Facebook and Google they have a rule that if it is not tested new code doesn't have to care if it breaks. Both companies have team that make broad Sweeping changes. Facebook wrote a new std::string and Google use clangtools to make automated changes in thousands of places at once. Even if code breaks or APIs change as long as tests pass these people can be sure that they negatively impacted the product and are following their team's rules.

Automated Tests can... This list could go on for a very long time.


Maybe you're thinking about some specific applications/pieces of code? Probably not a video game?

I think it should be a mix. I disagree with the "extreme minority" portion for most projects. There are times where a manual test is the right answer and there's exploratory manual testing.

Obviously though running through a long sequence of testing manually for every code change is crazy. And some sort of CI setup like you describe is a must for every project this day and age.

Then there's also the question of unit tests vs. system/integration tests...


I think that mutation testing should replace code coverage. It really ensure that the test verify the code behavior and not only invoked it.


The trouble with mutation testing is that most mutations will completely break an application.

Why not prove the code is correct instead? Should be much cheaper than 100% coverage and more certain.


Why do you say most mutations will completely break an application? This is certainly not my experience.

Mutation testing systems normal use fairly stable operators (e.g changing a > to a >=). In most locations in the code changes such as these will have only a subtle effect.


> even if you do all the work, and cover every line in a test, unless you cover 100% of your underlying dependencies, and cover all your inputs, you're still not covering all the cases.

On the other hand, if you do cover all your inputs, you've covered all the cases regardless of what % code coverage or path coverage you have.


Tests are not meant for ensuring it works with all inputs. You cant simply just throw values at it hoping its all okay.

To prove it works with all possible inputs, there are other tools at your disposal.


I'm reminded of a guy tasked with testing a 32 bit floating point library. After a number of false starts he realized the most effective way possible.

Brute force.

Oh other story. An OG (original geek) I know once proved the floating point unit on a mainframe was broken via brute force. Bad part meant the decimal floats used by and only by the accounting department sometimes produced erroneous results.

Me I have a pitiful amount of unit tests for the main codebase I work on. One module which does have unit tests, also had a bug that resulted in us having to replace a couple of thousand units in the field.

Otherwise most of the bugs that chap my keister aren't about procedural correctness, they're about state.


How do you test it by brute force? How is the algorithm supposed to know what the right thing to return is, unless you rewrite the whole thing, correctly this time? And, if you've done that, just replace the actual function with the test and be done with it.


One answer to that is to apply property based testing. If you're testing an addition algorithm, for example, you might test that

   A + B == B + A
   A + 0 == A
   (A + B) + C == A + (B + C)
   A + B != A + C (given B != C)


It's a good idea in principle, though your third and fourth properties are not true of floating point addition.

When you have one available, testing against the results against a reference implementation is a little more direct. Though, given how easy those property tests are, you might as well do both.


Your comparison source could be running on a different platform, or be slower, or ... and thus not be a drop-in replacement. (E.g. in the example of a broken floating-point unit, compare the results with a software emulation)


Hmm, true, it may have been a different platform/slower, thanks. It can't have been hardware/software, as the GP said they were testing a library (which implies software).


How is the algorithm supposed to know what the right thing to return is

Some times I have, for example, matlab code that I 'know' is right and want to verify that my C/Python/Julia... code does the same thing. So then I just call the matlab code, call the new code and check if they're the same.

Or I have a slow exact method and want to see if my faster approximation always gives an answer 'close enough' to the correct answer.


With floating-point math libraries especially, it's frequently the case that it's relatively easy to compute the desired result accurately by using higher precision for internal computations; e.g. you might compare against the same function evaluated in double and then rounded to float.

This sounds pretty sketchy at first, but it turns out that almost all of the difficulty (and hence almost all of the bugs) are in getting the last few bits right, and the last few bits of a much-higher precision result don't matter when you're testing a lower-precision implementation, so combined with some basic sanity checks this is actually a very reasonable testing strategy (and widely used in industry).


If a function takes a single 32 bit argument you can run it on every possible value.


… this reply does nothing to answer the parent's question. It's obvious that you can run it on all possible values, but that doesn't answer the question of the return values. (unless you only want to verify "doesn't crash" or something similarly basic)


I believe the question was edited after I wrote my answer. This, or I just misread it - unfortunately HN doesn't let us know. Anyway, I'm providing another answer to the question as it is now.


Validating the answer is different than providing it:

   FLOAT_EQ(sqrt(x) * sqrt(x), x) && sqrt(x)>=0.0
is a pretty good test for the square root even though it doesn't compute it.


What tools would that be?

I'd like to hear about real world test scenarios where all possible inputs are tested.


In practice, testing _all_ possible inputs is usually not a feasible goal, and often downright impossible (since it would mean solving the halting problem). While there are some languages and problem domains where formal verification methods are a possibility, for most of software development the best you can do is reducing solution space.

A sane type system would be a start for example. And it's no coincidence that functional programming and practices with emphasis on immutability are on the rise; Rusts ownership system is a direct consequence as well.

TDD _is_ important, if simply for enabling well-factored code and somewhat guarding against regression bugs. But - decades after Dijkstras statement (which someone has already posted in this thread) - the code coverage honeymoon finally seems to be over.


Polyspace [0] allows full state space checking. Usually, this is feasible because the kind of modules you test are heavily constrained 12 bit (or fewer) fixed point values (sensor output/ actor input) and contain no algebraic loops. With floats and/or large integers, this quickly becomes unwieldy (verification then takes weeks).

[0] https://de.mathworks.com/products/polyspace.html


One of them would be a fuzz testing. You start with a predefined 'corpus' of inputs which are then modified by the fuzzer to reach yet not covered code paths. In the process it discovers many bugs and crashes.

The input fuzzing process is rarely purely random. There are advanced techniques that allow the fuzzer to link input data to conditions of not covered branches.

It is quite useful mechanism for checking inputs, formats, behaviour patterns (if you have two solutions but one model, one simple that works 100% but is slowish and one more complex but very fast).

See: https://github.com/dvyukov/go-fuzz#trophies and http://lcamtuf.coredump.cx/afl/


Fuzzing is essentially the same as just throwing values at it hoping its all okay.

What I meant was what @AstralStorm said about mathematical and logic proofs that it works for all defined values.


Fuzz testing is useful at testing whole systems in vivo.

Proofs are powerful, but you can still make errors in the proof or the transcription into code. And of course you don't know what other emergent behaviours will surprise you when the proved code interacts with unproved code.

Depending on your means and needs, you want to try both.


Parent was talking about full coverage. While fuzzers are great tools, able to generate test inputs for easily coverable parts of the code, formally correct software has to be designed for provability.


Many embedded safety and medical applications prove correctness for all inputs and their combinations. Somme also verify error behaviour. (Out of range.) Granted, this is a relatively small input space, typically a few sensors.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: