
Musings on the Impossibility of Testing - george3d6
https://blog.cerebralab.com/Musings_on_the_impossibility_of_testing
======
lmm
A lot of the time it's easier to confirm that an answer is correct than
compute that answer. Imagine a test like:

    
    
        val result = factorize(367)
        assertEquals(367, result.reduce(_ * _))
    

This isn't really a "human validation" \- as a human I don't actually know
what the prime factorisation of 367 is - and the sense in which it's a
"compiler rule" is a deep and subtle one. But I'd argue that it's a valid test
and quite a lot of software tests are of this type.

Another type of test not considered here is a "redundant" check for a simple
case. If we write a test like:

    
    
        assertEquals(multiply(83, 83), pow(83, 2))
    

then to a certain extent that's redundant, but I'd argue it's a useful sanity
check of our pow implementation. Again it doesn't seem right to call this a
"human validation" since I don't actually know 83^2 as a human.

It also bears considering that even an entirely redundant test may be useful
in providing (and explaining) a use case for an API. If we've just implemented
the map() function for lists, then a test like:

    
    
        def f(x: Int) = x * 2
        val input = List(1, 2, 3)
        var result = List.empty()
        for { element <- input }
          result :+= f(element)
        assertEquals(result, input map f)
    

is "redundant" in this taxonomy, but a perfectly legitimate test of map; often
"be equivalent to this more complicated sequence of functions" is exactly the
specification of a particular function.

~~~
xtajv
> A lot of the time it's easier to confirm that an answer is correct than
> compute that answer.

Is this a "human validation" that P!=NP ? ;)

~~~
sfink
More like "most software is in NP", since "easy to confirm an answer" is the
set's defining characteristic.

------
hyko
Testing is obviously not impossible, so let’s set that aside for a moment.

The question is, why are you unit testing ML functions? ML is not software
engineering, and unit tests not a good way to broadly evaluate them, outside
some “third rail” type tests for insanely important cases. You’d typically use
statistical testing (e.g. f1 scores) to determine whether your models are
performing as expected, but they are ultimately probabilistic in nature and
highly unlikely to satisfy all of the intended cases simultaneously.

There is of course the philosophical question of how you determine the quality
of your tests, which are themselves software. The practical answer is to keep
your testing software as trivially simple, obvious, and boring as possible;
tests can then be validated with the mark I eyeball.

None of this has any bearing on the usefulness or otherwise of automated
testing.

~~~
george3d6
> The question is, why are you unit testing ML functions? ML is not software
> engineering, and unit tests not a good way to broadly evaluate them, outside
> some “third rail” type tests for insanely important cases.

In practice, I agree with this, and that's how I usually write my tests when
dealing with ML code. Though honestly I think this perspective could well
extend beyond ML code to any code with very broad input and output spaces.

> You’d typically use statistical testing (e.g. f1 scores) to determine
> whether your models are performing as expected, but they are ultimately
> probabilistic in nature and highly unlikely to satisfy all of the intended
> cases simultaneously.

Yes, but the solution to that (in practice) is to basically benchmark against
your own (previous) code and alternative implementations. Based on the idea
that "if 10x tried it, and mine is better, than I must be doing something
correctly" and/or "if the previous version did worst, well, then something
must be improving".

Again, I think this is equally true for testing a ray tracer or compilation
times as it is for testing an ML algorithm.

> There is of course the philosophical question of how you determine the
> quality of your tests, which are themselves software. The practical answer
> is to keep your testing software as trivially simple, obvious, and boring as
> possible; tests can then be validated with the mark I eyeball.

But this results in a situation where "good" tests are extremely subjective,
and replacing a team of engineers with another (or even just a member of that
team) would result in the need to replace the tests.

> Testing is obviously not impossible, so let’s set that aside for a moment. >
> None of this has any bearing on the usefulness or otherwise of automated
> testing.

To me it seems to have baring in that, unless I am able to have at least some
heuristic based on which I can define "good" or "sufficient" testing the
question of what to test and how much time to allocate to it becomes
impossible to answer. And that is a very useful question I'd want to have an
answer to that is better than just "based on my intuition and what I think is
sufficient as to not have my bosses/customer scream at me when stuff breaks,
this amount of testing is good".

------
seedless-sensat
> Since the function is very simple, time would probably better spent
> reviewing the actual code to make sure it's error-free.

I disagree with that statement. Unit tests are very helpful for complex
models, as they can guard against regressions in future changes. If the
original author tested all their expectations, I am less worried that I may
break something I didn't anticipate.

~~~
reidjs
They're also great for remembering what the heck functions do. At one of my
jobs we have a file called filters.js which had hundreds of functions in it. I
made it a rule that anytime I used or added one of these functions in the
future I would have to write at least one unit test for it.

------
csb6
I’ve used tools similar to quickcheck a little bit, and I think property-based
testing can be useful. Generating, say, 100,000 randomized inputs over a range
makes me more confident than if I hand write a handful of tests. Still,
quickcheck-like tools are limited to producing input over limited ranges for
different types, and a lot of tests need to have some kind of more complex and
unique interaction of various components that can only be done by hand (i.e.
integration testing). Still, I think property-based testing can serve as a
good way to more thoroughly test inputs.

------
rchaves
Well he gives the answer himself: “I suspect the answer lies somewhere in a
combination of smoke tests, testing common user flows, testing edge cases in
which we suspect the software will fail, and testing critical components the
failure of which is unacceptable”

------
exdsq
I disagree with the point about human validation tests being unsuitable for
multiplication. Sure it’s a contrived example but even with more complex
financial rules, you’re building the system for a reason and generally know
enough example inputs and outputs to have these tests. If I saw some similar
code that didn’t have these types of tests because the ‘machine is more
capable at solving them’ I’d fail the review. How do they know it works then?

~~~
bluGill
More than that, if I write in a test assert(10=2*3), eventually I will find
that test fails and fix it. Thus getting the test wrong isn't fatal. Of course
if this is a more complex case (which it probably is) the bug might take a
while to discover since it made me write wrong production code, but at least
once it is discovered and fixed (both the test and production) that will not
be broke again.

------
scollet
Goodness.

If all of software engineering is an infinite space, software testing is
always infinite+1.

The problem always starts at human-defined logic and ends at human-defined
interpretation.

There is no amount of testing that can replicate eyes on those expectations
defined on either end, and my own personal politics reflect that:

Tests are simply reinforcing these vectors from human to human where we might
have a black, grey box that is incompletely understood by its contributors.

The point is to reduce complexity and optimize those vectors so those
expectations directly align.

If you're writing the consistency of crêpe-like tests, perhaps those goals
don't align. I've been there. Something surely has to change. Tests should be
very simple validations and should be very developer interpretable. In fact,
they should be enlightening to that vertical code.

They should necessarily be an aside to the larger architecture where we
generally understand what is supposed to happen. They should define a ruleset
to its operation at each discrete step.

Never a validation of "what function functions?".

I've seen too many PMs, execs, and can-do engineers get lost beyond the design
because they were building on sand and justifying their castle.

Just stop.

Bring testing to the 'else' case on 'if, else if, elsif, ...' where we can
actually understand that path.

Goodness.

Addendum: perhaps I've not worked in complex enough systems. I would
appreciate any constructive argument towards my personal views.

------
amw-zero
Testing / verification is the hardest problem in computing. I don’t think the
author actually understands why that is and has a few superficial gripes with
how codebase are tested. That’s not to say that a lot of test suites are
actually good (they’re not). That criticism may be fair.

Static types do not help at all with software quality, beyond catching some
typos. You can never, ever write a static type that enforces anything of
substance, because for that to happen you’d need to execute arbitrary code
within the compiler (as happens with dependent types, and of course there is
the constant fear of undecidable or infinite type checking in that system).

The classic example is that there’s an infinite number of functions with type
int -> int. The logic of these functions are extremely varied, and the type
doesn’t help you at all. Enum types are very useful for knowing what values
you are allowed to write, but don’t help with correctness.

The author brings up large input spaces, but conveniently leaves out input
space partitioning, which any non-novice tester has to employ in their test
case management. The category partition method is a fantastic way of covering
large swaths of a seemingly infinite input space with relatively few tests. Of
course there’s combinatorial explosion, but that can be mitigated with
constraints on the input combinations.

I’m not saying testing is perfect - it’s not. But, the burden of proof is on
the critic of testing, not testing itself. How else can you reliably change a
large software product dozens of times a week, for 20 years in a row? And
what’s the alternative when I see products that large teams produce break
weekly?

This is not something that “ship it and iterate” has had any meaningful impact
on. Eventually you get people who don’t intimately know the whole codebase,
and make a breaking change unintentionally. What is the alternative to some
kind of testing?

~~~
sbergot
The part about static types is weird. "Tests are not always perfect but can be
useful. However static types are not perfect and so they are useless."

~~~
fgonzag
I felt the same about it.

Static typing is a tool, just like unit testing, integration testing, and code
reviews.

We combine these tools in order to maximize software quality, ease of
development and maintenance.

Static typing eliminates a whole class of errors that good code bases in
dynamic languages usually have to manually test (instance and type tests)

~~~
amw-zero
There is no need to have “type” tests in dynamically typed languages.
Dynamically typed objects are implementation details, you test without
referring to the underlying types.

Also, static types _are_ manual testing, you need to write the types.

------
onion2k
I wonder if the author has ever heard the aphorism, "Perfect is the enemy of
good."

~~~
george3d6
I'm not trying to achieve perfect, rather just be able to answer a question
like: What makes a test obviously bad or obviously good ? I needn't have a
system that can help me generate perfect tests, or tell apart almost equally
good implementations, but rather one where I can at least point to a test that
is obviously bad and say "this is why" rahter than just using my intuition.

------
t-writescode
I see the author posted this.

What are your thoughts on automated/unit tests used to guarantee expectations
do not change in future revisions (a regression suite) and also used to
present example inputs usages of the API that was designed?

~~~
george3d6
> What are your thoughts on automated/unit tests used to guarantee
> expectations do not change in future revisions (a regression suite) and also
> used to present example inputs usages of the API that was designed?

To be honest, this is one idea I was toying around with. As in, have a test
suite that generates a "state" file for your program and then, in any given
patch, list the items in the state that you'd expect to change.

That way, one can basically catch a lot of "bugs" which often boil down to
"this change I made here to affect X is also unexpectedly affecting Y".

My main problem with this are:

1\. I see nobody doing this, and I'm not sure why. 2\. The tooling for this
doesn't really seem to exist, and I'm not sure how easy one could bring it
into existence and/or if a generic version of it could be written. 3\. This
breaks down with a lot of software where a tiny change can affect, well,
everything (e.g. modifying a random seed or changing an error-checking
constant)

~~~
t-writescode
> As in, have a test suite that generates a "state" file for your program

What do you mean here?

Do you mean you would run a specific function with specific inputs and record
specific outputs and compare them?

~~~
george3d6
Yes, or rather, run your whole program with specific inputs then (e.g. by
using a debugger) capture as much of it's state as you can and compare that
with the state from the next run of those same inputs.

E.g. given a program with variables

a1 = array a2 = int a3 = string

Running might get the state flow:

a1 = []

a2 = 10

a3 = 'abc'

a3 = 'dd1'

a2 = 200

a1 = [200,'dd1']

Then, if a change is made, the programmer can say "I expect this change to
only affect the state of a2" and if you get the state flow:

a1 = []

a2 = 11

a3 = 'abc'

a3 = 'dd1'

a2 = 500

a1 = [200,'dd1']

Then the test passes

If you get the state flow:

a1 = []

a2 = 11

a3 = 'abc'

a3 = 'dd1'

a2 = 500

a1 = [500,'dd1']

The test fails with error "The change you expected to only affect a2 also had
an effect upon a1"

But again, the actual state you capture here and the way you check for
equality (e.g. to you check for equality among all the states during the whole
runtime, or just in the final state of the program ? If the former how is
state-change ordering handled ? If the later, what about relevant state that
is out of scope by the time the program finishes running ?)

~~~
t-writescode
How is this different from a set of unit tests where you give the function
under test each of those inputs and have the expected output be the same for
each entry as the given list of outputs?

e.g.

    
    
      [TestCase(null, null)]
      [TestCase(10, 11)]
      [TestCase('abc', 'abc')]
      [TestCase('dd1', 'dd1')]
      [TestCase(500, 500)]
      [TestCase([200, 'dd1'], [200, 'dd1'])]
      public void StateChanges(input, expected) {
          actual = DoThing(input);
          Assert.AreEqual(actual, expected);
      }

~~~
exdsq
Because this is only checking the individual function, not the entire systems
state. If I have a function that adds two ints I could test it with
[TestCase([2, 2], 4)] and this would pass _but_ if it also changed the value
of an object there could be an issue a unit test (and indeed integration/e2e
test) wouldn't catch.

------
pfdietz
The objection to comparing 'multiply' and '*' is silly. That's differential
testing. The objection there shouldn't be that those two functions are being
compared, it's that the arguments are static rather than being generated in
high volume, randomly.

------
ximm
The article is bikeshedding; still upvoting for the poem!

------
mdoms
Yep it's very easy to tear down a straw man of your own creation. Meanwhile
I'll keep writing useful tests that validate my code and prevent regressions.

~~~
2rsf
How do you know that your tests are useful ? most (I'm a senior tester so I
know a few) teams write tests that are useful when they are new or when a new
feature is being developed, those tests quickly stop finding problems and in
the worst case give false negatives (fail for no reason).

A related problem are Rotten green tests [1]

[1]
[https://dl.acm.org/doi/10.1109/ICSE.2019.00062](https://dl.acm.org/doi/10.1109/ICSE.2019.00062)

~~~
codeulike
If a test never fails unexpectedly then you know that particular test has been
a waste of time.

e.g. if no regressions ever get picked up by the test subsequently

~~~
rimliu
That's an interesting statement. Imagine I have a test which never fails
unexpectedly. But it expectedly fails when I decide to refactor some code. By
your definition that test was a waste of time, by my definition it did excatly
what it was supossed to do.

~~~
codeulike
If you're expecting it to fail then your brain is already doing the work that
the test is meant to do. It serves as an expensive to-do reminder in that
scenario.

If you are part way or all the way through your refactor and think you're on
the right track and then you unexpectedly have a failing test, then I would
say that that was a useful bit of test code.

