
Unit Testing Doesn’t Affect Codebases the Way You Would Think - ingve
https://blog.ndepend.com/unit-testing-affect-codebases/
======
cbanek
Love this article. Very interesting, and I love how the results are able to be
quantified.

Coming from some projects that had a very heavy unit testing culture, I can
agree that unit testing doesn't always necessarily have the intended results
on "quality" and can really shape a codebase.

I feel the root of the reason is that testing is not the same thing as
development. Truly testing things involves a mentality that is different, it
involves understanding how mulitple pieces fit together and operate in
reality, not in some unit testing framework. A lot of developers look at unit
tests the same way they look at development, once they work, it's done. But
testing is proving they work in all the right conditions, not just getting it
to work once.

I've personally seen very simple code convoluted so that it could be unit
tested, and how those convolutions can easily add bugs and additional layers
of indirection that hurt performance (as well as add unnecessary code, which
needs maintenance). Many times these tests could have been written with the
existing code as long as it used some other component... but then it wouldn't
be "unit testing" only to see these components not work together on the first
runtime. (For some reason, places that rely heavily on unit testing seem to
have fewer overall or integration tests, because they spend so much time and
effort on unit testing)

I'd love to see a study on how many lines of code change because of unit tests
vs in non unit tested codebases. Both in terms of how many unit tests need to
change and the size of the change itself.

~~~
ajmurmann
It's crucial to keep in mind that unit tests are only one part of a healthy
test and even TDD-diet. Especially if you write "true" unit tests that use
doubles for test collaborators.

The plus side of using doubles in your unit tests though is that the outcome
insinuated by your last question should likely be different. I've found that
changes typically only impact the unit test of the class or sometimes direct
collaborators and a very small number of integration or acceptance tests. If
you follow the open/closed principle those changes become even fewer.

~~~
cbanek
Totally and completely agree.

But everywhere I've worked in the past 10 years has always asked me to unit
test everything, and unit testing is specifically called out all the time.

Other types of tests are rarely even discussed, because many developers don't
do a lot of testing beyond unit testing. Sometimes there are other
organizations that pick up this slack, like a dedicated test organization, but
many times there are not.

Perhaps it's just my experience, but I've found unit tests often give a false
sense of security when most of the issues I have to fix (which are frankly the
hard issues, not simple logic bugs) heavily involve how the code interacts
with its environment. I'm talking OS level, user level, privilege escalation
(such as UAC in Windows), networking, load, perf and stress.

It's not that I think that unit tests aren't good, they are, but I think
people focus on them too much, to the detriment of other testing. Integration
testing should catch all the things the unit tests would catch as well, the
only difference is you have two copies of the test, and the unit test will
give you more granularity into what is failing when you look at the test suite
results. But as soon as you have a call stack or a logging facility
(especially around error conditions), which is best practice, most of this is
obvious as soon as you look at the result of the test.

~~~
ajmurmann
I totally agree that unit tests for bug prevention in new code are overrated.
I mainly like them because they 1. Encourage decoupled code and 2. Make future
refactoring easier especially in dynamically typed languages. Not having other
tests is a huge mistake. I think I'm spoiled because I've spent my entire
career at places that cared deeply about testing and proper TDD.

------
josteink
Disclaimer: unit-test proponent hypothesizing.

> But taking the data at face value, the trend lines speak clearly. More test
> methods in codebases predict more cyclomatic complexity per method and more
> lines of code per method

My initial reaction to this is that having more tests means your code
accommodates more and increasingly complex cases and edge-cases in general,
because you’re not going to write 10 test-cases testing the same logic (or
combination of logic).

That’s going to mean more code and more complex code either you like it or
not.

The projects with fewer test-cases may not handle as many edge-cases, and
maybe they don’t have this complexity (and the associated tests) because they
don’t need it yet?

If so, tests themselves are not bad, but may just be indicative of the real
world cost of real world complexity?

~~~
Izkata
I'm a heavy proponent of writing tests - even if it's not full-on TDD - at
work, to the sometimes annoyance of my co-workers, and my reaction before even
reading the results was that for those two, his prediction was wrong.

It seems obvious to me: If you have (good) automated tests for correctness, a
method can handle more complexity without the programmer needing to handle all
paths in their head at once. The tests should let them know if they made a
mistake somewhere. On the other hand, if no such tests exist, those methods
would be more likely to be written as smaller methods that call each other
(maybe one "driver" and a couple "helpers"), so the programmer doesn't need to
hold the whole thing in their head at once.

The odd part is that the second way can often be better for
clarity/maintainability, since those helpers can be reused and the driver can
be written at a higher level. But it seems like automated tests give people an
out where they don't have to think about that anymore.

~~~
bonesss
Just to expound on your hypothesis (from another OOP TDD-ish proponent):

Google defines Cyclomatic Complexity as " _a software metric... [that] is a
quantitative measure of the number of linearly independent paths through a
program 's source code._"

It seems to me that swapping out a dependency, abstracting it, and then
feeding multiple variants of the same dependency to some code would greatly
increase the number of linearly independent paths through the code. These
kinds of refactorings/improvements also tend to greatly increase
maintainability, clarity, flexibility, and your programs featureset...

By way of example: taking some DB insert code and updating it write to a
simplified interface and then injecting components to handle writing to the
DB, a local file for testing, AWS S3, and such... A well-factored, clean,
reusable, persistence ignorant solution will have a higher cyclomatic
complexity than a hard-coded, raw, INSERT. I know which one I want to
maintain, though ;)

------
blahedo
I love that this post is quantifying results over actual data—a huge
improvement over the speculative/anecdotal pattern that a lot of these
analyses have.

But the conclusion that complexity and LOC both are higher in codebases with
much unit testing... seems very weak. The complexity graph in particular looks
very level, with the trendline driven almost entirely by the very low datum
for 0-10%, and the LOC graph looks almost as suspicious. That low datum
definitely demands further explanation before any conclusions are drawn on
those two.

To me, it also inspires two further questions about the LOC methodology:

1) how sensitive is the measure to different coding styles? Does it include
the method header? Does it include the closing bracket for the whole function?
Does it include open bracket for the function when on a line by itself? Does
it include _any_ bracket on a line by itself? With the average method length
ranging from 2 to 5, coding conventions on brackets could make a substantial
difference, and if coding conventions correlate with testing philosophy at all
(for whatever reason), that's a possible threat to validity.

2) Could these averages be dominated by a few larger projects? That is, is the
average computed as

    
    
      average (length of all methods, all projects)
    

or

    
    
      average ( for each project, (average (length of methods)))
    

? If the former, larger projects would dominate the average. (True of all the
other questions that involve counts and averages too, actually.)

~~~
couchand
In additions to the methodology concerns you raised, I have to question the
use of a "linear" trend line, particularly when one axis is arbitrary buckets
of unequal size.

------
theptip
The consistent bump in the 0% bucket suggests that there is a significant
measurement error going on here.

The author acknowledges this:

> So if any of these codebases are using a different framework, they’ll be
> counted as non-test codebases (which I recognize as a threat to validity).

I'm inclined to say the 0% bucket should be thrown out, since it's impossible
to disambiguate between "no tests" and "tests I can't detect". All of the
other buckets are quite likely to be correctly classified.

------
kkapelon
Interesting article but I think the number of projects (100) is very low for
valid conclusions.

If you have already automated the process maybe you should try adding more
projects.

>Unit Testing Appears to Make Complexity and Method Length Worse

I think it is the other way around. People are forced to write unit tests for
complex and big methods. So not as negative as your present it.

~~~
theptip
> I think it is the other way around. People are forced to write unit tests
> for complex and big methods.

Intuitively this seems correct to me, at least for the code-first paradigm. I
do wonder if this holds in the TDD paradigm though, since in that case there
are (should be?) tests for everything regardless of complexity.

It would be interesting to try to answer this question with further
experimentation along the lines of the OP, though I can't think of any way of
detecting/classifying projects as TDD vs. not-TDD without contacting the
repository authors (which would be time-consuming, potentially prohibitively
so).

------
kgr
The results are compatible with "risk homeostasis" or "risk compensation"
theory. People who use unit tests have an increased sense of safety, so are
more comfortable living with less-safe designs and code-bases.

"Risk compensation is a theory which suggests that people typically adjust
their behavior in response to the perceived level of risk, becoming more
careful where they sense greater risk and less careful if they feel more
protected. Although usually small in comparison to the fundamental benefits of
safety interventions, it may result in a lower net benefit than expected" \--
[https://en.wikipedia.org/wiki/Risk_compensation](https://en.wikipedia.org/wiki/Risk_compensation)

[http://injuryprevention.bmj.com/content/4/2/89](http://injuryprevention.bmj.com/content/4/2/89)

------
skywhopper
I like the idea of this study. I wish there were more data behind it. I’d love
to see comparisons based on language, or testing tool, for example. Or code
base size, age, author count, etc. There also needs to be some weighting of
confidence based on how many code bases are in each group. eg, Since there are
only four code bases in the >50% group, it’s dangerous to draw conclusions
based on a trend line when that group is equally weighted with the far more
populous groups.

Still, I’m hoping the author expands his datasets and analytical methods as
time goes on and I’d be interested to see this topic revisited with a larger
sample.

------
fulafel
What about causality? It seems plausible that people would be more motivated
to write unit tests for complex functions that are hard to gain cofidence in
otherwise.

~~~
dozzie
It seems plausible that people would write shorter, clearer functions that
have no possibility of being incorrect if they had no tests to gain the
confidence of them being correct.

Also, designing for testability only gives you testability. If you ditch this
requirement, suddenly you're free to design directly for ease of use, not for
some proxy like testability.

~~~
dpark
> _It seems plausible that people would write shorter, clearer functions that
> have no possibility of being incorrect_

This seems exceedingly implausible. If people were capable of writing
perfectly correct functions consistently, they’d do so.

Also, in my experience the people who write tests are generally the ones who
write good small functions. The people writing massive dense functions are
rarely writing good (or often any) tests. (This obviously diverges from the
study’s results, though, so grain of salt.)

~~~
sidlls
In my experience the people who are the strongest advocates of unit testing
write small functions. I left off "good" because I don't agree that "small
functions" are "good" necessarily.

Often the functions are in fact _too_ small. They properly belong in a larger
function as part of that function's implementation, but are "factored out"
solely for the purpose of unit testing that piece of the algorithm inside the
function.

Usually these small functions are also only called once. In that case
especially it leads to messy, bloated code that is in fact harder to reason
about, harder to maintain and less efficient besides.

~~~
arthur_pryor
> Usually these small functions are also only called once. In that case
> especially it leads to messy, bloated code that is in fact harder to reason
> about, harder to maintain and less efficient besides.

the flipside is that if this approach is done well, you end up with things
broken down along the natural abstraction lines you'd use, and if you trust
that your small functions do the right thing (are tested themselves) and are
called correctly (their callers are tested well), it tends to make it much
easier to reason about the code, IMHO.

but yeah, also, sometimes larger functions and more integration-y tests are
the way to go. or the above fine breakdown + unit tests + integration tests.
depends on the situation, i think.

~~~
btilly
The tendency is to lock in the first abstraction line that you thought
of..which rarely ages well with future experience.

My rule of thumb is that something needs to be a function if it is called 3x,
doesn't if it is only called from one place, and if it happens 2x it is a
judgement call whether to leave cross-referencing comments between the spots
or to factor it out into a function. (If it is unlikely that I will need a
third and the common code is difficult to extract, I will leave comments.)

------
zebraflask
Unit testing tells you what you already know. It's kind of pointless, when you
think about it (unpopular opinion). If you have a handle on your code there's
rarely any need for it.

~~~
sinhpham
No. Software evolve and there's value in knowing your assumptions still hold
after the change/refactor.

~~~
zebraflask
I don't understand that but thank you for your opinion.

~~~
kovacs
The simple way to say it is that once your code gets too complicated to keep
all the execution paths in your head (sooner than you think), tests are what
enable you to make sure you didn't forget an assumption you made about how it
should work.

Do you like going back to manually verify things still worked after you make a
foundational change? Or do you just trust yourself to know you've not impacted
anything negatively? Do you honestly believe that's good use of your brain
power even if true? Do you enjoy being married to that code? Because that's
exactly what happens as nobody but you can work on it. That strategy works out
well for employees to become key men and get big retention bonuses so there is
some merit to your approach :-)

Not to mention the team impact to cowboy coding. It's a selfish approach to
software development and is often undertaken by bully developers that say
things like "use the source" as a way to make you feel dumb for not wanting to
spend all day trying to untangle their likely insane code. All because they
were too lazy to use a little empathy and be a good teammate.

To recap: Tests document and improve your design, verify functionality,
provide leverage to accelerate development, and reduce "bus factor" risk. I
don't care what a small sample data set says to the contrary there's really
little debate that when wielded properly these tools and techniques do improve
software on multiple dimensions that go beyond the software itself and heavily
impact the business.

20+ years of product development experience in lots of teams/products inform
this opinion. Let's see the data on how your business grinds to a halt when
these tools aren't used and the code base gets increasingly larger and complex
and people leave. Tell me about how you don't need those automated tests as
you approach technical debt bankruptcy and all your key people have left. "Use
the source" is a terrible response to that problem.

Rewrites can be fun though so maybe that's the answer. It sure worked out well
for Netscape. ;-)

~~~
valuearb
Unit tests are a poor way to validate code behavior. 30+ years of development
has taught me that validation tests within the code itself are far better.

1) They run whenever the product runs. Good logging code tells you when it
fails, even on customers devices.

2) By being part of the production code, they are far easier to keep up to
date as the code changes even through refactoring.

3) They don’t force you to change anything about how you code.

~~~
jacob019
Agreed. I like to test the validity of a function's input and throw an
exception if it fails. It catches most regressions with a nice error message.

------
fnl
Interesting article, but too many uncertainties: 100 examples only, the 30
code bases w/o unit tests might have tests the author didn't check or see, and
there is likely a cluster effect in this sample to be found. But interesting
enough to warrant a large scale investigation to see if those results would
really hold.

------
crdoconnor
If you define unit tests as tests which test non-divisible units on a low
level, then by definition they are a form of tight coupling.

The most loosely coupled form of test would be an end to end test (most
loosely coupled, but obviously not without its disadvantages).

Since the pain of dealing with one form of tight coupling is magnified by
having another form present, the presence of unit tests act as a sort of
canary in the coalmine: if your unit tests are extra painful today, that's
probably because the part they are connecting with has a high degree of tight
coupling. That might (or might not) lead you to refactor and decouple the rest
of the code base in an attempt to mute that pain. Either way you'll feel the
pain.

I believe this effect is real. I also don't do this myself, because I don't
believe in self flagellation as a means to enlightenment. The best thing to
couple your tests to depends upon a range of factors, IMHO, and getting
religious about where to couple anything is the worst thing you could do,
_especially_ if that religion dictates tight coupling (unit testing all the
things).

Method length and complexity probably go up because if you focus on loose
coupling (to mute the "unit testing pain") to the exclusion of other good code
qualities - less code/less complex code - then those qualities will be
sacrificed.

------
siliconc0w
Very cool, I'd love to see a more scientific approach to understanding
software development methodologies.

What might be interesting is a metric to see how 'refactorable' a codebase is
- i.e once a class is created what is the probability there would be
significant changes in the future for highly tested vs untested code.

Also the same metrics along a dimension of typed vs untyped languages (my
hypothesis would be that unit testing would show more benefits for untyped
languages like Python than .NET).

------
zo7
The author mentions that they only analyzed the collection of C# repositories
that they used in their study on Singletons, but I would expect the results to
be different depending on the language used.

In my experience with Python, unit testing is essential to catch careless
errors (typos, bad ducks, etc. that can't be caught by a linter) that would
cause your program to crash at runtime, so there's a slightly different
motivation for writing unit tests. It's a less restrictive language so you
don't have the issues with private methods like the author mentioned, and is
(at least to me) a bit less of a strain to write tests. Unit testing might
correlate more with good hygiene for a Python codebase because testing is
vital to check correctness and there are fewer situations where you have to
mangle your code for the tests, whereas it may not in C# because you get
guarantees from the type system that makes testing less vital and writing
tests is more of a nuisance.

------
ams6110
I was curious about _You can’t (without a lot of chicanery) unit test private
methods_.

Why is that? Maybe you can't test them directly but private methods should be
executed and thus indirectly testable by public methods. If there is code in
private methods that can't be executed by any public method invocation, it's
unnecessary code isn't it?

~~~
taeric
I've had a similar debate. Someone kept trying to exclude some of our data
objects from test coverage reports. I agreed we shouldn't bother testing
getter/setters. But, those object better be used in other tests we are
writing.

~~~
JustSomeNobody
The typical argument for getters/setters is that there is some logic that
should be performed and that's why the fields should not be public. If there's
logic, they should be tested. If there's no logic, they should be public
fields.

~~~
twblalock
No, the point of getters and setters is that the interface to get and set
fields should be the same whether or not logic is being performed behind the
scenes, and the caller should not need to know.

Making fields public makes it impossible to add logic in the future without
changing other code. That breaks encapsulation and is generally non-idiomatic
in languages with getters and setters.

------
daxfohl
My guess is that both variables in these results are almost entirely driven by
external factors. Team size, whether it's pro or side project, deadlines,
experience/talent of developers, whether tests are required or optional, the
language. Hard to draw direct correlations given the sample size.

------
spjt
I can speculate as to why this would happen. People expect unit-tested code to
have shorter methods because it's easier to write tests for shorter methods.

However, writing a new test for a new method can be harder than just stuffing
that code in some other method that already has a test. Depending on how
testing is enforced, you might be able to at get away with not having to do
anything at all if the existing test still passes. If you're in a hurry and
the test fails, just change the assertion to whatever happens. Is it right? I
dunno, the test says it's working.

------
danielovichdk
I believe you came to a result to quickly by the way you measure the code
bases chosen. Lot's of things here that you rund over very quickly without
showing anything to back up you right/wrong results.

And sure, tests is of course having an impact on design. As it should have.
There is consensus around, at least for the oop paradigm, that this is the
case.

I also belive you try to make a once size fit all for measuring tests and that
is simply not fair in terms of how differently these codebases are designed
and implemented.

------
andy_ppp
Awesome article, so interesting. I wonder how language design affects this, I
write much better code in Elixir than I do in Node for example.

Maybe though the solution is simple; if you value having low cyclometric
complexity/method length/etc. don’t let people commit stuff unless it passes
the required level. This is certainly working well for me with
prettier+eslint... I’ll look into other tools based on the code smells in this
article I think!

------
ajmurmann
I would love analysis like this based on all the data that codeclimate must
have. The fact that the test coverage buckets and at 50%+ makes me think that
the number of projects that were actually TDDed is very low. I've never seen a
project that was truly TDDed with less than 98%. Typically the 2% come from a
config file or something like that.

------
ysleepy
A generally interesting study subject.

But it seems not quite thought through.

Projects might only invest in unit tests once a project exceeds a certain
complexity, or has seen refactorings with age which made the need for tests
clear.

I'm not sure how to normalize properly, but maybe projects solving a similar
problem could be compared. Like different ORMs or Databases or Kernels.

------
davidbanham
This is a great direction for research, but wake me when there's been any
measurement of significance in the results. Trendlines are interesting but
until we've proven that the variation we're seeing is actually statistically
significant with regard to the sample size, these aren't actionable results.

------
mrlyc
When I was writing device drivers for the middleware people, I found that unit
tests were useful for testing the code and making sure the hardware was doing
what it was supposed to. Along with the drivers, I released the test code to
middleware so they could use it in their own code.

------
iamleppert
This is just my experience, but the engineers and managers who are the biggest
proponents of unit tests don't write tests themselves for their own work. I've
noticed this many times in many different jobs and I'm curious as to why it
is?

------
DmitryOlshansky
While an interesting read, I can’t help but wonder about two things:

1\. This is obviously .NET projects only 2\. 100 is a drop in the bucket even
if we take C#-only projects. How good are statistical results, is it
representative?

------
vikiomega9
Why do we expect fewer lines of code with more unit tests? I don't quite
follow this.

~~~
AdeptusAquinas
Per function, not overall. A function with more lines of code is doing more,
and so is harder to test

~~~
spc476
No, a function can be long but have a low cyclomatic complexity score [1]
(being a list of do_this(), do_that(), do_somethingelse(), etc) while a short
function can have a higher cyclomatic complexity score (lots of fiddly logic).

[1]
[https://en.wikipedia.org/wiki/Cyclomatic_complexity](https://en.wikipedia.org/wiki/Cyclomatic_complexity)

~~~
Dylan16807
You could deliberately do that. But most of the time length and complexity
correlate. More importantly, when you're _splitting up_ functions to make them
testable, you're very much reducing complexity and length in tandem.

~~~
sidlls
Splitting up functions just to make them testable is like a code smell, in my
experience. It has correlated highly with cultish adherence to testing for the
sake of testing rather than to achieve an engineering or quality goal.

It might be reducing "cyclomatic complexity" but not necessarily overall
complexity. Often the opposite is true: reduced locality imposes its own costs
in maintenance and efficiency for example.

~~~
vikiomega9
I don't think that makes sense, if I had a longish method or a task that could
be decomposed into smaller pieces to clearly identify what's happening I would
prefer to break it up and test those pieces if it made sense. Obviously this
depends on how you do it. DDD talks about this in terms of services and such
which might be what your long function really is. In which case you _do_ have
to test this large chunk of logic.

------
megamindbrian2
This is exactly what I want to talk about.

