
Exploding Software-Engineering Myths (2009) - luu
http://research.microsoft.com/en-us/news/features/nagappan-100609.aspx
======
asQuirreL
After reading the paper [1] regarding the empirical study on TDD, I'm pretty
dubious about how well backed the conclusions are. If you look at Table 2, in
the 4 case studies, 2 of the projects took longer with TDD (2.00x, and 2.64x
ratio in man months to 3sf) but then the other 2 did better: (0.319x and
0.476x ratio in man months, 3sf). But all of this seems to be ignored in
proving the original statement that TDD take 15-35% longer, which appears to
be backed by what is described as "Management Estimates" [Table 3].

In fact, taking the geometric mean of the man month values (which, I _think_ ,
but am not sure is the right metric to use when comparing ratios), we find the
average ratio to be 0.947x. Make of this what you will, given the size of the
sample.

[1]: [http://research.microsoft.com/en-
us/groups/ese/nagappan_tdd....](http://research.microsoft.com/en-
us/groups/ese/nagappan_tdd.pdf)

~~~
Silhouette
The Nagappan paper gets cited a fair bit, mostly because there is so little
hard evidence of any sort that actually backs TDD as presented by the big name
evangelists.

However, on a closer reading, even that paper isn't really assessing TDD. It
basically defines any test-first coding practice to be TDD, including those
that still spend considerable time and resources on up-front design.

What we really need is some properly structured study that compares
professional practitioners (not just students or recent grads) over extended
periods (not just a brief project that doesn't show the long-term benefits or
costs in terms of quality and maintainability) who are using no unit tests but
otherwise common development practices (control group), the same but writing
unit tests after the main code, the same but with unit tests written before
the main code, and full test- _driven_ development where other aspects of the
development process, particularly design, are effectively replaced by the
test-writing activities as advocated in well-known TDD books and other
training materials.

Of course it's all but impossible to really achieve that sort of like-for-like
comparison with usefully controlled conditions, which is why these issues are
so hard to debate objectively. But try finding any study with pro-TDD
conclusions that doesn't obviously fail the objectivity test in at least one
of the three ways above (not a sample of professional programmers, too short
term to draw general conclusions, or conflating TDD with some other unit-test-
related development process).

~~~
asQuirreL
You're right, this kind of study is really hard to set up properly. However,
in this instance, I'm more concerned by the quality of the analysis. It seems
to be, ignoring actual data in favour of anecdotal evidence, even when they
clearly aren't even correlated at a cursory glance.

------
BurningFrog
"Over a development cycle of 12 months, 35 percent is another four months,
which is huge,” Nagappan says. “However, the tradeoff is that you reduce post-
release maintenance costs significantly, since code quality is so much
better."

Another way of describing this is that the non TDD teams weren't actually done
when they shipped.

~~~
mathattack
I viewed this as a trade-off. "If you can spare 35% more time, write your
tests first. If speed is so important that you can slough off on testing at
the end, don't."

Sometimes partially perfect and released beats better but later. (Perhaps not
for Microsoft, but certainly for lots of their competitors) How many more bugs
can you find with software "in the Wild"?

This isn't to say it's worth tossing crappy code over the transom, just that
real tradeoffs do exist.

~~~
cbd1984
> Sometimes partially perfect and released beats better but later.

Especially since "later" implies a new market, with new demands on the
product. "Better" then becomes a moving target, and with "Best" you might as
well say "Stop building the road when you reach the horizon and the sky stops
you".

------
ploxiln
Lots of discussion here about TDD and test coverage... but the most
significant factor of "code quality" found was "organizational metrics":

> Organizational metrics, which are not related to the code, can predict
> software failure-proneness with a precision and recall of 85 percent. This
> is a significantly higher precision than traditional metrics such as churn,
> complexity, or coverage that have been used until now to predict failure-
> proneness. This was probably the most surprising outcome of all the studies.

Which I take to confirm my belief that the freedom and responsibility given to
developers, and how good they are, is more significant than the specific
testing policy enforced :)

~~~
UK-AL
More freedom, means more freedom to test correctly. As opposed to be
constantly being pushed to release as fast as possible.

~~~
mreiland
It's a hell of a lot more complex than that.

A developer with a short deadline is going to design things differently than
one with a longer deadline. That superior design can often result in better
long term maintenance.

There's also the idea of requirements gathering. Some people take it for
granted, but if your developers aren't in a position of power in your company,
they can't insist on good requirements for software. I've experienced this
firsthand.

I once worked for a company in which the field engineers brought in a large
chunk of the money for the company. They were the "big dick swingers" in the
company, if you will, while the software development team was an attempt at
the company to automate a lot of the work done by engineers and then sell
services with the gathered data.

This presented a conflict of interest in that the software would have actively
made the company need less engineers as they could do their job faster, but
because of the two groups relative status within the company, there was no way
for the software team to insist on accurate specifications.

The day I walked out of that company I had been asked to rewrite a module of
the software for the 3rd time and requested written documentation. I was on
the phone with both my manager and the engineer we were supposed to be
coordinating with. The engineer flat out refused and told me I didn't need it.
I asked him how many years of software dev experience he had.

That's one anecdote, but the point is, politics in a company severely affect
the performance of developers and not simply because of testing. I honestly
think that's an awfully naive view of the world.

~~~
UK-AL
I agree with everything you just said.

Freedom gives freedom to adjust anything as required, not just testing.

Having freedom means being able to ask for longer deadlines to do things
properly, regardless of politics for example.

------
pacaro
One way of looking at the coverage issue is that high coverage isn't a very
strong positive signal, but low or absent coverage is a strong negative
signal.

It's also interesting to combine metrics like cyclomatic complexity, or
relative churn rate, with coverage. Files/modules with high complexity/churn
and low coverage are much scarier than those with low complexity and low
coverage

~~~
pckspcks
Empirically, I have not seen this to be the case.

Test coverage has had little correlation to code quality on the projects I've
worked on.

I would agree that well-written tests of the most complex parts of code help
contribute towards a quality system. Most of the other tests -- sometimes more
so, and sometimes less so, depending on other aspects of the project.

~~~
pacaro
I was trying hard not to make a strong assertion, so I'm not sure what you are
disagreeing with entirely...

TL;DR - it's a matter of context

But here's an anecdote where two complex pieces of code interacted...

A few years ago I was working on a fairly big project (~50 developers) and was
responsible for much of the architecture, and on an implementation level,
implementing some of the lines that connected boxes in said architecture. One
of those was a line that crossed what can loosely be thought of as a
system/application boundary, this involved both implementing the application
hosting layer and writing the marshalling code that moved data in either
direction, there was a bunch of complexity because of not quite convergent
type systems, some threading issues etc...

So all this is done, not too many tests have been written, but hey, it works,
if it didn't nothing in the system would work... a year or so into the
project, the most senior technical person on the team has a flaky test that
once in a while just completely fails to make progress, he investigates, and
comes to the conclusion that once in a blue moon, the transport layer (my
code) is dropping a message.

Needless to say I'm totally flummoxed, there is no code path that can drop
messages without also creating a huge stink in logs etc. I spend two months
investigating this, eventually I have a consistent repo, a fairly big hole in
the other guys state machine that causes it to just stop making forward
progress.

So despite the fact that a test exposed the bug, neither piece of complex code
really had proper tests, but I would assert that the "line" didn't need the
same level of tests, it was being exercised constantly by dozens of "boxes",
but the unicorn "box" with the logic error could sure as hell have used some
unit tests to validate its state transitions.

------
douche
One thing I have never understood is how to apply TDD when the code that you
are writing is nearly 100% dependent on external services and "black-box"
library code. For instance, I rely on several more or less undocumented
Microsoft APIs in my day job. How do I even know what correct outputs to code
interacting with those components would be? It could depend on server
policies, registry settings, firewall settings, SQL server settings, Active
directory settings, and a hundred other things. My application code is bog-
simple next to all of the code handling possible FUBAR states of all of this
environment.

~~~
TheDong
Your issue is that you're not mocking correctly. You need to write mocks for
services you depend on, ideally by recording and replaying real api
interactions, such that you can verify that given the libraries you depend on
your tests will be correct. Then write TDD tests, then write code, then write
integ tests which will verify your original mocks work as assumed.

~~~
mikekchar
Another approach (which is not really so different) is to build adapters
around external services and then write your code to use the adapters. The
upside is that you can avoid writing integration tests which might be very
difficult to set up (depending on the situation). The downside is that you
might implement more in the adapter than you need.

In practice, I tend to do a mixture. I will mock/fake the external services
and then build adapters and then finally remove the mocks/fakes. In the final
part you may have to stub the adapters. The unit tests in the adapters will
alert me to changes in the external services, so this is relatively safe to
do. Your main sources of error will be subtle bugs where you have insufficient
coverage in the adapters. You must also use restraint in working on the
adapters independent of the code that uses it, because you have no end to end
tests.

Building fakes instead of mocks in this situation is also a good way to really
test your understanding of the external services. It takes more time, but it
can pay dividends. Many times I will write a fake and then think, "Does it
really work that way?" only to realize that I can't actually use the external
service in the way I envisioned. With a mock, you completely divorce yourself
from implementation so it is relatively easy to assume that the external
service can do impossible things. You can go a long way before you realize
that your code is unusable.

One thing I will caution -- mocks, fakes, stubs and integration tests tend to
attract people with almost religious views in how they should be used. I tend
to believe that this is stuff is hard and that current industry practice is
very naive. I think there is a lot of room for experimentation with a very big
upside. But you are likely to attract a lot of criticism if you wander outside
someone's boundaries, so it is best to spend significant amounts of time
pairing to make sure that everybody is comfortable with the approaches you
use. Nothing is worse that "test wars" where people are going in 100 different
directions in the tests.

~~~
rjbwork
This is the approach we take: Adapters+mocks for external services, but we're
okay with hard dependencies (via injection if possible) on non-network
libraries. This includes things like json serializers, other serializers,
automappers, math libraries, etc.

------
aetherson
You know, I understand and agree that code coverage isn't the be-all and end-
all of testing. But... we do all secretly agree that if it isn't _pretty_
high, then the testing isn't doing much, right?

Like, does anyone feel really good about saying there's good test coverage of
a class with 50% code coverage? 60%?

After I hit 80-90% coverage, I start wanting to know about other attributes of
the tests. Are they really really really testing that all the outputs are
exactly what we expect? Are they controlling dependencies in a useful,
realistic way? Are they exploring a wide range of inputs?

Before that, though, all of the above seem a bit premature. I don't deeply
care if you've done a really thorough job of testing 20% of the code if the
other 80% isn't even run.

~~~
sbov
Like most things, the answer is... it depends.

We have several with 0% and I'm just fine with that. They're exceedingly
simple. And/or they're 15 years old and only change in very slight ways every
few years.

There's other classes that are 100% that I'm not fine with. Because while the
code is short, the implications are complex.

I feel like blindly talking about these metrics is like saying "hey, I bought
an earring for $100, did I get ripped off?"

~~~
aetherson
I think that you're broadening the question. Sure, sometimes you feel okay
with code that's not at all tested. But surely you don't feel that code that
is not at all tested _is well tested_.

I'm talking about to what extent code coverage can and can not be used as a
proxy for "well tested," not to what extent "well tested" can and can not be
used as a proxy for "I feel good about my code."

------
tokenrove
A more recent paper that's related to this, "Coverage Is Not Strongly
Correlated with Test Suite Effectiveness":
[http://www.linozemtseva.com/research/2014/icse/coverage/cove...](http://www.linozemtseva.com/research/2014/icse/coverage/coverage_paper.pdf)

I wouldn't want people to conclude that coverage was a useless metric based on
this, but it does seem to support the idea that chasing code coverage
dogmatically is not necessarily the best use of your time.

~~~
wyldfire
I agree -- IMO being dogmatic about 100% or any other number can be a big
waste of time.

But a coverage ratio and/or browsing annotated source is a good feedback
mechanism for how many tests you've written.

------
aout
I'm not sure I fully understood the moral of the article but it seems to me
that all of these findings are basically what "people knew for a long time".
Okay it's good to put data on "beliefs" but is it really useful when
quantifying "promoted usage" and "obligation" has failed?

The article does prove that some people tried to impose (via management or
structure in large organizations) metrics on coding as a way to rationalize
the development quality and miserably failed but clearly lack the counterpart:
what happens when you don't impose these metrics? Experience is important but
that's not new is it? In the end it just feels like: small programs won't have
much problem as do small organizations - Big organization should function like
small companies. Not much of an answer.

Also, I'm not sure that a manager saying something like "according to data, we
need one more month to make the program 35% more stable" is a good argument
(or even a new one). What would be interesting is to try something like "ship
the program one month earlier and dedicate the two remaining month to fix
what's broken according to feedback", then compare with the first approach.

------
amelius
"higher code coverage was not the best measure of post-release failures in the
field"

Perhaps they should measure the coverage by taking the number of tested paths
through the software divided by the total number of paths.

------
jraedisch
The mentioned technical paper about assertions states:

"The assertions are primarily of two types of specification: specification of
function interfaces (e.g. consistency between arguments, dependency of return
value on arguments, effect of global state etc.) and specification of function
bodies (e.g. consistency of default branch in switch statement, consistency
between related data etc.)."

from
[http://research.microsoft.com/pubs/70290/tr-2006-54.pdf](http://research.microsoft.com/pubs/70290/tr-2006-54.pdf)

Is there somebody here, who could elaborate on this, ideally with some small
examples?

~~~
qznc
You want to have examples for assertions?

Consistency between arguments: memcpy "The memory areas must not overlap."

Dependency of return value on arguments: max returns one of its arguments. r =
max(x,y); assert (r == x || r == y);

Effect on global state: printf might change the global errno variable.

~~~
jraedisch
So this means embedding small tests into the production code itself while
accepting minor performance loss? And you somehow get a feeling for when this
will help through experience? I wonder, whether this impacts all programming
languages the same.

~~~
qznc
It is not necessarily about runtime tests. Often assertions are only used for
debug builds. Java requires an extra flag to enable assert statements.
Alternatively, assertions could be invariants or pre/post conditions for your
formal proofs of correctness.

I'm too lazy to check the actual paper, if they clarify their understanding.
;)

------
2muchcoffeeman
_" Code coverage measures how comprehensively a piece of code has been tested;
if a program contains 100 lines of code and the quality-assurance process
tests 95 lines, the effective code coverage is 95 percent."_

I've never heard of this before. Where does this come from and why did they
suspect this would be indicative of anything?

~~~
cgh
The idea is that shipping untested code is worse than shipping (unit) tested
code. Code coverage reporting is quite common and I'm surprised you've never
even heard of it if you've ever written software professionally.

~~~
tbrownaw
It's popular among the hobbyist / activist crowd.

People who like to talk about the "proper" way to develop software tend to
like it.

People who run open source projects tend to either like it or bow to peer
pressure.

But people working on in-house and enterprisey stuff? Yeah, not so much there.

~~~
KMag
> But people working on in-house and enterprisey stuff? Yeah, not so much
> there.

This depends a lot on your corporate development culture, even for in-house
enterprisey software.

------
cubano
_there needs to be a culture of using assertions in order to produce the
desired results._

IMO, proper "culture" is a significant part of almost all aspects of project
management and no matter what, without it, the teams will never reach their
optimum potentials, unless some highly motivated dev takes upon their
shoulders to implement these base tenets for everyone else.

I'm not saying it's always easy to implement or it guarantees success, but
that the lack of it always rears it's ugly head at some point in the
development process.

------
serve_yay
Data helps. But you still have to know how to apply it.

------
wampler
The Mythical Man Month is damn correct.

