
Testing Can Prevent Most Critical Failures: An Analysis of Production Failures - cpeterso
http://neverworkintheory.org/2014/10/08/simple-testing-can-prevent-most-critical-failures.html
======
bkirwi
The study argues that most bugs found in production can be reproduced with
simple test cases. This does _not_ imply that just adding a bunch of simple
test cases would have prevented these bugs. At least some of the tested
projects have very large test suites, and all of these bugs made it through
that first line of defense.

All this implies that the projects weren't testing the right stuff. The
suggestion to spend more time thinking about error cases is probably a good
one; in almost all cases people forget about the fascinating variety of ways
in which things might fail. On the other hand, when you have a large number of
permutations to test, things get a lot messier:

> The specific order of events is important in 88% of the failures that
> require multiple input events.

In cases like this, you get a lot more mileage out of Jepsen-style torture
testing and QuickCheck-style property testing, where the code is tested with
large numbers of random inputs. This simplifies the programmer's job a lot,
since they're no longer responsible for intuiting an exact series of inputs
that might make something fall over.

Of course, not all failures are even this difficult to flush out. It's
interesting that the authors got quite quick and substantial gains from their
code analysis tool, especially when you look at how simple it is:

> (i) the error handler is simply empty or only contains a log printing
> statement, (ii) the error handler aborts the cluster on an overly-general
> exception, and (iii) the error handler contains expressions like “FIXME” or
> “TODO” in the comments

~~~
JoeAltmaier
I don't know how trusting in test cases isn't the same as saying "hidesite is
20-20". Blaming bad test cases is easy; thinking of the right test cases up
front is hard.

I agree that lots of random thrashing is good. It may not find all the bugs
but boy howdy it will shake out many of them. Where I work we have our 'bot
army' that is 100's of programmed clients that log into the same space and
thrash around, chatting and videoing and switching their mike and headset on
and off. Its a threshold for a release, to run a week on the bot army without
issues (crashes, leaks, stuck bots)

------
mpweiher
Love the name of the blog: "This will never work in theory", indeed!

When I did my first greenfield TDD project[1], I was utterly amazed by the low
defect rate (3 in a year, only one really related to coding) we achieved with
a fairly relaxed attitude towards comprehensive code coverage and edge cases.

This seems to be related to a slightly different observation, which is that
most of the really awful/pernicious bugs tend to be super simple/stupid once
you have found them. You know, the "slap your forehead" type of bugs that you
just couldn't see because they were too obvious.

My suspicion was that firing even a few monte-carlo rays into that potentially
huge state/argument space is sufficient to induce writing the correct code in
the vast majority of cases, it's great to see pretty convincing empirical
evidence for it (rather than just anecdotal).

Of course, we all "know" that testing is insufficient, after all Dijkstra said
so. Did I mention I love the name "This will never work in theory"?

[1] [http://www.springerprofessional.de/011---in-process-rest-
at-...](http://www.springerprofessional.de/011---in-process-rest-at-the-
bbc/4852682.html)

------
grndn
There are some fascinating stories on using QuickCheck to test Riak and other
distributed systems. For example:
[http://basho.com/tag/quickcheck/](http://basho.com/tag/quickcheck/) and
[https://skillsmatter.com/skillscasts/4505-quickchecking-
riak](https://skillsmatter.com/skillscasts/4505-quickchecking-riak)
(registration required)

------
lostcolony
It consistently amazes me how many developers write code without thinking
about tests. I've entered large scale projects, where I had to write code that
interacted with a number of other components, and when asking "What happens
when (possible failure case)" I get a shrug, "We've never seen that happen in
development", and when I ask "Okay, how can I mock that failure case out
sufficiently that we can see what the system does, and make sure it works
okay?" I get a deer in the headlights look back.

------
tomerbd
and how do you test a system with tens of components each runs in its own
service with its own pace its own failures? for some reasons people treat
testing just as unit test, i don't get how you test the whole big thing as a
whole. in addition you never get the same hardware nor the same software,
network management, same number of servers/services as you have in production.
This is the real challenge, unit testing is piece of cake and I never skip it,
how to test the whole system this is where I get confused (without getting
test that run for hours which are unusable and overly complex).

