Hacker News new | past | comments | ask | show | jobs | submit login
Testing Can Prevent Most Critical Failures: An Analysis of Production Failures (neverworkintheory.org)
57 points by cpeterso on Oct 10, 2014 | hide | past | favorite | 6 comments

The study argues that most bugs found in production can be reproduced with simple test cases. This does not imply that just adding a bunch of simple test cases would have prevented these bugs. At least some of the tested projects have very large test suites, and all of these bugs made it through that first line of defense.

All this implies that the projects weren't testing the right stuff. The suggestion to spend more time thinking about error cases is probably a good one; in almost all cases people forget about the fascinating variety of ways in which things might fail. On the other hand, when you have a large number of permutations to test, things get a lot messier:

> The specific order of events is important in 88% of the failures that require multiple input events.

In cases like this, you get a lot more mileage out of Jepsen-style torture testing and QuickCheck-style property testing, where the code is tested with large numbers of random inputs. This simplifies the programmer's job a lot, since they're no longer responsible for intuiting an exact series of inputs that might make something fall over.

Of course, not all failures are even this difficult to flush out. It's interesting that the authors got quite quick and substantial gains from their code analysis tool, especially when you look at how simple it is:

> (i) the error handler is simply empty or only contains a log printing statement, (ii) the error handler aborts the cluster on an overly-general exception, and (iii) the error handler contains expressions like “FIXME” or “TODO” in the comments

I don't know how trusting in test cases isn't the same as saying "hidesite is 20-20". Blaming bad test cases is easy; thinking of the right test cases up front is hard.

I agree that lots of random thrashing is good. It may not find all the bugs but boy howdy it will shake out many of them. Where I work we have our 'bot army' that is 100's of programmed clients that log into the same space and thrash around, chatting and videoing and switching their mike and headset on and off. Its a threshold for a release, to run a week on the bot army without issues (crashes, leaks, stuck bots)

Love the name of the blog: "This will never work in theory", indeed!

When I did my first greenfield TDD project[1], I was utterly amazed by the low defect rate (3 in a year, only one really related to coding) we achieved with a fairly relaxed attitude towards comprehensive code coverage and edge cases.

This seems to be related to a slightly different observation, which is that most of the really awful/pernicious bugs tend to be super simple/stupid once you have found them. You know, the "slap your forehead" type of bugs that you just couldn't see because they were too obvious.

My suspicion was that firing even a few monte-carlo rays into that potentially huge state/argument space is sufficient to induce writing the correct code in the vast majority of cases, it's great to see pretty convincing empirical evidence for it (rather than just anecdotal).

Of course, we all "know" that testing is insufficient, after all Dijkstra said so. Did I mention I love the name "This will never work in theory"?

[1] http://www.springerprofessional.de/011---in-process-rest-at-...

There are some fascinating stories on using QuickCheck to test Riak and other distributed systems. For example: http://basho.com/tag/quickcheck/ and https://skillsmatter.com/skillscasts/4505-quickchecking-riak (registration required)

It consistently amazes me how many developers write code without thinking about tests. I've entered large scale projects, where I had to write code that interacted with a number of other components, and when asking "What happens when (possible failure case)" I get a shrug, "We've never seen that happen in development", and when I ask "Okay, how can I mock that failure case out sufficiently that we can see what the system does, and make sure it works okay?" I get a deer in the headlights look back.

and how do you test a system with tens of components each runs in its own service with its own pace its own failures? for some reasons people treat testing just as unit test, i don't get how you test the whole big thing as a whole. in addition you never get the same hardware nor the same software, network management, same number of servers/services as you have in production. This is the real challenge, unit testing is piece of cake and I never skip it, how to test the whole system this is where I get confused (without getting test that run for hours which are unusable and overly complex).

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact