All this implies that the projects weren't testing the right stuff. The suggestion to spend more time thinking about error cases is probably a good one; in almost all cases people forget about the fascinating variety of ways in which things might fail. On the other hand, when you have a large number of permutations to test, things get a lot messier:
> The specific order of events is important in 88% of the failures that require multiple input events.
In cases like this, you get a lot more mileage out of Jepsen-style torture testing and QuickCheck-style property testing, where the code is tested with large numbers of random inputs. This simplifies the programmer's job a lot, since they're no longer responsible for intuiting an exact series of inputs that might make something fall over.
Of course, not all failures are even this difficult to flush out. It's interesting that the authors got quite quick and substantial gains from their code analysis tool, especially when you look at how simple it is:
> (i) the error handler is simply empty or only contains a log printing statement, (ii) the error handler aborts the cluster on an overly-general exception, and (iii) the error handler contains expressions like “FIXME” or “TODO” in the comments
I agree that lots of random thrashing is good. It may not find all the bugs but boy howdy it will shake out many of them. Where I work we have our 'bot army' that is 100's of programmed clients that log into the same space and thrash around, chatting and videoing and switching their mike and headset on and off. Its a threshold for a release, to run a week on the bot army without issues (crashes, leaks, stuck bots)
When I did my first greenfield TDD project, I was utterly amazed by the low defect rate (3 in a year, only one really related to coding) we achieved with a fairly relaxed attitude towards comprehensive code coverage and edge cases.
This seems to be related to a slightly different observation, which is that most of the really awful/pernicious bugs tend to be super simple/stupid once you have found them. You know, the "slap your forehead" type of bugs that you just couldn't see because they were too obvious.
My suspicion was that firing even a few monte-carlo rays into that potentially huge state/argument space is sufficient to induce writing the correct code in the vast majority of cases, it's great to see pretty convincing empirical evidence for it (rather than just anecdotal).
Of course, we all "know" that testing is insufficient, after all Dijkstra said so. Did I mention I love the name "This will never work in theory"?