> Perhaps the "index subsystem" and "placement subsystem" are small enough for full-scale tests to be tractable, but certainly not cheap, and how often do you run it?

Rough guide:

CT = cost of 1 full scale test with necessary infrastructure and labor costs added up

CF = amount of money paid out in SLA claims + subjective estimate of business lost due to reputation damage etc

PF = estimate of probability of this event happening in a given year

if PF * CF > CT, then you run such a test at least once a year. Think of such an expense as an insurance premium.

What Netflix does with their simian army is amortize the cost of doing the test across millions of tests per year and the extra design complications arising from having to deal with failures that often.

