CT = cost of 1 full scale test with necessary infrastructure and labor costs added up
CF = amount of money paid out in SLA claims + subjective estimate of business lost due to reputation damage etc
PF = estimate of probability of this event happening in a given year
if PF * CF > CT, then you run such a test at least once a year. Think of such an expense as an insurance premium.
What Netflix does with their simian army is amortize the cost of doing the test across millions of tests per year and the extra design complications arising from having to deal with failures that often.