Could you imagine say a utility company creating a title called 'Chaos Engineer'.
If Netflix's system fails, the worst thing that could happen is you don't get to watch Orange is the New Black. If your bank's system fails, you may lose payments, fail to pay your bills on time, or worse.
Netflix's cost of failure is very high -- customers get real-time discovery of your failures, and the service costs less than $10/mo, so they aren't deeply invested in it and will drop you like a hot potato.
For many people, changing bank accounts is very difficult for many people, which is one of the reasons that many banks are able to treat customers with something close to contempt.
The only real difference is the load on staging would be generated artificially [e.g. ACH, credit card processing] using faked staging-only accounts. You even have a fake banking website for security audits that is a clone of production with the fake accounts.
No risk to production and probably 80% of the benefits. Of course its probably 125% of the effort.
Yeah, ok that had nothing to do with the OP, sorry, I just had to vent.
The kinds of things I work on now are entire "environments" we sell, so "production" testing is more along the lines of "go plug the box in and do stuff to it".
2) The projects I work at on $DAY_JOB that I control all have 3 hardware nodes for HA in both staging & production. Staging is under artificial load that plays back the same series of events on dummy data. [e.g. Job A runs -> Completes -> Puts a mock job in Staging's Job Queue -> runs on Staging with mocked data -> I randomly restart a node to see what happens]
Buying 3 1Us is $750 for a staging environment. If you have 1 DC availability, you can just put them on your Local LAN (Free).
Now, you suddenly have a rough approximation of a 3 Node / 1 DC environment.
But what if you have multiple DCs, 3 Nodes per DC, Hot/Cold Loadbalancers and have DNS Failover Between Them?!
Say, you run a LBx2+Nodex3/LBx2+Nodex3/LBx2+Nodex3 across 3 DCs.
Assuming you automate the build, you could just pay by hour. But if not...
Linode's Failover IPs let you do the Hot/Cold LB thing [$60 for 3 DCs worth].
3 Nodes per DC [$40 Nodes for 4096 MB is probably more than enough to test most issues]. $360
So for $420/month, you could replicate a pretty big real world setup.
What if you have 25 servers like stackoverflow and a hot/cold DC setup that is all Linux?
Linode's Failover IPs let you do the Hot/Cold LB thing [$40 for 2 DCs worth].
13 Nodes per DC [$40 Nodes for 4096 MB is probably more than enough to test most issues]. $520
$560/month you can basically duplicate the 2 coast hot/cold setup with 26 servers.
But its expensive!
...unless this is a one-man operation, not really. A fully loaded junior engineer is a minimum of $120k. A robust staging environment is very useful and isn't going to cost more than 1 engineer month to setup.
Yes, you might not use the same hardware. However, it is close enough and honestly...the same hardware 1:1 with real servers [based on the ebay servers] and a rack at Fremont w/ Hurricane Electric is $600/month. 13 servers is $3,250 in hardware. Get someone else on the East Coast in like Buffalo, NY and you can get as close as your budget affords. I doubt you'd hit 1 Junior Engineer Year worth of $$.
Not really Free. Power and A/C are things you'll pay for but its not a notable cost [less than $100/month].
In the end, one system needs to be responsible for a given account... best case you could have is only some accounts/users are affected. It's a different kind of problem.
Facebook, for example uses Cassandra with a very wide distribution of nodes, with an immediate ack on receive... you can't do that with a bank... transactional payment systems won't tolerate it. You could do a few things to alleviate this issue... read-only replicas, etc... even then it's a matter of failing soft (disabling only those systems/services that are unavailable instead of the whole thing).
You just need an appreciation that any down-time in a system is a symptom of larger problems, and the will to identify (and reproduce as chaos!) those problems.