It's always good to see this "chaos culture" being promoted as it drives best engineering practices in terms of architecting, testing and monitoring resilient systems. However it's also interesting to see how at times this space gets inflated, packaged and sold as this big complicated thing that requires "chaos engineers" to implement (almost like "agile" got inflated into a stand alone industry :)). It's just a set of good practices that engineers can do to improve critical systems.
If you lack sufficient maturity or you are not large enough, you will actually lose benefits if you try to implement chaos engineering. It’s like teaching a toddler to use a propane torch. You just don’t do it, or everything you love in the world will burn.
Thanks for the laugh though. Your phrasing was awesome.
Your comment was probably in jest, but it provided a nice platform for mine :)
To be fair though when we launched it at Netflix we knew it would break things. But we also knew we had the corporate will to deal with it.
That's really the key. If you don't have that, it's not going to work out in the long term.
Also, if you’re using this, the right time to start doing it is in development, account wide. It’s mildly obnoxious for devs to lose stuff like Jenkins servers and bastions, but you really quickly find single points of failure and start to engineer around it. Usually making your stuff resilient isn’t that much work, but it’s hard to know what you need to do until you test it.
If you’re randomly terminating everything in dev and testing and staging, you should already be battle hardened by the time you get into production.