We will tolerate faults in our computer systems when then faults don't cost us our homework, our jobs, or our lives. And complexity is the only way to making that happen.
I, for one, welcome our new chaos-wielding overlord!
Since this is HN, I'll instead say that if you don't have the guts, you're only lulling yourself into a false sense of security. AWS or not, your systems will fail.
By running the monkey, at least you can make it fail on a schedule that is convenient to you, instead of happening when you're drunk on a Saturday night (or whatever your vice of choice might be).
But remember: what's the difference between hacking and pentesting? Permission.
Increasing the amount of failures in order to increase the percentage of nice (not 3am) failures takes guts.
By creating faults in the Netflix service which they can build around to stay optimally available, they completely or nearly completely eliminate the risk of being affected by faults in AWS services.
Now, clearly this has not been 100% true yet because some major outages have effected Netflix - though that's when the entire service fails, as opposed to single point of failure types of faults, which chaos monkey is really designed to accomodate.
(disclaimer: this is my outside perception based on purely interest in how Netflix has built themselves. I could be incorrect)
By creating faults it means we know we can handle those types of outages.
As we experience new types of faults, we build tools and systems to make us resilient to those types of faults, and where possible similar but yet unexperienced faults.
I have a question, I asked this of Adrian, but he was of the opinion that Netflix would likely never accommodate:
How can a hospital have group Netflix streaming accounts such that users in their patient rooms could view/stream netflix to their rooms?
Can you make a commercial account / support this?
What about the following grey method for handling this: Th hospital pays for a group of individual streaming accounts. Their Patient Entertainment system can then check-out an account for use by the patient. The patient watches what they want, then when done, the system checks-in the account for use.
Would this be amenable to the TOS you may have?
The reason this is important is that cable ompanies are raping hospitals for providing cable TV service to their patient rooms. Netflix is a far better path for the future.
As for all the rest, I'm unfortunately not in a position to answer any of that. Adrian's probably right though. I suspect our content licenses would preclude that type of account.
Overall it's brilliant from an infrastructure standpoint but it's up to the developers to make your code monkey proof!
Their outages are becoming more and more random.
… or more loudly reported. Very few of the reported outages affect people who follow Amazon's guidelines for reliability - even the last one had most of the impact on cheapskates who didn't pay for redundancy.
I personaly like to run stuff on a CPU/network crippled test setup as that can expose cracks normal testing fails to highlight and in a way that is indicative of what can actualy happen under some load spikes.
Point I was making is that pulling out process's/servers/cables is all fine (aka pissing on your server) but it is not a ideal way to test. It is better IMHO to test out on what you could call weak hardware, as in realy throttled CPU/disc/network etc and see were things break as these emulate what potentual peak bottlenecks you will get. This approach is better as you get to test out a single process system as apposed to only being able to test cloud based approaches as this does.
Throttling CPU/disc/network is all well and fine and necessary too, but that's only one set of failure scenarios, and won't protect you against a whole range of failures related to systems just disappearing as they crash, or failing to restart, or any number of other concerns.