I wonder what the reasoning was for having version 2 only terminate instances (vs burning up CPU, taking disks offline, etc.)? I assume it's something to do with what Chaos Monkey is NOT trying to solve (ie. eating up CPU is caught elsewhere by another system and out of scope for Chaos Monkey now). Just trying to think it through...
Was wondering the same thing... I know in our environment, unexpected events on a box cause more problems than entire server failures. We design around servers coming in and out; specific processes failing in random ways is harder to design around.
If you can handle a server failing and you have good reporting, you can handle many of those random issues by simply rebooting the affected server. So I can see them taking out things like "load up the memory" or "load up the CPU". Logic errors (bad RAM, corrupted packets, high packet loss) are another story, but I don't know if V1 did those.
Seems to me those kind of issues would be a good way to test your monitoring. For instance verifying that the appropriate people are notified or automatic action is taken when within a reasonable amount of time after the CPU on the box starts spiking.
I would assume that terminating is easy via the AWS API, whereas some of the other things need a process on the instance. You shouldn't really be connecting to boxes directly over SSH if you do DevOps correctly, so maybe they blocked port 22 to enforce this.
"Devops" has bazillions of meanings, but avoiding (human) ssh to production boxes is a generally sound principle these days because our infrastructures are becoming harder to understand by poking at boxes one or two at a time now even for forensic analysis.
It's just a matter of scale. If you are at the scale of netflix, vm are probably too complex black boxes, and logs output somewhere else anyway. Plus the problem may involve the interraction of several vm, or the network, or other composants together.
At this scale, you basically have to have centralized logging. When you have thousands of parallel instances of a single application, searching logs box-by-box just isn't practical.
Consider also that if you're elastically scaling EC2 instances and you need logs off an instance that's since been terminated, too late! That disk is gone. So again, you need a central log service.
If deployment is automated and "clean", images get baked into machines and they just start. For instance, we use Ansible against the machine itself on boot, so we don't really need ssh access to it: everything is automated (but we keep it open to troubleshoot anything that may happen)
Everything goes through Spinnaker now, which in turn supports all clouddriver provides including aws, gcp, azure and kubernetes.
Resource limits should be set by instance type. It's more of an application level thing than infrastructure which is what chaos monkey is supposed to simulate
I actually agree with you about immutable infrastructure, as I work at implementing it.
But that's dangerously close to the "One True Way". Which is certainly not the case - so much of this is evolving, and a wide variety of situations and circumstances.
To test a distributed system, there's not a lot of value to simulate all the many different conditions that can happen on a machine. From the system's perspective, you don't really care what happens on a single computer.
The conditions you need to simulate are (a) the machine being abruptly gone or (b) the machine still accepting requests, but being very slow in returning them and maybe (c) machine returning incorrect results. Seen from the outside, everything that can happen is usually (a) or (b), and with ECC memory and reasonable software (?) hopefully never (c).
I think it could be because it doesn't provide the value to the development teams. Why chase a memory leak or a CPU bug if it's just being caused by your fault testing app?
To prevent the negative effect of random machines dissapearing though.. that's a challenge that involves good ops, devs, even UI/UX I would imagine, and closer to something that users experience negatives because of in real life.
I posted this piece of news to my team Slack at work, and a colleague of mine wrote: "we don't need chaos monkey, we have developers for that".
While being funny, it also holds a lot of truth. I guess that Netflix can hire really top-notch devs who do not accidentally force downtime to their software.
Developers cause software downtime, but they usually don't cause infrastructure downtime, which is what Chaos Monkey does. Cloud VMs have failure rates somewhere around 1-2% depending on who you ask. This is low enough that you can ignore it most of the time, but it'll come back and bite you hard later. Chaos Monkey artificially forces that failure rate high enough that you'll notice problems immediately and fix them before they become too engrained into your architecture.
We have different tools for different kinds of failures.
Chaos Monkey helps ensure that you are resilient to single instance failure. Kong helps ensure that we are resilient to region failure.
Most of the developer-induced pain (which is most frequent source of pain) happens at the service level -- a bad code push that somehow made it through canary, accidentally doing something you shouldn't, misconfiguring something, etc. For tolerating service-level failures, we use different tools that minimize the fallout of the failure injection. Specifically, FIT (and the soon to be revealed ChAP.) These tools allow us to be more surgical in our injection of failure and tie that into our telemetry solutions.
We only inject failures we expect to be resilient to. Sadly, that is a subset of the failures that people cause ;)
The joke is funny but it is actually shockingly difficult to make nodes kill themselves instead of doing something for more malignant which is the typical case for bugs. That is a zombie just like in real life can be worse than a dead corpse (since we are being funny and all). Chaos Monkey shoots two in the head.
I have had issues killing errant JVMs and Rackspace nodes (yes sadly we are still on Rackspace).
I can understand why 2.0 is much more focused given the plethora of monitoring solutions.
I don't think even hiring top-notch devs would make this reality completely go away. Tools like this would likely put them in check too. And yeah same response on our internal chat here basically x)
From my experience, this is naive.
It's funny, but really naive.
The bigger your application stack is (micro-services, API calls, network calls), the more failure you would need to test out and there's no way to "trust" developers to do it themselves.
Also, Netflix hires a lot of top-notch developers and their infrastructure is pretty awesome.
That rings so true. But on the flip side, it might be that because Netflix has a chaos monkey and all it's services need to be resilient to failures, accidental downtime isn't that big of an issue to them.
Their developers can still make the same mistakes that we do, but their architecture is better designed to handle that. Just a thought.
Another useful tool is https://github.com/gaia-adm/pumba - like ChaosMonkey, but just for Docker containers. The coolest part for us was emulating networking problems between containers (packet loss, unavailability etc).
HA containers for us means smart orchestration tools. We did not want to lock ourselves into Docker-only infrastructure (even now rkt is a very compelling alternative), and wanted an orchestrator/scheduler that is focused entirely on that job. Outside of Swarm, Mesos & co appeared too intrusive, and Nomad is quite narrow in what it does. So we picked Kubernetes and are very happy with it.
Parent may have meant that it would be cool to see a write up on how this tool can be used to simulate network conditions between nodes in a Jepsen test using only containers on a single docker host.
Interesting that all of the resource burning features have been removed, I wish they had expanded on the reasons why. I always found those to be the most differentiating features of Chaos Monkey. Did they just not get a lot of use internally at Netflix?
Resource exhaustion manifests as latency or failure. We inject latency and failure using FIT, so we can limit the "blast radius". When you are testing these failure modes, you are more testing the interaction between micro services, and this requires a bit more precision and sophistication.
Yes, I wonder that too - complete node failure is the nicest failure that can happen. (See e.g. http://danluu.com/limplock/, because Dan Luu's site is always excellent.)
"We rewrote the service for improved maintainability" seems an important part of this blog post.
I was thinking along these lines as well. Possibly they have other checks and balances in place for less than death situations for servers like high CPU usage and those situations result in another mechanism killing the server which is the same as what this solution provides.
This is awesome. Since Chaos Monkey now leverages spinnaker you can run it against clouddriver provider. Looking forward to trying this out with kubernetes. I believe spinnaker treats namespaces as regions. Eventually it would be cool to simulate masters or other entire federated clusters going down to test kubernetes scheduling resilience
Every time this is in the news I get a feeling of awe for operations teams having confidence enough to deploy this. I've usually been in small teams with the feature mill factor turned up way too high.
If you don't use a tool like this, entropy will take care of taking your machines down for you. Only then, it won't be a regularly rehearsed part of "normal operations" so you might find yourself up creek without a paddle.
Beyond external termination services like Chaos Monkey, what are good examples of software that purposely increase internal nondeterminism or failure injection in production?
Go has race detector mode, but it is an optional debug feature with a performance cost. The Linux kernel's jiffy clock starts counting from -5 minutes so drivers must handle clock rollover correctly because it's not a uncommon "once every 48 days" event. Firefox has a chaos debug mode that does things like randomize thread priorities and simulate short socket reads, but that has performance costs.
The Linux kernel also recently added a debugging option that when enabled, instead of just probing a device it performs a probe / remove / probe sequence to ensure that device removal works.
Chaos Monkey is sort of like Advanced Continuous Deployment. Most shops are still struggling with the basics. You cant even think of trying to sell this running to the C-level until you've proven that you can at least walk (automated deployment and rollback).
I remember reading years and years ago about bandit algorithms... this kind of ops work is at a level that's found only in a few different companies.
In my opinion, it doesn't count unless its in production.
Why? Your customers use your production environment, not your test environment. Something will cause loss of an instance for you:
* Mistaken termination
* AWS retirement (and you missed the email)
* Cable trip in the data center
* <Something else we can come up with>
* <This list goes on>
So, vaccinate against the loss of an instance cratering your service. Give your prod environment a booster shot (with Chaos Monkey or something like it) every hour of every day. Then, when anything from the above list happens you're infrastructure handles it gracefully and without intervention. Continued booster shots ensure that this stability continues through config changes, software version changes, OS changes, tooling changes, etc.
I think the better question is "Why wouldn't you do this?"
The drawbacks of potentially causing downtime and therefore having the potential to drive away customers as well as obtain an image of unreliability can be much more damaging than not using it in the first place. Customer image means quite a bit.
Agreed, it should be hard to explain benefits even to non technical people. It's like doing a fire drill, if you do it frequently when the actual fire happens you will know what to do. Similarly with infrastructure, it might not be good handling rare events, but once these events are not rare you will learn to handle them.
The biggest issue IMO is explaining need to make things more resilient. Actually the technical people (mainly developers) might be the biggest obstacle, because it adds more work for them (with no visible benefit to them, because when application fails it's ops who get woken up).
Although designed originally to catch places where malloc failure wasn't being handled, it can also be used to randomly trigger other off-nominal portions of the code that might not otherwise be tested.
I can see that Chaos Monkey adds selective pressure to ensure that systems evolve into a state where they can handle unexpected server outages.
But isn't there a danger that it also encourages maladaptions that come to rely on being regularly restarted by the Chaos Monkey? I'm particularly thinking that you might evolve a lot of resource leaks that go unnoticed so long as Chaos Monkey is on the job.
Holy fuck that is some small font size. And the paragraphs aren't in paragraph tags... just hanging out between <div>'s with some <br>'s to keep them company.
Every time that URL comes up people try to access it from https but the site is only available from http... Fix your Firefox, it's clearly at fault here.
Looks like some sequence of events convinces Firefox that HSTS is set, and it always rewrites the request as https (which is not supported by the server) after that.