> Transient resource or network failures can completely disable Chronos. Most systems tested with Jepsen return to some sort of normal operation within a few seconds to minutes after a failure is resolved. In no Jepsen test has Chronos ever recovered completely from a network failure. As an operator, this fragility does not inspire confidence.
Honestly, that is the money shot. Its not worth running an off the shelf cluster system that isn't self-healing from a network partition and/or a node loss->node restored cycle.
"This isn’t the end of the world–it does illustrate the fragility of a system with three distinct quorums, all of which must be available and connected to one another, but there will always be certain classes of network failure that can break a distributed scheduler."
As we build increasingly complex distributed from what are essentially modular distributed systems themselves, this is something to be aware of. The failure characteristics of all the underlying systems will bubble up in surprising ways.
Yes but the number of times I've had to manually heal a cluster in production in my professional life can be counted on one hand.
A network partition like Jepsen tests for is a once-every-other-month problem due to network maintenance, hardware failures, etc. So yeah, it is the end of the world for me. I like not having to wake up in the middle of the night every few months.
One thing to note here, is that upstream mesos is working on removing Zookeeper as a dependency. That would at a minimum remove one of the three pieces.
Note that I briefly evaluated Chronos, but tossed it aside as a toy when realizing it didn't support any form of constraints. Aurora is a very nice scheduler for HA / distributed cron (with constraints!) and long running services.
It would be fantastic if Kyle would test out Aurora, if not only to break it so that upstream fixes it. He is generally successful in breaking ALL THE THINGS.
In MESOS-1806, it mentions replacing ZK with ReplicatedLog or etcd. The replicated log is a mesos native construct, that has no external dependencies.
If you can replace ZK with a native Mesos construct, seems like it allows you to remove ZK entirely. I meant to say "optionally allow removing ZK as a dependency" in the original post. You're totally correct in that regard.
Oh I love zookeeper and have no issues whatsoever with it. However, it is almost universally the first thing I hear people moan about when someone suggest they use Mesos. People don't like Zookeeper.
The latest version of Chronos added support for constraints, at least for simple EQUALS constraints.
The annoying problem that I also encountered with Chronos was the lack of error reporting from the REST API. Invalid jobs would just return "400 Bad Request" with no error message. The error sometimes wouldn't even be reported in the logs.
It all depends on the claim vendors make. I think that is the great stuff about Kyle's series -- he often looks at how the systems behave vs. how they are documented and promoted.
If vendors fix things after the test, there are usually 2 types of fixes -- docs or code. Docs mean tell people how system really behaves and how data might be corrupted or actually fix the problem if possible.
So if Chronos just says in bold red letters on the front -- "You'll lose data in a partition" or "Our system is neither C or A if you use these options" that's ok too. The users can then at least make an informed choice.
Honestly, that is the money shot. Its not worth running an off the shelf cluster system that isn't self-healing from a network partition and/or a node loss->node restored cycle.