> Transient resource or network failures can completely disable Chronos. Most sy...

sbilstein · on Aug 18, 2015

Another big takeaway for me:

"This isn’t the end of the world–it does illustrate the fragility of a system with three distinct quorums, all of which must be available and connected to one another, but there will always be certain classes of network failure that can break a distributed scheduler."

As we build increasingly complex distributed from what are essentially modular distributed systems themselves, this is something to be aware of. The failure characteristics of all the underlying systems will bubble up in surprising ways.

fweespeech · on Aug 18, 2015

Yes but the number of times I've had to manually heal a cluster in production in my professional life can be counted on one hand.

A network partition like Jepsen tests for is a once-every-other-month problem due to network maintenance, hardware failures, etc. So yeah, it is the end of the world for me. I like not having to wake up in the middle of the night every few months.

SEJeff · on Aug 18, 2015

One thing to note here, is that upstream mesos is working on removing Zookeeper as a dependency. That would at a minimum remove one of the three pieces.

Note that I briefly evaluated Chronos, but tossed it aside as a toy when realizing it didn't support any form of constraints. Aurora is a very nice scheduler for HA / distributed cron (with constraints!) and long running services.

It would be fantastic if Kyle would test out Aurora, if not only to break it so that upstream fixes it. He is generally successful in breaking ALL THE THINGS.

bbrazil · on Aug 18, 2015

Mesos is looking at allowing things other than Zookeeper to be the dependency, I'm not aware of them removing the dependency altogether.

https://issues.apache.org/jira/browse/MESOS-1806

SEJeff · on Aug 18, 2015

In MESOS-1806, it mentions replacing ZK with ReplicatedLog or etcd. The replicated log is a mesos native construct, that has no external dependencies.

If you can replace ZK with a native Mesos construct, seems like it allows you to remove ZK entirely. I meant to say "optionally allow removing ZK as a dependency" in the original post. You're totally correct in that regard.

fidget · on Aug 18, 2015

I do wonder why the GP would want to, given how ridiculously reliable zookeeper is.

SEJeff · on Aug 18, 2015

Oh I love zookeeper and have no issues whatsoever with it. However, it is almost universally the first thing I hear people moan about when someone suggest they use Mesos. People don't like Zookeeper.

ianburrell · on Aug 18, 2015

The latest version of Chronos added support for constraints, at least for simple EQUALS constraints.

The annoying problem that I also encountered with Chronos was the lack of error reporting from the REST API. Invalid jobs would just return "400 Bad Request" with no error message. The error sometimes wouldn't even be reported in the logs.

SEJeff · on Aug 18, 2015

Yeah I was playing with it > 6 months ago, and Aurora was solid then as well.

rdtsc · on Aug 18, 2015

It all depends on the claim vendors make. I think that is the great stuff about Kyle's series -- he often looks at how the systems behave vs. how they are documented and promoted.

If vendors fix things after the test, there are usually 2 types of fixes -- docs or code. Docs mean tell people how system really behaves and how data might be corrupted or actually fix the problem if possible.

So if Chronos just says in bold red letters on the front -- "You'll lose data in a partition" or "Our system is neither C or A if you use these options" that's ok too. The users can then at least make an informed choice.