
Call Me Maybe: Chronos - sbilstein
https://aphyr.com/posts/326-call-me-maybe-chronos
======
rdtsc
Kyle series "Call Me Maybe" is great.

The general take-away from most of them is -- do not believe the vendors of
distributed systems. Even systems that were meant to solve consensus (etcd,
consul) failed:

[https://aphyr.com/posts/316-call-me-maybe-etcd-and-
consul](https://aphyr.com/posts/316-call-me-maybe-etcd-and-consul)

Elasticsearch didn't do well, MongoDB, Kafka, etc.

The only ones that did ok that I remember were: Zookeeper, Riak (with
siblings), and maybe Cassandra.

~~~
bbrazil
As well it's interesting to see the different responses of teams when problems
are raised.

Some are very thankful for the analysis, others don't seem to appreciate how
important correctness is in this type of distributed systems. I find it's a
good indicator of the maturity of the teams working on them, and how much I'm
willing to trust each system.

~~~
dopamean
> I find it's a good indicator of the maturity of the teams working on them,
> and how much I'm willing to trust each system.

I hate to nitpick but I'm generally curious. How do you know it's a good
indicator? It can definitely be _an_ indicator but how do you know it was
good? Did you have a follow up experience that validated what the indicator
indicated?

~~~
jerf
How a project performs today tells you the zeroth derivative of its location.
Looking at the commit log tells you about the first derivative. How people
react to Call Me Maybe when its about their product gives you a lot of
information about the second derivative.

~~~
AYBABTME
I'm surprised to find this being a pretty good way to look at it.

------
orf
Awesome read. The issue he made on github[1] is also worth it, brndnmtthws
could have definitely handled that better.

1\.
[https://github.com/mesos/chronos/issues/513](https://github.com/mesos/chronos/issues/513)

~~~
pavlov
No kidding. Someone raises reasonable points, and the engineer responds with:

 _" Like I suggested before, you should stop by our offices @ 88 Stevenson for
lunch some day, and you can chat with our best and brightest about this (and
other things) until the cows come home. Then you can save yourself a lot of
back and forth, and nerdraging about computers."_

Aphyr spending his time finding bugs and unspecified behavior in their
software is dismissed as "nerdraging". And instead of discussing it openly, he
should spend a day and go to "88 Stevenson", wherever that is...

I'm going to assume he means Stevenson Avenue, Polmont, Falkirk,
Stirlingshire, United Kingdom -- since that's the first Google result I got
for "Stevenson" as an address.

~~~
brndnmtthws
He works in SF, as do we. It makes sense, given the context.

I like face to face discussions. It would have been nice to talk to him about
what he's building, and what he's trying to accomplish. Google hangouts would
have worked, too.

~~~
tptacek
I think what he's trying to accomplish is running the same kinds of tests on
Chronos that he's famously run on almost every other important open source
distributed system, and that his writeup gives you probably the most important
info you need on how to make your stuff more amenable to that kind of
adversarial automated testing.

------
fweespeech
> Transient resource or network failures can completely disable Chronos. Most
> systems tested with Jepsen return to some sort of normal operation within a
> few seconds to minutes after a failure is resolved. In no Jepsen test has
> Chronos ever recovered completely from a network failure. As an operator,
> this fragility does not inspire confidence.

Honestly, that is the money shot. Its not worth running an off the shelf
cluster system that isn't self-healing from a network partition and/or a node
loss->node restored cycle.

~~~
sbilstein
Another big takeaway for me:

"This isn’t the end of the world–it does illustrate the fragility of a system
with three distinct quorums, all of which must be available and connected to
one another, but there will always be certain classes of network failure that
can break a distributed scheduler."

As we build increasingly complex distributed from what are essentially modular
distributed systems themselves, this is something to be aware of. The failure
characteristics of all the underlying systems will bubble up in surprising
ways.

~~~
SEJeff
One thing to note here, is that upstream mesos is working on removing
Zookeeper as a dependency. That would at a minimum remove one of the three
pieces.

Note that I briefly evaluated Chronos, but tossed it aside as a toy when
realizing it didn't support any form of constraints. Aurora is a very nice
scheduler for HA / distributed cron (with constraints!) and long running
services.

It would be fantastic if Kyle would test out Aurora, if not only to break it
so that upstream fixes it. He is generally successful in breaking ALL THE
THINGS.

~~~
bbrazil
Mesos is looking at allowing things other than Zookeeper to be the dependency,
I'm not aware of them removing the dependency altogether.

[https://issues.apache.org/jira/browse/MESOS-1806](https://issues.apache.org/jira/browse/MESOS-1806)

~~~
fidget
I do wonder why the GP would want to, given how ridiculously reliable
zookeeper is.

~~~
SEJeff
Oh I love zookeeper and have no issues whatsoever with it. However, it is
almost universally the first thing I hear people moan about when someone
suggest they use Mesos. People don't like Zookeeper.

------
bbrazil
> Several engineers suggest that Aurora is more robust, though more difficult
> to set up, than Chronos. I haven’t evaluated Aurora yet, but it’s likely
> worth looking in to.

I've tried out Aurora (and helped improve the setup docs a small bit) and
found it quite nice. If you're going to try it out use vagrant, and to deploy
it in production I suggest working off the Vagrant setup:
[https://github.com/apache/aurora/blob/master/examples/vagran...](https://github.com/apache/aurora/blob/master/examples/vagrant/provision-
dev-cluster.sh)

> Instrument your jobs to identify whether they ran or not

Prometheus.io let's you alert on when the job last succeeded, which is
generally what you want.
[http://prometheus.io/docs/instrumenting/pushing/](http://prometheus.io/docs/instrumenting/pushing/)
has a full example for Java.

------
baq
if i worked on a distributed system, i'd be thrilled and horrified at the same
time if i saw a bug report from aphyr. the whole series is a gem, well worth
an afternoon (or five).

------
justin66
As always people should check out Aphyr's twitter page for more detail and
local color.

 _" really wish I had more time to dig into chronos because it has so many
fascinating failure modes but I've got 2 other systems before talks!"_

------
lordlarm
Are there many of you actually using Chronos in production? I'm not sold that
this project will still be with us in a couple of years..

I'm not saying the idea is bad, but comparing the developer community of
Marathon with Chronos you see some obvious differences in commits, developers,
issues (and responses), releases etc.

We've been using it in conjunction with Marathon and Mesos, and my impression
is that it's a now half-dead project riddled with bugs (especially in the web
UI) and I'm unsure whether we should invest more resources and infrastructure
around this project.

Aurora does look interesting. But Apache doesn't have a much better
reputation, and I'm not particularly keen on going away from Marathon. Any
DevOps engineers with container/mesos infrastructures wanting to chime in?

~~~
jacques_chester
Have a look at Lattice ([http://lattice.cf](http://lattice.cf)), which is
extracted from Cloud Foundry.

Disclaimer: I work for Pivotal (in Labs), which donates the largest chunk of
engineering effort to Cloud Foundry, including the Lattice team.

~~~
lordlarm
Looks like a very interesting project which I'll keep a keen eye on. We're
probably not ready to switch our entire stack from Mesos to Lattice though,
especially considering you're missing some important bits such as external
service discovery etc. (by the looks of it) We're currently doing Marathon on
Mesos with gliderlabs/registrator which syncs Docker and Consul - and then
we're using consul-template + HAProxy for exposing applications externally.
Works quite nice.

The differences (and advantages) between Lattice and Mesos could've been made
clearer too I guess, although your docs and FAQs are quite nice.

~~~
jacques_chester
If you want the full stack, you go from Lattice to Cloud Foundry. It gives you
service injection, logging, routing and a bunch of other stuff I forget right
now.

And it doesn't require you to build your own snowflake PaaS, which you'll be
married to forever at your own expense.

I've worked on CF so I'm biased in its favour. But my actual paying job is
helping clients to deliver user value. Tinkering around with various tools is
fun, but it's not delivering user value.

Typing

    
    
        cf push app-name
    

Is.

And it works. It just plain old works. That makes my life 100x easier.

------
tpetr
Awesome post -- it'd be cool to see Kyle test more Mesos frameworks.
Singularity
([http://github.com/HubSpot/Singularity](http://github.com/HubSpot/Singularity))
is another good option for running scheduled jobs in Mesos (along with web
services, background workers, and one-off tasks). It fails fast like Chronos,
but we made it clear to use monit, supervisor, or systemd in the docs. ;) We
run our entire product on it (~1,600 deployable items) with no issues. Our
PROD cluster has launched 8,000,000 tasks so far in 2015!

------
0xdeadbeefbabe
Can someone explain the call me maybe reference. Was it a movie or a song?

~~~
zorked
It's a song, and probably the tool and aphyr's blog posts will have more
staying power than the song and the singer they are named after.

~~~
sanderjd
Ha, that's a good and funny point that hadn't occurred to me. More info in
answer to the parent's question: the song is "Call Me Maybe" by Carly Rae
Jepsen.

------
wereHamster
> for instance, it omits that Chronos nodes are fragile by design and must be
> supervised by a daemon to restart them.

Wat? Why is that necessary to mention? Every process is fragile and needs to
be supervised!

------
haosdent
Awesome analyze! :-)

------
sbilstein
"Ordinarily in Jepsen we’d refer to these as “primary” and “secondary” or
“leader” and “follower” to avoid connotations of, well, slavery, but the
master nodes themselves form a cluster with leaders and followers, and terms
like “executor” have other meanings in Mesos, so I’m going to use the Mesos
terms here."

hahaha, well naming things is one of the hardest problems in computer science
after all.

~~~
vezzy-fnord
It's a total bikeshed issue, in any event. No one seems to mind the issue of
Unix routinely forcing you to commit filicide and general killing of children,
for instance.

~~~
justizin
There's a difference between that and trying to work on a diverse team with
people whose ancestors were slaves and avoiding common phrases like:

    
    
      "We need more slaves!"
    

It's a bit of nomenclature that would never be used if people weren't
mimicking choices from decades ago.

What name to use other than slave might be a bikeshed issue, but the bikeshed
is currently painted with a confederate flag. ;)

~~~
vezzy-fnord
You're only proving my point, really. The only things you can think of are
Confederate flags and the trans-Atlantic slave trade, when slavery as an
institution is as old as human civilization itself (and continues on well to
this day).

Incidentally, filicide is also ancient and ongoing.

~~~
justizin
I can think of a lot of things other than Confederate flags, I was making a
symbolic argument against your misuse of the word bikeshed.

Slavery is a bad practice and I would definitely not work with someone who
felt otherwise, but other than that, your argument is just worthless escapism.
Have fun talking to HR in the future.

~~~
vezzy-fnord
My point was that trying to paint slavery as being particularly objectionable
in the face of just as seemingly reprehensible (but not at all in context)
metaphors of child murder, is disingenuous and pointlessly selective. Child
murder and slavery are horrible and I can't believe I actually have to specify
this. They are both ongoing, so your argument from ancestry doesn't work.
There's millions of slaves as we speak, many more ancestors of slaves and many
who have had their children murdered. That doesn't somehow make those
metaphors irredeemable when put in purely technical context, e.g. slavery as a
relationship of total control and ownership by one party over another
(actually just as applicable to S/M -- a consensual sexual practice, as it is
to real-life slavery).

I will have fun talking to HR, thank you.

