

Reliable Cron across the Planet - StylifyYourBlog
https://queue.acm.org/detail.cfm?id=2745840

======
teraflop
An interesting read, but it doesn't look like there's too much exciting or
novel here. (In a fundamental sense, that is. I'm sure there's all kinds of
interesting nuts-and-bolts engineering that outsiders aren't privy to.) TLDR:
use a replicated state machine to make scheduling decisions, and make all
operations on the datacenter idempotent.

The hashing trick to mitigate spiky load distributions is cool, but that seems
to be more about multi-tenancy than reliability.

I'm disappointed to see this article perpetuating the misconception that Paxos
is a leader election algorithm. It _tries_ to elect a leader for its own
purposes, but Paxos itself behaves safely even if the election process goes
temporarily amok; other systems built on top of it might not. If you want to
provide the guarantee that only one scheduler instance is running at a time,
you need to add a lease mechanism and make assumptions about clock synchrony.
I'm sure the authors know this, but not mentioning it at all seems pretty
sloppy.

------
endymi0n
Wonder if those guys checked out
[https://github.com/mesos/chronos](https://github.com/mesos/chronos) \- it was
the best solution I could find when I recently wanted to solve distributed,
reliable Cron for us.

~~~
teraflop
"Those guys" are Google engineers, and Chronos is built on Mesos, which has
drawn a lot of inspiration from Google's datacenter-scale computing
architecture. This project probably predates Chronos by a significant margin.

Also, I'm not familiar with Chronos in detail, but from the documentation, it
doesn't look like it supports the exactly-once execution semantics that
Google's cron replacement is aiming for.

------
KaiserPro
WE had a similar issue, although a different level of scale.

However Jenkins works as a good cron replacement. Although I'm not sure about
the limit to the number of build slaves you attach to jenkins.

