
We Don’t Run Cron Jobs (2016) - bedros
https://engblog.nextdoor.com/we-don-t-run-cron-jobs-at-nextdoor-6f7f9cc62040
======
Cieplak
Cron works great when you don’t need to guarantee execution, e.g., if a server
goes down. Unfortunately, all the alternatives are pretty heavyweight, e.g.,
Jenkins, Azkaban, Airflow. I’ve been working a job scheduler that strives to
work like a distributed cron. It works with very little code, because it leans
heavily on Postgres (for distributed locking, parsing time interval
expressions, configuration storage, log storage) and PostgREST (for the HTTP
API). The application binary (~100 lines of haskell), polls for new jobs, then
checks out and execute tasks. The code is here if you’re interested:

[https://github.com/finix-payments/jobs](https://github.com/finix-
payments/jobs)

It compiles to machine code, so deploying the binary is easy. That said, I’d
like to add some tooling to simplify deploying and configuring Postgres and
PostgREST.

~~~
freddie_mercury
Jenkins is "heavyweight" but Postgres isn't?

~~~
thatjsguy
As someone who has administrated both, I’d rather manage 10 Postgres instances
than one Jenkins box. No question.

Edit: I should expound. Jenkins seems like it has a lot of clunky moving
parts. It all works, and I’d rather use it than anything else, but it’s kind
of like IKEA furniture: you use it because you have to, not necessarily
because you want to.

It’s also incredibly difficult to automate. I can configure Postgres with a
config file or two and easily use Ansible to get the exact same instance every
time. Jenkins has to be dragged into Automation Alley kicking and screaming. I
partially blame this on the fact that Jenkins has nontrivial amounts of
configuration that’s done via GUI. I approach a long-running Jenkins instance
with the same fear and dread I approach a Windows box that hasn’t been
restarted in six months. I.e. the box is now a snowflake and trying to make it
reproducible and automated is going to be a bad time.

I could go on, but as a devops critter, Postgres wins every time.

~~~
peterwwillis
Jenkins is probably 100 times easier to deploy, configure, operate, and scale
than Postgres+whatever, which is a horrifying statement, but still true. If
what you have to do is schedule and run arbitrary jobs on any kind of machine,
it's no contest. They aren't even related. Postgres is a relational database,
and Jenkins is a single Java process that stores flat files on local disk,
connects to remote nodes with SSH, and has a thousand plugins.

It's like comparing a missile with an airplane. One gets where it needs to go
faster and more efficiently, and the other one transports people.

~~~
Cieplak
Here’s a fairly objective metric for comparing the complexity of deploying
Jenkins vs Postgres:

[https://github.com/geerlingguy/ansible-role-
jenkins](https://github.com/geerlingguy/ansible-role-jenkins)

[https://github.com/geerlingguy/ansible-role-
postgresql](https://github.com/geerlingguy/ansible-role-postgresql)

Two ansible roles by the same author, supporting both Centos and Ubuntu. Not
hugely different in complexity IMO. Installing Postgres on FreeBSD, though, is
little more than

    
    
        pkg install postgresql10-server
        sysrc postgresql_enable=“YES”
        service postgresql initdb
        service postgresql start

~~~
peterwwillis
I get that they install differently, but that's not the point. The point is
that they do very different things, in different ways. Postgres is not a
Jenkins replacement. Postgres is storage and querying. Its equivalent on
Jenkins is XML files.

------
swalsh
The last company I worked at used Jenkins for all cron jobs. It has great
reporting, supports complex jobs, has a million plugins. It worked really
great.

~~~
amoshg
Interesting, isn't Jenkins mostly used for internal builds? I have a hard time
imagining using it for things like marketing email cron jobs.

~~~
humbleMouse
The beauty of jenkins is that you can use it for pretty much anything.

~~~
sanderjd
I tend to feel the opposite about things you can use for pretty much anything.
I much prefer tools that are good for a single thing. I really disliked
maintaining a Jenkins instance awhile back; maybe this "jack of all trades"
ethos is part of the reason I felt that way.

~~~
brightball
I tend to prefer simplified stack, especially for tools that don't need to be
touched very often.

~~~
sanderjd
I can't tell: are you saying that using Jenkins results in the simplified
stack you prefer? If so, I don't agree; I think Jenkins is a very un-simple
way to run from jobs.

~~~
brightball
When I say simplified stack I mean fewer tools that must be learned and
managed by people. Jenkins simplifies in that way.

~~~
sanderjd
Interesting. I still don't see it that way. It seems less like a single tool
when used in this way than multiple tools running alongside one another.
Different strokes for different folks I suppose!

------
foobarian
> Here is an example of a typical oncall experience: 1) get paged with the
> command line of the failed job; 2) ssh into the scheduler machine; 3) copy &
> paste the command line to rerun the failed job

No doubt this works if it's an established procedure, but if I was approaching
a system I wasn't familiar with I would never do (3) because the environment
can differ wildly between crond and a login shell. It is safer to edit the
cron schedule and duplicate the entry with a time set for few minutes in the
future. (And clean that up afterwards).

~~~
TheSoftwareGuy
You could even write a cron job to clean up the crontab

~~~
thwarted
Or you could use at(1) or batch(1) to invoke the job once. That seems
significantly easier and less error prone than trying to clean up crontab.

~~~
Pete_D
at(1) is great, but you'll still have the problem of a different environment
to cron (it copies most of your environment variables and working directory).

------
encoderer
If you’re already using cron like millions of us, here are a couple things
that can help you:

[https://crontab.guru](https://crontab.guru)

[https://cronitor.io/docs/cron-troubleshooting-
guide](https://cronitor.io/docs/cron-troubleshooting-guide)

------
solutionyogi
I worked at a company which wrote their own scheduler and it was fraught with
bugs. Dealing with time and date is HARD. Really, really hard. Your custom
scheduler will break and at a worst possible time.

If Cron doesn't work, get an open source or commercial solution. And who cares
what tech the scheduler is written in? Scheduler's job is to provide run your
programs and provide API and a nice GUI if you desire.

~~~
jokh
Yeah exactly. I don't understand why they wanted the scheduler to be written
in Python, since the scheduler should be decoupled from the jobs they are
running anyway.

~~~
alexeiz
If your jobs are written in Python, there is nothing wrong with a Python-based
scheduler. It can actually be quite convenient.

------
lhr0909
I think they have a task worker system built back in 2014[1], so they need to
have something custom working with it as well. Back then I think they really
didn't have many options, but if they were to do it again now, I think either
AWS Lambda or AWS Batch will serve this type of scheduled job cases very well.

[1]: [https://engblog.nextdoor.com/nextdoor-taskworker-simple-
effi...](https://engblog.nextdoor.com/nextdoor-taskworker-simple-efficient-
amp-scalable-ac4f7886957b)

------
amyjess
NMS engineer at an enterprise telecom here. At my company, we've been
switching over to Jenkins for job scheduling. Most of what used to be cronjobs
have been fully Dockerized, and now we have Jenkins run periodic "builds" via
pipelines. The pipelines themselves just run a docker image.

The single biggest advantage this has gotten us is centralized logging. I can
check on the console output of any cronjob just by going to Jenkins and
clicking on the job.

Moving to Jenkins to cron wasn't my idea, but the implementation is mine. I've
built a few base Docker images as bases. One is just the standard Python 3.6
docker image. Another is the CentOS image equipped with Python 3.6 and Oracle
RPMs for jobs that need database access. Another is the aforementioned image
plus a number of Perl dependencies for jobs that need to call into our legacy
Perl scripts.

For many scripts, I can use identical Dockerfiles. I just copy the directory
containing the script, requirements.txt, Dockerfile, and Jenkinsfile, then I
change out the script, edit the Jenkinsfile to reference the new script's
name, and make any needed changes to the requirements.txt.

------
mark_story
I've had good experiences using celery-beat to replace crond. It lets you use
all the good parts of celery without much work.
[http://docs.celeryproject.org/en/latest/userguide/periodic-t...](http://docs.celeryproject.org/en/latest/userguide/periodic-
tasks.html)

~~~
deizel
Ditto, especially in combination with Python/Django, as used by Nextdoor.
Ironically, they had already removed Celery from their stack a few years
prior. [https://engblog.nextdoor.com/nextdoor-taskworker-simple-
effi...](https://engblog.nextdoor.com/nextdoor-taskworker-simple-efficient-
amp-scalable-ac4f7886957b)

------
ape4
Since nobody mentioned Anacron, I will.
[https://en.wikipedia.org/wiki/Anacron](https://en.wikipedia.org/wiki/Anacron)
Deals with the server possibly being down for a period.

------
ChuckMcM
That was a difficult read for me, the blog post starts out with the four main
problems with cron (their use wasn't scalable, editing the text file was hard,
running their jobs are complex, and they didn't have any telemetry.)

That's great, what does that have to do with cron?

As a result what I read was:

"We don't understand what cron does, nor do we understand how job scheduling
is supposed to work, and we don't understand how to write 'service' based
applications, so somebody said 'Just use cron' and we did some stuff and it
didn't work how we liked, and we still haven't figured out really what is
going on with schedulers so we wrote our own thing which works for us but we
don't have any idea why something as broken as cron has persisted as the way
to do something for longer than any of us has been alive."

I'm not sure that is the message they wanted to send. So lets look at their
problems and their solution for a minute and figure out what is really going
on here.

First problem was 'scalability'. Which is described in the blog post as "cron
jobs pushed the machine to its limit" and their scalability solution was to
write a program that put a layer between the starting of the jobs and the jobs
themselves (sends messages to SQS) and they used a new scheduler (AP scheduler
to implement the core scheduler).

So what was the real win here? (since they have recreated cron :-)) The win is
that instead of forking and execing as cron does allowing things like standard
in and what not to be connected to the process, their version of cron sends a
message to another system to actually start jobs. Guess what, if they wrote a
bit of python code that all it did was send a message to SQS and exit that
would run pretty simply. If they did it in C or C++ so they weren't loading an
entire interpreter and its run time everytime it would be lighting fast and
add no load to the "cron server", this is basically being unaware of how cron
works so not knowing what would be the best way to use it.

Their second beef was that cron tabs are hard to edit reliably. Their solution
was to write a giant CGI script in a web server that would read in the data
structure used by their scheduler for jobs, show it as a web page, let people
make changes to it, and then re-write the updated data structure to the
scheduler. Guess what, the cron tab is just the data structure for cron in a
text form so you can edit it by hand if necessary. Or you can use crontab -e
which does syntax checking, or you could even write a giant CGI script that
would read in the cron file, display it nicely on a web page, and then re-
write as syntactically correct cron tab when it was done.

Problem three was that their jobs were complex and _failed often_. This forced
their poor opsen to log in, cut and paste a complex command line and restart
the job. The real problem there is _jobs are failing_ which is going to
require someone to figure out why they failed. If you don't give a crap about
why they failed the standard idiom is a program that forks the child to do the
thing you want done and if you catch a signal from it that it has died you
fork it again[1]. But really what is important here is that you have a
configuration file _under source code control_ that contains the default
parameters for the jobs you are running so that starting them is just typing
in the jobname if you need to restart or maybe overriding a parameter like a
disk that is too full if that is why it failed. Again, nothing to do with cron
and everything to do with writing code that runs on servers.

And finally their is no telemetry, no way to tell what is going on. Except
that UNIX and Linux have like a zillion ways to get telemetry out, the
original one is syslog, where jobs can send messages that get collected over
the network even, of what they are up to, how they are feeling and what, if
anything, is going wrong. There are even different levels like INFO, or FATAL
which tell you which ones are important. Another tried and true technique is
to dump a death rattle into /tmp for later collection by a mortician process
(also scheduled by cron).

At the end of the day, I can't see how _cron_ had anything to do with their
problems, their inability to understand the problem they were trying to solve
in a general way which would have allowed them to see many solutions, both
with cron or with other tools that solve similar problems, would have saved
them from re-inventing the wheel yet again, giving them the opportunity to
experience the bugs that those other systems have fixed given their millions
(if not billions) of hours of collective run time.

[1] Yes this is lame and you get jobs that just loop forever restarting again
and again which is why the _real_ problem is the failing not the forking.

~~~
pjungwir
This rant seems out of character, but personally I appreciate it. People are
always suggesting getting rid of cron, but I have always liked and trusted it.
I prefer using tools that are old, battle-tested, and standard, but I do try
to appreciate the advantages of new things. Cron seems to be a favorite target
of NIH, for as long as I remember but especially in these days of
"serverless", so it's easy to second-guess my appreciation for it. Thanks for
clarifying where their problems really were. There seems to be a common
temptation to think a new tool will solve your problems, when really many are
in the irreducible specificity of your own code or systems.

EDIT: Oh by the way, Ruby folks struggling with cron might appreciate this:
[https://github.com/pjungwir/cron2english/](https://github.com/pjungwir/cron2english/)

~~~
ChuckMcM
Cron2English looks very helpful for folks who struggle to read lines in a
crontab.

I recognize that it is a sore spot for me when people re-invent the wheel when
it seems clear they didn't have to and even clearer that the energy spent re-
inventing the wheel would have been better (in terms of using the wheel) spent
learning why the wheel is the way it is.

It is sort of the Chesterson's Fence of computer science. Don't tell me its
wrong, tell me how it is right for what it does and where it falls short for
what you want it to do. Cron gets a bad rap here, as did sendmail for that
matter. When taken to extremes you replace working systems like init with
systems that are borked like systemd. No doubt cron is on the list of things
to be absorbed by the systemd borg at some point. Yes, it makes me grumpy.

~~~
JoachimSchipper
> cron ... absorbed by the systemd borg

This has already happened, right?
[https://wiki.archlinux.org/index.php/Systemd/Timers](https://wiki.archlinux.org/index.php/Systemd/Timers)

Not a systemd fan, myself.

~~~
SahAssar
IMO cron is one of the parts that a good process management system should
solve. With systemd timers I can do stuff like require other processes to run
for a timer to run, activate a process either on a timer or on a socket,
specify that it runs in a private chroot/netns/cgroup. These are all common
things that I might usually want a process or service to do, and putting
timed, manual, socket and path invocation of a service in the process
management system makes sense to me.

I understand some of the systemd criticism, but I don't at all understand why
you want something else to manage your processes than the process management
system (which is basically always the init system)?

~~~
mmt
> I understand some of the systemd criticism, but I don't at all understand
> why you want something else to manage your processes than the process
> management system (which is basically always the init system)?

Perhaps you don't understand one of the fundamental, philosophical criticisms,
which is that it violates the Unix tenet of "do one thing and one thing well".

Timing, and scheduling are difficult and (arguably, of course), deserve their
own, separate utility.

Cron isn't a process manager, any more than and interactive shell [1]. It just
starts processes (and passes output to a local email client, though that's,
understandably, brittle).

I wouldn't object to moving the process-starting functionality out of cron and
leaving it with only the ability to tell init to "start job named XYZ", but I
do object to just integrating its core competency of scheduling into that
init.

[1] Arguably less than, considering modern shells' job control and signal
facilities

------
thdxr
We schedule jobs on our Elixir cluster. Nice to not need anything on top of
what you're already running

------
oneeyedpigeon
> Second, editing the plain text crontab is error prone

Doesn't every crontab in existence have a comment line giving the order of the
time columns? I know I always rely on such.

Although the article didn't touch on it, this point reminded me of yesterday's
discussion about manpages and command-line options. I think it's still the
case today that `cron -e` is _the_ way to edit the crontab whilst `cron -r`
(the key _right nextdoor_ in case this part needed stressing!) removes it
altogether.

~~~
Pete_D
I got tired of writing the time specs manually, so now I keep this script in
my PATH:

    
    
        #!/bin/sh
        # print a crontab entry for 2 minutes into the future
    
        date -d "+2 minutes" +"%M %H %d %m * /path/to/command"

------
sergiotapia
Using Elixir we just scheduled work using a simple genserver, the initial
naive version has no tracking of jobs done/failed/etc, but you can append
those easily since they are just language constructs.

------
chrisferry
Kubernetes CronJobs resource would be my go to for this.

~~~
meddlepal
Good answer in 2018 but in 2016 the development of ScheduledJob (now CronJob)
had just begun around November/December of that year so it was not an option
for these guys.

------
whalesalad
The CPU comparison are kinda funny to see. At first glance the low CPU usage
looks good but to me that’s wasted resources. Good to see a more efficient
system though. Hopefully those instances get allocated to different problem
sets.

------
haney
I use Apache Airflow with BashOperator for tons of stuff like this, simple web
UI for logs/retries, supports dependencies between jobs and when tasks get
more complex it’s Python and it supports extensions.

~~~
andscoop
Apache Airflow is a great way to future proof your cron jobs. Existing cron
jobs can be easily migrated and with that you'll get access to built in
logging, distributed execution, connection management, web ui for simple
monitoring and task retries and more.

------
sideproject
A little tangential, but I recently created a small tool called "tsk"
(pronounced same as task)

[https://www.tsk.io](https://www.tsk.io)

I'm calling it a "speed-dial for your APIs". I used to have a VPS that would
run out of memory, so easiest way to resolve this was to restart the server.
But even that was annoying, so I created a button that would call the API to
restart the box. Now I just have to click the button whenever I want to
reboot.

Something similar to CRON, I'm currently building a feature in tsk to schedule
the tasks.

~~~
andromedavision
My scraping VPS keeps running out of memory too and I restart it every couple
of days.

> so I created a button that would call the API to restart the box

Been thinking about building a 'button' that does this as well. Will check out
TSK to see how well it addresses this. Sound idea for sure.

~~~
sideproject
Thanks! Would love to hear your feedback on it!

------
zrail
We switched from Heroku Scheduler (very limited cron) to a system called
Sidekiq Cron[1] (if we used Sidekiq Enterprise we would use the built-in
scheduler). All Sidekiq Cron does is drop a job into the queue on a given
interval. We also use HireFire to auto-scale our workers as necessary to keep
things running.

[1]: [https://github.com/ondrejbartas/sidekiq-
cron](https://github.com/ondrejbartas/sidekiq-cron)

[2]: [https://hirefire.io](https://hirefire.io)

~~~
shubh2336
We use Sidekiq's scheduled jobs[1] to replace the cron dependency in our
codebase. Did you try it before opting for Sidekiq Cron?

[1]: [https://github.com/mperham/sidekiq/wiki/Scheduled-
Jobs](https://github.com/mperham/sidekiq/wiki/Scheduled-Jobs)

------
lykr0n
Where I work we have a similar product where we run all scheduled tasks on our
Mesos cluster. Same idea as a uni cCron. You have a task that is executed
every N minutes/hours/days, it runs on a box, and does its thing.

It doesn't replace every cron job, but it is distributed, fault tolerant, and
only breaks when the hadoop cluster backs up. The product mentioned here seems
like a good solution for a team that needs to execute linux crons without much
overhead.

------
saganus
A bit of a side-topic but, has anyone tried APS (Advanced Python Scheduler -
[https://apscheduler.readthedocs.io/en/latest/](https://apscheduler.readthedocs.io/en/latest/))
in production?

I've been evaluating it as it seems to provide fault-tolerance, but IMO the
documentation could be much better with more examples (e.g. mixing different
triggers, configs,.etc)

Can anyone comment on it?

------
jskaggz
Job Scheduling.. We're using jenkins heavily at one of my clients sites for
this (it's already used for builds, so scheduling jobs didn't seem that far
afield..) The thing I'm missing, and would love to know if exists, is a
calendar like view of all of the scheduled jobs, ala google calendar. Surely a
product must exist that provides this view?

~~~
ficklepickle
[https://github.com/jenkinsci/calendar-view-
plugin](https://github.com/jenkinsci/calendar-view-plugin)

------
sk5t
Here it's Quartz and its JDBC persistence baked into a Dropwizard application
with a custom resource implementing job and trigger CRUD ops, and a Spring-
based JobFactory for dependency injection stuff. Quartz has a few funny
behaviors that could be better, but on the whole it's been working out nicely
for scheduling across a multi-node stateless cluster.

------
wiradikusuma
For JVM (Java, Scala, Clojure, Groovy, etc), Quartz with JDBC (backed by DB
instead of in-memory) works best for us.

------
elijahchancey
It’s worth noting that SQS promises “at least once” delivery. In practice,
this means some messages will be delivered multiple times.

After every job is successfully finished, it should be noted in your db. Every
time a worker starts processing a job, it should check the db to see if this
job has already been run.

------
OldHand2018
The old fogeys among us would point out that this is an incomplete
implementation of the batch queues that VMS has had for 3 or 4 decades. The
new fogeys would point out that 60% of this functionality (the hard parts) is
present in PowerShell and all the rest is a boring CRUD app.

------
kevan
Based on the CPU usage graph it looks like there's a lot more opportunity to
downsize the scheduler hardware. I'd be curious to see how their TaskWorker
cluster absorbs spikes in load when big cron jobs start.

------
scarface74
Or they could have just used Hashicorp’s Nomad....

[https://www.nomadproject.io/](https://www.nomadproject.io/)

It’s dead simple to set up and use.

~~~
spooneybarger
Nomad was started about 39 months ago. The post in question is from 32 months
ago when they wrote about a thing they had already built.

"We’ve been using Nextdoor Scheduler for over 18 months and we are extremely
happy with it."

That means that what they built pre-dates Nomad by at least 11 months.

------
mberning
They basically built sidekiq enterprise. Way to go.

------
wenbin
I'm the author of this blog post and the original developer of this scheduler
system.

Glad to see a lot of interesting and insightful comments in this thread!

Some contexts here:

0\. Like any piece of software, this scheduler system is not perfect for every
company -- legacy code, # of engineers, skillsets of existing employees,
engineering culture, tech stack, business,...

1\. Every engineer in Nextdoor needs to do on-call rotation -- 50+ engineers
back in 2014 (100+ now?), when the scheduler was built. It's important to run
a system that all 50+ engineers have confidence to debug on a Friday night if
things go south. There are some blackbox-type alternatives of cron, which may
be great for a small team of backend experts to operate, but may not be a good
fit for 50~100 engineers with very diverse skillsets & backend experience. But
why every engineer needs to be on call? Well, that may deserve a new thread of
discussions :)

2\. In 2014, there were ~200 production jobs with very different computing
characteristics (450+ now). Jobs are owned by different teams and different
people with different expertise. Jobs are frequently (every few weeks?) added,
removed, and updated. Outages most likely happen during code deployment. It’s
important to run a system that works well during deployment (and rollback),
e.g., always run the RIGHT version of code for hundreds of jobs.

3\. The core scheduler system was pretty tiny, which could be easily
understood by any engineer in the company. I remembered it took me a few days
(probably less than a week) to finish most of the code. The hard part was
productionization, e.g., carefully migrating 200 production jobs to the new
system, logging, monitoring/alerting at job level, deployment & rollback
process, oncall-related things (documentation/training)… With this simple
system, we could easily enumerate various failure situations, so oncall
engineers have confidence how to respond.

4\. I’m not with Nextdoor any more. But turns out this scheduler system is
still working well now:
[https://github.com/Nextdoor/ndscheduler/issues/33#issuecomme...](https://github.com/Nextdoor/ndscheduler/issues/33#issuecomment-387954933)
I think the ROI is pretty good -- a 3-man-month project (from design to run
all 200+ jobs on the new system) => running 4 years and counting & easy oncall
situation for 50~100+ engineers.

5\. Why not Jenkins or other open source alternatives? We investigated a bunch
of alternatives. Jenkins is great. Back in 2014, Nextdoor used Jenkins for CI
(not sure if it's still true now). We ruled out Jenkins (and Rundeck or the
like) for operational reasons, e.g., challenges to integrate with existing
code deployment process, operational complexity for 50+ engineers with
different backgrounds/expertise... Open source alternatives? Well, in 2014, we
couldn't find a good Python-based project. It's important to limit the number
of languages/external technologies in the tech stack.

------
hkchad
cloudwatch rules ??

~~~
gulperxcx
Bingo! Those work great paired with a AWS Lambda function

------
Something1234
Have they heard of crontab -e? It checks that your cronjobs have valid syntax.
There are a crazy number of edge cases related to time. Dealing with time
sucks, let the smart people who battled tested and built cron deal with it.

~~~
yuliyp
What does that have to do with the problems that they dealt with? (randomly
failing jobs, resource management, monitoring, etc)

~~~
dfc
I think the comment is in reference to the second problem listed, "editing the
plaintext file is difficult."

------
mamcx
So, if not cron, what is better?

------
yeukhon
Aurora? Nomad?

------
spork12
We use Jenkins when we need something more complicated than cron can provide.

------
dominotw
jenkins?

------
moonbug
They really ought just have learnt to use Cron properly

------
dominotw
I was expecting everything is 'reactive'( i.e react to events in real time) so
there is no need for cron jobs.

