Interesting they went straight from "1 box running shell scripts with flock" to "mega custom thing" w/out going thru something like kube crob jobs or an off-the-shelf scheduler in between.
I find jumps like this hint at "political stiction", sometimes it's hard to get permission to do incremental updates to things, you have to wait until the smoke from the burning tires is unmissable, then get big political consensus and a "visible project" to allocate budget and time for what would otherwise be unsexy maintenance work.
> Interesting they went straight from "1 box running shell scripts with flock" to "mega custom thing"
I’ve ended up building a custom job scheduler at a couple companies I’ve worked at. It’s a fun little problem. I don’t think it’s fair to characterise the problem as a “mega” thing at all - you can build a custom task scheduler on top of a database or Kafka that’ll be way more reliable than cron in a few hundred lines of code, in just about any language.
Are those lines of code wasted? Maybe. But the trade off is that you’re the world expert in that little thing you made. It’ll be easy to connect it to your dashboards for observability, and add the exact set of features you need in your org.
I don’t think everyone should roll their own task runner. But it’s not the mega engineering problem you’re imagining. A week or two of one engineer’s time is peanuts at a company like slack.
This line from the k8s CronJob docs has always made me nervous about adopting them:
> The scheduling is approximate because there are certain circumstances where two Jobs might be created, or no Job might be created. Kubernetes tries to avoid those situations, but does not completely prevent them.
This Stack Overflow [0] answer makes it sound like that's just stating a triviality about distributed systems.
For example, if the task that runs your cron is down when your cron is supposed to run, then it won't run.
The slack blog says they did some tinkering like preventing nodes from going down at the top of a minute because that's when they think cron jobs are most likely to run. But at scale things are going to break when they break, and you have to weigh the pros and cons of designing the jobs to be robust to failure vs trying to organize failures to correspond to the needs of your jobs.
So I think there is space for solutions that make different tradeoffs. But it does seem vastly easier to tune an existing solution that someone else is maintaining than to build your own solution on top of Kafka.
Most distributed systems at least promise they do something at least once or at most once. You can often achieve exactly once in practice with a combination of idempotent APIs and a shared database.
For a task runner, there are a lot of different behaviours you might want if the system crashes. Maybe the runner should “catch up” after coming back online. That’s easy enough to achieve if you move away from cron and track which tasks have been run in a small data store somewhere.
Fair, tho a load bearing cronjob pet at the beating heart of a $20bn company with 30m+ users is further than I would dare take this advice if I had to carry the pager. It's very common though and I guess the blog is more evidence that simple things (greased by some tears and toil) can take you much further than you might think.
This is because simplicity in a complex system allows easy problem solving and tooling built around it. Knowledge silos and haunted forests are much easier to avoid this way. Any complex distributed system to be comprehensible to a team let alone a single person has to have individually simple components with obvious design. There’s way too many rube goldberg machines out there.
Personally I'd see it as a negative vs using an industry standard solution.
However, I'm sure some folks would be tempted to add something like "designed and implemented a distributed task scheduler and execution engine for generalized asynchronous jobs utilized by X number of devs across Y teams" to their resumes.
My question is on the line of do you think it's relevant enough so that it would deserve being added to a resume? Even if it's something you personally don't like.
Because I can't imagine why it would award that relevance. It's right there with "implemented function to reverse a list because the stdlib had a bug".
The bet your making is you never need to do the refactor. The amount of absolutely garbage code that has never needed to change ever is easily worth a wait-and-see attitude.
Slack running all their scheduled tasks on a single box using linux cron all the way up till now is an amazing point in favor of every "you don't need overly complex system architecture at your company" argument anyone has ever made.
Reading this I realise an unconscious rule I've followed for years - Cron is for running jobs pertaining to that box, and that box alone. It _can_ be used to schedule anything, but that doesn't mean it should, especially business logic.
"I wrote a CRM system in ksh script, it works great!"
I can’t believe an org of Slack’s size relied on cron scripts for anything critical. It is the worst possible way to schedule jobs at scale. Serious problems with discoverability, single points of failure, and failure mode options.
Also surprised they didn’t just use an open source scheduler or product. There are a gazillion of them.
Why is this so hard to believe? This is proof that simple software can scale very well. Your takeaway, IMO, should be that a lot of solutions are over-engineered relative to the businesses they serve. Not every company handles Slack's scale.
It’s no longer simple when you have platform code to prevent nodes from disappearing or dying on the minute on a kubernetes cluster the size of slack. A triggered pulse event stream would have done the trick to invoke a lambda or call code for every “thing” that needed a beat. Kubernetes comes with a scheduler…
This is my take as well, just because you can get away with something doesn’t necessarily make it desirable to maintain or extend, and I honestly cannot imagine the effort in terms of labor hours that you’d have to go through to develop something like this compared to just plugging in an off-the-shelf scheduler to something slightly more sophisticated like k8s or even just a worker-and-queue system. When you’re talking about platform engineering to solve a problem that a relatively extensible Celery service could do (and have tests and such), I have no idea how the former could be “less work” or cheaper in the long haul.
Hard to believe because I have seen the cron dumpster fire at so many companies. Random cronjobs running on random boxes ends up with mystery jobs whose existence often goes unnoticed as people leave the org and docs are out of date. Random accounts with different cron jobs, with no visibility from one account to another. Box failure leads to mystery jobs failing with no way to figure out what happened. No “advanced” features like retry and retry with exponential back off. No database with job history. No parallelism.
Every place I have worked cron turned into a dumpster fire.
Sure, but those are organisational failures, not technical failures. If you need better technology because the org sucks at planning, the tech isn’t the issue and you are just sweeping issues under the carpet.
Alternately, if you define cron as a starting/startup point, an expectation when scaling, then simple format for a job queue, the context of the post is understood.
Organisations that cobble together cron scripts for critical apps become an "an org of Slack’s size".
Companies that deploy solutions that handle all the problems with discoverability, single points of failure, failure mode options and interplanetary multi species scale when they have 0 coustomers never become "an org of Slack’s size".
> It is the worst possible way to schedule jobs at scale.
The literal evidence proves that this approach is absolutely workable at scale, so “the worst possible way” clearly doesn’t apply.
The interesting question here is “how did they make this work?” The answer to which is immensely valuable to the technical community at large. “What are they building next?”, whilst interesting, is less immediately valuable as they will be building something Slack-scale, and most orgs are not Slack or Slack-scale
> The literal evidence proves that this approach is absolutely workable at scale, so “the worst possible way” clearly doesn’t apply.
The “worst possible” to me implies that it is possible.
Think about trying to solve the same problem by hurling engineers into an erupting vulcano. It is expensive, hurts the morale, causes staff retention issues, but also fundamentally does not solve the task of scheduled task running. I would not describe that as “worst possible” because it lacks the second factor by not being a possible solution.
I've been looking for software in that space, a thing that runs cronjobs, with ability to kick off adhoc runs from some UI, see what is running, and if possible parameterise the custom runs (i.e., run an ad-hoc report for a different client than usual cron does)
There's a lot to expore here. First, I'd suggest that cron and shell scripts may not be what you want. Cron has a complex format for scheduling and can lack features like sub-minute level scheduling, maintenance windows, and other task cadences. Shell scripts are OK for small things, but often I find I move to a different language if I keep something around long enough. Do the people writing jobs know shell?
For jobs run by data analysts, airflow and python work great. For devops jobs, begrudgingly, Jenkins or GitHub Actions. But there's so many varieties.
That's what I did. I know it was the right solution because I happily keep adding more and more scripts to run. I didn't do that when I was using cron jobs.
Plus, Jenkins has a few nice extensions to the crontab, including setting a timezone and using "H" to spread job execution load.
o rly? what happens if they fails? can you control the backoff period? can you schedule another job in the event it fails? what if it fails before starting the function, so your in-code error handling isn't triggered? is there a "dead-letter" queue? will the next scheduled job run if the previous run failed? should it? can you define if it should or not? can you view history of executed jobs? what if your logging function that saves that history fails? can anyone add jobs or just you/superadmin? should they? and more and more.
And these are just ones I personally ran into using GCP schedulers, pub/sub and functions.
See, what you're doing is re-inventing the wheel.
No matter how cool your tool is - there's always work in the edge cases beyond just running the job.
There is nothing about your questions that wouldn't apply to any system.
Literally all of your questions are answered in the rather well written documentation. I could go through them and answer them for you, but I don't think you'd really appreciate that.
What's the workflow for job registration look like? Does an engineer upload a single script/binary to some storage service(e.g., EFS/S3/whatever) or builds and uploads a container with script/binary, registers the rate in the conductor UI and then the job queue just pulls the script or container and runs it? Is registration through an IaC of sorts, or just through some UI/CLI?
I would be curious why kube cron jobs didn't seem to fit the bill, my favorite part of these posts are when they have a section hinting that they explored other options picked specific tradeoffs
Yeah, the article raises more questions than it answers them.
> When designing this new, more reliable service, we decided to leverage many existing services to decrease the amount we had to build
This might explain building from scratch. Maybe the existing solutions had dependencies they didn't want to maintain and they opted for using the existing internal systems. It feels like that influenced all the rest.
I definitely don't come anywhere near Slack's scale but I've managed systems where over 3,000 cron jobs ran per day, half of which came from a cron job running every minute which usually finished in a few seconds. Some of these jobs run for X minutes too.
It's nice because there's properties you can configure for each cron job around retries and if it should be uniquely run or not. Maybe certain cron jobs should be re-tried if they fail, for others maybe it's ok to be picked up on the next interval if it fails.
Overall it's been super stable for almost 2 years which is when I started using them. Only a handful of jobs failed over this period of time and they weren't the result of Kubernetes, it was because the HTTP endpoint that was being hit from the cron job failed to respond and the cron job failure threshold was reached.
It's a good reminder that important jobs run on a schedule should be resilient to failure (saving progress, idempotent, etc.).
Do you capture all of your job code in a single image and reference execution paths on container startup per job? Or, are you building an image per job?
The jobs all run curl commands to a specific API endpoint with a specific bearer token. Those tokens are loaded through an env through SealedSecrets.
They all use the public curl image where I override the command in the Kubernetes cron job definition. The job container itself starts almost instantly since there's no app to boot.
If I had a case you're describing I would use the main app's image and run a specific command, in this case I'm assuming if there's not an API endpoint it would be some callable script that lives in your app's code / image.
Spinning up a new Kubernetes pod for every single job run is a very expensive and wasteful operation, starting at least in the order of seconds (usually more) vs just milliseconds for a new process in an already hot environment.
Sure, but if you need that thing to run every hour for a few seconds, then seconds aren’t really the limiting factor. I don’t doubt that the resource management side of k8s would make it dicey at a certain volume of these things running, though, especially if they eat a lot of compute.
I'm still over here loving cron, not at this large of a scale for sure but I love getting all these tools to email me key information, in plain text, at the right times.
I feel like a more effective way to use cron is just to dispatch jobs into a queue that will perform the actual processing. And not to do the processing within the cron scripts themselves. That way the load on the cron is light and the heavy lifting is done by your queue/worker system.
A lifetime ago I scaled up cron jobs for a client with Gearman. Using cron to trigger jobs on the Gearman server and the pool of runners to do all the work. This proved to be so reliable they still use the system today, over 10 years later.
> To make this transition between pods seamless, we implemented logic to prevent the node from going down at the top of a minute when possible since — given the nature of cron — that is when it is likely that scripts will need to be scheduled to run.
Crons with precisely specified time where everyone just uses whole minutes/hours are not great practice. Very unlikely you actually need such precision in a cron job and you get spiky load.
Usual approach is to set the minute to a hash of the cron config name or something, modulo 60. Hourly jobs still run hourly, but each one on random minute.
(Setting aside how fragile that setup of avoiding pod downtime sounds)
> we implemented logic to prevent the node from going down at the top of a minute when possible since — given the nature of cron — that is when it is likely that scripts will need to be scheduled to run
Why not smear the start time of the jobs across seconds of that minute to avoid any thundering herd problems? How much functionality relies on a script being invoked at exactly the :00 mark? And if the functionality depends on that exact timing, doesn’t it suggest something is fragile and could be redesigned to be more resilient?
At their scale, staggering script start times over a 60 second window likely wouldn’t have much of an impact if they are experiencing a thundering herd, imo. If it did help, it would be a bandaid and ticking time bomb before someone has to actually solve the load problem that staggering start times kicked down the road
As in, if you have 500 cron scripts and you think you're reaching capacity of that box, just distribute the 500 scripts in one cron tab file to two boxes with 250 each?
If one cares more about the reliability of things, you can keep tab on the cron scripts starting at their times, and if they dont, then bring the box down and start a new box with the same cron tab?
this is a pretty simple cron system. curious if the authors investigated temporal and other similar workflow engines for the advanced cron feature set (https://docs.temporal.io/workflows#spec disclaimer: i used to work there)
There are certainly use cases for which it's more than is required. Like the most simple would be adding a cron string to a GitHub Action or Vercel function, but in most cases, and certainly Slack's case, you want more reliability, scalability, flexibility, and/or observability. And Temporal has extreme levels of those things, including pausing & editing schedules and seeing the current status of all triggered scripts/functions and all the steps a function has taken so far, and guaranteeing the triggered function completes, including ensuring that if the process or container dies, the function continues running on another one. Even if you don't care about all those things, you might care about some of them in the future, and it doesn't hurt to run a system that has capabilities you don't use.
Depends on how many cron jobs you have and what you need out of it?
Operating Temporal is not that hard -- you can start with `temporal --dev` on your own box. I have a "Nomad-Temporal" Terraform Module to stand one up on Nomad. [1] Temporal has Helm Charts for Kubernetes [2]. There is also Temporal Cloud [3].
That said, there is currently a chasm between "script in cronjob" to "scheduled task in Temporal". The focus of Temporal is more "Enterprise, get your Business Processes on Temporal", not "soloist, ditch your cron".
There's certainly space for somebody to a make DAG dataflow thing or lower-code product over Temporal. Airplane.dev [4] was built on Temporal and was approaching this; acquired by AirTable.
I recently evaluated Dagster, Prefect, and Flyte for a data pipeliney workflow and ended up going with Temporal.
The shared feature between Temporal and those three is the workflow orchestration piece. All 3 can manage a dependency graph of jobs, handle retries, start from checkpoints, etc.
At a high level the big reason they’re different is Temporal is entirely focused on the orchestration piece, and the others are much more focused on the data piece, which comes out in a lot of the different features. Temporal has SDKs in most languages, and has a queuing system that allows you to run different workflows or even activities (tasks within a workflow) in different workers, manage concurrency, etc. You can write a parent workflow that orchestrates sub-workflows that could live in 5 other services. It’s just really composable and fits much more nicely into the critical path of your app.
Prefect is probably the closest of your list to temporal, in that it’s less opinionated than others about the workflows being “data oriented”, but it’s still only in python, and it deosn't have queueing. In short this means that your workflows are kinda supposed to run in one box running python somewhere. Temporal will let you define a 10 part workflow where two parts run on a python service running with a GPU, and the remaining parts are running in the same node.js process as your main server.
Dagster’s feature set is even more focused on data-workflows, as your workflows are meant to produce data “assets” which can be materialized/cached, etc.
They’re pretty much all designed for a data engineering team to manage many individual pipelines that are external from your application code, whereas temporal is designed to be a system that manages workflow complexity for code that (more often) runs in your application.
They definitely are similar and can be used for similar functions but Cadence/Temporal are focused on code orchestration side rather than data orchestration.
I find that comparison interesting, because I don't think of Airflow as particularly data-oriented in terms of its features and core functionality. I tend to think of Airflow as "Cron + Make", with any "data-oriented" features being nice to have, but not essential.
I'm substantially less familiar with Dagster and Prefect so can't comment as much on those.
Maybe the most data-oriented thing about Airflow is its concept of a data interval, where each DAG run is associated with some "logical date" and an interval of time that starts from the logical date (inclusive) and ends at the next logical date in the schedule (exclusive). The idea is that if you have a daily task that runs at 1 AM, then the task is expected to operate on data starting from "yesterday at 1 AM" until "today at 1 AM". But it's entirely up to the user/developer what you actually do with those logical date ranges, and you're free to ignore them entirely if you don't need them.
I’ve only recently encountered airflow for the first time, and have been surprised at how half-baked it is for being damn near the industry standard (as far as open source, anyway). And it was a lot worse until recently!
Dynamic task dispatch being a relatively recent feature. The fundamental design imposing lots of structure (well, kind of—you can skip lots of it, but it takes time to figure that out) to practically no benefit (and god, is the terminology dumb, made all the more so because half the stuff it names is nearly useless). “Oh yeah the scheduler just crashes or locks up while still health-checking all the time, standard practice to so restart it frequently” posted on a hundred different issues dating from yesterday to years ago (many fixed! And yet…). It’s pretty bad at passing data between tasks (see again: lots of structure, little benefit)
Interesting to see how different teams approaches to the same issue. Often when reading this sort of thing I'm in PR mode and question why they did certain things the way they did. Then I realise that has probably been asked and answered a dozen times before it got released.
That there are numerous mature battle tested open source solutions to distributed and/or centrally managed job queues that it really makes me wonder how they justified building something from scratch.
I think there's a bit of "they could" but also something that is considered very little in many contexts unless you have experienced the contrary: integration is costly and integrating properly sometimes is more work than doing something from "scratch", so you don't do it and then you have a mess that hurts you in the long run.
I'm sure it's indeed something like that. I think it also comes down to, at least partly, having a culture that is more about building components than systems. I suspect it could also be the "buzz" factor. The press release about building a new system always seems more exciting than one about solving a familiar problem with boring old existing software.
We have a slack reminder that executes on the hour, I’ve noticed that it take about 30 seconds to a minute past the hour for the reminder to actually fire.
jitter on top of the hour for massively scaled notifications is typical.
OF COURSE everyone wants reminders at the even points, but what, do you scale up your system for every half-hour/hourly peak? Or just put in the TOS that the activation will jitter by a bit.
This is unhelpful without a more explicit and detailed account of what is deficient in their approach. It’s the internet so not holding mu breath but fyi.
I find jumps like this hint at "political stiction", sometimes it's hard to get permission to do incremental updates to things, you have to wait until the smoke from the burning tires is unmissable, then get big political consensus and a "visible project" to allocate budget and time for what would otherwise be unsexy maintenance work.