Executing Cron Scripts Reliably at Scale

xyzzy123 · on Jan 31, 2024

Interesting they went straight from "1 box running shell scripts with flock" to "mega custom thing" w/out going thru something like kube crob jobs or an off-the-shelf scheduler in between.

I find jumps like this hint at "political stiction", sometimes it's hard to get permission to do incremental updates to things, you have to wait until the smoke from the burning tires is unmissable, then get big political consensus and a "visible project" to allocate budget and time for what would otherwise be unsexy maintenance work.

josephg · on Feb 1, 2024

> Interesting they went straight from "1 box running shell scripts with flock" to "mega custom thing"

I’ve ended up building a custom job scheduler at a couple companies I’ve worked at. It’s a fun little problem. I don’t think it’s fair to characterise the problem as a “mega” thing at all - you can build a custom task scheduler on top of a database or Kafka that’ll be way more reliable than cron in a few hundred lines of code, in just about any language.

Are those lines of code wasted? Maybe. But the trade off is that you’re the world expert in that little thing you made. It’ll be easy to connect it to your dashboards for observability, and add the exact set of features you need in your org.

I don’t think everyone should roll their own task runner. But it’s not the mega engineering problem you’re imagining. A week or two of one engineer’s time is peanuts at a company like slack.

spiffytech · on Feb 1, 2024

This line from the k8s CronJob docs has always made me nervous about adopting them:

> The scheduling is approximate because there are certain circumstances where two Jobs might be created, or no Job might be created. Kubernetes tries to avoid those situations, but does not completely prevent them.

ants_everywhere · on Feb 1, 2024

This Stack Overflow [0] answer makes it sound like that's just stating a triviality about distributed systems.

For example, if the task that runs your cron is down when your cron is supposed to run, then it won't run.

The slack blog says they did some tinkering like preventing nodes from going down at the top of a minute because that's when they think cron jobs are most likely to run. But at scale things are going to break when they break, and you have to weigh the pros and cons of designing the jobs to be robust to failure vs trying to organize failures to correspond to the needs of your jobs.

So I think there is space for solutions that make different tradeoffs. But it does seem vastly easier to tune an existing solution that someone else is maintaining than to build your own solution on top of Kafka.

[0] https://stackoverflow.com/questions/47691278/why-in-kubernet...

josephg · on Feb 1, 2024

Most distributed systems at least promise they do something at least once or at most once. You can often achieve exactly once in practice with a combination of idempotent APIs and a shared database.

For a task runner, there are a lot of different behaviours you might want if the system crashes. Maybe the runner should “catch up” after coming back online. That’s easy enough to achieve if you move away from cron and track which tasks have been run in a small data store somewhere.

marcosdumay · on Feb 1, 2024

Looks like they made a cron on top of an eventual-consistent database.

Yeah, I'd avoid that too.

lijok · on Jan 31, 2024

Is that not what HN at large keeps on talking about doing more - stretch the thing you use to its absolute limits before you upgrade

xyzzy123 · on Feb 1, 2024

Fair, tho a load bearing cronjob pet at the beating heart of a $20bn company with 30m+ users is further than I would dare take this advice if I had to carry the pager. It's very common though and I guess the blog is more evidence that simple things (greased by some tears and toil) can take you much further than you might think.

JohnMakin · on Feb 1, 2024

This is because simplicity in a complex system allows easy problem solving and tooling built around it. Knowledge silos and haunted forests are much easier to avoid this way. Any complex distributed system to be comprehensible to a team let alone a single person has to have individually simple components with obvious design. There’s way too many rube goldberg machines out there.

noptd · on Feb 1, 2024

Yeah, either that or "architecting something new looks better on my resume than using an existing solution."

hparadiz · on Feb 1, 2024

The value of total control and flexibility sometimes has no price

marcosdumay · on Feb 1, 2024

Do you have "implemented a task scheduler" on your resume?

noptd · on Feb 1, 2024

Personally I'd see it as a negative vs using an industry standard solution.

However, I'm sure some folks would be tempted to add something like "designed and implemented a distributed task scheduler and execution engine for generalized asynchronous jobs utilized by X number of devs across Y teams" to their resumes.

marcosdumay · on Feb 1, 2024

My question is on the line of do you think it's relevant enough so that it would deserve being added to a resume? Even if it's something you personally don't like.

Because I can't imagine why it would award that relevance. It's right there with "implemented function to reverse a list because the stdlib had a bug".

VWWHFSfQ · on Feb 1, 2024

> "1 box running shell scripts with flock" to "mega custom thing"

Literally went from a working "tidy little house" to building a massive sky-scraper.

Obviously Slack operates at a much larger scale than I do...but holy moly.

hey, good luck to them with that!

teeray · on Feb 1, 2024

“Why are we doing this refactor in this PR along with the feature? Let’s do the feature first and then do the refactor later (read: never)”

Spivak · on Feb 1, 2024

The bet your making is you never need to do the refactor. The amount of absolutely garbage code that has never needed to change ever is easily worth a wait-and-see attitude.

teeray · on Feb 1, 2024

It’s worth it until you need to. Then you have a nightmare on your hands because you’re never allowed to rewrite it.

lghh · on Feb 1, 2024

Waiting until you need to rewrite it sounds like the perfect time to rewrite it.

paxys · on Feb 1, 2024

Slack running all their scheduled tasks on a single box using linux cron all the way up till now is an amazing point in favor of every "you don't need overly complex system architecture at your company" argument anyone has ever made.

roydivision · on Feb 1, 2024

Reading this I realise an unconscious rule I've followed for years - Cron is for running jobs pertaining to that box, and that box alone. It _can_ be used to schedule anything, but that doesn't mean it should, especially business logic.

"I wrote a CRM system in ksh script, it works great!"

Scubabear68 · on Feb 1, 2024

I can’t believe an org of Slack’s size relied on cron scripts for anything critical. It is the worst possible way to schedule jobs at scale. Serious problems with discoverability, single points of failure, and failure mode options.

Also surprised they didn’t just use an open source scheduler or product. There are a gazillion of them.

dimgl · on Feb 1, 2024

Why is this so hard to believe? This is proof that simple software can scale very well. Your takeaway, IMO, should be that a lot of solutions are over-engineered relative to the businesses they serve. Not every company handles Slack's scale.

reactordev · on Feb 1, 2024

It’s no longer simple when you have platform code to prevent nodes from disappearing or dying on the minute on a kubernetes cluster the size of slack. A triggered pulse event stream would have done the trick to invoke a lambda or call code for every “thing” that needed a beat. Kubernetes comes with a scheduler…

zerbinxx · on Feb 1, 2024

This is my take as well, just because you can get away with something doesn’t necessarily make it desirable to maintain or extend, and I honestly cannot imagine the effort in terms of labor hours that you’d have to go through to develop something like this compared to just plugging in an off-the-shelf scheduler to something slightly more sophisticated like k8s or even just a worker-and-queue system. When you’re talking about platform engineering to solve a problem that a relatively extensible Celery service could do (and have tests and such), I have no idea how the former could be “less work” or cheaper in the long haul.

Scubabear68 · on Feb 1, 2024

Hard to believe because I have seen the cron dumpster fire at so many companies. Random cronjobs running on random boxes ends up with mystery jobs whose existence often goes unnoticed as people leave the org and docs are out of date. Random accounts with different cron jobs, with no visibility from one account to another. Box failure leads to mystery jobs failing with no way to figure out what happened. No “advanced” features like retry and retry with exponential back off. No database with job history. No parallelism.

Every place I have worked cron turned into a dumpster fire.

mdekkers · on Feb 1, 2024

Sure, but those are organisational failures, not technical failures. If you need better technology because the org sucks at planning, the tech isn’t the issue and you are just sweeping issues under the carpet.

teaearlgraycold · on Feb 1, 2024

Or you could have a canonical cron box.

djbusby · on Feb 1, 2024

Or a fleet? Some of the problems up-thread feel organizational rather than tooling.

baq · on Feb 1, 2024

No matter what they tell you, it’s a people problem.

doubloon · on Feb 1, 2024

i will see your rando cron box and raise you Excel VBA macros run on Windows Task Scheduler on a spare machine in a cubicle.

mkarrmann · on Feb 1, 2024

Never forget that the world economy is probably being held together by a handful of these.

baq · on Feb 1, 2024

A boatload. Couple. Panamax or bigger. Otherwise agreed.

Scubabear68 · on Feb 1, 2024

Cron scaling? Seriously?

By definition cron does not scale, it is account by account per VM/machine with no rhyme or reason.

Supermancho · on Feb 1, 2024

Alternately, if you define cron as a starting/startup point, an expectation when scaling, then simple format for a job queue, the context of the post is understood.

baq · on Feb 1, 2024

Yeah that’s the point. If managed properly, turns out it can handle a $20B tech company.

blitzar · on Feb 1, 2024

Survivorship bias.

Organisations that cobble together cron scripts for critical apps become an "an org of Slack’s size".

Companies that deploy solutions that handle all the problems with discoverability, single points of failure, failure mode options and interplanetary multi species scale when they have 0 coustomers never become "an org of Slack’s size".

mdekkers · on Feb 1, 2024

> It is the worst possible way to schedule jobs at scale.

The literal evidence proves that this approach is absolutely workable at scale, so “the worst possible way” clearly doesn’t apply.

The interesting question here is “how did they make this work?” The answer to which is immensely valuable to the technical community at large. “What are they building next?”, whilst interesting, is less immediately valuable as they will be building something Slack-scale, and most orgs are not Slack or Slack-scale

krisoft · on Feb 1, 2024

> The literal evidence proves that this approach is absolutely workable at scale, so “the worst possible way” clearly doesn’t apply.

The “worst possible” to me implies that it is possible.

Think about trying to solve the same problem by hurling engineers into an erupting vulcano. It is expensive, hurts the morale, causes staff retention issues, but also fundamentally does not solve the task of scheduled task running. I would not describe that as “worst possible” because it lacks the second factor by not being a possible solution.

ohnoesjmr · on Feb 1, 2024

Can you name some?

I've been looking for software in that space, a thing that runs cronjobs, with ability to kick off adhoc runs from some UI, see what is running, and if possible parameterise the custom runs (i.e., run an ad-hoc report for a different client than usual cron does)

8organicbits · on Feb 1, 2024

There's a lot to expore here. First, I'd suggest that cron and shell scripts may not be what you want. Cron has a complex format for scheduling and can lack features like sub-minute level scheduling, maintenance windows, and other task cadences. Shell scripts are OK for small things, but often I find I move to a different language if I keep something around long enough. Do the people writing jobs know shell?

For jobs run by data analysts, airflow and python work great. For devops jobs, begrudgingly, Jenkins or GitHub Actions. But there's so many varieties.

baq · on Feb 1, 2024

GitHub actions.

I almost suffocated from all the yaml typing this, but unfortunately it’s the baseline.

djbusby · on Feb 1, 2024

Build the UI or API for the report, so you can ad-hoc from there. Cron uses that interface. You don't custom the cron, you cron the custom.

fukawi2 · on Feb 1, 2024

Rundeck fits this space.

sciurus · on Feb 1, 2024

I almost hate to say it, but Jenkins.

magarnicle · on Feb 1, 2024

That's what I did. I know it was the right solution because I happily keep adding more and more scripts to run. I didn't do that when I was using cron jobs.

Plus, Jenkins has a few nice extensions to the crontab, including setting a timezone and using "H" to spread job execution load.

otabdeveloper4 · on Feb 1, 2024

> There are a gazillion of them.

Such as?

latchkey · on Feb 1, 2024

Probably the simplest scheduling service I've ever used was Google Cloud Scheduler.

You give it a schedule and an endpoint (HTTP or a Cloud Task) and then it hits that endpoint on a schedule.

I go with Scheduler->Cloud Tasks->Cloud Functions, which gives you reliability and near infinite scalability.

Very easy to reason about and full monitoring of the whole stack.

TruthWillHurt · on Feb 1, 2024

o rly? what happens if they fails? can you control the backoff period? can you schedule another job in the event it fails? what if it fails before starting the function, so your in-code error handling isn't triggered? is there a "dead-letter" queue? will the next scheduled job run if the previous run failed? should it? can you define if it should or not? can you view history of executed jobs? what if your logging function that saves that history fails? can anyone add jobs or just you/superadmin? should they? and more and more.

And these are just ones I personally ran into using GCP schedulers, pub/sub and functions.

See, what you're doing is re-inventing the wheel. No matter how cool your tool is - there's always work in the edge cases beyond just running the job.

latchkey · on Feb 1, 2024

As the person below states, you sound bitter.

There is nothing about your questions that wouldn't apply to any system.

Literally all of your questions are answered in the rather well written documentation. I could go through them and answer them for you, but I don't think you'd really appreciate that.

1a527dd5 · on Feb 1, 2024

Wow, you sound a little bitter.

We use AWS CloudWatch Events at work, and it's fine.

djboz · on Jan 31, 2024

What's the workflow for job registration look like? Does an engineer upload a single script/binary to some storage service(e.g., EFS/S3/whatever) or builds and uploads a container with script/binary, registers the rate in the conductor UI and then the job queue just pulls the script or container and runs it? Is registration through an IaC of sorts, or just through some UI/CLI?

whorleater · on Feb 1, 2024

I would be curious why kube cron jobs didn't seem to fit the bill, my favorite part of these posts are when they have a section hinting that they explored other options picked specific tradeoffs

gtirloni · on Feb 1, 2024

Yeah, the article raises more questions than it answers them.

> When designing this new, more reliable service, we decided to leverage many existing services to decrease the amount we had to build

This might explain building from scratch. Maybe the existing solutions had dependencies they didn't want to maintain and they opted for using the existing internal systems. It feels like that influenced all the rest.

nickjj · on Feb 1, 2024

Kubernetes cron jobs are pretty good I must say.

I definitely don't come anywhere near Slack's scale but I've managed systems where over 3,000 cron jobs ran per day, half of which came from a cron job running every minute which usually finished in a few seconds. Some of these jobs run for X minutes too.

It's nice because there's properties you can configure for each cron job around retries and if it should be uniquely run or not. Maybe certain cron jobs should be re-tried if they fail, for others maybe it's ok to be picked up on the next interval if it fails.

Overall it's been super stable for almost 2 years which is when I started using them. Only a handful of jobs failed over this period of time and they weren't the result of Kubernetes, it was because the HTTP endpoint that was being hit from the cron job failed to respond and the cron job failure threshold was reached.

It's a good reminder that important jobs run on a schedule should be resilient to failure (saving progress, idempotent, etc.).

djboz · on Feb 1, 2024

Do you capture all of your job code in a single image and reference execution paths on container startup per job? Or, are you building an image per job?

nickjj · on Feb 1, 2024

The jobs all run curl commands to a specific API endpoint with a specific bearer token. Those tokens are loaded through an env through SealedSecrets.

They all use the public curl image where I override the command in the Kubernetes cron job definition. The job container itself starts almost instantly since there's no app to boot.

If I had a case you're describing I would use the main app's image and run a specific command, in this case I'm assuming if there's not an API endpoint it would be some callable script that lives in your app's code / image.

paxys · on Feb 1, 2024

Spinning up a new Kubernetes pod for every single job run is a very expensive and wasteful operation, starting at least in the order of seconds (usually more) vs just milliseconds for a new process in an already hot environment.

zerbinxx · on Feb 1, 2024

Sure, but if you need that thing to run every hour for a few seconds, then seconds aren’t really the limiting factor. I don’t doubt that the resource management side of k8s would make it dicey at a certain volume of these things running, though, especially if they eat a lot of compute.

djbusby · on Jan 31, 2024

I'm still over here loving cron, not at this large of a scale for sure but I love getting all these tools to email me key information, in plain text, at the right times.

ryze20245 · on Feb 1, 2024

I feel like a more effective way to use cron is just to dispatch jobs into a queue that will perform the actual processing. And not to do the processing within the cron scripts themselves. That way the load on the cron is light and the heavy lifting is done by your queue/worker system.

getoffmyyawn · on Feb 1, 2024

This works great in my experience.

A lifetime ago I scaled up cron jobs for a client with Gearman. Using cron to trigger jobs on the Gearman server and the pool of runners to do all the work. This proved to be so reliable they still use the system today, over 10 years later.

djboz · on Feb 1, 2024

Really cool! For the Gearman workers, did you load jobs dynamically? Or, would you have to re-deploy for new jobs/updated jobs?

getoffmyyawn · on Feb 5, 2024

As I recall, all the jobs are checked into a repo that is deployed to all the runners, which each start gearman workers for their assigned role.

progbits · on Feb 1, 2024

> To make this transition between pods seamless, we implemented logic to prevent the node from going down at the top of a minute when possible since — given the nature of cron — that is when it is likely that scripts will need to be scheduled to run.

Crons with precisely specified time where everyone just uses whole minutes/hours are not great practice. Very unlikely you actually need such precision in a cron job and you get spiky load.

Usual approach is to set the minute to a hash of the cron config name or something, modulo 60. Hourly jobs still run hourly, but each one on random minute.

(Setting aside how fragile that setup of avoiding pod downtime sounds)

brown9-2 · on Feb 1, 2024

> we implemented logic to prevent the node from going down at the top of a minute when possible since — given the nature of cron — that is when it is likely that scripts will need to be scheduled to run

Why not smear the start time of the jobs across seconds of that minute to avoid any thundering herd problems? How much functionality relies on a script being invoked at exactly the :00 mark? And if the functionality depends on that exact timing, doesn’t it suggest something is fragile and could be redesigned to be more resilient?

wutwutwat · on Feb 1, 2024

At their scale, staggering script start times over a 60 second window likely wouldn’t have much of an impact if they are experiencing a thundering herd, imo. If it did help, it would be a bandaid and ticking time bomb before someone has to actually solve the load problem that staggering start times kicked down the road

reacharavindh · on Feb 1, 2024

Isnt cron inherently easy to horizontally scale?

As in, if you have 500 cron scripts and you think you're reaching capacity of that box, just distribute the 500 scripts in one cron tab file to two boxes with 250 each?

If one cares more about the reliability of things, you can keep tab on the cron scripts starting at their times, and if they dont, then bring the box down and start a new box with the same cron tab?

swyx · on Jan 31, 2024

this is a pretty simple cron system. curious if the authors investigated temporal and other similar workflow engines for the advanced cron feature set (https://docs.temporal.io/workflows#spec disclaimer: i used to work there)

nicksaroha · on Feb 1, 2024

Wouldn’t it be a huge overkill to run your own temporal or airflow instances just to run your cron jobs? Just curious.

lorendsr · on Feb 2, 2024

There are certainly use cases for which it's more than is required. Like the most simple would be adding a cron string to a GitHub Action or Vercel function, but in most cases, and certainly Slack's case, you want more reliability, scalability, flexibility, and/or observability. And Temporal has extreme levels of those things, including pausing & editing schedules and seeing the current status of all triggered scripts/functions and all the steps a function has taken so far, and guaranteeing the triggered function completes, including ensuring that if the process or container dies, the function continues running on another one. Even if you don't care about all those things, you might care about some of them in the future, and it doesn't hurt to run a system that has capabilities you don't use.

neomantra · on Feb 1, 2024

Depends on how many cron jobs you have and what you need out of it?

Operating Temporal is not that hard -- you can start with `temporal --dev` on your own box. I have a "Nomad-Temporal" Terraform Module to stand one up on Nomad. [1] Temporal has Helm Charts for Kubernetes [2]. There is also Temporal Cloud [3].

That said, there is currently a chasm between "script in cronjob" to "scheduled task in Temporal". The focus of Temporal is more "Enterprise, get your Business Processes on Temporal", not "soloist, ditch your cron".

There's certainly space for somebody to a make DAG dataflow thing or lower-code product over Temporal. Airplane.dev [4] was built on Temporal and was approaching this; acquired by AirTable.

[1] https://github.com/neomantra/terraform-nomad-temporal [2] https://github.com/temporalio/helm-charts [3] https://temporal.io/cloud [4] https://www.airplane.dev

storyinmemo · on Feb 1, 2024

More overkill than writing your own system? Temporal is just plain quick to setup.

swyx · on Feb 1, 2024

i mean... not for Slack? lol

jaydeegee · on Jan 31, 2024

100% agree this is a perfect use case for Cadence or Temporal.

nerdponx · on Jan 31, 2024

Coming from the "data world", how does a tool like Airflow/Dagster/Prefect differ from these?

jtmarmon · on Feb 1, 2024

I recently evaluated Dagster, Prefect, and Flyte for a data pipeliney workflow and ended up going with Temporal.

The shared feature between Temporal and those three is the workflow orchestration piece. All 3 can manage a dependency graph of jobs, handle retries, start from checkpoints, etc.

At a high level the big reason they’re different is Temporal is entirely focused on the orchestration piece, and the others are much more focused on the data piece, which comes out in a lot of the different features. Temporal has SDKs in most languages, and has a queuing system that allows you to run different workflows or even activities (tasks within a workflow) in different workers, manage concurrency, etc. You can write a parent workflow that orchestrates sub-workflows that could live in 5 other services. It’s just really composable and fits much more nicely into the critical path of your app.

Prefect is probably the closest of your list to temporal, in that it’s less opinionated than others about the workflows being “data oriented”, but it’s still only in python, and it deosn't have queueing. In short this means that your workflows are kinda supposed to run in one box running python somewhere. Temporal will let you define a 10 part workflow where two parts run on a python service running with a GPU, and the remaining parts are running in the same node.js process as your main server.

Dagster’s feature set is even more focused on data-workflows, as your workflows are meant to produce data “assets” which can be materialized/cached, etc.

They’re pretty much all designed for a data engineering team to manage many individual pipelines that are external from your application code, whereas temporal is designed to be a system that manages workflow complexity for code that (more often) runs in your application.

lorendsr · on Feb 2, 2024

I wrote up this comparison: https://community.temporal.io/t/what-are-the-pros-and-cons-o...

jaydeegee · on Feb 1, 2024

They definitely are similar and can be used for similar functions but Cadence/Temporal are focused on code orchestration side rather than data orchestration.

nerdponx · on Feb 1, 2024

I find that comparison interesting, because I don't think of Airflow as particularly data-oriented in terms of its features and core functionality. I tend to think of Airflow as "Cron + Make", with any "data-oriented" features being nice to have, but not essential.

I'm substantially less familiar with Dagster and Prefect so can't comment as much on those.

Maybe the most data-oriented thing about Airflow is its concept of a data interval, where each DAG run is associated with some "logical date" and an interval of time that starts from the logical date (inclusive) and ends at the next logical date in the schedule (exclusive). The idea is that if you have a daily task that runs at 1 AM, then the task is expected to operate on data starting from "yesterday at 1 AM" until "today at 1 AM". But it's entirely up to the user/developer what you actually do with those logical date ranges, and you're free to ignore them entirely if you don't need them.

wharvle · on Feb 2, 2024

I’ve only recently encountered airflow for the first time, and have been surprised at how half-baked it is for being damn near the industry standard (as far as open source, anyway). And it was a lot worse until recently!

Dynamic task dispatch being a relatively recent feature. The fundamental design imposing lots of structure (well, kind of—you can skip lots of it, but it takes time to figure that out) to practically no benefit (and god, is the terminology dumb, made all the more so because half the stuff it names is nearly useless). “Oh yeah the scheduler just crashes or locks up while still health-checking all the time, standard practice to so restart it frequently” posted on a hundred different issues dating from yesterday to years ago (many fixed! And yet…). It’s pretty bad at passing data between tasks (see again: lots of structure, little benefit)

mianos · on Feb 1, 2024

It doesn't. They didn't look for it. This is exactly what DAG based workflow systems to, in a modern professional manner. Crons don't do a DAG.

jaydeegee · on Jan 31, 2024

Interesting to see how different teams approaches to the same issue. Often when reading this sort of thing I'm in PR mode and question why they did certain things the way they did. Then I realise that has probably been asked and answered a dozen times before it got released.

fdefilippo · on Feb 1, 2024

Wasn't it simpler to use Cronicle (https://github.com/jhuckaby/Cronicle)?

getoffmyyawn · on Feb 1, 2024

That there are numerous mature battle tested open source solutions to distributed and/or centrally managed job queues that it really makes me wonder how they justified building something from scratch.

chronid · on Feb 1, 2024

I think there's a bit of "they could" but also something that is considered very little in many contexts unless you have experienced the contrary: integration is costly and integrating properly sometimes is more work than doing something from "scratch", so you don't do it and then you have a mess that hurts you in the long run.

getoffmyyawn · on Feb 1, 2024

I'm sure it's indeed something like that. I think it also comes down to, at least partly, having a culture that is more about building components than systems. I suspect it could also be the "buzz" factor. The press release about building a new system always seems more exciting than one about solving a familiar problem with boring old existing software.

WatchDog · on Jan 31, 2024

We have a slack reminder that executes on the hour, I’ve noticed that it take about 30 seconds to a minute past the hour for the reminder to actually fire.

AtlasBarfed · on Feb 1, 2024

jitter on top of the hour for massively scaled notifications is typical.

OF COURSE everyone wants reminders at the even points, but what, do you scale up your system for every half-hour/hourly peak? Or just put in the TOS that the activation will jitter by a bit.

wodenokoto · on Feb 1, 2024

Doesn't the client actually ring the bell for the notification? They client could be told "ring on the top of the oncoming minute".

throwawaaarrgh · on Jan 31, 2024

[flagged]

bomewish · on Feb 1, 2024

This is unhelpful without a more explicit and detailed account of what is deficient in their approach. It’s the internet so not holding mu breath but fyi.

Bjartr · on Jan 31, 2024

What does a more sane sausage factory look like wrt cron or the like?

SteveNuts · on Jan 31, 2024

It may not necessarily be "more sane" but we define any batch jobs like this in Nomad.

The benefits are:

- We can collect metrics and logs of the jobs to see how long they take start to finish, and whether they started/died early

- More robust scheduling to tolerate node failures

- Easier to deploy as IAC