There are a number of concepts that are worth learning about at a high-level if ...

scarface74 · on Jan 21, 2019

I agree with most of that list - except for cron.

Cron jobs are the definition of the anti pattern of treating servers like pets and not cattle. You have to worry about that one non redundant server running your cron jobs. There are others ways to skin the cat, but my favorite is Hashicorp’s Nomad. I like to call it “distributed cron”. Together with Consul for configuration it’s dead simple to schedule jobs across app servers - the jobs can be executables, Docker containers, shell scripts, anything.

freedomben · on Jan 21, 2019

These days I think of "cron" as more of a type than an actual implementation. When somebody says, "I need a cron job" my answer these days would never be, "ssh into www1 and add it to the crontab." It would be, "create a CronJob type in Kubernetes."

But I agree and concede, a user with zero back-end experience will just google "cron" which will take them to a crontab example, so they will likely be mislead into the anti-pattern, as you said.

natdempk · on Jan 21, 2019

I agree that is definitely the way to go at some scale. :) I really just put cron on there as an example of how someone might think about scheduled jobs as most of the more advanced things are conceptually similar to cron, but where you don't have to worry about where your job is actually running and how the environment was setup.

I think it is worth noting for any of the systems above that there's a spectrum of possibilities around how much you automate/offload the management of them, as well as plenty of backend systems for managing those.

sl1ck731 · on Jan 22, 2019

That is only true if your jobs aren't doing a stateless operation. I use cron all the time on cattle VMs. No reason to cargo cult extra stuff into the mix.

scarface74 · on Jan 22, 2019

Until that one server goes down in the middle of the night or it has issues.

Or if the server doesn’t go down and your cron job fails. Do you then implement retry logic in every job since cron can’t do retries automatically?

kiallmacinnes · on Jan 22, 2019

I think the parent poster was referring to the fact that Cron jobs can act purely on local state, and that's OK.

For example, if I had a traditionally deployed (i.e. not in K8S / a PaaS / similar) backend app that accepted file uploads, then passed those off to something else, I'd be streaming the uploads to a temporary holding directory on disk. I'd then have a CronJob that clears stale items from the temp dir. If the server fails, that's OK that the CronJob didn't run.

There are still plenty of use cases for traditional cron.

coldtea · on Jan 22, 2019

>Until that one server goes down in the middle of the night or it has issues.

The extra complexity of any workflow manager would make it even more certain to "goes down in the middle of the night or have issues".

scarface74 · on Jan 22, 2019

In the case of Nomad, you run it as cluster of three. But if one of my app servers went down in the middle of the night, when I was using Nomad, the next day I might notice a degradation of performance but everything still ran.

The server going down was never really the issue though honestly. The issue was usually a process taking more CPU/Memory than expected in that case Nomad could intelligently schedule jobs based on available resources across the fleet of app servers.

These days with AWS, I don’t use Nomad, I just use CloudWatch and for the processes that aren’t Lambda based, I use autoscaling groups with the appropriate metrics for scaling in and out.

That also means if a server goes wonky, I can just take it out of the group for troubleshooting later and another instance will automatically be launched.

C1sc0cat · on Jan 22, 2019

Yes jobs that run unattended via chron should have logging and watch dog processes that check for completion and possibly do reconciliation - that can alert if something goes wrong and the logs help diagnosis.

scarface74 · on Jan 22, 2019

Now you’re reinventing Nomad....

gravypod · on Jan 21, 2019

> Cron jobs are the definition of the anti pattern of treating servers like pets and not cattle

What's old is new again. Much of clean distributed systems development is now built on, what is essentially, scheduled period operations. They're pretty much the least complicated ways to loosely couple domain logic in distributed systems that follow eventually-consistent semantics. They're also a good model for the functionality of many distributed scheduling systems like Kubernetes, AWS' ECS Scheduled Tasks, and more.

Kubernetes even goes so far to have Jobs (batch operations) and CronJobs (scheduler that creates Jobs).

scarface74 · on Jan 22, 2019

I use AWS’s native services most of the time like CloudWatch to schedule jobs, lambdas, and step functions these days, but to keep the post generic, I mentioned Nomad/Consul that I have used in the past for an on prem implementation that kept us from having to bring Docker into the mix. Since we were already using Consul, using Nomad just made sense.

hestefisk · on Jan 22, 2019

Depends on the use case / workload.

coldtea · on Jan 22, 2019

>Cron jobs are the definition of the anti pattern of treating servers like pets and not cattle.

Cron jobs are the definition of "if it works don't fix it" and YAGNI.

scarface74 · on Jan 22, 2019

But cron jobs don’t work when you need server redundancy or retry logic.

felixgallo · on Jan 22, 2019

That’s not cron’s job. Different jobs have different requirements and tradeoffs. Ownership of retry logic and distribution design belongs with whatever cron kicked off.

coldtea · on Jan 22, 2019

Yes, but it's not like the original claim, that it's somehow totally "deprecated" today and people should skip it.

scarface74 · on Jan 22, 2019

Yeah because for me, anything that doesn’t have redundancy should be skipped....

Unless the cron job is only doing some type of maintenance that only affects the local redundant server....

therealdrag0 · on Feb 3, 2019

That's a common use case, log rotation, AV scan, etc. It's just a concept to be familiar with.

whalesalad · on Jan 21, 2019

It's just as important to understand the history/genesis of cron as it is to know when to not use it.

sudhirj · on Jan 22, 2019

Learning the cron format is still necessary / useful - most distributed deployment systems have some form of schedule job running. Even on a purely serverless model like AWS Lambda, it’s still possible to do distributed crons. I actually think the word ‘cron’ has already been repurposed - when using it my team actually refers to the distributed version, not the per server version.

igotsideas · on Jan 21, 2019

Still good to know for legacy fixes.

_fwu1 · on Jan 22, 2019

Hey - why not just systemctl ? It’s quite robust and reliable.

zwetan · on Jan 22, 2019

doesn't work when you can not access the machine where the cron is supposed to run for security reason or whatever

there crontab is not an anti-pattern at all

scarface74 · on Jan 22, 2019

What does not having access to the server have anything to do with using a fault tolerant solution?

pcmaffey · on Jan 22, 2019

A good portion of this falls under the purview of "data engineering" which is another conceptual layer to think about / research.

chrisdsaldivar · on Jan 21, 2019

Yes: reliability, monitoring, and error handling were the types of things I’m looking for more information on. Do you have any recommendations for more information on these topics? I should have clarified that my question was geared towards important concepts agnostic of languages/frameworks/etc. This is a great list of further reading, thank you.

Also what does observability mean is this context?

jerf · on Jan 21, 2019

"Also what does observability mean is this context?"

Something went wrong, and now your site is serving 500 server errors to everybody at the rate of 25,000 per minute. The ops team already tried "just reboot it" and it didn't help. How are you going to figure out what is going on and fix it?

It's (mostly) too late to add anything, so all you've got is the logs you already had, the metrics you already had, etc. That's the "observable" stuff in a system. There's an art to recording what it is you need to know, while at the same time recording so much that you can't find what you need in the mess.

(The "mostly" is that if you have a good enough setup, you might be able to bring up a new system and route some very small fraction of traffic to it to examine it more intensely in real-time with a debugger or something, though in my experience, on those occasions I've had the opportunity to try this, it's never been a problem that would manifest on a new system receiving a vanishing fraction of a percent of the scale of a production box. But maybe you'll get lucky.)

You certainly want to do everything you can to not be in that mess in the first place, but it won't be enough. You need a system sufficiently observable that you can find the problem and find some sort of solution.

chrisdsaldivar · on Jan 21, 2019

Oh thank you, I didn't know that was referred to as "observability" I thought it was just logging. This article from Etsy's engineering blog [1] was part of the inspiration for this question. Funnily enough when I googled "Etsy engineering logging" the 5th result was for a position on Etsy's observability team.

[1] https://codeascraft.com/2011/02/15/measure-anything-measure-...

kenrose · on Jan 22, 2019

I think of observability as a triad:

- logging (ex tools: Splunk, Sumologic, LogDNA)

- metrics (Prometheus, datadog, Grafana)

- tracing (lightstep, new relic, zipkin)

As mentioned above, observability is the data collected about a system.

therealdrag0 · on Feb 3, 2019

When it comes to "measure everything" I've found services that have clients that already grok popular frameworks to be a godsend. We use NewRelic and it's abilty to automatically insturment all rest apis and db transactions is delightful. I could not imagine going back to having to do it manually or guess what information might be useful later.

aynsof · on Jan 22, 2019

You might want to look into honeycomb.io and follow Charity Majors on Twitter. Heck, just follow Charity anyway - she's a genius.

natdempk · on Jan 21, 2019

jerf answered observability well in another reply to this comment.

As for reliability, monitoring, and error handling I've heard good things about the Google SRE book: https://landing.google.com/sre/books/

I haven't read it personally, but I've heard good things from others and looking over it briefly the advice there lines up with what I've experienced in practice.

malkia · on Jan 22, 2019

For some of these concepts - take a look at what Envoy + Istio , linkerd (and other service meshes) are trying to solve and conceptualize: load balancing, auth(n/z), monitoring, logging, etc.