There are a number of concepts that are worth learning about at a high-level if you want to learn about building large-scale projects. Most modern/large companies use some/all of the following to build their backends:
If you dig into any of these there's a ton to learn, especially around looking into the underlying technologies used to build these higher-level systems.
There are also more conceptual things that are part of building/maintaining backend systems. These are a bit fuzzier, but I would say are also as important as the specific technologies used:
- Reliability
- Monitoring
- Observability
- Error/failure handling
- Migration strategies
- Data normalization/denormalization
- Horizontal vs. vertical scalability
This is by no means a complete list, but these terms are enough to get you in the right ballpark of ideas and start learning. I think highscalability.com is a great place to read about how other companies have built backend systems to solve specific problems. They have a massive list of quality articles written about various backend systems at scale.
Cron jobs are the definition of the anti pattern of treating servers like pets and not cattle. You have to worry about that one non redundant server running your cron jobs. There are others ways to skin the cat, but my favorite is Hashicorp’s Nomad. I like to call it “distributed cron”. Together with Consul for configuration it’s dead simple to schedule jobs across app servers - the jobs can be executables, Docker containers, shell scripts, anything.
These days I think of "cron" as more of a type than an actual implementation. When somebody says, "I need a cron job" my answer these days would never be, "ssh into www1 and add it to the crontab." It would be, "create a CronJob type in Kubernetes."
But I agree and concede, a user with zero back-end experience will just google "cron" which will take them to a crontab example, so they will likely be mislead into the anti-pattern, as you said.
I agree that is definitely the way to go at some scale. :) I really just put cron on there as an example of how someone might think about scheduled jobs as most of the more advanced things are conceptually similar to cron, but where you don't have to worry about where your job is actually running and how the environment was setup.
I think it is worth noting for any of the systems above that there's a spectrum of possibilities around how much you automate/offload the management of them, as well as plenty of backend systems for managing those.
That is only true if your jobs aren't doing a stateless operation. I use cron all the time on cattle VMs. No reason to cargo cult extra stuff into the mix.
I think the parent poster was referring to the fact that Cron jobs can act purely on local state, and that's OK.
For example, if I had a traditionally deployed (i.e. not in K8S / a PaaS / similar) backend app that accepted file uploads, then passed those off to something else, I'd be streaming the uploads to a temporary holding directory on disk. I'd then have a CronJob that clears stale items from the temp dir. If the server fails, that's OK that the CronJob didn't run.
There are still plenty of use cases for traditional cron.
In the case of Nomad, you run it as cluster of three. But if one of my app servers went down in the middle of the night, when I was using Nomad, the next day I might notice a degradation of performance but everything still ran.
The server going down was never really the issue though honestly. The issue was usually a process taking more CPU/Memory than expected in that case Nomad could intelligently schedule jobs based on available resources across the fleet of app servers.
These days with AWS, I don’t use Nomad, I just use CloudWatch and for the processes that aren’t Lambda based, I use autoscaling groups with the appropriate metrics for scaling in and out.
That also means if a server goes wonky, I can just take it out of the group for troubleshooting later and another instance will automatically be launched.
Yes jobs that run unattended via chron should have logging and watch dog processes that check for completion and possibly do reconciliation - that can alert if something goes wrong and the logs help diagnosis.
> Cron jobs are the definition of the anti pattern of treating servers like pets and not cattle
What's old is new again. Much of clean distributed systems development is now built on, what is essentially, scheduled period operations. They're pretty much the least complicated ways to loosely couple domain logic in distributed systems that follow eventually-consistent semantics. They're also a good model for the functionality of many distributed scheduling systems like Kubernetes, AWS' ECS Scheduled Tasks, and more.
Kubernetes even goes so far to have Jobs (batch operations) and CronJobs (scheduler that creates Jobs).
I use AWS’s native services most of the time like CloudWatch to schedule jobs, lambdas, and step functions these days, but to keep the post generic, I mentioned Nomad/Consul that I have used in the past for an on prem implementation that kept us from having to bring Docker into the mix. Since we were already using Consul, using Nomad just made sense.
That’s not cron’s job. Different jobs have different requirements and tradeoffs. Ownership of retry logic and distribution design belongs with whatever cron kicked off.
Learning the cron format is still necessary / useful - most distributed deployment systems have some form of schedule job running. Even on a purely serverless model like AWS Lambda, it’s still possible to do distributed crons. I actually think the word ‘cron’ has already been repurposed - when using it my team actually refers to the distributed version, not the per server version.
Yes: reliability, monitoring, and error handling were the types of things I’m looking for more information on. Do you have any recommendations for more information on these topics? I should have clarified that my question was geared towards important concepts agnostic of languages/frameworks/etc. This is a great list of further reading, thank you.
Also what does observability mean is this context?
"Also what does observability mean is this context?"
Something went wrong, and now your site is serving 500 server errors to everybody at the rate of 25,000 per minute. The ops team already tried "just reboot it" and it didn't help. How are you going to figure out what is going on and fix it?
It's (mostly) too late to add anything, so all you've got is the logs you already had, the metrics you already had, etc. That's the "observable" stuff in a system. There's an art to recording what it is you need to know, while at the same time recording so much that you can't find what you need in the mess.
(The "mostly" is that if you have a good enough setup, you might be able to bring up a new system and route some very small fraction of traffic to it to examine it more intensely in real-time with a debugger or something, though in my experience, on those occasions I've had the opportunity to try this, it's never been a problem that would manifest on a new system receiving a vanishing fraction of a percent of the scale of a production box. But maybe you'll get lucky.)
You certainly want to do everything you can to not be in that mess in the first place, but it won't be enough. You need a system sufficiently observable that you can find the problem and find some sort of solution.
Oh thank you, I didn't know that was referred to as "observability" I thought it was just logging. This article from Etsy's engineering blog [1] was part of the inspiration for this question. Funnily enough when I googled "Etsy engineering logging" the 5th result was for a position on Etsy's observability team.
When it comes to "measure everything" I've found services that have clients that already grok popular frameworks to be a godsend. We use NewRelic and it's abilty to automatically insturment all rest apis and db transactions is delightful. I could not imagine going back to having to do it manually or guess what information might be useful later.
I haven't read it personally, but I've heard good things from others and looking over it briefly the advice there lines up with what I've experienced in practice.
For some of these concepts - take a look at what Envoy + Istio , linkerd (and other service meshes) are trying to solve and conceptualize: load balancing, auth(n/z), monitoring, logging, etc.
- Load balancers
- Web servers
- Caches (eg. Redis, memcached)
- Databases (relational, non-relational, document)
- Search datastores (eg. Elasticsearch, Solr)
- Log/event/message processors (eg. Kafka)
- Task queues/task processing libraries
- Periodic jobs (eg. cron)
If you dig into any of these there's a ton to learn, especially around looking into the underlying technologies used to build these higher-level systems.
There are also more conceptual things that are part of building/maintaining backend systems. These are a bit fuzzier, but I would say are also as important as the specific technologies used:
- Reliability
- Monitoring
- Observability
- Error/failure handling
- Migration strategies
- Data normalization/denormalization
- Horizontal vs. vertical scalability
This is by no means a complete list, but these terms are enough to get you in the right ballpark of ideas and start learning. I think highscalability.com is a great place to read about how other companies have built backend systems to solve specific problems. They have a massive list of quality articles written about various backend systems at scale.