Hacker News new | past | comments | ask | show | jobs | submit login

There are a number of concepts that are worth learning about at a high-level if you want to learn about building large-scale projects. Most modern/large companies use some/all of the following to build their backends:

- Load balancers

- Web servers

- Caches (eg. Redis, memcached)

- Databases (relational, non-relational, document)

- Search datastores (eg. Elasticsearch, Solr)

- Log/event/message processors (eg. Kafka)

- Task queues/task processing libraries

- Periodic jobs (eg. cron)

If you dig into any of these there's a ton to learn, especially around looking into the underlying technologies used to build these higher-level systems.

There are also more conceptual things that are part of building/maintaining backend systems. These are a bit fuzzier, but I would say are also as important as the specific technologies used:

- Reliability

- Monitoring

- Observability

- Error/failure handling

- Migration strategies

- Data normalization/denormalization

- Horizontal vs. vertical scalability

This is by no means a complete list, but these terms are enough to get you in the right ballpark of ideas and start learning. I think highscalability.com is a great place to read about how other companies have built backend systems to solve specific problems. They have a massive list of quality articles written about various backend systems at scale.




I agree with most of that list - except for cron.

Cron jobs are the definition of the anti pattern of treating servers like pets and not cattle. You have to worry about that one non redundant server running your cron jobs. There are others ways to skin the cat, but my favorite is Hashicorp’s Nomad. I like to call it “distributed cron”. Together with Consul for configuration it’s dead simple to schedule jobs across app servers - the jobs can be executables, Docker containers, shell scripts, anything.


These days I think of "cron" as more of a type than an actual implementation. When somebody says, "I need a cron job" my answer these days would never be, "ssh into www1 and add it to the crontab." It would be, "create a CronJob type in Kubernetes."

But I agree and concede, a user with zero back-end experience will just google "cron" which will take them to a crontab example, so they will likely be mislead into the anti-pattern, as you said.


I agree that is definitely the way to go at some scale. :) I really just put cron on there as an example of how someone might think about scheduled jobs as most of the more advanced things are conceptually similar to cron, but where you don't have to worry about where your job is actually running and how the environment was setup.

I think it is worth noting for any of the systems above that there's a spectrum of possibilities around how much you automate/offload the management of them, as well as plenty of backend systems for managing those.


That is only true if your jobs aren't doing a stateless operation. I use cron all the time on cattle VMs. No reason to cargo cult extra stuff into the mix.


Until that one server goes down in the middle of the night or it has issues.

Or if the server doesn’t go down and your cron job fails. Do you then implement retry logic in every job since cron can’t do retries automatically?


I think the parent poster was referring to the fact that Cron jobs can act purely on local state, and that's OK.

For example, if I had a traditionally deployed (i.e. not in K8S / a PaaS / similar) backend app that accepted file uploads, then passed those off to something else, I'd be streaming the uploads to a temporary holding directory on disk. I'd then have a CronJob that clears stale items from the temp dir. If the server fails, that's OK that the CronJob didn't run.

There are still plenty of use cases for traditional cron.


>Until that one server goes down in the middle of the night or it has issues.

The extra complexity of any workflow manager would make it even more certain to "goes down in the middle of the night or have issues".


In the case of Nomad, you run it as cluster of three. But if one of my app servers went down in the middle of the night, when I was using Nomad, the next day I might notice a degradation of performance but everything still ran.

The server going down was never really the issue though honestly. The issue was usually a process taking more CPU/Memory than expected in that case Nomad could intelligently schedule jobs based on available resources across the fleet of app servers.

These days with AWS, I don’t use Nomad, I just use CloudWatch and for the processes that aren’t Lambda based, I use autoscaling groups with the appropriate metrics for scaling in and out.

That also means if a server goes wonky, I can just take it out of the group for troubleshooting later and another instance will automatically be launched.


Yes jobs that run unattended via chron should have logging and watch dog processes that check for completion and possibly do reconciliation - that can alert if something goes wrong and the logs help diagnosis.


Now you’re reinventing Nomad....


> Cron jobs are the definition of the anti pattern of treating servers like pets and not cattle

What's old is new again. Much of clean distributed systems development is now built on, what is essentially, scheduled period operations. They're pretty much the least complicated ways to loosely couple domain logic in distributed systems that follow eventually-consistent semantics. They're also a good model for the functionality of many distributed scheduling systems like Kubernetes, AWS' ECS Scheduled Tasks, and more.

Kubernetes even goes so far to have Jobs (batch operations) and CronJobs (scheduler that creates Jobs).


I use AWS’s native services most of the time like CloudWatch to schedule jobs, lambdas, and step functions these days, but to keep the post generic, I mentioned Nomad/Consul that I have used in the past for an on prem implementation that kept us from having to bring Docker into the mix. Since we were already using Consul, using Nomad just made sense.


Depends on the use case / workload.


>Cron jobs are the definition of the anti pattern of treating servers like pets and not cattle.

Cron jobs are the definition of "if it works don't fix it" and YAGNI.


But cron jobs don’t work when you need server redundancy or retry logic.


That’s not cron’s job. Different jobs have different requirements and tradeoffs. Ownership of retry logic and distribution design belongs with whatever cron kicked off.


Yes, but it's not like the original claim, that it's somehow totally "deprecated" today and people should skip it.


Yeah because for me, anything that doesn’t have redundancy should be skipped....

Unless the cron job is only doing some type of maintenance that only affects the local redundant server....


That's a common use case, log rotation, AV scan, etc. It's just a concept to be familiar with.


It's just as important to understand the history/genesis of cron as it is to know when to not use it.


Learning the cron format is still necessary / useful - most distributed deployment systems have some form of schedule job running. Even on a purely serverless model like AWS Lambda, it’s still possible to do distributed crons. I actually think the word ‘cron’ has already been repurposed - when using it my team actually refers to the distributed version, not the per server version.


Still good to know for legacy fixes.


Hey - why not just systemctl ? It’s quite robust and reliable.


doesn't work when you can not access the machine where the cron is supposed to run for security reason or whatever

there crontab is not an anti-pattern at all


What does not having access to the server have anything to do with using a fault tolerant solution?


A good portion of this falls under the purview of "data engineering" which is another conceptual layer to think about / research.


Yes: reliability, monitoring, and error handling were the types of things I’m looking for more information on. Do you have any recommendations for more information on these topics? I should have clarified that my question was geared towards important concepts agnostic of languages/frameworks/etc. This is a great list of further reading, thank you.

Also what does observability mean is this context?


"Also what does observability mean is this context?"

Something went wrong, and now your site is serving 500 server errors to everybody at the rate of 25,000 per minute. The ops team already tried "just reboot it" and it didn't help. How are you going to figure out what is going on and fix it?

It's (mostly) too late to add anything, so all you've got is the logs you already had, the metrics you already had, etc. That's the "observable" stuff in a system. There's an art to recording what it is you need to know, while at the same time recording so much that you can't find what you need in the mess.

(The "mostly" is that if you have a good enough setup, you might be able to bring up a new system and route some very small fraction of traffic to it to examine it more intensely in real-time with a debugger or something, though in my experience, on those occasions I've had the opportunity to try this, it's never been a problem that would manifest on a new system receiving a vanishing fraction of a percent of the scale of a production box. But maybe you'll get lucky.)

You certainly want to do everything you can to not be in that mess in the first place, but it won't be enough. You need a system sufficiently observable that you can find the problem and find some sort of solution.


Oh thank you, I didn't know that was referred to as "observability" I thought it was just logging. This article from Etsy's engineering blog [1] was part of the inspiration for this question. Funnily enough when I googled "Etsy engineering logging" the 5th result was for a position on Etsy's observability team.

[1] https://codeascraft.com/2011/02/15/measure-anything-measure-...


I think of observability as a triad:

- logging (ex tools: Splunk, Sumologic, LogDNA)

- metrics (Prometheus, datadog, Grafana)

- tracing (lightstep, new relic, zipkin)

As mentioned above, observability is the data collected about a system.


When it comes to "measure everything" I've found services that have clients that already grok popular frameworks to be a godsend. We use NewRelic and it's abilty to automatically insturment all rest apis and db transactions is delightful. I could not imagine going back to having to do it manually or guess what information might be useful later.


You might want to look into honeycomb.io and follow Charity Majors on Twitter. Heck, just follow Charity anyway - she's a genius.


jerf answered observability well in another reply to this comment.

As for reliability, monitoring, and error handling I've heard good things about the Google SRE book: https://landing.google.com/sre/books/

I haven't read it personally, but I've heard good things from others and looking over it briefly the advice there lines up with what I've experienced in practice.


For some of these concepts - take a look at what Envoy + Istio , linkerd (and other service meshes) are trying to solve and conceptualize: load balancing, auth(n/z), monitoring, logging, etc.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: