Hacker News new | past | comments | ask | show | jobs | submit login

Completely agree. One of the first problems I found was that the server wasn't running via a process manager, so when it crashed it never rebooted. Simple enough to solve, but non-obvious if you've never run into it before. It's a checkmark for anyone who has hosted any application before, but it's one of many important ones.



I can’t think of any well-known professional projects that operate like this. IMO, automatically restarting after a crash is a terrible crutch that ensures crashing issues never get fixed at the priority level they deserve. Can you imagine if Apache, postgres, postfix, haproxy, etc. operated this way? If you have so many crashing issues that you need something like this, you have serious problems.


Can you imagine if Apache, postgres, postfix, haproxy, etc. operated this way?

They do operate that way; when an Apache or Postgres worker crashes, a new one is started.


Postgres uses some heuristics to determine if it will restart. But that's a problem mostly limited to databases.


A process manager is one of many failsafes used in case of a crash, and it also has many uses other than recovering from a crash. A robust production system should have many layers of failsafes to prevent it from crashing, to recover quickly from crashes (increased load, outages, etc...), and to review crashes after-the-fact. In fact, most setup articles run every piece of software you mentioned in a process manager. They don't do this because they think the software will crash; they do it because when it does crash, they don't want to have to wait for a developer to manually reboot it while they diagnose the issue. Perhaps this is a good blog post topic.


Stateless servers can and should recover automatically after a crash. Erlang has built a whole philosophy of reliable software construction based on the idea of letting errors crash a process and be restarted by a supervisor - systems with higher reliability requirements than most database servers, e.g. upgrades with zero downtime.


I encourage you to read Joe Armstrong's dissertation of how a simple idea of restarting on crash can make you an impossibly reliable system.


Resilience is one of the main characteristics of a stable codebase. Many failures, probably most failures after your developers hit mid-level, occur because an assumption about something stopped holding true.

Just this last week there was a major outage on an internal tool because the connection to the LDAP server got flaky, and the internal tool didn't know how to handle that -- instead of recovering gracefully, it freaked out until the whole server crashed.

That could've been avoided with just a small amount of recovery logic, but instead, the original author must've thought "If we can't keep our LDAP server connected, we have serious problems". But in the real world, things go sideways sometimes. In the real world, there are bugs in Microsoft's July 2018 patchset that ruin the TCP stack (cf. https://support.microsoft.com/en-us/help/4345421/windows-10-... among many others).

It's simple naivete to assume that you won't have problems as long as you keep your nose clean. There are externalities that impact even the most meticulous of us. If you want reliability, your systems must anticipate this.


The problem is that you often can't anticipate in what ways other system may f..k you over. So automated restarts make sense. But you should then try to analyze the problem and handle it properly the next time.


"If you have so many crashing issues that you need something like this, you have serious problems."

You don't have to have many crash issues for a process manager to make sense. If you just have a crash from time to time it's nice if the whole system recovers without somebody having to restart it. However, every crash should be taken very seriously and analyzed.


They do by default. The linux package will install them as a system service and the service manager takes care of the restart. Your mileage may vary with the Linux distribution.

Apache goes further than that. When running applications like PHP, it has default settings to restart the worker after every few requests. Historically, PHP applications suffered from memory leaks.


> Can you imagine if Apache, postgres, postfix, haproxy, etc. operated this way? They all do. In the real world a crash or a forced restart is a pretty normal part of operating.


I can't think of any well-known professional project that doesn't work like that.


Your process manager can alert you when a restart happens. You can then fix the problem.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: