> The goal should be to restart less often, not to restart more quickly.
No! These are not mutually exclusive. The goal should be to restart less ofter, but restart faster when it is necessary. Systems have to restart. Well run/maintained/designed systems may not need to restart ofter, but they do need to restart.
Consider: Many moons ago I was responsible for (among other things) the company's CRM systems. These were a couple of mid-range Suns that were extremely reliable. On an anual basis, the business operations department issues downdown cost estimates for business critical systems (these were used to develop inter-departmental SLAs), and estimated that the CRM database server cost the company about $250K per hour of down time, or, almost $4200 per minute. Or read another way: a 10 seat license for our software in a saturated (niche) market.
The point is, outages happen (no you can not account for all possible failure scenarios), and startup times matter.
While start up time is critical, how much time is being lost by imposing complexity on the humans? At least in my opinion that bit about binary logs makes the entire thing a no go. The absolute last thing I want in any sort of time critical situation is to complicate the process of reading logs.
Very true, and, to be honest, I'm not advocating for systemd. Though according to the original article this isn't quite the issue that it's claimed to be (I'm not actually familiar enough with systemd to be able to have an opinion on it), see point 20.
My assessment on the systemd debate, though, is that it feels a bit like neophobia and the technical justifications (against systemd) are a bit on the weak side.
You're right, it is mostly neophobia. I sort of tried to say that in my top comment; I don't think that systemd is technically inferior, I just can't yet justify switching to it.
But, I've been meaning to write about this for a while: I think neophobia is starting to become an unnecessarily religious article of faith in the technical community. I run a small consulting shop, we support (or try to support) just about everything under the sun. What this means is that every single time there's a significant hardware change, my hardware guy has to be on top of it; every time there's a new mobile device UI change, or platform, or service or software, he has to get to know it right away; every time PHP or MySQL or Linode or any of a number of aspects of Linux changes, I have to be on top it; every time a new operating system version is released, we have to be familiar with it.
And every single one of those little ecosystems assumes that you have the time to sit down and read and digest their documentation or play with their new way of doing things. And, if you don't, or if you miss something, you're chastised by the community (or, worse, by your customers).
It is exactly like being in college and having every one of your professors assign 3 hours' work each night and then wonder why you're making a big deal of it.
To make things worse, when something goes wrong, you know as well as I do that you rely heavily on familiarity. You don't want to be looking things up in a man page trying to remember the specific incantation for a particular thing while your budget shrinks by the second. I was lucky that the smbd/nmbd fault happened on my personal system; if it had happened to one of our clients, who has everyone using a centralized smbd share, then smbd failing to start would have stopped work for the entire company.
So, yes, it's true that I am not a fan of change for change's sake. Lennart argues that that's not what's happening with systemd. Maybe he's right. But, it is still another massive change in something that I use and support on a daily basis, that is going to irretrievably consume a little bit more time out of my limited life, that is going to force me to throw out all of the old familiar tactics ("hmm ... new client, their MySQL isn't starting, dunno where MySQL logs are on this system, let's start with `grep -R mysql /var/log/∗` ... oh wait, this is a systemd system, that doesn't work").
From that standpoint, I feel like the technical justifications in favor of systemd are a bit on the weak side.
Systemd is actually way better at logging why a service did not start. With sysvinit I often had to figure out the exact command the shell script would start the daemon with to see the stderr output. Systemd just logs this by itself and you can see this output in e.g. systemctl. This saves loads of time.
Sure, but given that developer time is a bounded resource, we can't really expect to get both fewer restarts and faster restarts. I sympathize with your example, but I would expect that you had already seriously optimized your startup process, and since systemd doesn't make services start more quickly (it only does a better job of parallelizing them), I doubt that it would have made a huge difference in your case. Assuming even a very generous 5-second startup difference, that would have only made a $350 difference per incident.
> but given that developer time is a bounded resource, we can't really expect to get both fewer restarts and faster restarts.
Your're going to have to clarify, as this statement doesn't make any sense.
You're right though. In this instance, system startup was already highly optimized. In fact, the order of startup was shifted around to get Oracle (the DB in this example) up and operational as early as possible.
On the other hand, with the introduction of SMF, that "need to optimise" largely went away. Sure, SMF didn't make much a difference in this particular instance, but parallelisation does make a difference when you're talking about total system (not service) startup time. Further it can make a difference on system that run multiple services. If 5 services have the same dependency set, then a parallel startup means that services 2-5 become available earlier versus a linear startup.
Finally, shaving 5 seconds off of service startup _does_ make a difference. While the above example means $350, there's a few things to consider.
1) a $350 loss is stil a loss. From a business perspective, a loss, no matter how small, is still to be avoided. There's other considerations as well, such as the impact of the extra 5 seconds on my budget (the SLAs mean that the loss comes out of my budget).
2) In the realm of big enterprise, this is a relatively small outage cost. In fact, I've managed system with an order of magnitude larger failure cost (SLAs are bitch).
3) It's difficult to factor soft costs. Things such as customer confidence, indirect productivity loss, etc. 5 seconds, again, means services are available sooner, lessening the impact of these soft, and difficult to quantify, costs.
I've always architected systems such that restart time didn't matter - the systems were able to fail over. However, in flight operations would fail, which meant that a restart would, in fact, cost >$50k - regardless of the length of time it took for that machine to come back.
Faster restart time is irrelevant to me. I'm already designed to deal with outages. Reducing the impact of a restart is extremely important (making them fewer in number, smaller in scope).
This usually depends on how business critical it is. If some service is not very business critical, then how do you justify (or get budget for) the extra resources you need to make it way more reliable?
But even when not business critical, it is nice that you can bring it up quickly.