Hacker News new | past | comments | ask | show | jobs | submit login
Kill Me Softly – Kill processes in a reliable manner (github.com/alanfranz)
58 points by alanfranz on Dec 16, 2014 | hide | past | favorite | 53 comments



If you are the one designing your applications, don't rely on SIGTERM and clean shutdowns. Design them so SIGKILL is the default way to stop them such that no data is corrupted and persistent state stays consistent. Then you don't have to rely on some rarely executed "recovery" code, or even worse take chances of what might happen if SIGTERM doesn't work.

Handle SIGTERM only if you get a blackbox-like application (a database that will break if you pull the power plug on it) and so on.


Your point of view is correct, from a purely theoretical point of view. But the real world doesn't always work this way.

The recovery process after a SIGKILL may require manual intervention, or may leave some data in an inconsistent state. Not every software in the world has the amount of testing for hard kills as an ACID compliant RDBMS - I'd say all the contrary.

So, I don't want to trigger bugs and lose time in complicate recoveries if I am absolutely not forced to do it. Let processes cleanup after themselves, and hit them hard only when and if needed.


I hear what you're saying, but the flip side is that all that software also won't work when you lose power or the system panics. I've found it much easier to just always use SIGKILL. That way, you're testing the recovery code all the time, especially during development when you restart, upgrade, or after a crash. You fix the bugs, and you have high confidence because you're exercising this path all the time.


> from a purely theoretical point of view. But the real world doesn't always work this way.

I have implemented practical running systems that work like that so it is practical not just academic papers.

> But the real world doesn't always work this way.

Sure, sometimes you don't have choice. You get a 3rd party library/service/db that will shit the bed if something pulls the plug on the machine. You can try to identify those parts and eliminate them or mitigate it as much as possible. But sometimes you can't. Then try to characterise their behavior. Maybe have some external cleanup logic if you care to ensure data doesn't get corrupted. And live with it.

> Not every software in the world has the amount of testing for hard kills as an ACID compliant RDBMS - I'd say all the contrary.

Well but it is something to strive for. If you write the software it benefits to understand what happens to the data. Knowing a bit about file syncing, what kernel does with dirty pages, how/if/when to use O_DIRECT, using append-only modes, a write-ahead log. Think about using checksums in your persistent data, take advantage of atomic file system operations (renames), etc.

Well I certainly saw the benefit and the return from running a better, well tested, well designed service that operates that way.

> Let processes cleanup after themselves,

Think about that statement for a bit. How do you guarnatee processes "cleanup up after themselves". Say a junior tech trips over the power cord of the server, how does the process "clean up after itself"?


I'm not speaking about the software I write - for that software, I strive for a way to ALWAYS recover, even from and hard crash, without getting mad. But even there the automatic recovery sometimes is not completely immediate (it may take a bit of time at startup), so a direct SIGKILL is still something I don't strive for.

But when I'm on my workstation, I need to kill other software as well. I have an UPS on my workstation, and the UPS is very close to the workstation, so the tripping is unlinkely :-)

And when hard crashes happen, sometimes I find some software in an incosinstent state, and I need to cleanup something manually.

Or maybe a remote endpoint for any software would like to be notified of me shutting down, something that's usually done with a clean exit, not with a SIGKILL. Or temporary files would be cleaned up immediately, rather than on a cron-based janitoring job.


I don't disagree with the intent (Make crash-resistant, journaling programs), but TERM is an incredibly useful tool. Lame-Ducking (when your program continues processing current requests but refuses new ones) is a great way to handle rollouts and other planned maintenance in a way that minimises or prevents user visibility.


If you think of it from a developer's view, you'll come to conclusion that you can't create a program that doesn't need to cleanup things before termination. Unless it's very useless program that doesn't process any incoming data. That's why there is SIGTERM in operating systems.


That's nonsense. All ACID databases handle the requirement that a SIGKILL doesn't corrupt data. There may be a non-trivial recovery cost if you experience an unexpected shutdown, but that's about it for any application that's serious about data integrity.

EDIT: In fact they handle a much stronger requirement that a power failure doesn't corrupt data, but the point stands.


ACID databases can do this because they have the powerful hand of math giving them superpowers. Most developers don't.


Sqlite is a library that you can just link to. Use that for your state-handling, and you need not fear a thousand SIGKILLs. This is a very popular approach.

Or, if you're on a server, you can use another database. Or use atomic file writes, if you have fancy needs. If you have really exotic needs then sure, do your own thing -- but most of the time, killability is easy and beneficial.


If you do that, then you have to model program state not as objects, but as database rows. Have fun using ORM on all your projects, for everything.


"powerful hand of math".

What? I did a pretty simple proof of the safety of Write-ahead Logging as an undergraduate project -- given reasonable assumptions about disks (or implementation safeguards such as pervasive checksumming).

Most developers have superpowers: It's called "existing research". I just wish they'd use them more.


You very much can--it depends on what you're doing.

Supporting SIGTERM and certain other signals is just deciding how big a crater you want to make when the program goes down. That said, it's good practice to use things (write-ahead logging, idempotent operations, etc.) that at least try to mitigate the amount of debris.


> That said, it's good practice to use things (write-ahead logging, idempotent operations, etc.) that at least try to mitigate the amount of debris.

Yap that are some of the things I used. Append only mode for files. Code that handle roll-back to last consistent state that runs on every startup. External monitors and watchdog that know how to monitor and to extra cleanup is necessary.


> If you think of it from a developer's view, you'll come to conclusion that you can't create a program that doesn't need to cleanup things before termination.

Nope. I did create a lot of complicated programs that work that way. Tested, shiped and got paid for them nicely.


A proper init system will send you a SIGTERM before a normal reboot/shutdown, and also for stopping the program. I do agree that "power failure safety" is a good way to go, but ignoring to handle SIGTERM if there is a use for it is just ignorant.


> I do agree that "power failure safety" is a good way to go,

So you have write code to handle abrupt power shutoff? And test it, and make sure it doesn't break your data at rest or your overwall service. And you have to write to handle a paralle set of code to also handle SIGTERM.

> , but ignoring to handle SIGTERM if there is a use for it is just ignorant.

It seems having 2 sets of shutdown/failure code (one abrupt power failure, "one gentle" is the more ignorant path.


When you actually work with especially embedded devices, you know that there's a big difference. Powerloss/SIGKILL must not corrupt/break your data/application. That is _not_ the same thing as for a network application for instance being nice and sending goodbye messages, cleaning up resources like closing fd's etc upon a SIGTERM.

So no you design for being robust and to handle SIGTERM if applicable. You can never know when you'll get killed, that's why your design have to be solid all the way through.


I thought you couldn't handle SIGKILL, so how do you have two sets of code? SIGTERM just lets you do a bit more of the operations you've guaranteed are always safe.


> I thought you couldn't handle SIGKILL

The process that gets sent the SIGKILL can't handle the moment after it gets the SIGKILL. But, the rest of the system could handle the "killing" clenaup (other process and management of resource). That could be another process (a supervisor). Another process an another machine. Or it could be the same process when it gets restarted. First thing it always does is deal with its remnants and after-effects of when it was killed last.

> SIGTERM just lets you do a bit more of the operations you've guaranteed are always safe.

What does that mean? Can you expand a bit. I don't quite understnad "guaranteed are always safe". If the process that gets sent the SIGTERM catches it and does some something "clever". That clever part is never guaranteed. Because the power can cut out, SIGKILL and come unexpectedly (supervisor decides that SIGTERM handler took too long and send), etc.


Right, but you engineered your application so that it only performs operations that can be safely interrupted by SIGKILL.

The application deals with some kind of stream of data. You engineer it so that processing can be interrupted at any point, and on next startup, you're fine--your data is consistent.

Now, once you have that system, I would think that handling SIGTERM is not so hard--you finish processing what's in flight already, while not trying to handle anything new.

From your earlier work, you know that this can't corrupt data. Even if SIGKILL follows while you're trying to handle SIGTERM, you're no worse off than if you just had received SIGKILL in the first place.

The downside seems to be that you now have to show that your system will reliably terminate if it receives SIGTERM.


> I would think that handling SIGTERM is not so hard--you finish processing what's in flight already, while not trying to handle anything new.

What is the point of SIGTERM then. Why bother messing with that code that runs in an exiting/aborting program if SIGKILL will handle. Heck, SIGTERM if not handled will just be SIGKILL.

That is the main point of this -- why write 2 sets of code? Unless you can guarantee your program will always get SIGTERM then SIGKILL. You have to make sure it works correctly with SIGKILL. If you do, then why bother putting in the lines of code to handle SIGTERM?


Exactly.

>The downside seems to be that you now have to show that your system will reliably terminate if it receives SIGTERM.

That's not a problem unless it's followed by a SIGKILL (like from an init system).


SIGTERM if not handled (caught) terminates the program.


I know it's normally a good idea to avoid code duplication, but in this case I would have copied the function inside kms_functions into both kmsn and kmsp, since this removes the dependency on readlink and also makes it possible to move the executables around without care.


One of the best things I learned from a very senior engineer was "There are far worse things in life than duplicating code." Dependencies are frequently one of those things.


Oh thank you, these are gold words. I was amused recently when my collegues wanted to avoid code duplication of one small library and ended up with creating a new repository, learning awful Python packaging and releasing a new package for every small change. Much more pain gained IMHO.


I have been thinking at alternatives modes of operation that don't require an explicit install (it's in the TODOs), I'll probably improve the whole in the next days. Thanks for your suggestion.


Yeah, and you can always use a build tool to generate the two scripts from a DRY codebase.


Alternative modes of operation that would be useful:

* Send SIGUSR1 if TERM doesn't work

* Send SIGCONT if TERM doesn't work

* Attempt to reap the process

* Attempt to ptrace the process

* Give a hint as to why it can't be killed (like blocked I/O, a zombie, etc)

* Kill the process group (dangerous!)


SIGUSR1 is often caught by applications for their own IPC. SIGQUIT would probably be a safer one to send, but that can also be caught and repurposed. Sending other signal types would definitely have to be opt-in to prevent unintended side effects.


Yes, of course that would be useful, something is already in the TODOs. I don't want to overcomplicate what is a simple shell script by the way. Reaping may be interesting, ptrace may be a bit of overkill. Thanks.


Often I need to kill processes, and wait until they're dead. I want to send a SIGTERM first, then a SIGKILL if it fails. And I want this process to be automated. That's it.


I really want to like this. But SIGKILL is such a special instrument in the administrator's toolbox that I don't want it automatically sent after a short timeout. Moreover I don't think anyone should want that. This tool will hide or allow you to ignore important issues with your system.

I do unequivocally like the blocking behaviour though, I might implement that for myself.

Edit: so just something like this:

  function killwait() {
    kill $1
    while ps -p $1 &>/dev/null; do
      sleep 1
    done
  }


I appreciate the sentiment, but if you think about it... If a process can't be killed arbitrarily at any point during its execution then it's definitely not behaving in a "transactional" way (as in: sort-of-like-ACID-except-the-durability-may-not-be-that-important-as-long-as-consistency-and-atomicity-is-preserved). You really want processes to behave "transactionally", otherwise you'll be really really screwed when the power/UPSs fail.

(I'm ignoring the fact that a "kill -9" actually still flushes buffers and such, but that's tangential to the point I'm trying to make.)

EDIT: In summary: I'd rather find out sooner rather than later :).


It seems you still don't want the advertised tool though, because you prefer to just send SIGKILL immediately.

Environments differ though, the software I run will do worse on power failures/SIGKILL than on SIGTERM. I would prefer it to be more resillient, embrace a crash-only design, etc. But my world suffers from practicalities, so I deal with it by being careful with SIGKILL.


What they want is a fuzzer: send SIGTERM and/or SIGKILL with arbitrary durations between them.


I'd recommend adding some kind of max retries mechanism because otherwise, if process ignores SIGTERM for instance, you can wait forever.


I personally prefer utilities like this just do one simple task and perform it well. And adding a timeout would just complicate things unnecessarily, compared to the alternative of pressing Ctrl-C when you feel you've waited long enough.


And again you'll get this:

> This tool will hide or allow you to ignore important issues with your system.


So instead of two kill messages, we now have two scripts to run, dependent on name or pid?

Just put everything in one script - these scripts are shorter than an init boilerplate section - and use an option to switch. Do process names by default, then perhaps use a -p arg to switch the script into 'pid' mode. Bash's 'getopts' is pretty easy to implement.

As it stands, you currently have to waste mental power to figure out whether you want the 'n' or 'p' version anyway, so may as well just make it an arg.


Is the functionality even needed? Doesn't take much to type..

ksm `pidof cupsd`


A lot of these bash scripts could be replaced with Go programs. It doesn't take much code for a bash script to get ugly.


Sure they could, but then you would need different compiled versions for different architectures, and this is kind of OTT for the task it accomplishes as it is!


This would be an incentive to package the code, which would mean being able to check it against SHAs. There are a lot of go programs on homebrew now. I'd like to see a trend of smaller ones getting added there.


Or why not make them #!/bin/sh compatible instead, I don't have bash on my systems for instance.


In my opinion, current process handling is braindead, both in POSIX and Windows. That could be "fixed" with an additional unique identifier during the OS uptime, in order to allow other processes track instances, and not just the ones doing the fork.


As you point out, using the pid alone is not reliable. FWIW, Solaris and illumos systems have process contracts that allow you to reliably track processes[0]. If the child forks, it's the new child is still in the same contract (unless configured otherwise). And both the child or parent (the watcher) can restart and then return to the normal state (parent watching child processes).

Solving this problem requires kernel support because the kernel is generally the only fault domain in the system that cannot crash while the system is still running.

Edit: forgot the link the first time. [0] http://illumos.org/man/4/process


Linux has this capability in recent kernel versions (3.x?). I don't know what the implementation is called (prctl?), but DragonflyBSD add this in procctl(2) in release 4.0. FreeBSD already had a procctl(2) and added the same functionality to be compatible.

http://lists.dragonflybsd.org/pipermail/commits/2014-Novembe...

So now FreeBSD and DragonflyBSD have plans to utilize this exactly like Illumos and build lightweight process supervision without the need for something as disruptive as systemd.


Thank to everybody for the feedback. Standalone executable scripts that are both OSX and Linux compatible are now available.


> Those should be available on almost any Linux system, even on minimal installs:

> [..]

> Bash

Maybe I'm nitpicking here, but why Bash _should_ be available?


Those should be available on almost any Linux system, even on minimal installs:

Linux


Tautologies aren't false.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: