
When should I not kill -9 a process? - yiedyie
http://unix.stackexchange.com/q/8916/22558
======
btmorex
As if often the case with stackoverflow answers, all of them are wrong in
different ways. You should only kill -9 when every other signal the program is
likely to respond to has not worked. kill -9 is likely to leave program in a
state that requires manual intervention, especially if that program is a
database.

If you're a developer, before you kill -9 a program send SIGTERM (ie kill
without args or kill -15). If the program does not respond, run gdb -p <pid>
and then "thread apply all bt" before killing it. At the very least, you
should get a good idea of why it was not responding to other signals.

~~~
TillE
You can't really corrupt a database that easily, can you? That's half the
point of using a database, so you have transactions, etc.

~~~
btmorex
not corrupt != everything is peachy

At the very least, you need to be prepared for a possibly very long replay of
logs. Also, a huge number of people run databases in a configuration that
doesn't make those guarantees. For example, many people will run mysql (esp.
less performant slaves) with innodb_flush_log_at_trx_commit = 0 for
performance with the understanding that a failure might require manual fixes.

~~~
seunosewa
A failure with that parameter will only cause transactions committed in the
last second to be lost. The DB won't require manual fixes.

~~~
spudlyo
_The DB won 't require manual fixes._

Losing a transaction that was committed upstream can lead to a bunch of manual
fixes in replicated database clusters. What happens if rows inserted in that
lost transaction are eventually updated? Replication breaks, you get paged,
and now you're faced with either manually fixing the DB consistency error or
by rebuilding the whole slave. Woe be unto you if this host is a replication
hub with a bunch of slaves hanging off it.

------
jeffdavis
To anyone saying that you shouldn't "kill -9" a process, or that you should do
some song-and-dance first: kill -9 is exactly what the OOM (out-of-memory)
killer on linux does when memory is short. Typically, the application has no
good way to even know that memory is short, because linux radically
overcommits memory and still won't return a NULL from a malloc().

So, software should be written to assume it might be killed if you want to
have a robust system.

By the way, using a small fixed amount of memory is no defense, or at least
not in all kernel versions. The "badness" heuristic function used to find the
victim could end up counting the same byte of memory many times:

[http://thoughts.davisjeff.com/2009/11/29/linux-oom-
killer/](http://thoughts.davisjeff.com/2009/11/29/linux-oom-killer/)

~~~
falcolas
Just because a program which is designed to prevent kernel panics due to OOM
kill -9s a process, doesn't mean that you as a sysadmin should.

kill -15 typically leaves processes in a properly shut down state, which in
terms of databases alone means that they will start up without a recovery
process (which can be a 20-30 minute operation sometimes). That alone makes
waiting a few minutes for a running process to respond to a kill -15
worthwhile.

~~~
jeffdavis
I was commenting more about how software should be written than what an admin
should do.

For admins, I wouldn't fault them much for kill -9, I would fault the software
more if it lead to anything more than an inconvenience. But sure, it's wise to
use -15 or whatever as long as it works.

------
jrockway
Fiber optic cables are dug up by backhoes. Hard drives randomly fail. RAM is
corrupted by cosmic rays. Racks lose power. CPU fans stop spinning.

If your process relies on not being kill -9'd, then you might as well quit
programming and go buy a lottery ticket.

~~~
vacri
This is a bit like saying that since your car can do emergency braking, you
should always do emergency braking. Some processes clean up after themselves
more neatly, or finish the current run of what they're doing first. It's
dependent on what it is that you're stopping.

~~~
mcintyre1994
It seems more like saying "children, cyclists, animals, other drivers. If your
car can't do emergency braking you might as well stop driving and buy a
lottery ticket" to me. Which IMO is pretty reasonable. They don't say anything
about when to use kill -9 (all examples are outside user control) just that it
should be survivable.

~~~
al_gore
I get what you're going for, but it's pretty much openly legal to kill
pedestrians with cars (in the US, at least). "No criminality suspected"!

~~~
aaronem
There's a little more to the United States than the city of New York. I grant
it is an uncommonly large city, but it's not _that_ large.

------
justizin
Over 15 years ago, as a teenager, I taught Linux / UNIX Admin courses, and
worked as a consultant advising folks, and in the late 90s I was very adamant
that you should never -9 anything unless you know exactly what you are doing.

As infrastructures have grown, and I have managed large applications involving
tens to hundreds, often over a thousand servers, and I have grown to accept
that a power supply can fail and a node can disappear from the network and
it's even possible that none of its' components, including its' drives, will
ever work again. I've never _really_ experienced such a catastrophic failure,
but it's a lot easier to sleep at night if you just assume that.

kill -9 should never be worse than pulling the power plug, which is what
netflix's chaos monkey always tries to simulate.

we all have to live on a continuum of how much of that we can survive, but if
you always assume abrubt failure, it'll be pretty tough to give you a bad day.

~~~
vacri
_kill -9 should never be worse than pulling the power plug_

No, it shouldn't, but just pulling the power plug isn't exactly recommended
behaviour, either. There aren't many admins out there who will happily yank
the power cord out of their desktops when they want to power it down.

Last night I spent several hours getting a server back into gear after a
'pulled power plug' event. A friend's rack was affected by a (seven-hour!)
substation power outage, and it wasn't on a UPS, so it didn't shut down
cleanly. Eventually we were able to coax the server to boot again (had to
remove all USB devices in the process, including internal ones), and the
problem was a corrupted MBR. Make a rescue usb stick, boot into that, finally
diagnose the problem, cat a new MBR onto it, and tada, fixed. Let's just say
that I don't find the argument "shouldn't be worse than just pulling the plug"
to be particularly comforting at the moment :)

~~~
userbinator
Were you writing to the MBR somehow when it died? If not, that looks like
really badly designed hardware to me.

I've had abrupt shutdowns happen on laptops (one had a particularly loose
battery...), desktops, servers and although have encountered corrupt files and
filesystems, never had any of them corrupt the MBR.

~~~
vacri
It's unclear what caused the MBR problems - the power outage was a couple of
days before, and the system seemed to come back okay. My friend was busy with
work and a cursory check had it clear, but yesterday things started acting
funny, he logged in and the load was 40 and rising before it became
unresponsive to his diagnostics. This machine had been running happily for
quite some time before the powerout event (it's basically just a kvm host, the
fun stuff is on the guests), so it's particularly puzzling. Somehow the MBR
was overwritten with a syslinux one, and he says that box had never had
syslinux used on it (extlinux, yes). The root cause will become evident at
some point, it just needs some head-scratching time.

~~~
userbinator
I almost read your last sentence as "The root kit will become evident at some
point", because that's what came to mind with those symptoms. I'd check for an
infection.

------
eknkc
We write all of our server code with kill -9 in mind. Basically, eveything we
have can be killed with -9 without any problems. It needs some cleanup code
for lefover files or things like that. And use of atomic operations here and
there.. But then you are ready for all kinds of hardware issues.

~~~
mcintyre1994
Are there resources on how to deal with this that you'd recommend? Is it even
an issue in a higher level programming language or will the issue be
abstracted away by a language above say, C?

~~~
ZeroGravitas
I googled for an old LWN article on "crash-only" software and found thus
request for similar resources:

[http://stackoverflow.com/questions/2405172/resources-
about-c...](http://stackoverflow.com/questions/2405172/resources-about-crash-
safe-and-fault-tolerance-programming)

------
Theodores
Well, there is always this scenario: when you can't even use CTRL+ALT+F2 to
get to some type of terminal and only the power button, held in for ten
seconds, will do. That's when you should not 'kill -9'.

I have heard the best practice advice for many years and I think that the 'you
should send some friendly signal first' is not universally what works out
best. For instance, if your Chrome browser is getting out of hand and the
system is permanently doing some 96% wait for some reason, a gentle killing of
Chrome will take ages and, when it restarts, you might get some but not all of
your tabs back. With a killall -s 9 you can be back to work quickly with all
your tabs (and underlying swappiness problem hopefully resolved).

------
kator
I think caution is still key, if you don't know what's going on Slow is Fast
here. Several years ago while I was on an airplane flying to spend a nice
vacation break with my family my admin partner tried to shutdown a MySQL db
the "right way". He logged in and ran a mysqladmin shutdown and waited for a
while. Not sure how long he waited but he claimed it was a "long time". Since
it felt like there was no response to the command he assumed the database was
hung and issued a kill -9 on all the mysql processes.

Sadly, what he failed to check was disk IO stats, this MySQL setup had heavy
innodb table usage and settings that where deliberately set for more
performance then reliability (large buffers, delayed commits etc.). What was
going on was normal, MySQL was flushing everything to disk and to logs and was
most likely going to stop without a problem.

He didn't look at the facts at hand, the disk IO was still going, MySQL was
mostly writing to the log files, users where not being let in so the db was
doing an orderly shutdown. Instead with the adrenalin pumping he felt he had
“waited a long time” and issued the kill -9 and corrupted the InnoDB logs and
tables beyond all recognition.

I landed at the airport to five frantic voicemails because this db was the
core of a bunch of high profile sites and he was up to his ears in phone calls
from the client. I had to spend the first 9 hours of my vacation with my kids
playing in the background while I sweated it out on a laptop over a crappy
connection that kept dropping me.

Yes I know, MySQL should have been able to handle the "power out" but this
event was made worse because he started the shutdown, we had a deliberately
fragile implementation, he didn't check the slaves so we didn't have a clean
fall back and meanwhile he "waited a long time" but never checked the process
to see what it was doing.

I use kill -9 (-KILL) all the time, but I do it where I know it's needed. Most
of the time kill just works and if it doesn't that should give you pause to
think carefully about what you'll do next. Slow is Fast and Fast is Slow, if
you quickly do something radical like kill -9 or init 6 or 10 second power
button crash then you may be spending the rest of your day cleaning up. Slow
down a bit, look, listen and gather facts about the situation then make an
informed decision. At least if you do all that and the rest of your day is
still ruined you won't have that nagging feeling you shot yourself in the foot
and you can talk intelligently to your client or boss about steps you took to
avoid the situation.

My failure I suppose was that I hadn't explained to him that it was typical
for the db to take upwards of 5 to 8 minutes to cleanly shutdown. Which gets
to a second topic, documentation for production systems is essential, when the
fire is on too many mistakes can be made because of "knowledge gaps" between
team members. Needless to say after this incident I wrote extensive
documentation for the team so the next time I was "on vacation" I could
actually be "on vacation". :-)

~~~
marcosdumay
I'm aware it may be an irritating question to answer, and I'd understand if
you don't bother, but I have to ask it because I did never understand why
people do such things...

Was the extra performance in any way worth it?

~~~
kator
This was a core cluster db for about 150 web sites that had high profiles and
user traffic. It was a long time ago and the client didn't want to pay for 20
more boxes to do something more durable. These things are always a balance
between durability and speed. In this case the client wanted cheap speed.

Sadly it worked very well except when people used a big hammer on it. We had
replication slaves but in this incident the salves were not replicating and
somehow the script checking them wasn't alerting. But even so the admin didn't
check before bashing the keyboard and thus we were left with manual
reconstruction from backups and other sundry sources of information.

~~~
marcosdumay
Thanks to the answer. Looks like Murphy's law was in action that day.

So, that was by the client's demand. You got to save 20 machines with it, I'm
impressed, and now I understand it better why you did it.

------
servowire
Don't exaggerate the impact of a kill -9. It can be safe if a homegrown
application has a bug or there is a hardware failure preventing a clean reboot
(like a locked IO to a disk that is no longer in order).

Sure it will mess up some things, but when management is pushing to limit the
downtime of, for instance, a golden-image provisioned Linux machine, I'd kill
it off no problem.

Now when we are talking a hardware box running some form of Oracle/mySQL - no,
don't use -9 indeed.

------
captainmuon
The advice to never ever use `kill -9` is too strong. It's fine to use if you
know what the program you are killing does.

In my case, processes I have to `kill -9` are

\- programs that only read data files (or write to dispensable files)

\- programs that I know don't react to SIGTERM (there is no cleanup logic, but
still something makes them swallow SIGTERM)

\- often then are simple tools (e.g. ls) that become wedged in a system call
or in kernel code (when trying to access a bad NFS share)

\- in the other cases they are in-house programs that are either badly
written, or too complex (the worst offender is CERN's ROOT, if it becomes
wedged you have to `kill` and `kill -9` several processes it spawns), or where
we don't care enough to fix them

Interestingly, there seem to be some cases where even `kill -9` doesn't help.
What I do then is to freeze the process with Ctrl+Z (Ctrl+C doesnt work of
course), and then `killall -9 $(jobs -p); fg`.

Actually, I have one program I routinely call with `program; killall -9 $(jobs
-p); fg` and end it with Ctrl+Z. Sad, but true.

(Of course, if your process is a database or a GUI tool or something, then all
the standard wisdom against `kill -9` applies.)

------
staunch
One example of where you might not want to kill -9 (-KILL) is a web server,
like nginx. If you kill -QUIT (-3) nginx it will do a graceful shutdown. Nginx
closes its listen sockets while allowing existing clients to finish. Many
other daemons have similarly friendly behavior if you give them a chance to
shutdown gracefully. They should degrade safely (if not nicely) with -KILL
(-9) as well.

------
nailer
Protip: rather than remembering magic numbers, use the actual names:

    
    
        kill -HUP <PID>
    
        kill -KILL <PID>

------
vram22
Interesting post and HN comment thread, I'll have to read it in full (or at
least the good parts). Anyway, here's a loosely related post that may be of
interest, from my blog:

Unix one-liner to kill a hanging Firefox process:

[http://jugad2.blogspot.in/2008/09/unix-one-liner-to-kill-
han...](http://jugad2.blogspot.in/2008/09/unix-one-liner-to-kill-hanging-
firefox.html)

It had an interesting thread of comments in which both others and I
participated, and at least I learnt some things.

------
spullara
If your application doesn't work properly when it is kill -9, you just don't
have a reliable application.

------
tambourine_man
[http://youtube.com/watch?v=Fow7iUaKrq4](http://youtube.com/watch?v=Fow7iUaKrq4)

kill -9!

------
cies
My naive answer: If a normal `kill` did not do the trick.

At least that what I do.

I guess a process is given more space to "clean up after itself" with a normal
`kill`; where a `kill -9` forces it to die.

Anyway; I don't know the exact answer -- will come back later to read a wiser
person's answer. :)

------
oleganza
If the process is not designed to survive the crash than it's a more like a
bug. I'd rather encourage everyone to design programs in a robust way: when
they can clean up after themselves upon relaunch.

~~~
falcolas
And what about when that cleanup takes longer than a proper shutdown?

~~~
oleganza
Good point. Can you give me an example?

~~~
falcolas
MySQL. Rebuilding the state from the transaction logs takes significantly
longer than writing the memory state to disk.

------
mmphosis
[https://github.com/mmphosis/kill](https://github.com/mmphosis/kill)

------
fensipens
When should I not use signal names in conjunction with kill?

Never.

~~~
mturmon
Earlier implementations of the Unix kill command did not allow names, only
numbers, so many people (deeply familiar with Unix) know the numbers as well
as or better than the names. Plus, it's shorter.

~~~
fensipens
What earlier implementations? The initial import of /bin/kill into the NetBSD
source-tree accepted signal names and that was 21 years and 2 months ago. Same
with FreeBSD and their commit message even implies that signal names were
allowed in the original 4.4BSD-Lite source.

WRT shorter: Magic numbers don't just suck in programming.

~~~
mturmon
21 years ago was 1993. I learned Unix a decade before that.

7th Edition AT&T Unix did not allow signal names ([http://plan9.bell-
labs.com/7thEdMan/v7vol1.pdf](http://plan9.bell-labs.com/7thEdMan/v7vol1.pdf),
search for "extreme prejudice" \-- I still remember many of the little gags in
the early manpages).

That was a mainstream release in the mid-1980s. Even the basic utilities like
kill(1) were incompatible back then, so if you worked on both BSD and AT&T
systems, it was easier to use the compatible subset.

 __*

Regarding magic numbers: in general, yes, to be avoided. But my usual use of
kill -9 is in exasperation, from the command line, and clarity for others is
not a priority. I admit, in a script, kill -HUP is to be preferred to kill -1.
But even in a script, I'd say kill -9. This usage thing seems to be complex.

------
smegel
I do it all the time. In production. _While_ testing my code.

~ The most interesting man in the world.

~~~
Aardwolf
While. I see what you did there :)

