
Reading /proc/pid/cmdline can hang forever - omnibrain
http://rachelbythebay.com/w/2014/10/27/ps/
======
npsimons
This is interesting, but not nearly as bad as it sounds:

 _No, the problems start when you disable the kernel 's OOM killer and try to
act on your own._

Surprise, surprise: when you disable the time-tested built-in mechanisms that
have been put to use in a multitude of cases that boggles the mind and try to
implement your own, you're gonna have a bad time.

------
chubot
So I guess she is talking about this case:

    
    
        1) something like systemd starts a process in a cgroup, to limit memory usage
        2) it disables the kernel OOM killer, and expects to
           receive notification itself and kills the offending process
        3) systemd crashes?  This it can no longer process the
           notifications, so you have a child process which is at
           the memory limit but not killed.
    

This is a great example of why an init system should be as simple as possible
and not die under any circumstance :) Why would your init system die? If you
something like daemontools, they are designed to not even allocate memory
after startup, because that could fail.

Was she talking about systemd? Which process in systemd does the killing? Is
it systemd-nspawn or something? (I'm not using systemd now.)

~~~
ars
Is systemd going to become like the NSA leak? Every article no matter what
will somehow become about it?

This has nothing to do with systemd.

And init of systemd doesn't die anyway, all the extra stuff is run in separate
processes, process 1 is indeed, very simple.

~~~
chubot
systemd is the most notable process manager that has cgroups support, and the
article is talking about process managers that "fail".

It doesn't matter if it's not the PID 1 that fails. The article is saying: if
the process that is supposed to receive OOM notification and kill the process
fails, you will observe symptoms that are very hard to diagnose.

PID 1 dying is of course really bad, but if other processes in the init system
die, your system can still be hosed, as pointed out here.

~~~
ars
> systemd is the most notable process manager that has cgroups support

So? Not everything is about systemd.

systemd does not disable the OOM killer, so this has nothing to do with
systemd.

You can create cgroups yourself you know - you don't need systemd for that.

~~~
glandium
> So? Not everything is about systemd.

It's time for an equivalent to Godwin's law for systemd.

------
peterwwillis
'ps' can hang for all sorts of reasons. If you have a process on an nfs-mount
that's hung, 'ps' will stat the 'exe' instead of the link itself and wait
forever for stat to return. solution? read /proc/<pid>/status for the program
name and lstat the exe (there's really no reason to stat it in the first
place)

~~~
andreasvc
Why is that? Doesn't it make more sense to have a deadline of, say, 10 seconds
after which the stat/read/etc will simply fail? I remember trying to fix this
for broken NFS mounts and not succeeding.

~~~
peterwwillis
Basically, in order to maintain the integrity of the filesystem state, it is
assumed that all NFS operations are only temporarily unavailable, and the
system generally waits forever for the server to respond. If the kernel
interrupted the operation with the client, the client might decide to act in a
way that negatively affects the state of the filesystem.

Of course there's no reason for 'ps' not to build in its own timeout for i/o.
It could cause premature failure on loaded boxes, but it wouldn't hurt
anything.

------
IgorPartola
So in the short term this can be bad. In the long term, I don't think it
matters. Basically, there are four cases:

1\. oomkiller is on, and has no bugs in it. Great. We are all set.

2\. oomkiller is off, process manager is on, and is bug free. Ditto.

3\. oomkiller is on, but has bugs in it. Bad shit happens.

4\. oomkiller is off, process manager is on, but has bugs. Bad shit happens.

Basically, the question is this: is it possible to develop a process manager
that would do the job of oomkiller that has as high code quality as the Linux
kernel. I am guessing the answer is yes, so we will always be hitting cases 1
and 2, or at least we'd be hitting cases 3 and 4 with roughly the same
frequency.

Now, it's possible that a process manager can fail for reasons other than
bugs. Not knowing enough about how cgroups, oom, etc., I can't say that you
can write a 100% reliable (assuming it is 100% bug free) process manager
without having kernel-level access. Perhaps there is something in the
architecture of the whole thing that would prevent that.

~~~
npsimons
_Basically, the question is this: is it possible to develop a process manager
that would do the job of oomkiller that has as high code quality as the Linux
kernel. I am guessing the answer is yes_

I'm not trying to put the kernel devs on a pedestal or anything, but how can
you compare a tool that's been tested in a myriad of use cases in literally
everything from microcontrollers to clusters and big iron to something that's
much less mature and probably won't see a tenth of the use cases? My bet is
that you'll see more and more problems as people try to reinvent the wheel
with their own "not-oom-killers", and blame it on the Linux kernel (ie, 99% of
problems will be case 4, not nearly the same frequency). In that respect this
article is incredibly informative as it will hopefully point people to check
out their own code first. And the beauty is, if they _do_ come up with a
better oom-killer, they can always contribute a patch to the kernel.

~~~
IgorPartola
Well, that's why I said that in the long term this might be solved. There is
nothing as good right now, but in principle it could exist.

Secondly, I doubt that a process manager with support for cgroups would need
to run on microcontrollers. At least up to this point, I have not seen many
microcontrollers running Linux containers.

Lastly, a hybrid solution could be good: process manager identifies what
processes to kill and in what order, oomkiller does the killing.

~~~
npsimons
My point about running on microcontrollers is that the Linux kernel (and oom-
killer in particular) has been tested on more use cases than most developers
usually consider. This testing/usage has revealed mistaken design decisions
and bugs. Very few process managers created from scratch will have been tested
on nearly as many use cases and will naturally not be as robust.

I do agree with you about the hybrid solution, with a twist: since the oom-
killer has been so widely tested, I would think that if someone needed
different performance parameters, it might behoove them to start with the oom-
killer and tune it, modifying the source if necessary, instead of re-inventing
the wheel from scratch.

------
ambrop7
Would it be hard to fix this specifically in the kernel? I don't know the
details of the implementation, but I suppose those regions of memory holding
the required information for /proc to work could be made always accessible,
even if a process has hit its memory limit.

~~~
Someone1234
Then it isn't a limit at all, it is a suggestion.

If something running within a cgroup can exceed what the cgroup allows then
the cgroup is completely worthless as a concept.

Plus you have to assume some cgroups might be TRYING to DOS the machine its
running on (e.g. shared hosting). If you give them a way to bypass the
protections of cgroup and use up additional memory then they WILL take out the
machine.

~~~
ambrop7
That the process is being memory limited shouldn't make it impossible for
another process to read its cmdline. The cmdline data of the process should
already be in memory.

EDIT: I suspect that this problem is a consequence of a silly design that argv
in the process holds the cmdline. Since apparently you can change the process
name by changing argv, see:

[http://www.uofr.net/~greg/processname.html](http://www.uofr.net/~greg/processname.html)

This seems to allow any process to mess with what the kernel sees as the
cmdline. I hope there's not a "more" serious security issue hiding.

This is more of a suspicion than fact, I'm planning to look into the kernel
code to get more info on that.

------
hijinks
I'm running into this issue which I asked about here

[http://serverfault.com/questions/640248/ps-aux-hanging-on-
hi...](http://serverfault.com/questions/640248/ps-aux-hanging-on-high-cpu-io-
with-java-processes)

I've yet to find the solution. Pretty much CentOS 6 install so no cgroups.
It's pretty much IO resource starved and for the life of me can't figure out
why it's only doing into /proc/pid# of a java process.

------
JoshTriplett
This makes no sense. With the OOM killer disabled, the kernel shouldn't _hang_
when memory allocation fails; it should return -ENOMEM. If something isn't
handling that correctly, it needs fixing. In particular, if something in the
kernel doesn't handle that correctly and propagate -ENOMEM back to userspace,
it needs fixing.

~~~
andreasvc
This is not about memory allocation. OOM occurs when too much of the allocated
memory is being _used_. For example, forking happens with copy-on-write
memory, so if I have a 3G process on a 4G machine, I can fork it 10 times
without problems. It's only when those ten processes each start to write to
their memory that the physical memory usage will quickly become too much to
handle. In the scenario in the blog post, the OOM killer is disabled, and
instead Linux can only prevent the memory from being accessed to avoid things
getting worse. I would agree that this setup does sound like it needs fixing,
but it's what we got in exchange for having cheap process forks.

~~~
JoshTriplett
Ah, thanks for the clarification; you're right, there's no way around that,
short of blocking forks and speculative allocations unless there's enough
memory to back them. That's theoretically possible, but extremely harsh, given
the common case of fork/exec for instance.

~~~
andreasvc
I think it would be worthwhile to find a way around it, but it would require
all applications to be conservative about their allocations. For example,
instead of forking, sharing memory between processes could become something
that has to be explicitly requested. It's probably never going to happen,
given that these primitives such as fork and malloc are so ingrained, but when
you consider what Linux is being used for nowadays, from mission critical
servers to satellites, it might make sense to make things more robust at the
expense of compatibility.

~~~
EdiX
>For example, instead of forking, sharing memory between processes could
become something that has to be explicitly requested A fine grained version of
fork, clone, clone already exists. But it's not just fork, call stacks are
also part of this problem.

~~~
andreasvc
How are stacks part of the problem? Given tail-call optimization stacks
shouldn't grow unbounded, right?

~~~
EdiX
I didn't see this until now, the problem is that when you spawn a new thread
space for the stack needs to be allocated. Since the stack can't be
reallocated easily in C/C++ it also needs to be "big enough". The current
implementation is to allocate several megabytes of stack for each thread,
since memory accounting isn't strict this doesn't create problems because
unused stack doesn't really take up space. If you start accounting memory you
will also have to be stricter with allocating stacks for new threads, limiting
yourself to just a page or two wherever possible. This is not an easy task.

------
dded
Is this a hard problem for kernel devs to solve? Or could it be as simple as
using a timeout when the kernel's OOM killer is disabled? That is, when a
cgroup is at the limit, the kernel waits for some finite amount of time and
then starts killing things itself. Or would this cause other problems?

~~~
npsimons
The oom-killer is the kernel devs solution; if you're disabling it, you should
know better, and more importantly, you're on your own. Also, the kind of
people who would disable the oom-killer are probably the same kind who
wouldn't want the kernel messing around with their cgroup; in many senses,
that just looks like another oom-killer, only tuned to wait a bit longer.

------
dmitrygr
TLDR: if you run out of RAM you're permitted to allocate, things that need a
RAM allocation may malfunction. Ok then.

