
Docker and the PID 1 zombie reaping problem - ohcoder
http://blog.phusion.nl/2015/01/20/docker-and-the-pid-1-zombie-reaping-problem/
======
KaiserPro
I know this will cost me a lot of karma, but surely this is the point that you
say, lets just use a hypervisor?

yes, I know its much easier to just use docker, but at the point that you have
to write a new INIT, doesn't that strike you as time you used something thats
not easy anymore?

Things like "make sure we don't loose syslog" make me shudder.

Yes supposedly there is an argument that docker is faster. However as a
buildeng i've seen little evidence of that.

We use VMware with throwaway build disks. This means that a Dev can destroy a
machine, once rebooted its back to a known default state.

the best part is that we have collectd on the vmware host, and the in the
guest to collect metrics on CPU, mem and IO.

Because we have decent disks, and lots of memory, we don't hit any
limitations. For heavy build we have NFS root'd servers with non persistant
disks. much faster, and requires almost no thinking about.

~~~
bitwize
This is actually why we all need to support systemd. It is already an init
system, and a sophisticated one; now that it has integrated container support
it's only a matter of time before systemd's pid 1 assumes zombie reaping
duties for the host and all container images.

People who criticize systemd fail to realize that it provides solutoons for
many problems under Linux, and even problems with the Unix model itself, and
has widespread support from companies, distributions, and other stakeholders.

~~~
KaiserPro
Yes, its provides a supervisord type mechanism, thats great, so is
dependencies but outweighed by the following:

The horrid shite that is the journald. whoever thought it was a brilliant idea
to smoosh grep tail and vi together is someone who's never had to debug
anything hung over.

seriously why in living fuck do I have to use your shitty tool to look at
system logs? The whole point of linux that you are supposed to be able to use
anything to open anything. Sure its a stupid idea to use most tools, but thats
not your decision.

Seriously do I have to learn a completely new syntax to find all things that
happened at 10:23:04 three days ago? Everywhere else I can use
vim/nano/emacs/awk/grep/sed.

but here? no I'm to stupid to understand those tools, here use this one. We've
thought of every use case, if you're having trouble its because you're either
a hater, idiot, sysadmin, or too old.

you can tell that the people who write this shit are not the poor fuckers who
have to tune/support/debug/migrate or use it.

So no, I'm not going to "get behind systemd" until journal is fit for purpose.

~~~
grifferz
If you're running a syslogd then the logs also get stored there, so you can
continue to use the tools you are familiar with if that is more important to
you than possibly considering different tools that may work better.

~~~
KaiserPro
Much as I like the inference that I've not tried journal, or that I'm immune
to new ideas, thanks, thanks very much.

Look, my job is to implement new ideas. I run a datacenter that has 15pb on
tier 1 storage, and 28,000 CPUs.

The one thing I care about most is this: getting home to see my children.

The only time I look at logs is if something terrible has gone wrong.
Everything else is done by sourcing metrics directly from the components
directly (collectd + graphite)

We do ship logs, and yes we have elastic search, but thats far to heavy and
far to higher latency to be useful.

So the only real time that I look at logs is when something is horrifically
broken, or we cant figure out from the graphs what has gone wrong.

At that time, I do not want to have to remember new syntax to get at what
should be text files. (yes signed logs, but lets be honest thats why you ship
logs.)

If I got _more_ functionality, then yes I'll be onboard.

However I can't use bash to talk to journal, It only has a C API, which smacks
of it being designed for phones and super integrated HPC, not the 95% use
case.

Because it uses binary blobs, (And yes, I do love binary, just not for logs)
Its a pain in the arse to inspect using a scripting language.

if journalctl _COMM=sshd | tail | grep "something" then;

really nice. also what the fuck is with starting a switch with an _?
underscore indicates hidden function. seriously, why is it different to every
over log inspection tool?

~~~
the_why_of_y
It's not a switch, it's a match argument, and the underscore prefix indicates
that the field is "trusted": the kernel retrieved the value and the logging
application is unable to provide "fake" data for the field in case it has been
compromised.

------
shykes
Hi, Docker author. Just wanted to say that we are aware of the problem,
understand the underlying mechanics, and plan to fix it. In the meantime
application-level fixes like this can definitely help.

One slight difficulty is that some applications _do_ want to be pid1. While
99% of the ecosystem wants the problem of being pid1 to be handled natively by
Docker, the remaining 1% (people who write init systems, or platforms tied to
a particular init system) emphatically wants the opposite. So we need to make
sure both modes of operations are available, and deal with a dilemma: do we
change the default behavior, potentially breaking images expecting to be run
as pid1? Or do we keep a default which breaks 99% of applications, requiring
an explicit setting to opt into the correct behavior?

Anybody with opinions on the matter is welcome to voice them on #docker-dev /
freenode, or the dev mailing list, we'd love to hear from you.

~~~
FooBarWidget
Having a config option is a good idea. I vote for changing the default
behavior because Docker has so much growth potential left. Leaving the default
as-is is problematic for the majority of people, both current and future. I
think that breaking it is acceptable because a lot of the people who use
Docker in production -- especially those that are expecting to be PID 1 -- are
early adopters and will forgive you for the slight breakage.

~~~
Florin_Andrei
Yes, change it now, before it's everywhere. The longer you wait, the worse the
problem becomes.

------
sandGorgon
I'm not sure writing yet another pid 1 tool is such a good idea. I have been
using Docker in production for a few months now and I use supervisord as pid
1. I am able to hook it to most app servers (python,rails,etc) as well as
rsyslogd . logging is a big pain on Docker, and you should think of your
logging strategy before the pid 1. How are you rotating your logs and is your
pid 1 restarting your app on failure? For us, supervisord does all that.

We run syslog and logrotate inside the container (using a volume) and it works
fairly well for us. Plus supervisord comes with a pretty kickass tool -
supervisorctl - and the conf file is fairly well documented.

Optimizing the RAM/CPU usage of supervisors and rsyslogd is not very critical
- it doesn't use too much.

I actually learned a lot from baseimage - but dont know why they went with
runit rather than supervisord.

Would really,really,REALLY love to learn to do thus with systemd.

EDIT: ahh.. forgot that their script was based on runit. My mistake - but dont
think I will change over from supervisord anytime soon.

~~~
detaro
Supervisord docs say explicitly that it is not meant be used as PID 1. Does it
still perform all the necessary functions?

~~~
sandGorgon
yes - we use django, rails, postgres, redis, syslogd and cron. In general it
works very well - we use puma for the rails server and gunicorn for python and
we are able to control them very well.

------
justincormack
Surely writing a minimal init in C (or Go if you like) is better than using
Python... it is not a lot of code, and frankly I would trust it not to die and
kill your container more. Plus Python has a chunk of dependencies you don't
need in a minimal container.

~~~
FooBarWidget
Lots of system tools are written in Python nowadays. If you use a Debian,
Ubuntu or CentOS base image then there's a good chance that Python is
preinstalled because some system tool depends on it.

It was also much easier to write the tool in Python than in C. Writing it in C
would have taken much longer, and at the end of the day we wanted to get work
done instead of worrying about a few MBs, which only take 1 second to download
and which cost $0.001 of disk space.

~~~
mborch
It's trivial to write this program in C and it is a waste of memory to load up
50 MB of Python just to accomplish this simple task.

~~~
FooBarWidget
Actually, the Python process eats less than 5 MB of RAM. Most other stuff on
the system eats more RAM.

Writing my_init took much more time than you think, and I do not regret
choosing Python. My_init not only handles process reaping, but also
environment variable handling and bunch of other stuff. Writing text parsers
in C is a huge pain (libc has no regexps that are as easy to use as Python).

I do not oppose rewriting it in C, but given limited manpower and the tiny
benefits it's not high on my todo list either. If you are worried about
resource usage, then I would very much welcome contributions for rewriting it
in C.

~~~
edwintorok
Why not use already existing init? runit seems small enough:
[http://smarden.org/runit/](http://smarden.org/runit/)

Or if you want small have a look here for minimal PID 1 "So how should init be
done right?" [http://ewontfix.com/14/](http://ewontfix.com/14/)

~~~
FooBarWidget
Baseimage-docker _does_ use Runit. my_init calls Runit.

~~~
edwintorok
Why not use [http://smarden.org/runit/runit-
init.8.html](http://smarden.org/runit/runit-init.8.html) and run my_init
_from_ runit?

------
rubiquity
I think needing to run an init system with Docker is a solution to a problem
you've created for yourself. From reading this post and other posts on your
blog about Docker it sounds like you are `docker start`-ing containers that
have been `docker stop`-ped. In my experience, using `docker start` isn't a
best practice and I design my containers in a way that I won't need to restart
a stopped container.

If you use Docker containers as a form of lightweight immutable infrastructure
you won't care about zombie processes because you won't be restarting stopped
containers. Instead you just `docker run` every time you want to run a new
container. The example of a corrupt file is moot because your containers
shouldn't be writing to the file system with the intention of that file being
around for long. This also applies to databases, which I'm not fond of running
in containers at all (setting aside highly distributed databases, perhaps).

~~~
ekimekim
The problem isn't when your main docker process is dying (in this case I
believe dockerd itself reaps the process), the problem is when your service
spawns child processes as part of normal operation.

As an example, suppose we have a simple web server that executes CGI scripts
written in bash (I know this is kinda contrived but it illustrates my point).
I do a request to the server, it runs a bash script. The script runs grep.
While grep is executing, the web server decides bash has taken too long and
SIGKILLs it. Then grep is adopted by your PID1, in this case your webserver.
But the webserver doesn't know its there, and so never calls wait(). You now
have a zombie process that will last until you stop the container.

------
jrelsasser
FWIW I have written a very, very small /sbin/init replacement for docker
containers. It runs one script on startup and another on shutdown, supplied by
the container author:
[https://github.com/jre/dinit](https://github.com/jre/dinit)

------
Alupis
The terminology used to describe children processes can often seem comical. I
recall my systems programming classes and the professor would shout "The
parent must kill all the children!" \-- was good for a few chuckles.

More seriously, the issues raised in this article are real, and I do believe
the author has done a fine job breaking down the UNIX process hierarchy and
defined responsibilities.

I'd say it's a shame as this is yet one more important
stability/scalability/security/performance item the Docker team has not
thought of (or maybe their busy writing more eco-system apps), but I do
believe this is a container problem at large seeing that the official App
Container Spec has not addresses it.

For those criticizing the need for "yet another pid 1", well, there doesn't
seem to be any way around it that can guarantee the same outcome. Yes, it's
understandable having reservations about Python being the tool of choice for
such an init system, and perhaps C would be a better choice allowing more
fine-grained control and guarantees. My guess is the article author was most
familiar with Python and wanted to have a POC for the write-up, which is
satisfactory.

------
koffiezet
as I posted on reddit:

A simpler and better solution for this problem is: do not let apps run in
background (double-fork/detach) in containers. Most applications that
background by default can override this with a command-line flag. And
certainly don't run programs that execute other programs that background. Keep
it simple, and try to limit yourself to 1 process per container. Sure there
might be situations where this cannot be avoided, but in 99% of the situations
- this is not a problem. Yes you have to be aware of this potential problem,
but I am currently running tons of containers, of which only a few run
supervisord (which handles reaping too) to run multiple processes within one
container because it was the sane thing to do. Stale zombie processes on my
docker hosts? None. These precautions in my opinion are only required for
badly behaving software. And yes - there is software like that.

------
dypsilon
Is there a ticket in the docker project? I would like to follow the progress
on this one.

Everything I could find was this ticket:
[https://github.com/docker/docker/issues/5305](https://github.com/docker/docker/issues/5305)

------
tjholowaychuk
Isn't this effectively a non-issue? It's literally just a tiny entry in a
table that is held no? Or does it just cause Docker to behave oddly?

~~~
FooBarWidget
There are two issues. First is Murphey's law. 1 zombie process is probably
harmless, but if the amount grows then you'll fill up kernel limits and the
container cannot create further processes.

Second, zombie processes that don't go away can also interfere with software
that check for the existence of processes. For example, [the Phusion Passenger
application
server]([https://www.phusionpassenger.com/](https://www.phusionpassenger.com/))
manages processes. It restarts processes when they crash. Crash detection is
implemented by parsing the output of `ps`, and by sending a 0 signal to the
process ID. Zombie processes are displayed in `ps` and respond to the 0
signal, so Phusion Passenger thinks the process is still alive even though it
has terminated.

~~~
davidw
> Crash detection is implemented by parsing the output of `ps`

Would there be anything to be gained by actually rooting around in /proc
yourselves, rather than relying on ps?

~~~
FooBarWidget
Ps is implemented by parsing /proc, so the results you get are the same.

Having said that, /proc/xxx/status does contain a flag that tells you whether
a process is a zombie or not. So on Linux, Phusion Passenger parses that file
as an extra check. But not all software does this.

------
amalag
How does this compare to minit or runit ?

------
emeraldd
Why write your own mini-init instead of using something like supervisord?

~~~
masklinn
Supervisord is not an init and must not be run as pid1, that's explicitly
mentioned in its documentation[0]:

> It shares some of the same goals of programs like launchd, daemontools, and
> runit. Unlike some of these programs, it is not meant to be run as a
> substitute for init as “process id 1”.

There are micro-inits you could use instead though, s6 should be runnable as
pid1 for instance.

[0]
[http://supervisord.org/index.html?highlight=init](http://supervisord.org/index.html?highlight=init)

~~~
emeraldd
I'm a little confused here. Especially considering that the docker site lists
supervisord explicitly with instructions on how to use it ....

see:
[https://docs.docker.com/articles/using_supervisord/](https://docs.docker.com/articles/using_supervisord/)

~~~
masklinn
There's no problem with running supervisord, inside or outside of docker,
there's a problem running supervisord as an init (pid1). You can run openrc or
systemd or s6 or runit or minit or whatever and use supervisord on top of
that, you just shouldn't use supervisord directly as pid1.

------
digi_owl
Sysv heavyweight?

~~~
_delirium
It's got a reasonable amount of stuff in it. The Debian version of sysvinit
(v2.88) is about 10,000 lines of C code. Probably not a huge deal, but it does
more than just reaping zombie processes, if that's really all you want.

~~~
pdw
The actual init command is only about 3000 lines, the rest is for auxiliary
programs that you probably wouldn't need for Docker. And 10k lines is still
tiny compared to Python :)

------
m00dy
>“As long as a zombie is not removed from the system via a wait, it will
consume a slot in the kernel process table, and if this table fills, it will
not be possible to create further processes.”

To the author->

Do you know how many processes a kernel can handle ?

root@lisa2:~# cat /proc/sys/kernel/pid_max 32768

fyi.

~~~
ekimekim
Suppose a service has a timed routine that spawns a child every 10 seconds.
Unfortunately, this child has a bug and starts leaking zombies to PID1.
Congratulations, you now have approx 9 hours until your entire system stops
being able to spawn new processes.

~~~
m00dy
You could have a nuclear war near the datacenter where you host that broken
software & container. I think it has a better probability to exist in a real
world.

------
vezzy-fnord
This is nice and all, but the way the authors wrote this article gives off
this feeling of hubris that they've uncovered some critical yet technically
obscure problem.

They haven't. This is Unix programming 101 right here. Even if the Docker
community has been generally ignorant of this, the problem is not any less
glaring or basic.

I'm not sure if my_init is really an "init system", though. I'd call it an
"init daemon", since it handles the bare minimum of waiting on its children. A
system would imply greater complexity.

~~~
FooBarWidget
Author of the article here. Of course it's a basic Unix programming 101...
from _our_ points of view! You and I think it's basic because we're so
knowledgeable in this area. But this does not hold for the rest of the world.
99 out of 100 people that I've spoken to have absolutely no idea about this
problem. I used to link to the Wikipedia article and thought that people would
get it. They didn't. People started getting it once I gave 15 minute talks
about the problem and describing things with diagrams. Lots of people gave me
the suggestion that I should blog about it, and so I did.

~~~
vezzy-fnord
There is no relative point of view here. This is simply an egregious
educational failure where application developers have failed to understand the
underpinnings of how their platform handles processes at a high level.

Then again, this is how the industry works, I suppose. You rediscover an old
paradigm or basic property of your system that has been abstracted out,
present it as new, deliver it in a shiny package and reap the fame.

Sorry, I was mostly ticked at how you presented this issue in such an alarmist
and "problem-reaction-solution" tone. It's just humorous.

